1
00:00:00,000 --> 00:00:03,571
[MUSIC]

2
00:00:03,571 --> 00:00:07,960
In this video, we will talk about tricks
that will make training of new neural

3
00:00:07,960 --> 00:00:10,290
networks much faster.

4
00:00:10,290 --> 00:00:12,740
And the first one is transfer learning.

5
00:00:12,740 --> 00:00:16,940
Remember that deep networks learn
complex features extractor but

6
00:00:16,940 --> 00:00:19,850
we need lots of data to
train it from scratch.

7
00:00:19,850 --> 00:00:23,320
Let's look at the ImageNet
classification architecture.

8
00:00:23,320 --> 00:00:27,570
There's lots of convolutional layers,
and we call it, features extractor.

9
00:00:27,570 --> 00:00:32,190
Because the last convolutional layer,
extracts features that are useful for

10
00:00:32,190 --> 00:00:36,370
classification with MOP,
which are the last pink layers.

11
00:00:36,370 --> 00:00:40,390
This architecture is trained
on ImageNet dataset.

12
00:00:40,390 --> 00:00:43,790
But what if we can reuse
an existing features extractor,

13
00:00:43,790 --> 00:00:48,410
that blue convolutions on the slide,
for a new task?

14
00:00:48,410 --> 00:00:49,720
How do we do that?

15
00:00:49,720 --> 00:00:53,180
We add a new classifier on
top of that features, and

16
00:00:53,180 --> 00:00:55,660
that orange weights is
all we need to train.

17
00:00:57,310 --> 00:01:01,960
You need less data to train, because
you train only the final MLP layer.

18
00:01:01,960 --> 00:01:06,130
It works if a domain of a new
task is similar to ImageNet's.

19
00:01:06,130 --> 00:01:10,300
It won't work for human emotions
classification, because ImageNet doesn't

20
00:01:10,300 --> 00:01:15,180
have people faces in the dataset, so it
doesn't know the concept of a human face.

21
00:01:16,350 --> 00:01:19,300
But what if we need to
classify human emotions?

22
00:01:19,300 --> 00:01:22,270
Maybe we can partially reuse
ImageNet features extractor.

23
00:01:23,320 --> 00:01:27,140
Let's look at the perfect
feature extractor that we need,

24
00:01:27,140 --> 00:01:29,790
it looks like a bunch of
convolutional layers, and

25
00:01:29,790 --> 00:01:33,500
let's look at what activation
stimuli we actually want to get.

26
00:01:33,500 --> 00:01:38,490
The first convolutional layer
will have highest activations for

27
00:01:38,490 --> 00:01:41,480
edge detectors that have
different rotation.

28
00:01:41,480 --> 00:01:46,590
If we go deeper than those convolutional
layers along the concept of a human eye,

29
00:01:46,590 --> 00:01:48,390
or nose, or throat.

30
00:01:48,390 --> 00:01:50,090
And if we go deeper than that,

31
00:01:50,090 --> 00:01:55,650
then we have layers that learn
the representation of a human face.

32
00:01:55,650 --> 00:01:58,640
That is our perfect feature
extractor that we want,

33
00:01:58,640 --> 00:02:01,610
let's compare it with
ImageNet feature extractor.

34
00:02:01,610 --> 00:02:05,460
ImageNet definitely has those
edge detectors as well,

35
00:02:05,460 --> 00:02:10,100
but, provided that it doesn't
have human faces in the data set,

36
00:02:10,100 --> 00:02:15,160
it doesn't know the concept of a nose or
throat, so we will need to train this.

37
00:02:17,080 --> 00:02:20,220
Let's look at how an architecture for

38
00:02:20,220 --> 00:02:23,960
human emotions classification
might look like.

39
00:02:23,960 --> 00:02:28,480
What we do is we actually reuse first
convolutional layers which are in green,

40
00:02:28,480 --> 00:02:30,850
and we add new convolutional layers and

41
00:02:30,850 --> 00:02:34,850
a new multi-layer perceptron to train for
our new task.

42
00:02:34,850 --> 00:02:38,350
And all we need to train
are those blue convolutions and

43
00:02:38,350 --> 00:02:40,680
orange full connected layers.

44
00:02:40,680 --> 00:02:46,410
It works much better because we have to
train much less number of parameters.

45
00:02:46,410 --> 00:02:49,410
What if we don't start from scratch and

46
00:02:49,410 --> 00:02:53,860
don't initialize those blue
convolutions with random numbers, but

47
00:02:53,860 --> 00:02:58,960
rather use initialization from
pre-trained ImageNet network?

48
00:02:58,960 --> 00:03:03,670
That leads us to so called,
fine-tuning technique,

49
00:03:03,670 --> 00:03:07,300
because you don't start with
the random initialization, but rather

50
00:03:07,300 --> 00:03:12,850
reuse those complex representations that
is suitable for ImageNet classification.

51
00:03:12,850 --> 00:03:15,230
What is the intuition behind this?

52
00:03:15,230 --> 00:03:17,910
We don't start with random features, but

53
00:03:17,910 --> 00:03:21,850
we start with features that are useful for
some other task.

54
00:03:21,850 --> 00:03:26,830
They are not perfect for our task, but
they might be much better than random.

55
00:03:27,870 --> 00:03:32,983
What we do next is we propagate all
the gradients, but with a smaller learning

56
00:03:32,983 --> 00:03:38,110
rate, so that we don't lose that
initialization that we got from ImageNet.

57
00:03:40,211 --> 00:03:46,140
Fine-tuning is very frequently used thanks
to wide spectrum of ImageNet classes.

58
00:03:46,140 --> 00:03:50,720
Keras, a deep learning framework,
has the weights of pre-trained VGG,

59
00:03:50,720 --> 00:03:53,600
Inception, and ResNet architectures.

60
00:03:53,600 --> 00:03:56,910
What is so special about that,
is that you can fine-tune

61
00:03:56,910 --> 00:04:00,720
a bunch of different architectures and
make an ensemble out of them.

62
00:04:00,720 --> 00:04:02,420
And you don't have to wait for two or

63
00:04:02,420 --> 00:04:05,930
three weeks to train your
network on ImageNet dataset.

64
00:04:07,810 --> 00:04:09,710
Let's summarize a little bit.

65
00:04:09,710 --> 00:04:14,840
If you have a small dataset and it is from
ImageNet domain, which means that you have

66
00:04:14,840 --> 00:04:20,000
objects that are somewhat similar to
those seen in the ImageNet data set,

67
00:04:20,000 --> 00:04:26,410
then all you need to do is to use transfer
learning and train last MLP layers.

68
00:04:26,410 --> 00:04:32,130
If you have a bigger data set, then it
makes sense to fine-tune deeper layers so

69
00:04:32,130 --> 00:04:36,350
that you squeeze a little bit more
of quality from this network.

70
00:04:36,350 --> 00:04:40,570
If you have a big dataset but
it's not similar to ImageNet domain,

71
00:04:40,570 --> 00:04:44,400
then it makes sense to train from
scratch because most likely,

72
00:04:44,400 --> 00:04:46,440
you can't reuse
the features from ImageNet.

73
00:04:47,520 --> 00:04:51,720
But if you have a small dataset and
which is not similar to ImageNet,

74
00:04:51,720 --> 00:04:55,770
then you'll not like it and most likely,
you will have to collect more data.

75
00:04:57,520 --> 00:05:01,920
In the next video, we will take a look
at other computer vision problems that

76
00:05:01,920 --> 00:05:04,001
utilize convolutional networks.

77
00:05:04,001 --> 00:05:14,001
[MUSIC]