1
00:00:00,270 --> 00:00:04,240
If you're building a computer vision application rather than

2
00:00:04,240 --> 00:00:07,630
training the ways from scratch, from random initialization,

3
00:00:07,630 --> 00:00:10,010
you often make much faster progress if you download

4
00:00:10,010 --> 00:00:12,530
ways that someone else has already trained on

5
00:00:12,530 --> 00:00:15,640
the network architecture and use that as pre-training

6
00:00:15,640 --> 00:00:19,525
and transfer that to a new task that you might be interested in.

7
00:00:19,525 --> 00:00:24,230
The computer vision research community has been pretty good at posting lots of

8
00:00:24,230 --> 00:00:28,860
data sets on the Internet so if you hear of things like Image Net, or MS COCO,

9
00:00:28,860 --> 00:00:31,082
or Pascal types of data sets,

10
00:00:31,082 --> 00:00:33,995
these are the names of different data sets that people have post

11
00:00:33,995 --> 00:00:39,640
online and a lot of computer researchers have trained their algorithms on.

12
00:00:39,640 --> 00:00:44,760
Sometimes these training takes several weeks and might take many GP use

13
00:00:44,760 --> 00:00:47,020
and the fact that someone else has done

14
00:00:47,020 --> 00:00:50,005
this and gone through the painful high-performance search process,

15
00:00:50,005 --> 00:00:55,360
means that you can often download open source ways that took someone else many weeks or

16
00:00:55,360 --> 00:00:57,570
months to figure out and use that as

17
00:00:57,570 --> 00:01:01,550
a very good initialization for your own neural network.

18
00:01:01,550 --> 00:01:05,012
And use transfer learning to sort of transfer knowledge from some of

19
00:01:05,012 --> 00:01:08,761
these very large public data sets to your own problem.

20
00:01:08,761 --> 00:01:11,710
Let's take a deeper look at how to do this.

21
00:01:11,710 --> 00:01:13,760
Let's start with the example,

22
00:01:13,760 --> 00:01:18,820
let's say you're building a cat detector to recognize your own pet cat.

23
00:01:18,820 --> 00:01:22,570
According to the internet,

24
00:01:22,570 --> 00:01:32,855
Tigger is a common cat name and Misty is another common cat name.

25
00:01:32,855 --> 00:01:41,480
Let's say your cats are called Tiger and Misty and there's also neither.

26
00:01:41,480 --> 00:01:44,570
You have a classification problem with three clauses.

27
00:01:44,570 --> 00:01:46,300
Is this picture Tigger,

28
00:01:46,300 --> 00:01:49,075
or is it Misty, or is it neither.

29
00:01:49,075 --> 00:01:53,875
And in all the case of both of you cats appearing in the picture.

30
00:01:53,875 --> 00:01:57,020
Now, you probably don't have a lot of pictures of Tigger

31
00:01:57,020 --> 00:02:00,745
or Misty so your training set will be small.

32
00:02:00,745 --> 00:02:02,170
What can you do?

33
00:02:02,170 --> 00:02:06,920
I recommend you go online and download some open source implementation of

34
00:02:06,920 --> 00:02:12,600
a neural network and download not just the code but also the weights.

35
00:02:12,600 --> 00:02:19,900
There are a lot of networks you can download that have been trained on for example,

36
00:02:19,900 --> 00:02:25,130
the Init Net data sets which has a thousand different clauses so the network

37
00:02:25,130 --> 00:02:31,180
might have a softmax unit that outputs one of a thousand possible clauses.

38
00:02:31,180 --> 00:02:36,110
What you can do is then get rid of the softmax layer and create

39
00:02:36,110 --> 00:02:46,025
your own softmax unit that outputs Tigger or Misty or neither.

40
00:02:46,025 --> 00:02:48,710
In terms of the network,

41
00:02:48,710 --> 00:02:52,350
I'd encourage you to think of all of these layers as

42
00:02:52,350 --> 00:02:56,415
frozen so you freeze the parameters in

43
00:02:56,415 --> 00:03:00,300
all of these layers of the network and you would then just

44
00:03:00,300 --> 00:03:05,700
train the parameters associated with your softmax layer.

45
00:03:05,700 --> 00:03:08,982
Which is the softmax layer with three possible outputs,

46
00:03:08,982 --> 00:03:11,790
Tigger, Misty or neither.

47
00:03:11,790 --> 00:03:16,560
By using someone else's free trade ways,

48
00:03:16,560 --> 00:03:22,600
you might probably get pretty good performance on this even with a small data set.

49
00:03:22,600 --> 00:03:25,100
Fortunately, a lot of people learning frameworks

50
00:03:25,100 --> 00:03:28,000
support this mode of operation and in fact,

51
00:03:28,000 --> 00:03:35,085
depending on the framework it might have things like trainable parameter equals zero,

52
00:03:35,085 --> 00:03:37,660
you might set that for some of these early layers.

53
00:03:37,660 --> 00:03:39,117
In others they just say,

54
00:03:39,117 --> 00:03:42,885
don't train those ways or sometimes you have a parameter

55
00:03:42,885 --> 00:03:47,295
like freeze equals one and these are

56
00:03:47,295 --> 00:03:50,730
different ways and different deep learning program frameworks that let you

57
00:03:50,730 --> 00:03:56,245
specify whether or not to train the ways associated with particular layer.

58
00:03:56,245 --> 00:03:58,440
In this case, you will train

59
00:03:58,440 --> 00:04:04,945
only the softmax layers ways but freeze all of the earlier layers ways.

60
00:04:04,945 --> 00:04:07,930
One other neat trick that may help for some implementations

61
00:04:07,930 --> 00:04:12,380
is that because all of these early leads are frozen,

62
00:04:12,380 --> 00:04:16,345
there are some fixed function that doesn't change because you're not changing it,

63
00:04:16,345 --> 00:04:20,040
you not training it that takes this input image acts and

64
00:04:20,040 --> 00:04:24,305
maps it to some set of activations in that layer.

65
00:04:24,305 --> 00:04:30,455
One of the trick that could speed up training is we just pre-compute that layer,

66
00:04:30,455 --> 00:04:36,615
the features of re-activations from that layer and just save them to disk.

67
00:04:36,615 --> 00:04:40,620
What you're doing is using this fixed function,

68
00:04:40,620 --> 00:04:43,030
in this first part of the neural network,

69
00:04:43,030 --> 00:04:49,680
to take this input any image X and compute some feature vector for it and then you're

70
00:04:49,680 --> 00:04:56,445
training a shallow softmax model from this feature vector to make a prediction.

71
00:04:56,445 --> 00:05:04,170
One step that could help your computation as you just pre-compute that layers activation,

72
00:05:04,170 --> 00:05:06,330
for all the examples in training sets and save them to

73
00:05:06,330 --> 00:05:10,525
disk and then just train the softmax clause right on top of that.

74
00:05:10,525 --> 00:05:12,585
The advantage of the safety disk or

75
00:05:12,585 --> 00:05:15,990
the pre-compute method or the safety disk is that you don't need to

76
00:05:15,990 --> 00:05:19,020
recompute those activations everytime

77
00:05:19,020 --> 00:05:23,150
you take a epoch or take a post through a training set.

78
00:05:23,150 --> 00:05:28,585
This is what you do if you have a pretty small training set for your task.

79
00:05:28,585 --> 00:05:31,215
Whether you have a larger training set.

80
00:05:31,215 --> 00:05:33,810
One rule of thumb is if you have

81
00:05:33,810 --> 00:05:39,164
a larger label data set so maybe you just have a ton of pictures of Tigger,

82
00:05:39,164 --> 00:05:41,940
Misty as well as I guess pictures neither of them,

83
00:05:41,940 --> 00:05:45,935
one thing you could do is then freeze fewer layers.

84
00:05:45,935 --> 00:05:52,761
Maybe you freeze just these layers and then train these later layers.

85
00:05:52,761 --> 00:05:57,540
Although if the output layer has different clauses then you need to have

86
00:05:57,540 --> 00:06:04,321
your own output unit any way Tigger, Misty or neither.

87
00:06:04,321 --> 00:06:07,550
There are a couple of ways to do this.

88
00:06:07,550 --> 00:06:10,980
You could take the last few layers ways

89
00:06:10,980 --> 00:06:17,346
and just use that as initialization and do gradient descent

90
00:06:17,346 --> 00:06:22,050
from there or you can also blow away these last few layers and

91
00:06:22,050 --> 00:06:27,990
just use your own new hidden units and in your own final softmax outputs.

92
00:06:27,990 --> 00:06:32,000
Either of these matters could be worth trying.

93
00:06:32,000 --> 00:06:35,220
But maybe one pattern is if you have more data,

94
00:06:35,220 --> 00:06:39,090
the number of layers you've freeze could be smaller and then

95
00:06:39,090 --> 00:06:43,810
the number of layers you train on top could be greater.

96
00:06:43,810 --> 00:06:46,710
And the idea is that if you pick a data set and maybe have

97
00:06:46,710 --> 00:06:51,090
enough data not just to train a single softmax unit but to train

98
00:06:51,090 --> 00:06:54,960
some other size neural network that comprises

99
00:06:54,960 --> 00:07:00,400
the last few layers of this final network that you end up using.

100
00:07:00,400 --> 00:07:03,965
Finally, if you have a lot of data,

101
00:07:03,965 --> 00:07:09,710
one thing you might do is take this open source network and ways and use

102
00:07:09,710 --> 00:07:15,430
the whole thing just as initialization and train the whole network.

103
00:07:15,430 --> 00:07:20,945
Although again if this was a thousand of softmax and you have just three outputs,

104
00:07:20,945 --> 00:07:23,610
you need your own softmax output.

105
00:07:23,610 --> 00:07:26,133
The output of labels you care about.

106
00:07:26,133 --> 00:07:29,760
But the more label data you have for

107
00:07:29,760 --> 00:07:33,630
your task or the more pictures you have of Tigger, Misty and neither,

108
00:07:33,630 --> 00:07:37,005
the more layers you could train and in the extreme case,

109
00:07:37,005 --> 00:07:40,365
you could use the ways you download just as

110
00:07:40,365 --> 00:07:42,405
initialization so they would replace

111
00:07:42,405 --> 00:07:45,260
random initialization and then could do gradient descent,

112
00:07:45,260 --> 00:07:50,838
training updating all the ways and all the layers of the network.

113
00:07:50,838 --> 00:07:54,510
That's transfer learning for the training of ConvNets.

114
00:07:54,510 --> 00:08:00,090
In practice, because the open data sets on the internet are so big and the ways you

115
00:08:00,090 --> 00:08:05,580
can download that someone else has spent weeks training has learned from so much data,

116
00:08:05,580 --> 00:08:08,385
you find that for a lot of computer vision applications,

117
00:08:08,385 --> 00:08:10,980
you just do much better if you download

118
00:08:10,980 --> 00:08:16,080
someone else's open source ways and use that as initialization for your problem.

119
00:08:16,080 --> 00:08:18,410
In all the different disciplines,

120
00:08:18,410 --> 00:08:21,720
in all the different applications of deep learning,

121
00:08:21,720 --> 00:08:25,220
I think that computer vision is one where transfer learning is

122
00:08:25,220 --> 00:08:28,815
something that you should almost always do unless,

123
00:08:28,815 --> 00:08:35,145
you have an exceptionally large data set to train everything else from scratch yourself.

124
00:08:35,145 --> 00:08:40,560
But transfer learning is just very worth seriously considering unless you have

125
00:08:40,560 --> 00:08:43,745
an exceptionally large data set and a very large computation budget

126
00:08:43,745 --> 00:08:47,350
to train everything from scratch by yourself.