1
00:00:00,000 --> 00:00:04,650
One way to learn the parameters of the neural network so that it gives you

2
00:00:04,650 --> 00:00:07,875
a good encoding for your pictures of faces is to

3
00:00:07,875 --> 00:00:11,930
define an applied gradient descent on the triplet loss function.

4
00:00:11,930 --> 00:00:13,425
Let's see what that means.

5
00:00:13,425 --> 00:00:15,341
To apply the triplet loss,

6
00:00:15,341 --> 00:00:18,630
you need to compare pairs of images.

7
00:00:18,630 --> 00:00:21,365
For example, given this picture,

8
00:00:21,365 --> 00:00:24,075
to learn the parameters of the neural network,

9
00:00:24,075 --> 00:00:27,605
you have to look at several pictures at the same time.

10
00:00:27,605 --> 00:00:29,685
For example, given this pair of images,

11
00:00:29,685 --> 00:00:33,895
you want their encodings to be similar because these are the same person.

12
00:00:33,895 --> 00:00:35,925
Whereas, given this pair of images,

13
00:00:35,925 --> 00:00:41,500
you want their encodings to be quite different because these are different persons.

14
00:00:41,500 --> 00:00:44,770
In the terminology of the triplet loss,

15
00:00:44,770 --> 00:00:49,290
what you're going do is always look at one anchor image and

16
00:00:49,290 --> 00:00:53,820
then you want to distance between the anchor and the positive image,

17
00:00:53,820 --> 00:00:55,200
really a positive example,

18
00:00:55,200 --> 00:00:58,025
meaning as the same person to be similar.

19
00:00:58,025 --> 00:01:01,470
Whereas, you want the anchor when pairs are compared to

20
00:01:01,470 --> 00:01:07,455
the negative example for their distances to be much further apart.

21
00:01:07,455 --> 00:01:10,755
So, this is what gives rise to the term triplet loss,

22
00:01:10,755 --> 00:01:15,145
which is that you'll always be looking at three images at a time.

23
00:01:15,145 --> 00:01:17,415
You'll be looking at an anchor image,

24
00:01:17,415 --> 00:01:22,185
a positive image, as well as a negative image.

25
00:01:22,185 --> 00:01:25,993
And I'm going to abbreviate anchor positive and negative as A,

26
00:01:25,993 --> 00:01:29,561
P, and N. So to formalize this,

27
00:01:29,561 --> 00:01:32,700
what you want is for the parameters of

28
00:01:32,700 --> 00:01:36,060
your neural network of your encodings to have the following property,

29
00:01:36,060 --> 00:01:39,735
which is that you want the encoding

30
00:01:39,735 --> 00:01:44,445
between the anchor minus the encoding of the positive example,

31
00:01:44,445 --> 00:01:47,865
you want this to be small and in particular,

32
00:01:47,865 --> 00:01:52,950
you want this to be less than or equal to the distance of the squared norm

33
00:01:52,950 --> 00:01:58,900
between the encoding of the anchor and the encoding of the negative,

34
00:01:58,900 --> 00:02:02,485
where of course, this is d of A,

35
00:02:02,485 --> 00:02:06,610
P and this is d of A,

36
00:02:06,610 --> 00:02:11,280
N. And you can think of d as a distance function,

37
00:02:11,280 --> 00:02:14,700
which is why we named it with the alphabet d. Now,

38
00:02:14,700 --> 00:02:18,825
if we move to term from the right side of this equation to the left side,

39
00:02:18,825 --> 00:02:25,140
what you end up with is f of A minus f of P squared minus,

40
00:02:25,140 --> 00:02:27,675
let's take the right-hand side now,

41
00:02:27,675 --> 00:02:31,404
minus F of N squared,

42
00:02:31,404 --> 00:02:34,800
you want this to be less than or equal to zero.

43
00:02:34,800 --> 00:02:38,280
But now, we're going to make a slight change to this expression,

44
00:02:38,280 --> 00:02:41,420
which is one trivial way to make sure this is satisfied,

45
00:02:41,420 --> 00:02:44,245
is to just learn everything equals zero.

46
00:02:44,245 --> 00:02:46,275
If f always equals zero,

47
00:02:46,275 --> 00:02:47,955
then this is zero minus zero,

48
00:02:47,955 --> 00:02:50,545
which is zero, this is zero minus zero which is zero.

49
00:02:50,545 --> 00:02:57,140
And so, well, by saying f of any image equals a vector of all zeroes,

50
00:02:57,140 --> 00:03:01,410
you can almost trivially satisfy this equation.

51
00:03:01,410 --> 00:03:07,465
So, to make sure that the neural network doesn't just output zero for all the encoding,

52
00:03:07,465 --> 00:03:12,480
so to make sure that it doesn't set all the encodings equal to each other.

53
00:03:12,480 --> 00:03:16,285
Another way for the neural network to give a trivial output is if

54
00:03:16,285 --> 00:03:20,246
the encoding for every image was identical to the encoding to every other image,

55
00:03:20,246 --> 00:03:25,400
in which case, you again get zero minus zero.

56
00:03:25,400 --> 00:03:28,075
So to prevent a neural network from doing that,

57
00:03:28,075 --> 00:03:32,370
what we're going to do is modify this objective to say that,

58
00:03:32,370 --> 00:03:36,985
this doesn't need to be just less than or equal to zero,

59
00:03:36,985 --> 00:03:40,575
it needs to be quite a bit smaller than zero.

60
00:03:40,575 --> 00:03:45,755
So, in particular, if we say this needs to be less than negative alpha,

61
00:03:45,755 --> 00:03:49,235
where alpha is another hyperparameter,

62
00:03:49,235 --> 00:03:53,475
then this prevents a neural network from outputting the trivial solutions.

63
00:03:53,475 --> 00:03:55,230
And by convention, usually,

64
00:03:55,230 --> 00:03:59,550
we write plus alpha instead of negative alpha there.

65
00:03:59,550 --> 00:04:02,763
And this is also called, a margin,

66
00:04:02,763 --> 00:04:05,190
which is terminology that you'd be familiar

67
00:04:05,190 --> 00:04:10,320
with if you've also seen the literature on support vector machines,

68
00:04:10,320 --> 00:04:12,675
but don't worry about it if you haven't.

69
00:04:12,675 --> 00:04:17,995
And we can also modify this equation on top by adding this margin parameter.

70
00:04:17,995 --> 00:04:19,395
So to give an example,

71
00:04:19,395 --> 00:04:23,260
let's say the margin is set to 0.2.

72
00:04:23,260 --> 00:04:24,855
If in this example,

73
00:04:24,855 --> 00:04:29,550
d of the anchor and the positive is equal to 0.5,

74
00:04:29,550 --> 00:04:32,580
then you won't be satisfied if d between

75
00:04:32,580 --> 00:04:37,087
the anchor and the negative was just a little bit bigger, say 0.51.

76
00:04:37,087 --> 00:04:41,610
Even though 0.51 is bigger than 0.5,

77
00:04:41,610 --> 00:04:43,917
you're saying, that's not good enough, we want a dfA,

78
00:04:43,917 --> 00:04:47,610
N to be much bigger than dfA,

79
00:04:47,610 --> 00:04:49,245
P and in particular,

80
00:04:49,245 --> 00:04:53,520
you want this to be at least 0.7 or higher.

81
00:04:53,520 --> 00:04:58,740
Alternatively, to achieve this margin or this gap of at least 0.2,

82
00:04:58,740 --> 00:05:02,330
you could either push this up or push this down so

83
00:05:02,330 --> 00:05:06,305
that there is at least this gap of this alpha,

84
00:05:06,305 --> 00:05:09,350
hyperparameter alpha 0.2 between

85
00:05:09,350 --> 00:05:14,530
the distance between the anchor and the positive versus the anchor and the negative.

86
00:05:14,530 --> 00:05:17,491
So that's what having a margin parameter here does,

87
00:05:17,491 --> 00:05:19,055
which is it pushes

88
00:05:19,055 --> 00:05:25,720
the anchor positive pair and the anchor negative pair further away from each other.

89
00:05:25,720 --> 00:05:29,435
So, let's take this equation we have here at the bottom,

90
00:05:29,435 --> 00:05:31,160
and on the next slide,

91
00:05:31,160 --> 00:05:35,225
formalize it, and define the triplet loss function.

92
00:05:35,225 --> 00:05:40,810
So, the triplet loss function is defined on triples of images.

93
00:05:40,810 --> 00:05:43,465
So, given three images,

94
00:05:43,465 --> 00:05:46,350
A, P, and N,

95
00:05:46,350 --> 00:05:48,990
the anchor positive and negative examples.

96
00:05:48,990 --> 00:05:53,245
So the positive examples is of the same person as the anchor,

97
00:05:53,245 --> 00:05:58,040
but the negative is of a different person than the anchor.

98
00:05:58,040 --> 00:06:01,515
We're going to define the loss as follows.

99
00:06:01,515 --> 00:06:03,055
The loss on this example,

100
00:06:03,055 --> 00:06:06,465
which is really defined on a triplet of images is,

101
00:06:06,465 --> 00:06:10,169
let me first copy over what we had on the previous slide.

102
00:06:10,169 --> 00:06:16,045
So, that was fA minus fP squared

103
00:06:16,045 --> 00:06:23,790
minus fA minus fN squared,

104
00:06:23,790 --> 00:06:26,755
and then plus alpha, the margin parameter.

105
00:06:26,755 --> 00:06:31,600
And what you want is for this to be less than or equal to zero.

106
00:06:31,600 --> 00:06:34,365
So, to define the loss function,

107
00:06:34,365 --> 00:06:39,040
let's take the max between this and zero.

108
00:06:39,040 --> 00:06:42,030
So, the effect of taking the max here is that,

109
00:06:42,030 --> 00:06:45,225
so long as this is less than zero,

110
00:06:45,225 --> 00:06:47,070
then the loss is zero,

111
00:06:47,070 --> 00:06:49,847
because the max is something less than equal to zero,

112
00:06:49,847 --> 00:06:52,705
when zero is going to be zero.

113
00:06:52,705 --> 00:06:56,370
So, so long as you achieve the goal of making this thing I've underlined in green,

114
00:06:56,370 --> 00:07:00,950
so long as you've achieved the objective of making that less than or equal to zero,

115
00:07:00,950 --> 00:07:04,150
then the loss on this example is equals to zero.

116
00:07:04,150 --> 00:07:05,355
But if on the other hand,

117
00:07:05,355 --> 00:07:07,650
if this is greater than zero,

118
00:07:07,650 --> 00:07:09,120
then if you take the max,

119
00:07:09,120 --> 00:07:10,665
the max we end up selecting,

120
00:07:10,665 --> 00:07:12,343
this thing I've underlined in green,

121
00:07:12,343 --> 00:07:15,455
and so you would have a positive loss.

122
00:07:15,455 --> 00:07:17,475
So by trying to minimize this,

123
00:07:17,475 --> 00:07:22,130
this has the effect of trying to send this thing to be zero,

124
00:07:22,130 --> 00:07:23,450
less than or equal to zero.

125
00:07:23,450 --> 00:07:26,550
And then, so long as there's zero or less than or equal to zero,

126
00:07:26,550 --> 00:07:31,980
the neural network doesn't care how much further negative it is.

127
00:07:31,980 --> 00:07:33,990
So, this is how you define the loss on

128
00:07:33,990 --> 00:07:38,970
a single triplet and the overall cost function for your neural network can

129
00:07:38,970 --> 00:07:47,575
be sum over a training set of these individual losses on different triplets.

130
00:07:47,575 --> 00:07:55,080
So, if you have a training set of say 10,000 pictures with 1,000 different persons,

131
00:07:55,080 --> 00:08:00,360
what you'd have to do is take your 10,000 pictures and use it to generate,

132
00:08:00,360 --> 00:08:03,720
to select triplets like this and then train

133
00:08:03,720 --> 00:08:08,005
your learning algorithm using gradient descent on this type of cost function,

134
00:08:08,005 --> 00:08:14,145
which is really defined on triplets of images drawn from your training set.

135
00:08:14,145 --> 00:08:19,635
Notice that in order to define this dataset of triplets,

136
00:08:19,635 --> 00:08:25,405
you do need some pairs of A and P. Pairs of pictures of the same person.

137
00:08:25,405 --> 00:08:27,510
So the purpose of training your system,

138
00:08:27,510 --> 00:08:32,600
you do need a dataset where you have multiple pictures of the same person.

139
00:08:32,600 --> 00:08:33,960
That's why in this example,

140
00:08:33,960 --> 00:08:38,040
I said if you have 10,000 pictures of 1,000 different person,

141
00:08:38,040 --> 00:08:40,965
so maybe have 10 pictures on average

142
00:08:40,965 --> 00:08:44,310
of each of your 1,000 persons to make up your entire dataset.

143
00:08:44,310 --> 00:08:47,085
If you had just one picture of each person,

144
00:08:47,085 --> 00:08:49,725
then you can't actually train this system.

145
00:08:49,725 --> 00:08:52,080
But of course after training,

146
00:08:52,080 --> 00:08:54,205
if you're applying this,

147
00:08:54,205 --> 00:08:56,565
but of course after having trained the system,

148
00:08:56,565 --> 00:08:58,200
you can then apply it to

149
00:08:58,200 --> 00:09:02,685
your one shot learning problem where for your face recognition system,

150
00:09:02,685 --> 00:09:06,945
maybe you have only a single picture of someone you might be trying to recognize.

151
00:09:06,945 --> 00:09:08,335
But for your training set,

152
00:09:08,335 --> 00:09:12,780
you do need to make sure you have multiple images of the same person at least for

153
00:09:12,780 --> 00:09:14,940
some people in your training set so that you can

154
00:09:14,940 --> 00:09:19,000
have pairs of anchor and positive images.

155
00:09:19,000 --> 00:09:25,240
Now, how do you actually choose these triplets to form your training set?

156
00:09:25,240 --> 00:09:28,635
One of the problems if you choose A, P,

157
00:09:28,635 --> 00:09:34,260
and N randomly from your training set subject to A and P being from the same person,

158
00:09:34,260 --> 00:09:36,270
and A and N being different persons,

159
00:09:36,270 --> 00:09:40,140
one of the problems is that if you choose them so that they're at random,

160
00:09:40,140 --> 00:09:44,160
then this constraint is very easy to satisfy.

161
00:09:44,160 --> 00:09:48,405
Because given two randomly chosen pictures of people,

162
00:09:48,405 --> 00:09:51,945
chances are A and N are much different than

163
00:09:51,945 --> 00:09:55,740
A and P. I hope you still recognize this notation,

164
00:09:55,740 --> 00:10:02,710
this d(A, P) was what we had written on the last year's slides as this encoding.

165
00:10:02,710 --> 00:10:06,190
So this is just equal to this

166
00:10:06,190 --> 00:10:11,040
squared known distance between the encodings that we have on the previous slide.

167
00:10:11,040 --> 00:10:14,640
But if A and N are two randomly chosen different persons,

168
00:10:14,640 --> 00:10:17,610
then there is a very high chance that this will be much

169
00:10:17,610 --> 00:10:21,630
bigger more than the margin alpha that that term on the left.

170
00:10:21,630 --> 00:10:24,405
And so, the neural network won't learn much from it.

171
00:10:24,405 --> 00:10:25,940
So to construct a training set,

172
00:10:25,940 --> 00:10:28,340
what you want to do is to choose triplets A,

173
00:10:28,340 --> 00:10:31,280
P, and N that are hard to train on.

174
00:10:31,280 --> 00:10:38,685
So in particular, what you want is for all triplets that this constraint be satisfied.

175
00:10:38,685 --> 00:10:44,995
So, a triplet that is hard will be if you choose values for A, P,

176
00:10:44,995 --> 00:10:47,454
and N so that maybe d(A,

177
00:10:47,454 --> 00:10:52,550
P) is actually quite close to d(A,N).

178
00:10:52,550 --> 00:10:54,020
So in that case,

179
00:10:54,020 --> 00:10:57,230
the learning algorithm has to try extra hard to take

180
00:10:57,230 --> 00:11:00,740
this thing on the right and try to push it up or take this thing on

181
00:11:00,740 --> 00:11:04,030
the left and try to push it down so that there is

182
00:11:04,030 --> 00:11:08,430
at least a margin of alpha between the left side and the right side.

183
00:11:08,430 --> 00:11:11,900
And the effect of choosing these triplets is that it

184
00:11:11,900 --> 00:11:16,460
increases the computational efficiency of your learning algorithm.

185
00:11:16,460 --> 00:11:18,500
If you choose your triplets randomly,

186
00:11:18,500 --> 00:11:21,725
then too many triplets would be really easy, and so,

187
00:11:21,725 --> 00:11:25,970
gradient descent won't do anything because your neural network will just get them right,

188
00:11:25,970 --> 00:11:27,090
pretty much all the time.

189
00:11:27,090 --> 00:11:32,270
And it's only by using hard triplets that the gradient descent procedure

190
00:11:32,270 --> 00:11:38,700
has to do some work to try to push these quantities further away from those quantities.

191
00:11:38,700 --> 00:11:40,155
And if you're interested,

192
00:11:40,155 --> 00:11:46,295
the details are presented in this paper by Florian Schroff, Dmitry Kalinichenko,

193
00:11:46,295 --> 00:11:51,305
and James Philbin, where they have a system called FaceNet,

194
00:11:51,305 --> 00:11:55,860
which is where a lot of the ideas I'm presenting in this video come from.

195
00:11:55,860 --> 00:11:58,220
By the way, this is also a fun fact about

196
00:11:58,220 --> 00:12:02,030
how algorithms are often named in the deep learning world,

197
00:12:02,030 --> 00:12:05,810
which is if you work in a certain domain, then we call that blank.

198
00:12:05,810 --> 00:12:10,710
You often have a system called blank net or deep blank.

199
00:12:10,710 --> 00:12:13,095
So, we've been talking about face recognition.

200
00:12:13,095 --> 00:12:16,123
So this paper is called FaceNet,

201
00:12:16,123 --> 00:12:17,465
and in the last video,

202
00:12:17,465 --> 00:12:19,910
you just saw deep face.

203
00:12:19,910 --> 00:12:23,600
But this idea of a blank net or deep blank is

204
00:12:23,600 --> 00:12:28,370
a very popular way of naming algorithms in the deep learning world.

205
00:12:28,370 --> 00:12:32,780
And you should feel free to take a look at that paper if you want to learn some of

206
00:12:32,780 --> 00:12:34,940
these other details for speeding up your algorithm

207
00:12:34,940 --> 00:12:38,745
by choosing the most useful triplets to train on,

208
00:12:38,745 --> 00:12:40,025
it is a nice paper.

209
00:12:40,025 --> 00:12:41,240
So, just to wrap up,

210
00:12:41,240 --> 00:12:42,670
to train on triplet loss,

211
00:12:42,670 --> 00:12:47,060
you need to take your training set and map it to a lot of triples.

212
00:12:47,060 --> 00:12:50,550
So, here is our triple with an anchor and a positive,

213
00:12:50,550 --> 00:12:54,375
both for the same person and the negative of a different person.

214
00:12:54,375 --> 00:12:58,445
Here's another one where the anchor and positive are of

215
00:12:58,445 --> 00:13:04,315
the same person but the anchor and negative are of different persons and so on.

216
00:13:04,315 --> 00:13:07,000
And what you do having defined this training sets

217
00:13:07,000 --> 00:13:09,920
of anchor positive and negative triples is use

218
00:13:09,920 --> 00:13:16,740
gradient descent to try to minimize the cost function J we defined on an earlier slide,

219
00:13:16,740 --> 00:13:20,090
and that will have the effect of that propagating to

220
00:13:20,090 --> 00:13:23,640
all of the parameters of the neural network in order to learn

221
00:13:23,640 --> 00:13:27,435
an encoding so that d of

222
00:13:27,435 --> 00:13:33,395
two images will be small when these two images are of the same person,

223
00:13:33,395 --> 00:13:40,286
and they'll be large when these are two images of different persons.

224
00:13:40,286 --> 00:13:43,805
So, that's it for the triplet loss and how you can train

225
00:13:43,805 --> 00:13:48,200
a neural network for learning and an encoding for face recognition.

226
00:13:48,200 --> 00:13:49,715
Now, it turns out that

227
00:13:49,715 --> 00:13:54,556
commercial face recognition systems are trained on fairly large datasets at this point.

228
00:13:54,556 --> 00:13:56,630
Often, north of the million images

229
00:13:56,630 --> 00:14:00,275
sometimes not in frequently north of 10 million images.

230
00:14:00,275 --> 00:14:05,210
And there are some commercial companies talking about using over 100 million images.

231
00:14:05,210 --> 00:14:09,300
So these are very large datasets by models that even balm.

232
00:14:09,300 --> 00:14:12,800
So that's it for the triplet loss and how you can use it to

233
00:14:12,800 --> 00:14:18,230
train a neural network to operate a good encoding for face recognition.

234
00:14:18,230 --> 00:14:21,500
Now, it turns out that today's face recognition systems

235
00:14:21,500 --> 00:14:24,830
especially the loss cure commercial face recognition systems

236
00:14:24,830 --> 00:14:27,360
are trained on very large datasets.

237
00:14:27,360 --> 00:14:30,350
Datasets north of a million images is not uncommon,

238
00:14:30,350 --> 00:14:34,040
some companies are using north of 10 million images and some companies

239
00:14:34,040 --> 00:14:38,135
have north of 100 million images with which to try to train these systems.

240
00:14:38,135 --> 00:14:41,730
So these are very large datasets even by modern standards,

241
00:14:41,730 --> 00:14:45,230
these dataset assets are not easy to acquire.

242
00:14:45,230 --> 00:14:48,140
Fortunately, some of these companies have trained

243
00:14:48,140 --> 00:14:51,875
these large networks and posted parameters online.

244
00:14:51,875 --> 00:14:54,790
So, rather than trying to train one of these networks from scratch,

245
00:14:54,790 --> 00:14:59,280
this is one domain where because of the share data volume sizes,

246
00:14:59,280 --> 00:15:02,390
this is one domain where often it might be useful for you

247
00:15:02,390 --> 00:15:05,313
to download someone else's pre-train model,

248
00:15:05,313 --> 00:15:07,685
rather than do everything from scratch yourself.

249
00:15:07,685 --> 00:15:10,130
But even if you do download someone else's pre-train model,

250
00:15:10,130 --> 00:15:14,195
I think it's still useful to know how these algorithms were trained or

251
00:15:14,195 --> 00:15:19,225
in case you need to apply these ideas from scratch yourself for some application.

252
00:15:19,225 --> 00:15:21,405
So that's it for the triplet loss.

253
00:15:21,405 --> 00:15:22,640
In the next video,

254
00:15:22,640 --> 00:15:25,280
I want to show you also some other variations

255
00:15:25,280 --> 00:15:28,510
on siamese networks and how to train these systems.

256
00:15:28,510 --> 00:15:30,000
Let's go onto the next video.