1 00:00:00,000 --> 00:00:04,650 One way to learn the parameters of the neural network so that it gives you 2 00:00:04,650 --> 00:00:07,875 a good encoding for your pictures of faces is to 3 00:00:07,875 --> 00:00:11,930 define an applied gradient descent on the triplet loss function. 4 00:00:11,930 --> 00:00:13,425 Let's see what that means. 5 00:00:13,425 --> 00:00:15,341 To apply the triplet loss, 6 00:00:15,341 --> 00:00:18,630 you need to compare pairs of images. 7 00:00:18,630 --> 00:00:21,365 For example, given this picture, 8 00:00:21,365 --> 00:00:24,075 to learn the parameters of the neural network, 9 00:00:24,075 --> 00:00:27,605 you have to look at several pictures at the same time. 10 00:00:27,605 --> 00:00:29,685 For example, given this pair of images, 11 00:00:29,685 --> 00:00:33,895 you want their encodings to be similar because these are the same person. 12 00:00:33,895 --> 00:00:35,925 Whereas, given this pair of images, 13 00:00:35,925 --> 00:00:41,500 you want their encodings to be quite different because these are different persons. 14 00:00:41,500 --> 00:00:44,770 In the terminology of the triplet loss, 15 00:00:44,770 --> 00:00:49,290 what you're going do is always look at one anchor image and 16 00:00:49,290 --> 00:00:53,820 then you want to distance between the anchor and the positive image, 17 00:00:53,820 --> 00:00:55,200 really a positive example, 18 00:00:55,200 --> 00:00:58,025 meaning as the same person to be similar. 19 00:00:58,025 --> 00:01:01,470 Whereas, you want the anchor when pairs are compared to 20 00:01:01,470 --> 00:01:07,455 the negative example for their distances to be much further apart. 21 00:01:07,455 --> 00:01:10,755 So, this is what gives rise to the term triplet loss, 22 00:01:10,755 --> 00:01:15,145 which is that you'll always be looking at three images at a time. 23 00:01:15,145 --> 00:01:17,415 You'll be looking at an anchor image, 24 00:01:17,415 --> 00:01:22,185 a positive image, as well as a negative image. 25 00:01:22,185 --> 00:01:25,993 And I'm going to abbreviate anchor positive and negative as A, 26 00:01:25,993 --> 00:01:29,561 P, and N. So to formalize this, 27 00:01:29,561 --> 00:01:32,700 what you want is for the parameters of 28 00:01:32,700 --> 00:01:36,060 your neural network of your encodings to have the following property, 29 00:01:36,060 --> 00:01:39,735 which is that you want the encoding 30 00:01:39,735 --> 00:01:44,445 between the anchor minus the encoding of the positive example, 31 00:01:44,445 --> 00:01:47,865 you want this to be small and in particular, 32 00:01:47,865 --> 00:01:52,950 you want this to be less than or equal to the distance of the squared norm 33 00:01:52,950 --> 00:01:58,900 between the encoding of the anchor and the encoding of the negative, 34 00:01:58,900 --> 00:02:02,485 where of course, this is d of A, 35 00:02:02,485 --> 00:02:06,610 P and this is d of A, 36 00:02:06,610 --> 00:02:11,280 N. And you can think of d as a distance function, 37 00:02:11,280 --> 00:02:14,700 which is why we named it with the alphabet d. Now, 38 00:02:14,700 --> 00:02:18,825 if we move to term from the right side of this equation to the left side, 39 00:02:18,825 --> 00:02:25,140 what you end up with is f of A minus f of P squared minus, 40 00:02:25,140 --> 00:02:27,675 let's take the right-hand side now, 41 00:02:27,675 --> 00:02:31,404 minus F of N squared, 42 00:02:31,404 --> 00:02:34,800 you want this to be less than or equal to zero. 43 00:02:34,800 --> 00:02:38,280 But now, we're going to make a slight change to this expression, 44 00:02:38,280 --> 00:02:41,420 which is one trivial way to make sure this is satisfied, 45 00:02:41,420 --> 00:02:44,245 is to just learn everything equals zero. 46 00:02:44,245 --> 00:02:46,275 If f always equals zero, 47 00:02:46,275 --> 00:02:47,955 then this is zero minus zero, 48 00:02:47,955 --> 00:02:50,545 which is zero, this is zero minus zero which is zero. 49 00:02:50,545 --> 00:02:57,140 And so, well, by saying f of any image equals a vector of all zeroes, 50 00:02:57,140 --> 00:03:01,410 you can almost trivially satisfy this equation. 51 00:03:01,410 --> 00:03:07,465 So, to make sure that the neural network doesn't just output zero for all the encoding, 52 00:03:07,465 --> 00:03:12,480 so to make sure that it doesn't set all the encodings equal to each other. 53 00:03:12,480 --> 00:03:16,285 Another way for the neural network to give a trivial output is if 54 00:03:16,285 --> 00:03:20,246 the encoding for every image was identical to the encoding to every other image, 55 00:03:20,246 --> 00:03:25,400 in which case, you again get zero minus zero. 56 00:03:25,400 --> 00:03:28,075 So to prevent a neural network from doing that, 57 00:03:28,075 --> 00:03:32,370 what we're going to do is modify this objective to say that, 58 00:03:32,370 --> 00:03:36,985 this doesn't need to be just less than or equal to zero, 59 00:03:36,985 --> 00:03:40,575 it needs to be quite a bit smaller than zero. 60 00:03:40,575 --> 00:03:45,755 So, in particular, if we say this needs to be less than negative alpha, 61 00:03:45,755 --> 00:03:49,235 where alpha is another hyperparameter, 62 00:03:49,235 --> 00:03:53,475 then this prevents a neural network from outputting the trivial solutions. 63 00:03:53,475 --> 00:03:55,230 And by convention, usually, 64 00:03:55,230 --> 00:03:59,550 we write plus alpha instead of negative alpha there. 65 00:03:59,550 --> 00:04:02,763 And this is also called, a margin, 66 00:04:02,763 --> 00:04:05,190 which is terminology that you'd be familiar 67 00:04:05,190 --> 00:04:10,320 with if you've also seen the literature on support vector machines, 68 00:04:10,320 --> 00:04:12,675 but don't worry about it if you haven't. 69 00:04:12,675 --> 00:04:17,995 And we can also modify this equation on top by adding this margin parameter. 70 00:04:17,995 --> 00:04:19,395 So to give an example, 71 00:04:19,395 --> 00:04:23,260 let's say the margin is set to 0.2. 72 00:04:23,260 --> 00:04:24,855 If in this example, 73 00:04:24,855 --> 00:04:29,550 d of the anchor and the positive is equal to 0.5, 74 00:04:29,550 --> 00:04:32,580 then you won't be satisfied if d between 75 00:04:32,580 --> 00:04:37,087 the anchor and the negative was just a little bit bigger, say 0.51. 76 00:04:37,087 --> 00:04:41,610 Even though 0.51 is bigger than 0.5, 77 00:04:41,610 --> 00:04:43,917 you're saying, that's not good enough, we want a dfA, 78 00:04:43,917 --> 00:04:47,610 N to be much bigger than dfA, 79 00:04:47,610 --> 00:04:49,245 P and in particular, 80 00:04:49,245 --> 00:04:53,520 you want this to be at least 0.7 or higher. 81 00:04:53,520 --> 00:04:58,740 Alternatively, to achieve this margin or this gap of at least 0.2, 82 00:04:58,740 --> 00:05:02,330 you could either push this up or push this down so 83 00:05:02,330 --> 00:05:06,305 that there is at least this gap of this alpha, 84 00:05:06,305 --> 00:05:09,350 hyperparameter alpha 0.2 between 85 00:05:09,350 --> 00:05:14,530 the distance between the anchor and the positive versus the anchor and the negative. 86 00:05:14,530 --> 00:05:17,491 So that's what having a margin parameter here does, 87 00:05:17,491 --> 00:05:19,055 which is it pushes 88 00:05:19,055 --> 00:05:25,720 the anchor positive pair and the anchor negative pair further away from each other. 89 00:05:25,720 --> 00:05:29,435 So, let's take this equation we have here at the bottom, 90 00:05:29,435 --> 00:05:31,160 and on the next slide, 91 00:05:31,160 --> 00:05:35,225 formalize it, and define the triplet loss function. 92 00:05:35,225 --> 00:05:40,810 So, the triplet loss function is defined on triples of images. 93 00:05:40,810 --> 00:05:43,465 So, given three images, 94 00:05:43,465 --> 00:05:46,350 A, P, and N, 95 00:05:46,350 --> 00:05:48,990 the anchor positive and negative examples. 96 00:05:48,990 --> 00:05:53,245 So the positive examples is of the same person as the anchor, 97 00:05:53,245 --> 00:05:58,040 but the negative is of a different person than the anchor. 98 00:05:58,040 --> 00:06:01,515 We're going to define the loss as follows. 99 00:06:01,515 --> 00:06:03,055 The loss on this example, 100 00:06:03,055 --> 00:06:06,465 which is really defined on a triplet of images is, 101 00:06:06,465 --> 00:06:10,169 let me first copy over what we had on the previous slide. 102 00:06:10,169 --> 00:06:16,045 So, that was fA minus fP squared 103 00:06:16,045 --> 00:06:23,790 minus fA minus fN squared, 104 00:06:23,790 --> 00:06:26,755 and then plus alpha, the margin parameter. 105 00:06:26,755 --> 00:06:31,600 And what you want is for this to be less than or equal to zero. 106 00:06:31,600 --> 00:06:34,365 So, to define the loss function, 107 00:06:34,365 --> 00:06:39,040 let's take the max between this and zero. 108 00:06:39,040 --> 00:06:42,030 So, the effect of taking the max here is that, 109 00:06:42,030 --> 00:06:45,225 so long as this is less than zero, 110 00:06:45,225 --> 00:06:47,070 then the loss is zero, 111 00:06:47,070 --> 00:06:49,847 because the max is something less than equal to zero, 112 00:06:49,847 --> 00:06:52,705 when zero is going to be zero. 113 00:06:52,705 --> 00:06:56,370 So, so long as you achieve the goal of making this thing I've underlined in green, 114 00:06:56,370 --> 00:07:00,950 so long as you've achieved the objective of making that less than or equal to zero, 115 00:07:00,950 --> 00:07:04,150 then the loss on this example is equals to zero. 116 00:07:04,150 --> 00:07:05,355 But if on the other hand, 117 00:07:05,355 --> 00:07:07,650 if this is greater than zero, 118 00:07:07,650 --> 00:07:09,120 then if you take the max, 119 00:07:09,120 --> 00:07:10,665 the max we end up selecting, 120 00:07:10,665 --> 00:07:12,343 this thing I've underlined in green, 121 00:07:12,343 --> 00:07:15,455 and so you would have a positive loss. 122 00:07:15,455 --> 00:07:17,475 So by trying to minimize this, 123 00:07:17,475 --> 00:07:22,130 this has the effect of trying to send this thing to be zero, 124 00:07:22,130 --> 00:07:23,450 less than or equal to zero. 125 00:07:23,450 --> 00:07:26,550 And then, so long as there's zero or less than or equal to zero, 126 00:07:26,550 --> 00:07:31,980 the neural network doesn't care how much further negative it is. 127 00:07:31,980 --> 00:07:33,990 So, this is how you define the loss on 128 00:07:33,990 --> 00:07:38,970 a single triplet and the overall cost function for your neural network can 129 00:07:38,970 --> 00:07:47,575 be sum over a training set of these individual losses on different triplets. 130 00:07:47,575 --> 00:07:55,080 So, if you have a training set of say 10,000 pictures with 1,000 different persons, 131 00:07:55,080 --> 00:08:00,360 what you'd have to do is take your 10,000 pictures and use it to generate, 132 00:08:00,360 --> 00:08:03,720 to select triplets like this and then train 133 00:08:03,720 --> 00:08:08,005 your learning algorithm using gradient descent on this type of cost function, 134 00:08:08,005 --> 00:08:14,145 which is really defined on triplets of images drawn from your training set. 135 00:08:14,145 --> 00:08:19,635 Notice that in order to define this dataset of triplets, 136 00:08:19,635 --> 00:08:25,405 you do need some pairs of A and P. Pairs of pictures of the same person. 137 00:08:25,405 --> 00:08:27,510 So the purpose of training your system, 138 00:08:27,510 --> 00:08:32,600 you do need a dataset where you have multiple pictures of the same person. 139 00:08:32,600 --> 00:08:33,960 That's why in this example, 140 00:08:33,960 --> 00:08:38,040 I said if you have 10,000 pictures of 1,000 different person, 141 00:08:38,040 --> 00:08:40,965 so maybe have 10 pictures on average 142 00:08:40,965 --> 00:08:44,310 of each of your 1,000 persons to make up your entire dataset. 143 00:08:44,310 --> 00:08:47,085 If you had just one picture of each person, 144 00:08:47,085 --> 00:08:49,725 then you can't actually train this system. 145 00:08:49,725 --> 00:08:52,080 But of course after training, 146 00:08:52,080 --> 00:08:54,205 if you're applying this, 147 00:08:54,205 --> 00:08:56,565 but of course after having trained the system, 148 00:08:56,565 --> 00:08:58,200 you can then apply it to 149 00:08:58,200 --> 00:09:02,685 your one shot learning problem where for your face recognition system, 150 00:09:02,685 --> 00:09:06,945 maybe you have only a single picture of someone you might be trying to recognize. 151 00:09:06,945 --> 00:09:08,335 But for your training set, 152 00:09:08,335 --> 00:09:12,780 you do need to make sure you have multiple images of the same person at least for 153 00:09:12,780 --> 00:09:14,940 some people in your training set so that you can 154 00:09:14,940 --> 00:09:19,000 have pairs of anchor and positive images. 155 00:09:19,000 --> 00:09:25,240 Now, how do you actually choose these triplets to form your training set? 156 00:09:25,240 --> 00:09:28,635 One of the problems if you choose A, P, 157 00:09:28,635 --> 00:09:34,260 and N randomly from your training set subject to A and P being from the same person, 158 00:09:34,260 --> 00:09:36,270 and A and N being different persons, 159 00:09:36,270 --> 00:09:40,140 one of the problems is that if you choose them so that they're at random, 160 00:09:40,140 --> 00:09:44,160 then this constraint is very easy to satisfy. 161 00:09:44,160 --> 00:09:48,405 Because given two randomly chosen pictures of people, 162 00:09:48,405 --> 00:09:51,945 chances are A and N are much different than 163 00:09:51,945 --> 00:09:55,740 A and P. I hope you still recognize this notation, 164 00:09:55,740 --> 00:10:02,710 this d(A, P) was what we had written on the last year's slides as this encoding. 165 00:10:02,710 --> 00:10:06,190 So this is just equal to this 166 00:10:06,190 --> 00:10:11,040 squared known distance between the encodings that we have on the previous slide. 167 00:10:11,040 --> 00:10:14,640 But if A and N are two randomly chosen different persons, 168 00:10:14,640 --> 00:10:17,610 then there is a very high chance that this will be much 169 00:10:17,610 --> 00:10:21,630 bigger more than the margin alpha that that term on the left. 170 00:10:21,630 --> 00:10:24,405 And so, the neural network won't learn much from it. 171 00:10:24,405 --> 00:10:25,940 So to construct a training set, 172 00:10:25,940 --> 00:10:28,340 what you want to do is to choose triplets A, 173 00:10:28,340 --> 00:10:31,280 P, and N that are hard to train on. 174 00:10:31,280 --> 00:10:38,685 So in particular, what you want is for all triplets that this constraint be satisfied. 175 00:10:38,685 --> 00:10:44,995 So, a triplet that is hard will be if you choose values for A, P, 176 00:10:44,995 --> 00:10:47,454 and N so that maybe d(A, 177 00:10:47,454 --> 00:10:52,550 P) is actually quite close to d(A,N). 178 00:10:52,550 --> 00:10:54,020 So in that case, 179 00:10:54,020 --> 00:10:57,230 the learning algorithm has to try extra hard to take 180 00:10:57,230 --> 00:11:00,740 this thing on the right and try to push it up or take this thing on 181 00:11:00,740 --> 00:11:04,030 the left and try to push it down so that there is 182 00:11:04,030 --> 00:11:08,430 at least a margin of alpha between the left side and the right side. 183 00:11:08,430 --> 00:11:11,900 And the effect of choosing these triplets is that it 184 00:11:11,900 --> 00:11:16,460 increases the computational efficiency of your learning algorithm. 185 00:11:16,460 --> 00:11:18,500 If you choose your triplets randomly, 186 00:11:18,500 --> 00:11:21,725 then too many triplets would be really easy, and so, 187 00:11:21,725 --> 00:11:25,970 gradient descent won't do anything because your neural network will just get them right, 188 00:11:25,970 --> 00:11:27,090 pretty much all the time. 189 00:11:27,090 --> 00:11:32,270 And it's only by using hard triplets that the gradient descent procedure 190 00:11:32,270 --> 00:11:38,700 has to do some work to try to push these quantities further away from those quantities. 191 00:11:38,700 --> 00:11:40,155 And if you're interested, 192 00:11:40,155 --> 00:11:46,295 the details are presented in this paper by Florian Schroff, Dmitry Kalinichenko, 193 00:11:46,295 --> 00:11:51,305 and James Philbin, where they have a system called FaceNet, 194 00:11:51,305 --> 00:11:55,860 which is where a lot of the ideas I'm presenting in this video come from. 195 00:11:55,860 --> 00:11:58,220 By the way, this is also a fun fact about 196 00:11:58,220 --> 00:12:02,030 how algorithms are often named in the deep learning world, 197 00:12:02,030 --> 00:12:05,810 which is if you work in a certain domain, then we call that blank. 198 00:12:05,810 --> 00:12:10,710 You often have a system called blank net or deep blank. 199 00:12:10,710 --> 00:12:13,095 So, we've been talking about face recognition. 200 00:12:13,095 --> 00:12:16,123 So this paper is called FaceNet, 201 00:12:16,123 --> 00:12:17,465 and in the last video, 202 00:12:17,465 --> 00:12:19,910 you just saw deep face. 203 00:12:19,910 --> 00:12:23,600 But this idea of a blank net or deep blank is 204 00:12:23,600 --> 00:12:28,370 a very popular way of naming algorithms in the deep learning world. 205 00:12:28,370 --> 00:12:32,780 And you should feel free to take a look at that paper if you want to learn some of 206 00:12:32,780 --> 00:12:34,940 these other details for speeding up your algorithm 207 00:12:34,940 --> 00:12:38,745 by choosing the most useful triplets to train on, 208 00:12:38,745 --> 00:12:40,025 it is a nice paper. 209 00:12:40,025 --> 00:12:41,240 So, just to wrap up, 210 00:12:41,240 --> 00:12:42,670 to train on triplet loss, 211 00:12:42,670 --> 00:12:47,060 you need to take your training set and map it to a lot of triples. 212 00:12:47,060 --> 00:12:50,550 So, here is our triple with an anchor and a positive, 213 00:12:50,550 --> 00:12:54,375 both for the same person and the negative of a different person. 214 00:12:54,375 --> 00:12:58,445 Here's another one where the anchor and positive are of 215 00:12:58,445 --> 00:13:04,315 the same person but the anchor and negative are of different persons and so on. 216 00:13:04,315 --> 00:13:07,000 And what you do having defined this training sets 217 00:13:07,000 --> 00:13:09,920 of anchor positive and negative triples is use 218 00:13:09,920 --> 00:13:16,740 gradient descent to try to minimize the cost function J we defined on an earlier slide, 219 00:13:16,740 --> 00:13:20,090 and that will have the effect of that propagating to 220 00:13:20,090 --> 00:13:23,640 all of the parameters of the neural network in order to learn 221 00:13:23,640 --> 00:13:27,435 an encoding so that d of 222 00:13:27,435 --> 00:13:33,395 two images will be small when these two images are of the same person, 223 00:13:33,395 --> 00:13:40,286 and they'll be large when these are two images of different persons. 224 00:13:40,286 --> 00:13:43,805 So, that's it for the triplet loss and how you can train 225 00:13:43,805 --> 00:13:48,200 a neural network for learning and an encoding for face recognition. 226 00:13:48,200 --> 00:13:49,715 Now, it turns out that 227 00:13:49,715 --> 00:13:54,556 commercial face recognition systems are trained on fairly large datasets at this point. 228 00:13:54,556 --> 00:13:56,630 Often, north of the million images 229 00:13:56,630 --> 00:14:00,275 sometimes not in frequently north of 10 million images. 230 00:14:00,275 --> 00:14:05,210 And there are some commercial companies talking about using over 100 million images. 231 00:14:05,210 --> 00:14:09,300 So these are very large datasets by models that even balm. 232 00:14:09,300 --> 00:14:12,800 So that's it for the triplet loss and how you can use it to 233 00:14:12,800 --> 00:14:18,230 train a neural network to operate a good encoding for face recognition. 234 00:14:18,230 --> 00:14:21,500 Now, it turns out that today's face recognition systems 235 00:14:21,500 --> 00:14:24,830 especially the loss cure commercial face recognition systems 236 00:14:24,830 --> 00:14:27,360 are trained on very large datasets. 237 00:14:27,360 --> 00:14:30,350 Datasets north of a million images is not uncommon, 238 00:14:30,350 --> 00:14:34,040 some companies are using north of 10 million images and some companies 239 00:14:34,040 --> 00:14:38,135 have north of 100 million images with which to try to train these systems. 240 00:14:38,135 --> 00:14:41,730 So these are very large datasets even by modern standards, 241 00:14:41,730 --> 00:14:45,230 these dataset assets are not easy to acquire. 242 00:14:45,230 --> 00:14:48,140 Fortunately, some of these companies have trained 243 00:14:48,140 --> 00:14:51,875 these large networks and posted parameters online. 244 00:14:51,875 --> 00:14:54,790 So, rather than trying to train one of these networks from scratch, 245 00:14:54,790 --> 00:14:59,280 this is one domain where because of the share data volume sizes, 246 00:14:59,280 --> 00:15:02,390 this is one domain where often it might be useful for you 247 00:15:02,390 --> 00:15:05,313 to download someone else's pre-train model, 248 00:15:05,313 --> 00:15:07,685 rather than do everything from scratch yourself. 249 00:15:07,685 --> 00:15:10,130 But even if you do download someone else's pre-train model, 250 00:15:10,130 --> 00:15:14,195 I think it's still useful to know how these algorithms were trained or 251 00:15:14,195 --> 00:15:19,225 in case you need to apply these ideas from scratch yourself for some application. 252 00:15:19,225 --> 00:15:21,405 So that's it for the triplet loss. 253 00:15:21,405 --> 00:15:22,640 In the next video, 254 00:15:22,640 --> 00:15:25,280 I want to show you also some other variations 255 00:15:25,280 --> 00:15:28,510 on siamese networks and how to train these systems. 256 00:15:28,510 --> 00:15:30,000 Let's go onto the next video.