1 00:00:00,300 --> 00:00:04,300 The job of the function d, which you learned about in the last video, 2 00:00:04,300 --> 00:00:08,860 is to input two faces and tell you how similar or how different they are. 3 00:00:08,860 --> 00:00:12,605 A good way to do this is to use a Siamese network. 4 00:00:12,605 --> 00:00:13,105 Let's take a look. 5 00:00:15,117 --> 00:00:19,098 You're used to seeing pictures of confidence like these where you input 6 00:00:19,098 --> 00:00:20,635 an image, let's say x1. 7 00:00:20,635 --> 00:00:24,165 And through a sequence of convolutional and pulling and 8 00:00:24,165 --> 00:00:28,465 fully connected layers, end up with a feature vector like that. 9 00:00:30,260 --> 00:00:35,940 And sometimes this is fed to a softmax unit to make a classification. 10 00:00:35,940 --> 00:00:39,000 We're not going to use that in this video. 11 00:00:39,000 --> 00:00:44,090 Instead, we're going to focus on this vector of let's say 128 12 00:00:44,090 --> 00:00:49,900 numbers computed by some fully connected layer that is deeper in the network. 13 00:00:50,930 --> 00:00:54,850 And I'm going to give this list of 128 numbers a name. 14 00:00:54,850 --> 00:01:00,220 I'm going to call this f of x1, 15 00:01:00,220 --> 00:01:07,476 and you should think of f of x1 as an encoding of the input image x1. 16 00:01:07,476 --> 00:01:12,653 So it's taken the input image, here this picture of Kian, 17 00:01:12,653 --> 00:01:17,750 and is re-representing it as a vector of 128 numbers. 18 00:01:17,750 --> 00:01:22,840 The way you can build a face recognition system is then that if you want to compare 19 00:01:22,840 --> 00:01:27,890 two pictures, let's say this first picture with this second picture here. 20 00:01:27,890 --> 00:01:31,860 What you can do is feed this second picture to the same neural network with 21 00:01:31,860 --> 00:01:37,625 the same parameters and get a different vector of 128 numbers, 22 00:01:37,625 --> 00:01:41,740 which encodes this second picture. 23 00:01:41,740 --> 00:01:43,560 So I'm going to call this second picture. 24 00:01:44,570 --> 00:01:50,480 So I'm going to call this encoding of this second picture f of x2, and 25 00:01:50,480 --> 00:01:55,680 here I'm using x1 and x2 just to denote two input images. 26 00:01:55,680 --> 00:01:58,030 They don't necessarily have to be the first and 27 00:01:58,030 --> 00:02:00,010 second examples in your training sets. 28 00:02:00,010 --> 00:02:02,230 It can be any two pictures. 29 00:02:02,230 --> 00:02:07,320 Finally, if you believe that these encodings are a good representation 30 00:02:07,320 --> 00:02:11,430 of these two images, what you can do is then define the image 31 00:02:12,670 --> 00:02:16,463 d of distance between x1 and 32 00:02:16,463 --> 00:02:21,820 x2 as the norm of the difference 33 00:02:21,820 --> 00:02:25,750 between the encodings of these two images. 34 00:02:26,990 --> 00:02:29,434 So this idea of running two identical, 35 00:02:29,434 --> 00:02:34,544 convolutional neural networks on two different inputs and then comparing them, 36 00:02:34,544 --> 00:02:39,380 sometimes that's called a Siamese neural network architecture. 37 00:02:39,380 --> 00:02:43,530 And a lot of the ideas I'm presenting here came from 38 00:02:43,530 --> 00:02:48,490 this paper due to Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and 39 00:02:48,490 --> 00:02:53,690 Lior Wolf in the research system that they developed called DeepFace. 40 00:02:54,890 --> 00:03:00,741 And many of the ideas I'm presenting here came from a paper due to Yaniv Taigman, 41 00:03:00,741 --> 00:03:02,920 Ming Yang, Marc'Aurelio Ranzato, and 42 00:03:02,920 --> 00:03:07,155 Lior Wolf in a system that they developed called DeepFace. 43 00:03:08,520 --> 00:03:11,830 So how do you train this Siamese neural network? 44 00:03:11,830 --> 00:03:15,380 Remember that these two neural networks have the same parameters. 45 00:03:16,730 --> 00:03:19,600 So what you want to do is really train the neural network so 46 00:03:19,600 --> 00:03:24,250 that the encoding that it computes results in a function d 47 00:03:24,250 --> 00:03:27,420 that tells you when two pictures are of the same person. 48 00:03:27,420 --> 00:03:33,350 So more formally, the parameters of the neural network define an encoding f of xi. 49 00:03:33,350 --> 00:03:35,548 So given any input image xi, 50 00:03:35,548 --> 00:03:40,968 the neural network outputs this 128 dimensional encoding f of xi. 51 00:03:40,968 --> 00:03:45,655 So more formally, what you want to do is learn parameters so 52 00:03:45,655 --> 00:03:50,152 that if two pictures, xi and xj, are of the same person, 53 00:03:50,152 --> 00:03:55,347 then you want that distance between their encodings to be small. 54 00:03:55,347 --> 00:03:59,583 And in the previous slide, l was using x1 and x2, but 55 00:03:59,583 --> 00:04:03,841 it's really any pair xi and xj from your training set. 56 00:04:03,841 --> 00:04:07,959 And in contrast, if xi and xj are of different persons, 57 00:04:07,959 --> 00:04:13,340 then you want that distance between their encodings to be large. 58 00:04:13,340 --> 00:04:18,160 So as you vary the parameters in all of these layers of the neural network, 59 00:04:18,160 --> 00:04:20,665 you end up with different encodings. 60 00:04:20,665 --> 00:04:23,639 And what you can do is use back propagation and 61 00:04:23,639 --> 00:04:29,590 vary all those parameters in order to make sure these conditions are satisfied. 62 00:04:29,590 --> 00:04:33,460 So you've learned about the Siamese network architecture and 63 00:04:33,460 --> 00:04:36,890 have a sense of what you want the neural network to output for 64 00:04:36,890 --> 00:04:39,830 you in terms of what would make a good encoding. 65 00:04:39,830 --> 00:04:42,790 But how do you actually define an objective function 66 00:04:42,790 --> 00:04:46,700 to make a neural network learn to do what we just discussed here? 67 00:04:46,700 --> 00:04:50,940 Let's see how you can do that in the next video using the triplet loss function.