1 00:00:00,000 --> 00:00:01,440 In the previous video, 2 00:00:01,440 --> 00:00:06,410 you saw how you can get a neural network to output four numbers of bx, 3 00:00:06,410 --> 00:00:08,850 by, bh, and bw to specify 4 00:00:08,850 --> 00:00:12,840 the bounding box of an object you want a neural network to localize. 5 00:00:12,840 --> 00:00:14,753 In more general cases, 6 00:00:14,753 --> 00:00:17,400 you can have a neural network just output X and 7 00:00:17,400 --> 00:00:20,700 Y coordinates of important points and image, 8 00:00:20,700 --> 00:00:24,960 sometimes called landmarks, that you want the neural networks to recognize. 9 00:00:24,960 --> 00:00:26,451 Let me show you a few examples. 10 00:00:26,451 --> 00:00:31,095 Let's say you're building a face recognition application and for some reason, 11 00:00:31,095 --> 00:00:36,835 you want the algorithm to tell you where is the corner of someone's eye. 12 00:00:36,835 --> 00:00:40,530 So that point has an X and Y coordinate, 13 00:00:40,530 --> 00:00:43,310 so you can just have a neural network have 14 00:00:43,310 --> 00:00:48,235 its final layer and have it just output two more numbers which 15 00:00:48,235 --> 00:00:51,804 I'm going to call our lx and ly 16 00:00:51,804 --> 00:00:56,730 to just tell you the coordinates of that corner of the person's eye. 17 00:00:56,730 --> 00:01:01,680 Now, what if you want it to tell you all four corners of the eye, 18 00:01:01,680 --> 00:01:03,595 really of both eyes. 19 00:01:03,595 --> 00:01:06,360 So, if we call the points, the first, second, 20 00:01:06,360 --> 00:01:08,640 third and fourth points going from left to right, 21 00:01:08,640 --> 00:01:13,585 then you could modify the neural network now to output l1x, 22 00:01:13,585 --> 00:01:17,975 l1y for the first point and l2x, 23 00:01:17,975 --> 00:01:22,140 l2y for the second point and so on, 24 00:01:22,140 --> 00:01:25,015 so that the neural network can output 25 00:01:25,015 --> 00:01:29,590 the estimated position of all those four points of the person's face. 26 00:01:29,590 --> 00:01:31,500 But what if you don't want just those four points? 27 00:01:31,500 --> 00:01:33,195 What do you want to output this point, 28 00:01:33,195 --> 00:01:36,885 and this point and this point and this point along the eye? 29 00:01:36,885 --> 00:01:39,945 Maybe I'll put some key points along the mouth, 30 00:01:39,945 --> 00:01:44,205 so you can extract the mouth shape and tell if the person is smiling or frowning, 31 00:01:44,205 --> 00:01:47,250 maybe extract a few key points along the edges 32 00:01:47,250 --> 00:01:50,175 of the nose but you could define some number, 33 00:01:50,175 --> 00:01:57,080 for the sake of argument, let's say 64 points or 64 landmarks on the face. 34 00:01:57,080 --> 00:02:02,020 Maybe even some points that help you define the edge of the face, 35 00:02:02,020 --> 00:02:06,480 defines the jaw line but by selecting a number of landmarks 36 00:02:06,480 --> 00:02:11,415 and generating a label training sets that contains all of these landmarks, 37 00:02:11,415 --> 00:02:14,240 you can then have the neural network to tell you where are 38 00:02:14,240 --> 00:02:19,045 all the key positions or the key landmarks on a face. 39 00:02:19,045 --> 00:02:21,305 So what you do is you have this image, 40 00:02:21,305 --> 00:02:23,755 a person's face as input, 41 00:02:23,755 --> 00:02:28,564 have it go through a convnet and have a convnet, 42 00:02:28,564 --> 00:02:32,160 then have some set of features, 43 00:02:32,160 --> 00:02:34,285 maybe have it output 0 or 1, 44 00:02:34,285 --> 00:02:40,605 like zero face changes or not and then have it also output l1x, 45 00:02:40,605 --> 00:02:48,675 l1y and so on down to l64x, l64y. 46 00:02:48,675 --> 00:02:52,065 And here I'm using l to stand for a landmark. 47 00:02:52,065 --> 00:02:59,905 So this example would have 129 output units, 48 00:02:59,905 --> 00:03:02,005 one for is your face or not? 49 00:03:02,005 --> 00:03:04,020 And then if you have 64 landmarks, 50 00:03:04,020 --> 00:03:05,520 that's sixty-four times two, 51 00:03:05,520 --> 00:03:09,600 so 128 plus one output units and 52 00:03:09,600 --> 00:03:14,450 this can tell you if there's a face as well as where all the key landmarks on the face. 53 00:03:14,450 --> 00:03:19,225 So, this is a basic building block for recognizing 54 00:03:19,225 --> 00:03:25,020 emotions from faces and if you played with the Snapchat and the other entertainment, 55 00:03:25,020 --> 00:03:28,635 also AR augmented reality filters like 56 00:03:28,635 --> 00:03:33,635 the Snapchat photos can draw a crown on the face and have other special effects. 57 00:03:33,635 --> 00:03:36,565 Being able to detect these landmarks on the face, 58 00:03:36,565 --> 00:03:41,949 there's also a key building block for the computer graphics effects that warp 59 00:03:41,949 --> 00:03:48,125 the face or drawing various special effects like putting a crown or a hat on the person. 60 00:03:48,125 --> 00:03:50,600 Of course, in order to treat a network like this, 61 00:03:50,600 --> 00:03:52,995 you will need a label training set. 62 00:03:52,995 --> 00:03:57,455 We have a set of images as well as labels Y where people, 63 00:03:57,455 --> 00:04:00,350 where someone will have had to go through and 64 00:04:00,350 --> 00:04:04,925 laboriously annotate all of these landmarks. 65 00:04:04,925 --> 00:04:11,870 One last example, if you are interested in people pose detection, 66 00:04:11,870 --> 00:04:17,110 you could also define a few key positions like the midpoint of the chest, 67 00:04:17,110 --> 00:04:21,807 the left shoulder, left elbow, the wrist, 68 00:04:21,807 --> 00:04:24,936 and so on, and just have a neural network to annotate 69 00:04:24,936 --> 00:04:34,195 key positions in the person's pose as well and by having a neural network output, 70 00:04:34,195 --> 00:04:36,743 all of those points I'm annotating, 71 00:04:36,743 --> 00:04:42,685 you could also have the neural network output the pose of the person. 72 00:04:42,685 --> 00:04:47,325 And of course, to do that you also need to specify on 73 00:04:47,325 --> 00:04:50,775 these key landmarks like maybe l1x and l1y 74 00:04:50,775 --> 00:04:55,395 is the midpoint of the chest down to maybe l32x, 75 00:04:55,395 --> 00:05:01,840 l32y, if you use 32 coordinates to specify the pose of the person. 76 00:05:01,840 --> 00:05:06,690 So, this idea might seem quite simple of just adding a bunch of output units 77 00:05:06,690 --> 00:05:12,200 to output the X,Y coordinates of different landmarks you want to recognize. 78 00:05:12,200 --> 00:05:16,290 To be clear, the identity of landmark one must be 79 00:05:16,290 --> 00:05:18,525 consistent across different images like maybe 80 00:05:18,525 --> 00:05:21,368 landmark one is always this corner of the eye, 81 00:05:21,368 --> 00:05:23,535 landmark two is always this corner of the eye, 82 00:05:23,535 --> 00:05:25,650 landmark three, landmark four, and so on. 83 00:05:25,650 --> 00:05:29,475 So, the labels have to be consistent across different images. 84 00:05:29,475 --> 00:05:34,350 But if you can hire labelers or label yourself a big enough data set to do this, 85 00:05:34,350 --> 00:05:38,368 then a neural network can output all of these landmarks which is going to 86 00:05:38,368 --> 00:05:43,440 used to carry out other interesting effect such as with the pose of the person, 87 00:05:43,440 --> 00:05:47,935 maybe try to recognize someone's emotion from a picture, and so on. 88 00:05:47,935 --> 00:05:50,460 So that's it for landmark detection. 89 00:05:50,460 --> 00:05:53,280 Next, let's take these building blocks and use 90 00:05:53,280 --> 00:05:56,280 it to start building up towards object detection.