1 00:00:00,000 --> 00:00:04,291 [MUSIC] 2 00:00:04,291 --> 00:00:07,792 The first place where neural networks made a tremendous amount of difference, 3 00:00:07,792 --> 00:00:11,140 is in an area called computer vision, so analyzing images and videos. 4 00:00:11,140 --> 00:00:15,200 So let's see a few examples of how deep learning, or 5 00:00:15,200 --> 00:00:18,160 this big neural networks, can be applied to computer vision. 6 00:00:18,160 --> 00:00:22,020 So to do that, it's good to understand what image features are. 7 00:00:23,710 --> 00:00:27,580 So in computer vision, image features are kind of like local detectors that get 8 00:00:27,580 --> 00:00:29,550 combined to make a prediction. 9 00:00:29,550 --> 00:00:31,520 So let's say we take this particular image. 10 00:00:31,520 --> 00:00:35,680 Suppose that I want to predict whether this a face image or not a face image. 11 00:00:35,680 --> 00:00:41,490 I run the neural detector, let's say a nose detector, eye detector, 12 00:00:41,490 --> 00:00:45,070 another eye detector, a mouth detector, and 13 00:00:45,070 --> 00:00:48,448 if all of these fire, you can do it and 14 00:00:48,448 --> 00:00:53,090 using a little neural network, you can say this is a face, and that's our prediction. 15 00:00:55,180 --> 00:00:59,200 Now, this is a simple example of how it can build a classifier for images, but 16 00:00:59,200 --> 00:01:02,920 in reality they don't explicitly have a nose detector or eye detector. 17 00:01:02,920 --> 00:01:06,440 What happens is these called image features or 18 00:01:06,440 --> 00:01:09,910 interest points and there's various names for this. 19 00:01:09,910 --> 00:01:13,280 But they really tried to find local image segments, 20 00:01:13,280 --> 00:01:15,616 patches, that are really distinctive. 21 00:01:15,616 --> 00:01:19,720 So then maybe they'll find the corner around the eye, 22 00:01:19,720 --> 00:01:24,090 maybe the corner around the nose, so if you have lots of this corner detectors, 23 00:01:24,090 --> 00:01:27,030 a face is comprised of corners. 24 00:01:27,030 --> 00:01:32,587 Corner detector firings at places around the eye, the mouth, and both eyes. 25 00:01:32,587 --> 00:01:35,124 And if enough of this fire in a particular pattern, 26 00:01:35,124 --> 00:01:36,842 you discover that you have a face. 27 00:01:36,842 --> 00:01:39,170 So this is how computer vision typically works. 28 00:01:39,170 --> 00:01:40,260 So this is how classification works. 29 00:01:40,260 --> 00:01:43,170 Of course, there's more general models and more complex ones, but 30 00:01:43,170 --> 00:01:45,240 this is kind of the basic idea. 31 00:01:45,240 --> 00:01:51,140 For years, these types of detectors of local features are built by hand. 32 00:01:51,140 --> 00:01:54,520 A very popular one was called SIFT features. 33 00:01:54,520 --> 00:01:57,844 And this retransformed their computer vision because they were really quite 34 00:01:57,844 --> 00:01:59,590 applicable and quite cool. 35 00:01:59,590 --> 00:02:03,820 And then, there are many others that improve the accuracy. 36 00:02:03,820 --> 00:02:06,240 So other kinds of features that can be used. 37 00:02:07,370 --> 00:02:12,460 We talked about this hand created image features like SIFT feature, and 38 00:02:12,460 --> 00:02:16,480 so, let's talk about how they can be typically used for classification. 39 00:02:16,480 --> 00:02:19,749 What we do is, we run the sifted textures over the image and 40 00:02:19,749 --> 00:02:22,290 they fire in various places. 41 00:02:22,290 --> 00:02:25,550 So for example the corners of the eyes and the mouth. 42 00:02:25,550 --> 00:02:31,130 And then what we do is, we create a vector that describe the image based 43 00:02:31,130 --> 00:02:35,570 on the firings, the locations where those SIFT features fired. 44 00:02:35,570 --> 00:02:41,554 So, you might have some firings in some locations, no firings in other locations, 45 00:02:41,554 --> 00:02:45,787 and this can be viewed similarly to the words in a document. 46 00:02:45,787 --> 00:02:48,640 So, does the word messy appear? 47 00:02:48,640 --> 00:02:50,480 Does the word football appear? 48 00:02:50,480 --> 00:02:54,510 Similarly, does a corner appear in a particular place in the image. 49 00:02:54,510 --> 00:02:59,080 Now once we have that description of the image, we feed it to a classifier. 50 00:02:59,080 --> 00:03:00,240 So for example, 51 00:03:00,240 --> 00:03:03,240 a simple linear classifier like we talked about earlier in the quarter. 52 00:03:04,860 --> 00:03:07,320 It's not a quarter, we're teaching this online. 53 00:03:07,320 --> 00:03:08,660 It was earlier in the module. 54 00:03:08,660 --> 00:03:12,690 [LAUGH] So as we talked about earlier in 55 00:03:12,690 --> 00:03:17,750 the module, you can feed it to a simple linear classifier and some names for those 56 00:03:17,750 --> 00:03:21,420 are things like linguistically regression, support vector machines and more. 57 00:03:21,420 --> 00:03:25,830 And from there, we get a detection as to whether this is image is a face or not. 58 00:03:27,790 --> 00:03:29,420 Now that sounds pretty exciting and 59 00:03:29,420 --> 00:03:33,180 it had a real significant impact in their computer vision. 60 00:03:33,180 --> 00:03:40,180 The challenge though, is that creating these hand built image features was 61 00:03:40,180 --> 00:03:45,500 a really complicated process and required several PhD thesis to be done well. 62 00:03:48,240 --> 00:03:52,480 Neural networks are going to discover and learn those features automatically. 63 00:03:52,480 --> 00:03:54,120 Let me give you an example of that. 64 00:03:54,120 --> 00:03:56,210 Suppose they give you this input image and 65 00:03:56,210 --> 00:04:00,320 they run it through a three layer neural network before making a prediction. 66 00:04:00,320 --> 00:04:04,952 Typically what happens, is that you learn local feature detectors, 67 00:04:04,952 --> 00:04:08,380 they're like SIFT, but at different levels and different layers. 68 00:04:08,380 --> 00:04:13,230 And this detectors that you learn, they detect different things, 69 00:04:13,230 --> 00:04:16,110 different properties of the image at different levels. 70 00:04:16,110 --> 00:04:20,590 So, the first layer, you might learn detectors that look kinda 71 00:04:20,590 --> 00:04:25,710 like these little patches, which really react to things like diagonal edges. 72 00:04:25,710 --> 00:04:29,830 So this first detector here is all about capturing diagonal edges. 73 00:04:29,830 --> 00:04:33,760 The center one is about capturing diagonal edges in the other direction. 74 00:04:33,760 --> 00:04:39,720 And the last one here is about capturing transitions and color from dark to green. 75 00:04:42,005 --> 00:04:46,415 Now, if we look at the next layer, you're combining this edge, 76 00:04:46,415 --> 00:04:50,715 diagonal edge [INAUDIBLE] into some kind of more complex detectors. 77 00:04:50,715 --> 00:04:55,620 So, for example, we discovered this wiggly line and 78 00:04:55,620 --> 00:04:58,675 pattern detectors in the layer. 79 00:04:58,675 --> 00:05:02,710 You also discovered this kind of detectors that react to corners, 80 00:05:02,710 --> 00:05:05,531 that [INAUDIBLE] detect corners in the images. 81 00:05:05,531 --> 00:05:10,508 And at the final layers you come up with detectors that are even more complicated. 82 00:05:10,508 --> 00:05:15,030 So for a variety of images you might end up with 83 00:05:15,030 --> 00:05:18,960 things that react to torsos and faces. 84 00:05:18,960 --> 00:05:24,831 Or maybe if you have a bigger data set, even to these images of here, 85 00:05:24,831 --> 00:05:28,448 which they fire up with images of corals. 86 00:05:28,448 --> 00:05:33,146 So neural networks capture different types of image features at 87 00:05:33,146 --> 00:05:37,591 different layers and then they get learned automatically. 88 00:05:37,591 --> 00:05:41,819 [MUSIC]