1 00:00:00,000 --> 00:00:05,695 You've learned about Object Localization as well as Landmark Detection. 2 00:00:05,695 --> 00:00:09,470 Now, let's build up to other object detection algorithm. 3 00:00:09,470 --> 00:00:13,005 In this video, you'll learn how to use a cofinite to perform 4 00:00:13,005 --> 00:00:18,150 object detection using something called the Sliding Windows Detection Algorithm. 5 00:00:18,150 --> 00:00:21,154 Let's say you want to build a car detection algorithm. 6 00:00:21,154 --> 00:00:22,315 Here's what you can do. 7 00:00:22,315 --> 00:00:24,734 You can first create a label training set, 8 00:00:24,734 --> 00:00:29,100 so x and y with closely cropped examples of cars. 9 00:00:29,100 --> 00:00:32,970 So, this is image x has a positive example, there's a car, 10 00:00:32,970 --> 00:00:35,140 here's a car, here's a car, 11 00:00:35,140 --> 00:00:37,755 and then there's not a car, there's not a car. 12 00:00:37,755 --> 00:00:39,840 And for our purposes in this training set, 13 00:00:39,840 --> 00:00:43,733 you can start off with the one with the car closely cropped images. 14 00:00:43,733 --> 00:00:47,365 Meaning that x is pretty much only the car. 15 00:00:47,365 --> 00:00:49,650 So, you can take a picture and crop out and just 16 00:00:49,650 --> 00:00:52,340 cut out anything else that's not part of a car. 17 00:00:52,340 --> 00:00:57,450 So you end up with the car centered in pretty much the entire image. 18 00:00:57,450 --> 00:01:01,090 Given this label training set, 19 00:01:01,090 --> 00:01:05,412 you can then train a cofinite that inputs an image, 20 00:01:05,412 --> 00:01:07,977 like one of these closely cropped images. 21 00:01:07,977 --> 00:01:12,135 And then the job of the cofinite is to output y, 22 00:01:12,135 --> 00:01:15,090 zero or one, is there a car or not. 23 00:01:15,090 --> 00:01:17,044 Once you've trained up this cofinite, 24 00:01:17,044 --> 00:01:20,515 you can then use it in Sliding Windows Detection. 25 00:01:20,515 --> 00:01:21,870 So the way you do that is, 26 00:01:21,870 --> 00:01:25,560 if you have a test image like this what you do is you 27 00:01:25,560 --> 00:01:29,625 start by picking a certain window size, shown down there. 28 00:01:29,625 --> 00:01:35,070 And then you would input into this cofinite a small rectangular region. 29 00:01:35,070 --> 00:01:38,670 So, take just this below red square, 30 00:01:38,670 --> 00:01:41,235 input that into the cofinite, 31 00:01:41,235 --> 00:01:43,020 and have a cofinite make a prediction. 32 00:01:43,020 --> 00:01:47,215 And presumably for that little region in the red square, 33 00:01:47,215 --> 00:01:50,640 it'll say, no that little red square does not contain a car. 34 00:01:50,640 --> 00:01:52,310 In the Sliding Windows Detection Algorithm, 35 00:01:52,310 --> 00:01:56,900 what you do is you then pass as input 36 00:01:56,900 --> 00:02:00,000 a second image now bounded by 37 00:02:00,000 --> 00:02:03,970 this red square shifted a little bit over and feed that to the cofinite. 38 00:02:03,970 --> 00:02:06,715 So, you're feeding just the region of the image 39 00:02:06,715 --> 00:02:10,665 in the red squares of the cofinite and run the cofinite again. 40 00:02:10,665 --> 00:02:16,275 And then you do that with a third image and so on. 41 00:02:16,275 --> 00:02:23,415 And you keep going until you've slid the window across every position in the image. 42 00:02:23,415 --> 00:02:28,975 And I'm using a pretty large stride in this example just to make the animation go faster. 43 00:02:28,975 --> 00:02:34,700 But the idea is you basically go through every region of this size, 44 00:02:34,700 --> 00:02:38,460 and pass lots of little cropped images into 45 00:02:38,460 --> 00:02:45,125 the cofinite and have it classified zero or one for each position as some stride. 46 00:02:45,125 --> 00:02:47,085 Now, having done this once 47 00:02:47,085 --> 00:02:54,230 with running this was called the sliding window through the image. 48 00:02:54,230 --> 00:02:55,295 You then repeat it, 49 00:02:55,295 --> 00:02:57,710 but now use a larger window. 50 00:02:57,710 --> 00:03:02,191 So, now you take a slightly larger region and run that region. 51 00:03:02,191 --> 00:03:06,440 So, resize this region into whatever input size the cofinite is expecting, 52 00:03:06,440 --> 00:03:10,235 and feed that to the cofinite and have it output zero or one. 53 00:03:10,235 --> 00:03:15,305 And then slide the window over again using some stride and so on. 54 00:03:15,305 --> 00:03:20,500 And you run that throughout your entire image until you get to the end. 55 00:03:20,500 --> 00:03:26,283 And then you might do the third time using even larger windows and so on. 56 00:03:26,283 --> 00:03:29,738 Right. And the hope is that if you do this, 57 00:03:29,738 --> 00:03:36,080 then so long as there's a car somewhere in the image that there will be a window where, 58 00:03:36,080 --> 00:03:40,200 for example if you are passing in this window into the cofinite, 59 00:03:40,200 --> 00:03:44,890 hopefully the cofinite will have outputs one for that input region. 60 00:03:44,890 --> 00:03:47,825 So then you detect that there is a car there. 61 00:03:47,825 --> 00:03:52,895 So this algorithm is called Sliding Windows Detection because you take these windows, 62 00:03:52,895 --> 00:03:58,745 these square boxes, and slide them across the entire image 63 00:03:58,745 --> 00:04:05,770 and classify every square region with some stride as containing a car or not. 64 00:04:05,770 --> 00:04:10,055 Now there's a huge disadvantage of Sliding Windows Detection, 65 00:04:10,055 --> 00:04:12,704 which is the computational cost. 66 00:04:12,704 --> 00:04:16,460 Because you're cropping out so many different square regions in 67 00:04:16,460 --> 00:04:21,370 the image and running each of them independently through a cofinite. 68 00:04:21,370 --> 00:04:24,505 And if you use a very coarse stride, 69 00:04:24,505 --> 00:04:26,745 a very big stride, a very big step size, 70 00:04:26,745 --> 00:04:31,598 then that will reduce the number of windows you need to pass through the cofinite, 71 00:04:31,598 --> 00:04:35,810 but that courser granularity may hurt performance. 72 00:04:35,810 --> 00:04:39,630 Whereas if you use a very fine granularity or a very small stride, 73 00:04:39,630 --> 00:04:44,005 then the huge number of all these little regions you're 74 00:04:44,005 --> 00:04:48,995 passing through the cofinite means that means there is a very high computational cost. 75 00:04:48,995 --> 00:04:54,180 So, before the rise of Neural Networks people used to use much simpler classifiers like 76 00:04:54,180 --> 00:04:56,910 a simple linear classifier over hand 77 00:04:56,910 --> 00:05:00,450 engineer features in order to perform object detection. 78 00:05:00,450 --> 00:05:04,870 And in that era because each classifier was relatively cheap to compute, 79 00:05:04,870 --> 00:05:06,480 it was just a linear function, 80 00:05:06,480 --> 00:05:08,980 Sliding Windows Detection ran okay. 81 00:05:08,980 --> 00:05:10,395 It was not a bad method, 82 00:05:10,395 --> 00:05:15,450 but with cofinite now running a single classification task is much 83 00:05:15,450 --> 00:05:21,125 more expensive and sliding windows this way is infeasibily slow. 84 00:05:21,125 --> 00:05:26,305 And unless you use a very fine granularity or a very small stride, 85 00:05:26,305 --> 00:05:32,850 you end up not able to localize the objects that accurately within the image as well. 86 00:05:32,850 --> 00:05:38,575 Fortunately however, this problem of computational cost has a pretty good solution. 87 00:05:38,575 --> 00:05:41,845 In particular, the Sliding Windows Object Detector 88 00:05:41,845 --> 00:05:45,935 can be implemented convolutionally or much more efficiently. 89 00:05:45,935 --> 00:05:48,310 Let's see in the next video how you can do that.