1
00:00:00,000 --> 00:00:05,695
You've learned about Object Localization as well as Landmark Detection.

2
00:00:05,695 --> 00:00:09,470
Now, let's build up to other object detection algorithm.

3
00:00:09,470 --> 00:00:13,005
In this video, you'll learn how to use a cofinite to perform

4
00:00:13,005 --> 00:00:18,150
object detection using something called the Sliding Windows Detection Algorithm.

5
00:00:18,150 --> 00:00:21,154
Let's say you want to build a car detection algorithm.

6
00:00:21,154 --> 00:00:22,315
Here's what you can do.

7
00:00:22,315 --> 00:00:24,734
You can first create a label training set,

8
00:00:24,734 --> 00:00:29,100
so x and y with closely cropped examples of cars.

9
00:00:29,100 --> 00:00:32,970
So, this is image x has a positive example, there's a car,

10
00:00:32,970 --> 00:00:35,140
here's a car, here's a car,

11
00:00:35,140 --> 00:00:37,755
and then there's not a car, there's not a car.

12
00:00:37,755 --> 00:00:39,840
And for our purposes in this training set,

13
00:00:39,840 --> 00:00:43,733
you can start off with the one with the car closely cropped images.

14
00:00:43,733 --> 00:00:47,365
Meaning that x is pretty much only the car.

15
00:00:47,365 --> 00:00:49,650
So, you can take a picture and crop out and just

16
00:00:49,650 --> 00:00:52,340
cut out anything else that's not part of a car.

17
00:00:52,340 --> 00:00:57,450
So you end up with the car centered in pretty much the entire image.

18
00:00:57,450 --> 00:01:01,090
Given this label training set,

19
00:01:01,090 --> 00:01:05,412
you can then train a cofinite that inputs an image,

20
00:01:05,412 --> 00:01:07,977
like one of these closely cropped images.

21
00:01:07,977 --> 00:01:12,135
And then the job of the cofinite is to output y,

22
00:01:12,135 --> 00:01:15,090
zero or one, is there a car or not.

23
00:01:15,090 --> 00:01:17,044
Once you've trained up this cofinite,

24
00:01:17,044 --> 00:01:20,515
you can then use it in Sliding Windows Detection.

25
00:01:20,515 --> 00:01:21,870
So the way you do that is,

26
00:01:21,870 --> 00:01:25,560
if you have a test image like this what you do is you

27
00:01:25,560 --> 00:01:29,625
start by picking a certain window size, shown down there.

28
00:01:29,625 --> 00:01:35,070
And then you would input into this cofinite a small rectangular region.

29
00:01:35,070 --> 00:01:38,670
So, take just this below red square,

30
00:01:38,670 --> 00:01:41,235
input that into the cofinite,

31
00:01:41,235 --> 00:01:43,020
and have a cofinite make a prediction.

32
00:01:43,020 --> 00:01:47,215
And presumably for that little region in the red square,

33
00:01:47,215 --> 00:01:50,640
it'll say, no that little red square does not contain a car.

34
00:01:50,640 --> 00:01:52,310
In the Sliding Windows Detection Algorithm,

35
00:01:52,310 --> 00:01:56,900
what you do is you then pass as input

36
00:01:56,900 --> 00:02:00,000
a second image now bounded by

37
00:02:00,000 --> 00:02:03,970
this red square shifted a little bit over and feed that to the cofinite.

38
00:02:03,970 --> 00:02:06,715
So, you're feeding just the region of the image

39
00:02:06,715 --> 00:02:10,665
in the red squares of the cofinite and run the cofinite again.

40
00:02:10,665 --> 00:02:16,275
And then you do that with a third image and so on.

41
00:02:16,275 --> 00:02:23,415
And you keep going until you've slid the window across every position in the image.

42
00:02:23,415 --> 00:02:28,975
And I'm using a pretty large stride in this example just to make the animation go faster.

43
00:02:28,975 --> 00:02:34,700
But the idea is you basically go through every region of this size,

44
00:02:34,700 --> 00:02:38,460
and pass lots of little cropped images into

45
00:02:38,460 --> 00:02:45,125
the cofinite and have it classified zero or one for each position as some stride.

46
00:02:45,125 --> 00:02:47,085
Now, having done this once

47
00:02:47,085 --> 00:02:54,230
with running this was called the sliding window through the image.

48
00:02:54,230 --> 00:02:55,295
You then repeat it,

49
00:02:55,295 --> 00:02:57,710
but now use a larger window.

50
00:02:57,710 --> 00:03:02,191
So, now you take a slightly larger region and run that region.

51
00:03:02,191 --> 00:03:06,440
So, resize this region into whatever input size the cofinite is expecting,

52
00:03:06,440 --> 00:03:10,235
and feed that to the cofinite and have it output zero or one.

53
00:03:10,235 --> 00:03:15,305
And then slide the window over again using some stride and so on.

54
00:03:15,305 --> 00:03:20,500
And you run that throughout your entire image until you get to the end.

55
00:03:20,500 --> 00:03:26,283
And then you might do the third time using even larger windows and so on.

56
00:03:26,283 --> 00:03:29,738
Right. And the hope is that if you do this,

57
00:03:29,738 --> 00:03:36,080
then so long as there's a car somewhere in the image that there will be a window where,

58
00:03:36,080 --> 00:03:40,200
for example if you are passing in this window into the cofinite,

59
00:03:40,200 --> 00:03:44,890
hopefully the cofinite will have outputs one for that input region.

60
00:03:44,890 --> 00:03:47,825
So then you detect that there is a car there.

61
00:03:47,825 --> 00:03:52,895
So this algorithm is called Sliding Windows Detection because you take these windows,

62
00:03:52,895 --> 00:03:58,745
these square boxes, and slide them across the entire image

63
00:03:58,745 --> 00:04:05,770
and classify every square region with some stride as containing a car or not.

64
00:04:05,770 --> 00:04:10,055
Now there's a huge disadvantage of Sliding Windows Detection,

65
00:04:10,055 --> 00:04:12,704
which is the computational cost.

66
00:04:12,704 --> 00:04:16,460
Because you're cropping out so many different square regions in

67
00:04:16,460 --> 00:04:21,370
the image and running each of them independently through a cofinite.

68
00:04:21,370 --> 00:04:24,505
And if you use a very coarse stride,

69
00:04:24,505 --> 00:04:26,745
a very big stride, a very big step size,

70
00:04:26,745 --> 00:04:31,598
then that will reduce the number of windows you need to pass through the cofinite,

71
00:04:31,598 --> 00:04:35,810
but that courser granularity may hurt performance.

72
00:04:35,810 --> 00:04:39,630
Whereas if you use a very fine granularity or a very small stride,

73
00:04:39,630 --> 00:04:44,005
then the huge number of all these little regions you're

74
00:04:44,005 --> 00:04:48,995
passing through the cofinite means that means there is a very high computational cost.

75
00:04:48,995 --> 00:04:54,180
So, before the rise of Neural Networks people used to use much simpler classifiers like

76
00:04:54,180 --> 00:04:56,910
a simple linear classifier over hand

77
00:04:56,910 --> 00:05:00,450
engineer features in order to perform object detection.

78
00:05:00,450 --> 00:05:04,870
And in that era because each classifier was relatively cheap to compute,

79
00:05:04,870 --> 00:05:06,480
it was just a linear function,

80
00:05:06,480 --> 00:05:08,980
Sliding Windows Detection ran okay.

81
00:05:08,980 --> 00:05:10,395
It was not a bad method,

82
00:05:10,395 --> 00:05:15,450
but with cofinite now running a single classification task is much

83
00:05:15,450 --> 00:05:21,125
more expensive and sliding windows this way is infeasibily slow.

84
00:05:21,125 --> 00:05:26,305
And unless you use a very fine granularity or a very small stride,

85
00:05:26,305 --> 00:05:32,850
you end up not able to localize the objects that accurately within the image as well.

86
00:05:32,850 --> 00:05:38,575
Fortunately however, this problem of computational cost has a pretty good solution.

87
00:05:38,575 --> 00:05:41,845
In particular, the Sliding Windows Object Detector

88
00:05:41,845 --> 00:05:45,935
can be implemented convolutionally or much more efficiently.

89
00:05:45,935 --> 00:05:48,310
Let's see in the next video how you can do that.