1
00:00:00,000 --> 00:00:01,440
In the last video,

2
00:00:01,440 --> 00:00:03,705
you learned about the sliding windows

3
00:00:03,705 --> 00:00:08,170
object detection algorithm using a convnet but we saw that it was too slow.

4
00:00:08,170 --> 00:00:13,090
In this video, you'll learn how to implement that algorithm convolutionally.

5
00:00:13,090 --> 00:00:14,640
Let's see what this means.

6
00:00:14,640 --> 00:00:20,040
To build up towards the convolutional implementation of sliding windows let's first see

7
00:00:20,040 --> 00:00:25,590
how you can turn fully connected layers in neural network into convolutional layers.

8
00:00:25,590 --> 00:00:28,620
We'll do that first on this slide and then the next slide,

9
00:00:28,620 --> 00:00:33,600
we'll use the ideas from this slide to show you the convolutional implementation.

10
00:00:33,600 --> 00:00:39,560
So let's say that your object detection algorithm inputs 14 by 14 by 3 images.

11
00:00:39,560 --> 00:00:42,240
This is quite small but just for illustrative purposes,

12
00:00:42,240 --> 00:00:45,650
and let's say it then uses 5 by 5 filters,

13
00:00:45,650 --> 00:00:52,155
and let's say it uses 16 of them to map it from 14 by 14 by 3 to 10 by 10 by 16.

14
00:00:52,155 --> 00:00:56,970
And then does a 2 by 2 max pooling to reduce it to 5 by 5 by 16.

15
00:00:56,970 --> 00:01:01,700
Then has a fully connected layer to connect to 400 units.

16
00:01:01,700 --> 00:01:07,425
Then now they're fully connected layer and then finally outputs a Y using a softmax unit.

17
00:01:07,425 --> 00:01:11,220
In order to make the change we'll need to in a second,

18
00:01:11,220 --> 00:01:14,100
I'm going to change this picture a little bit and instead I'm

19
00:01:14,100 --> 00:01:18,105
going to view Y as four numbers,

20
00:01:18,105 --> 00:01:21,035
corresponding to the cause probabilities of

21
00:01:21,035 --> 00:01:26,050
the four causes that softmax units is classified amongst.

22
00:01:26,050 --> 00:01:31,405
And the full causes could be pedestrian,

23
00:01:31,405 --> 00:01:35,609
car, motorcycle, and background or something else.

24
00:01:35,609 --> 00:01:38,530
Now, what I'd like to do is show how

25
00:01:38,530 --> 00:01:43,030
these layers can be turned into convolutional layers.

26
00:01:43,030 --> 00:01:47,710
So, the convnet will draw same as before for the first few layers.

27
00:01:47,710 --> 00:01:51,010
And now, one way of implementing this next layer,

28
00:01:51,010 --> 00:01:55,030
this fully connected layer is to implement this as a 5 by

29
00:01:55,030 --> 00:02:02,625
5 filter and let's use 400 5 by 5 filters.

30
00:02:02,625 --> 00:02:08,950
So if you take a 5 by 5 by 16 image and convolve it with a 5 by 5 filter, remember,

31
00:02:08,950 --> 00:02:13,450
a 5 by 5 filter is implemented as 5 by 5 by

32
00:02:13,450 --> 00:02:19,240
16 because our convention is that the filter looks across all 16 channels.

33
00:02:19,240 --> 00:02:25,375
So this 16 and this 16 must match and so the outputs will be 1 by 1.

34
00:02:25,375 --> 00:02:30,445
And if you have 400 of these 5 by 5 by 16 filters,

35
00:02:30,445 --> 00:02:36,056
then the output dimension is going to be 1 by 1 by 400.

36
00:02:36,056 --> 00:02:41,016
So rather than viewing these 400 as just a set of nodes,

37
00:02:41,016 --> 00:02:44,602
we're going to view this as a 1 by 1 by 400 volume.

38
00:02:44,602 --> 00:02:50,260
Mathematically, this is the same as a fully connected layer

39
00:02:50,260 --> 00:02:57,154
because each of these 400 nodes has a filter of dimension 5 by 5 by 16.

40
00:02:57,154 --> 00:02:59,770
So each of those 400 values is

41
00:02:59,770 --> 00:03:07,705
some arbitrary linear function of these 5 by 5 by 16 activations from the previous layer.

42
00:03:07,705 --> 00:03:10,654
Next, to implement the next convolutional layer,

43
00:03:10,654 --> 00:03:14,230
we're going to implement a 1 by 1 convolution.

44
00:03:14,230 --> 00:03:18,500
If you have 400 1 by 1 filters then,

45
00:03:18,500 --> 00:03:24,955
with 400 filters the next layer will again be 1 by 1 by 400.

46
00:03:24,955 --> 00:03:29,030
So that gives you this next fully connected layer.

47
00:03:29,030 --> 00:03:35,215
And then finally, we're going to have another 1 by 1 filter,

48
00:03:35,215 --> 00:03:37,360
followed by a softmax activation.

49
00:03:37,360 --> 00:03:40,140
So as to give a 1 by 1 by 4 volume

50
00:03:40,140 --> 00:03:46,115
to take the place of these four numbers that the network was operating.

51
00:03:46,115 --> 00:03:50,035
So this shows how you can take these fully connected layers

52
00:03:50,035 --> 00:03:54,310
and implement them using convolutional layers so

53
00:03:54,310 --> 00:03:57,815
that these sets of units instead are not implemented

54
00:03:57,815 --> 00:04:02,680
as 1 by 1 by 400 and 1 by 1 by 4 volumes.

55
00:04:02,680 --> 00:04:06,580
After this conversion, let's see how you

56
00:04:06,580 --> 00:04:11,400
can have a convolutional implementation of sliding windows object detection.

57
00:04:11,400 --> 00:04:16,850
The presentation on this slide is based on the OverFeat paper,

58
00:04:16,850 --> 00:04:18,650
referenced at the bottom, by Pierre Sermanet,

59
00:04:18,650 --> 00:04:21,010
David Eigen, Xiang Zhang,

60
00:04:21,010 --> 00:04:24,290
Michael Mathieu, Robert Fergus and Yann Lecun.

61
00:04:24,290 --> 00:04:31,385
Let's say that your sliding windows convnet inputs 14 by 14 by 3 images and again,

62
00:04:31,385 --> 00:04:35,495
I'm just using small numbers like the 14 by 14 image

63
00:04:35,495 --> 00:04:40,790
in this slide mainly to make the numbers and illustrations simpler.

64
00:04:40,790 --> 00:04:44,450
So as before, you have a neural network as

65
00:04:44,450 --> 00:04:49,100
follows that eventually outputs a 1 by 1 by 4 volume,

66
00:04:49,100 --> 00:04:52,465
which is the output of your softmax.

67
00:04:52,465 --> 00:04:54,815
Again, to simplify the drawing here,

68
00:04:54,815 --> 00:05:01,185
14 by 14 by 3 is technically a volume 5 by 5 or 10 by 10 by 16,

69
00:05:01,185 --> 00:05:02,530
the second clear volume.

70
00:05:02,530 --> 00:05:04,490
But to simplify the drawing for this slide,

71
00:05:04,490 --> 00:05:07,620
I'm just going to draw the front face of this volume.

72
00:05:07,620 --> 00:05:10,940
So instead of drawing 1 by 1 by 400 volume,

73
00:05:10,940 --> 00:05:14,480
I'm just going to draw the 1 by 1 cause of all of these.

74
00:05:14,480 --> 00:05:19,368
So just dropped the three components of these drawings, just for this slide.

75
00:05:19,368 --> 00:05:23,810
So let's say that your convnet inputs 14 by 14 images or 14 by

76
00:05:23,810 --> 00:05:29,035
14 by 3 images and your tested image is 16 by 16 by 3.

77
00:05:29,035 --> 00:05:33,615
So now added that yellow stripe to the border of this image.

78
00:05:33,615 --> 00:05:36,335
In the original sliding windows algorithm,

79
00:05:36,335 --> 00:05:41,150
you might want to input the blue region into

80
00:05:41,150 --> 00:05:46,485
a convnet and run that once to generate a consecration 01 and then slightly down a bit,

81
00:05:46,485 --> 00:05:54,020
least he uses a stride of two pixels and then you might slide that to the right by

82
00:05:54,020 --> 00:05:56,090
two pixels to input

83
00:05:56,090 --> 00:05:59,130
this green rectangle into the convnet and

84
00:05:59,130 --> 00:06:02,945
we run the whole convnet and get another label, 01.

85
00:06:02,945 --> 00:06:05,180
Then you might input

86
00:06:05,180 --> 00:06:12,595
this orange region into the convnet and run it one more time to get another label.

87
00:06:12,595 --> 00:06:21,634
And then do it the fourth and final time with this lower right purple square.

88
00:06:21,634 --> 00:06:26,115
To run sliding windows on this 16 by 16 by 3 image is pretty small image.

89
00:06:26,115 --> 00:06:32,065
You run this convnet four times in order to get four labels.

90
00:06:32,065 --> 00:06:34,685
But it turns out a lot of this computation

91
00:06:34,685 --> 00:06:38,345
done by these four convnets is highly duplicative.

92
00:06:38,345 --> 00:06:42,485
So what the convolutional implementation of sliding windows does is it allows

93
00:06:42,485 --> 00:06:48,150
these four pauses in the convnet to share a lot of computation.

94
00:06:48,150 --> 00:06:49,955
Specifically, here's what you can do.

95
00:06:49,955 --> 00:06:54,170
You can take the convnet and just run it same parameters,

96
00:06:54,170 --> 00:06:56,731
the same 5 by 5 filters,

97
00:06:56,731 --> 00:07:00,230
also 16 5 by 5 filters and run it.

98
00:07:00,230 --> 00:07:04,850
Now, you can have a 12 by 12 by 16 output volume.

99
00:07:04,850 --> 00:07:07,280
Then do the max pool, same as before.

100
00:07:07,280 --> 00:07:09,210
Now you have a 6 by 6 by 16,

101
00:07:09,210 --> 00:07:18,093
runs through your same 400 5 by 5 filters to get now your 2 by 2 by 40 volume.

102
00:07:18,093 --> 00:07:24,835
So now instead of a 1 by 1 by 400 volume,

103
00:07:24,835 --> 00:07:29,105
we have instead a 2 by 2 by 400 volume.

104
00:07:29,105 --> 00:07:32,870
Run it through a 1 by 1 filter gives

105
00:07:32,870 --> 00:07:37,260
you another 2 by 2 by 400 instead of 1 by 1 like 400.

106
00:07:37,260 --> 00:07:40,220
Do that one more time and now you're left with a

107
00:07:40,220 --> 00:07:44,320
2 by 2 by 4 output volume instead of 1 by 1 by 4.

108
00:07:44,320 --> 00:07:49,250
It turns out that this blue 1 by 1 by 4 subset gives

109
00:07:49,250 --> 00:07:54,368
you the result of running in the upper left hand corner 14 by 14 image.

110
00:07:54,368 --> 00:08:01,215
This upper right 1 by 1 by 4 volume gives you the upper right result.

111
00:08:01,215 --> 00:08:04,820
The lower left gives you the results of

112
00:08:04,820 --> 00:08:08,660
implementing the convnet on the lower left 14 by 14 region.

113
00:08:08,660 --> 00:08:13,310
And the lower right 1 by 1 by 4 volume gives you the same result

114
00:08:13,310 --> 00:08:18,040
as running the convnet on the lower right 14 by 14 medium.

115
00:08:18,040 --> 00:08:20,990
And if you step through all the steps of the calculation,

116
00:08:20,990 --> 00:08:23,110
let's look at the green example,

117
00:08:23,110 --> 00:08:25,850
if you had cropped out just this region

118
00:08:25,850 --> 00:08:29,120
and passed it through the convnet through the convnet on top,

119
00:08:29,120 --> 00:08:34,105
then the first layer's activations would have been exactly this region.

120
00:08:34,105 --> 00:08:37,037
The next layer's activation after max pooling would have been

121
00:08:37,037 --> 00:08:40,490
exactly this region and then the next layer,

122
00:08:40,490 --> 00:08:43,460
the next layer would have been as follows.

123
00:08:43,460 --> 00:08:44,805
So what this process does,

124
00:08:44,805 --> 00:08:47,216
what this convolution implementation does is,

125
00:08:47,216 --> 00:08:50,345
instead of forcing you to run four propagation

126
00:08:50,345 --> 00:08:54,635
on four subsets of the input image independently, Instead,

127
00:08:54,635 --> 00:08:58,730
it combines all four into one form of computation and shares

128
00:08:58,730 --> 00:09:02,713
a lot of the computation in the regions of image that are common.

129
00:09:02,713 --> 00:09:07,895
So all four of the 14 by 14 patches we saw here.

130
00:09:07,895 --> 00:09:09,935
Now let's just go through a bigger example.

131
00:09:09,935 --> 00:09:14,845
Let's say you now want to run sliding windows on a 28 by 28 by 3 image.

132
00:09:14,845 --> 00:09:16,820
It turns out If you run four from

133
00:09:16,820 --> 00:09:21,410
the same way then you end up with an 8 by 8 by 4 output.

134
00:09:21,410 --> 00:09:27,735
And just go small and surviving sliding windows with that 14 by 14 region.

135
00:09:27,735 --> 00:09:33,380
And that corresponds to running a sliding windows first on that region thus,

136
00:09:33,380 --> 00:09:36,496
giving you the output corresponding the upper left hand corner.

137
00:09:36,496 --> 00:09:39,661
Then using a slider too to shift one window over,

138
00:09:39,661 --> 00:09:43,775
one window over, one window over and so on and the eight positions.

139
00:09:43,775 --> 00:09:48,830
So that gives you this first row and then as you go down the image as well,

140
00:09:48,830 --> 00:09:53,350
that gives you all of these 8 by 8 by 4 outputs.

141
00:09:53,350 --> 00:09:58,760
Because of the max pooling up too that this corresponds to

142
00:09:58,760 --> 00:10:04,055
running your neural network with a stride of two on the original image.

143
00:10:04,055 --> 00:10:05,335
So just to recap,

144
00:10:05,335 --> 00:10:07,853
to implement sliding windows,

145
00:10:07,853 --> 00:10:11,715
previously, what you do is you crop out a region.

146
00:10:11,715 --> 00:10:14,750
Let's say this is 14 by 14

147
00:10:14,750 --> 00:10:18,811
and run that through your convnet and do that for the next region over,

148
00:10:18,811 --> 00:10:21,604
then do that for the next 14 by 14 region,

149
00:10:21,604 --> 00:10:23,210
then the next one, then the next one,

150
00:10:23,210 --> 00:10:25,700
then the next one, then the next one and so on,

151
00:10:25,700 --> 00:10:29,070
until hopefully that one recognizes the car.

152
00:10:29,070 --> 00:10:31,610
But now, instead of doing it sequentially,

153
00:10:31,610 --> 00:10:35,540
with this convolutional implementation that you saw in the previous slide,

154
00:10:35,540 --> 00:10:37,745
you can implement the entire image,

155
00:10:37,745 --> 00:10:42,890
all maybe 28 by 28 and convolutionally make all the predictions at

156
00:10:42,890 --> 00:10:46,270
the same time by one forward pass through this big convnet

157
00:10:46,270 --> 00:10:50,357
and hopefully have it recognize the position of the car.

158
00:10:50,357 --> 00:10:53,490
So that's how you implement sliding windows

159
00:10:53,490 --> 00:10:57,700
convolutionally and it makes the whole thing much more efficient.

160
00:10:57,700 --> 00:10:59,960
Now, this [inaudible] still has one weakness,

161
00:10:59,960 --> 00:11:04,585
which is the position of the bounding boxes is not going to be too accurate.

162
00:11:04,585 --> 00:11:05,786
In the next video,

163
00:11:05,786 --> 00:11:08,030
let's see how you can fix that problem.