1
00:00:00,000 --> 00:00:01,380
In the last video,

2
00:00:01,380 --> 00:00:06,300
you learned how to use a convolutional implementation of sliding windows.

3
00:00:06,300 --> 00:00:08,780
That's more computationally efficient,

4
00:00:08,780 --> 00:00:14,130
but it still has a problem of not quite outputting the most accurate bounding boxes.

5
00:00:14,130 --> 00:00:16,710
In this video, let's see how you can get

6
00:00:16,710 --> 00:00:19,520
your bounding box predictions to be more accurate.

7
00:00:19,520 --> 00:00:21,495
With sliding windows, you take

8
00:00:21,495 --> 00:00:26,190
this three sets of locations and run the crossfire through it.

9
00:00:26,190 --> 00:00:27,825
And in this case,

10
00:00:27,825 --> 00:00:32,636
none of the boxes really match up perfectly with the position of the car.

11
00:00:32,636 --> 00:00:35,900
So, maybe that box is the best match.

12
00:00:35,900 --> 00:00:38,465
And also, it looks like in drawn through,

13
00:00:38,465 --> 00:00:41,585
the perfect bounding box isn't even quite square,

14
00:00:41,585 --> 00:00:48,750
it's actually has a slightly wider rectangle or slightly horizontal aspect ratio.

15
00:00:48,750 --> 00:00:53,975
So, is there a way to get this algorithm to outputs more accurate bounding boxes?

16
00:00:53,975 --> 00:00:59,275
A good way to get this output more accurate bounding boxes is with the YOLO algorithm.

17
00:00:59,275 --> 00:01:03,225
YOLO stands for, You Only Look Once.

18
00:01:03,225 --> 00:01:06,195
And is an algorithm due to Joseph Redmon,

19
00:01:06,195 --> 00:01:07,975
Santosh Divvala, Ross Girshick and Ali Farhadi.

20
00:01:07,975 --> 00:01:10,735
Here's what you do.

21
00:01:10,735 --> 00:01:14,230
Let's say you have an input image at 100 by 100,

22
00:01:14,230 --> 00:01:17,175
you're going to place down a grid on this image.

23
00:01:17,175 --> 00:01:19,425
And for the purposes of illustration,

24
00:01:19,425 --> 00:01:21,780
I'm going to use a 3 by 3 grid.

25
00:01:21,780 --> 00:01:23,640
Although in an actual implementation,

26
00:01:23,640 --> 00:01:24,840
you use a finer one,

27
00:01:24,840 --> 00:01:27,880
like maybe a 19 by 19 grid.

28
00:01:27,880 --> 00:01:30,660
And the basic idea is you're going to take

29
00:01:30,660 --> 00:01:36,345
the image classification and localization algorithm that you saw a few videos back,

30
00:01:36,345 --> 00:01:40,165
and apply it to each of the nine grids.

31
00:01:40,165 --> 00:01:47,910
And the basic idea is you're going to take the image classification and

32
00:01:47,910 --> 00:01:52,170
localization algorithm that you saw in the first video of

33
00:01:52,170 --> 00:01:58,440
this week and apply that to each of the nine grid cells of this image.

34
00:01:58,440 --> 00:01:59,851
So the more concrete,

35
00:01:59,851 --> 00:02:05,171
here's how you define the labels you use for training.

36
00:02:05,171 --> 00:02:08,070
So for each of the nine grid cells, you specify a label Y,

37
00:02:08,070 --> 00:02:14,561
where the label Y is this eight dimensional vector,

38
00:02:14,561 --> 00:02:16,470
same as you saw previously.

39
00:02:16,470 --> 00:02:19,610
Your first output PC 01 depending on whether or

40
00:02:19,610 --> 00:02:24,170
not there's an image in that grid cell and then BX,

41
00:02:24,170 --> 00:02:30,800
BY, BH, BW to specify the bounding box if there is an image,

42
00:02:30,800 --> 00:02:33,348
if there is an object associated with that grid cell.

43
00:02:33,348 --> 00:02:36,320
And then say, C1, C2, C3,

44
00:02:36,320 --> 00:02:40,935
if you try and recognize three classes not counting the background class.

45
00:02:40,935 --> 00:02:43,120
So you try to recognize pedestrian's class,

46
00:02:43,120 --> 00:02:45,110
motorcycles and the background class.

47
00:02:45,110 --> 00:02:47,720
Then C1 C2 C3 can be the pedestrian,

48
00:02:47,720 --> 00:02:50,570
car and motorcycle classes.

49
00:02:50,570 --> 00:02:52,505
So in this image,

50
00:02:52,505 --> 00:02:53,870
we have nine grid cells,

51
00:02:53,870 --> 00:02:59,105
so you have a vector like this for each of the grid cells.

52
00:02:59,105 --> 00:03:02,316
So let's start with the upper left grid cell,

53
00:03:02,316 --> 00:03:03,955
this one up here.

54
00:03:03,955 --> 00:03:06,115
For that one, there is no object.

55
00:03:06,115 --> 00:03:11,960
So, the label vector Y for the upper left grid cell would be zero,

56
00:03:11,960 --> 00:03:16,655
and then don't cares for the rest of these.

57
00:03:16,655 --> 00:03:20,707
The output label Y would be the same for this grid cell,

58
00:03:20,707 --> 00:03:24,640
and this grid cell, and all the grid cells with nothing,

59
00:03:24,640 --> 00:03:27,845
with no interesting object in them.

60
00:03:27,845 --> 00:03:32,435
Now, how about this grid cell?

61
00:03:32,435 --> 00:03:34,385
To give a bit more detail,

62
00:03:34,385 --> 00:03:36,355
this image has two objects.

63
00:03:36,355 --> 00:03:40,270
And what the YOLO algorithm does is it takes the midpoint of reach of

64
00:03:40,270 --> 00:03:45,690
the two objects and then assigns the object to the grid cell containing the midpoint.

65
00:03:45,690 --> 00:03:49,900
So the left car is assigned to this grid cell,

66
00:03:49,900 --> 00:03:51,445
and the car on the right,

67
00:03:51,445 --> 00:03:53,140
which is this midpoint,

68
00:03:53,140 --> 00:03:57,265
is assigned to this grid cell.

69
00:03:57,265 --> 00:04:02,510
And so even though the central grid cell has some parts of both cars,

70
00:04:02,510 --> 00:04:06,175
we'll pretend the central grid cell has no interesting object so that

71
00:04:06,175 --> 00:04:11,560
the central grid cell the class label Y also looks like this vector with no object,

72
00:04:11,560 --> 00:04:13,450
and so the first component PC,

73
00:04:13,450 --> 00:04:15,000
and then the rest are don't cares.

74
00:04:15,000 --> 00:04:17,091
Whereas for this cell,

75
00:04:17,091 --> 00:04:21,005
this cell that I have circled in green on the left,

76
00:04:21,005 --> 00:04:23,765
the target label Y would be as follows.

77
00:04:23,765 --> 00:04:25,085
There is an object,

78
00:04:25,085 --> 00:04:26,770
and then you write BX, BY, BH,

79
00:04:26,770 --> 00:04:32,830
BW, to specify the position of this bounding box.

80
00:04:32,830 --> 00:04:34,870
And then you have, let's see,

81
00:04:34,870 --> 00:04:38,680
if class one was a pedestrian, then that was zero.

82
00:04:38,680 --> 00:04:40,420
Class two is a car, that's one.

83
00:04:40,420 --> 00:04:43,960
Class three was a motorcycle, that's zero.

84
00:04:43,960 --> 00:04:45,900
And then similarly, for the grid cell on

85
00:04:45,900 --> 00:04:48,415
their right because that does have an object in it,

86
00:04:48,415 --> 00:04:52,690
it will also have some vector like

87
00:04:52,690 --> 00:04:58,325
this as the target label corresponding to the grid cell on the right.

88
00:04:58,325 --> 00:05:00,715
So, for each of these nine grid cells,

89
00:05:00,715 --> 00:05:04,815
you end up with a eight dimensional output vector.

90
00:05:04,815 --> 00:05:08,415
And because you have 3 by 3 grid cells,

91
00:05:08,415 --> 00:05:09,863
you have nine grid cells,

92
00:05:09,863 --> 00:05:17,435
the total volume of the output is going to be 3 by 3 by 8.

93
00:05:17,435 --> 00:05:24,760
So the target output is going to be 3 by 3 by 8 because you have 3 by 3 grid cells.

94
00:05:24,760 --> 00:05:28,045
And for each of the 3 by 3 grid cells,

95
00:05:28,045 --> 00:05:32,790
you have a eight dimensional Y vector.

96
00:05:32,790 --> 00:05:36,250
So the target output volume is 3 by 3 by 8.

97
00:05:36,250 --> 00:05:41,245
Where for example, this 1 by 1 by 8 volume in

98
00:05:41,245 --> 00:05:42,970
the upper left corresponds to

99
00:05:42,970 --> 00:05:47,770
the target output vector for the upper left of the nine grid cells.

100
00:05:47,770 --> 00:05:50,710
And so for each of the 3 by 3 positions,

101
00:05:50,710 --> 00:05:52,345
for each of these nine grid cells,

102
00:05:52,345 --> 00:05:58,180
does it correspond in eight dimensional target vector Y that you want to the output.

103
00:05:58,180 --> 00:05:59,610
Some of which could be don't cares,

104
00:05:59,610 --> 00:06:01,010
if there's no object there.

105
00:06:01,010 --> 00:06:03,360
And that's why the total target outputs,

106
00:06:03,360 --> 00:06:08,635
the output label for this image is now itself a 3 by 3 by 8 volume.

107
00:06:08,635 --> 00:06:11,245
So now, to train your neural network,

108
00:06:11,245 --> 00:06:17,475
the input is 100 by 100 by 3,

109
00:06:17,475 --> 00:06:19,015
that's the input image.

110
00:06:19,015 --> 00:06:22,795
And then you have a usual convnet with conv,

111
00:06:22,795 --> 00:06:27,690
layers of max pool layers, and so on.

112
00:06:27,690 --> 00:06:28,870
So that in the end,

113
00:06:28,870 --> 00:06:34,440
you have this, should choose the conv layers and the max pool layers,

114
00:06:34,440 --> 00:06:42,320
and so on, so that this eventually maps to a 3 by 3 by 8 output volume.

115
00:06:42,320 --> 00:06:46,470
And so what you do is you have an input X which is the input image like that,

116
00:06:46,470 --> 00:06:50,125
and you have these target labels Y which are 3 by 3 by 8,

117
00:06:50,125 --> 00:06:54,160
and you use map propagation to train the neural network to map

118
00:06:54,160 --> 00:06:58,565
from any input X to this type of output volume Y.

119
00:06:58,565 --> 00:07:01,360
So the advantage of this algorithm is that

120
00:07:01,360 --> 00:07:07,228
the neural network outputs precise bounding boxes as follows.

121
00:07:07,228 --> 00:07:08,320
So at test time,

122
00:07:08,320 --> 00:07:10,930
what you do is you feed an input image X

123
00:07:10,930 --> 00:07:14,255
and run forward prop until you get this output Y.

124
00:07:14,255 --> 00:07:16,735
And then for each of the nine outputs

125
00:07:16,735 --> 00:07:19,480
of each of the 3 by 3 positions in which of the output,

126
00:07:19,480 --> 00:07:22,810
you can then just read off 1 or 0.

127
00:07:22,810 --> 00:07:27,153
Is there an object associated with that one of the nine positions?

128
00:07:27,153 --> 00:07:29,590
And that there is an object, what object it is,

129
00:07:29,590 --> 00:07:36,065
and where is the bounding box for the object in that grid cell?

130
00:07:36,065 --> 00:07:39,118
And so long as you don't have more than one object in each grid cell,

131
00:07:39,118 --> 00:07:41,810
this algorithm should work okay.

132
00:07:41,810 --> 00:07:43,900
And the problem of having multiple objects within

133
00:07:43,900 --> 00:07:46,600
the grid cell is something we'll address later.

134
00:07:46,600 --> 00:07:51,985
Of use a relatively small 3 by 3 grid,

135
00:07:51,985 --> 00:07:54,470
in practice, you might use a much finer,

136
00:07:54,470 --> 00:07:56,160
grid maybe 19 by 19.

137
00:07:56,160 --> 00:07:58,900
So you end up with 19 by 19 by 8,

138
00:07:58,900 --> 00:08:02,315
and that also makes your grid much finer.

139
00:08:02,315 --> 00:08:07,180
It reduces the chance that there are multiple objects assigned to the same grid cell.

140
00:08:07,180 --> 00:08:08,800
And just as a reminder,

141
00:08:08,800 --> 00:08:11,590
the way you assign an object to grid cell as

142
00:08:11,590 --> 00:08:14,290
you look at the midpoint of an object and then

143
00:08:14,290 --> 00:08:19,930
you assign that object to whichever one grid cell contains the midpoint of the object.

144
00:08:19,930 --> 00:08:23,926
So each object, even if the objects spends multiple grid cells,

145
00:08:23,926 --> 00:08:27,410
that object is assigned only to one of the nine grid cells,

146
00:08:27,410 --> 00:08:29,018
or one of the 3 by 3,

147
00:08:29,018 --> 00:08:31,565
or one of the 19 by 19 grid cells.

148
00:08:31,565 --> 00:08:33,584
Algorithm of a 19 by 19 grid,

149
00:08:33,584 --> 00:08:36,715
the chance of an object of two midpoints of

150
00:08:36,715 --> 00:08:41,445
objects appearing in the same grid cell is just a bit smaller.

151
00:08:41,445 --> 00:08:44,043
So notice two things, first,

152
00:08:44,043 --> 00:08:46,930
this is a lot like the image classification and

153
00:08:46,930 --> 00:08:51,530
localization algorithm that we talked about in the first video of this week.

154
00:08:51,530 --> 00:08:55,380
And that it outputs the bounding balls coordinates explicitly.

155
00:08:55,380 --> 00:08:58,235
And so this allows in your network to output bounding

156
00:08:58,235 --> 00:09:02,440
boxes of any aspect ratio, as well as,

157
00:09:02,440 --> 00:09:05,690
output much more precise coordinates that aren't just

158
00:09:05,690 --> 00:09:10,530
dictated by the stripe size of your sliding windows classifier.

159
00:09:10,530 --> 00:09:12,220
And second, this is

160
00:09:12,220 --> 00:09:17,320
a convolutional implementation and you're not implementing this algorithm nine

161
00:09:17,320 --> 00:09:25,540
times on the 3 by 3 grid or if you're using a 19 by 19 grid.19 squared is 361.

162
00:09:25,540 --> 00:09:31,090
So, you're not running the same algorithm 361 times or 19 squared times.

163
00:09:31,090 --> 00:09:34,285
Instead, this is one single convolutional implantation,

164
00:09:34,285 --> 00:09:39,610
where you use one consonant with a lot of shared computation between

165
00:09:39,610 --> 00:09:46,780
all the computations needed for all of your 3 by 3 or all of your 19 by 19 grid cells.

166
00:09:46,780 --> 00:09:49,135
So, this is a pretty efficient algorithm.

167
00:09:49,135 --> 00:09:52,720
And in fact, one nice thing about the YOLO algorithm,

168
00:09:52,720 --> 00:09:57,445
which is constant popularity is because this is a convolutional implementation,

169
00:09:57,445 --> 00:09:58,930
it actually runs very fast.

170
00:09:58,930 --> 00:10:02,530
So this works even for real time object detection.

171
00:10:02,530 --> 00:10:03,915
Now, before wrapping up,

172
00:10:03,915 --> 00:10:06,610
there's one more detail I want to share with you,

173
00:10:06,610 --> 00:10:12,495
which is, how do you encode these bounding boxes bx, by, BH, BW?

174
00:10:12,495 --> 00:10:16,135
Let's discuss that on the next slide.

175
00:10:16,135 --> 00:10:18,610
So, given these two cars,

176
00:10:18,610 --> 00:10:21,465
remember, we have the 3 by 3 grid.

177
00:10:21,465 --> 00:10:25,120
Let's take the example of the car on the right.

178
00:10:25,120 --> 00:10:32,220
So, in this grid cell there is an object and so the target label y will be one,

179
00:10:32,220 --> 00:10:34,270
that was PC is equal to one.

180
00:10:34,270 --> 00:10:37,060
And then bx, by,

181
00:10:37,060 --> 00:10:40,970
BH, BW, and then 0 1 0.

182
00:10:40,970 --> 00:10:43,790
So, how do you specify the bounding box?

183
00:10:43,790 --> 00:10:48,310
In the YOLO algorithm, relative to this square,

184
00:10:48,310 --> 00:10:51,545
when I take the convention that the upper left point here is

185
00:10:51,545 --> 00:10:56,180
0 0 and this lower right point is 1 1.

186
00:10:56,180 --> 00:10:59,155
So to specify the position of that midpoint,

187
00:10:59,155 --> 00:11:02,715
that orange dot, bx might be,

188
00:11:02,715 --> 00:11:05,980
let's say x looks like is about 0.4.

189
00:11:05,980 --> 00:11:09,760
Maybe its about 0.4 of the way to their right.

190
00:11:09,760 --> 00:11:15,945
And then y, looks I guess maybe 0.3.

191
00:11:15,945 --> 00:11:19,380
And then the height of the bounding box is specified as

192
00:11:19,380 --> 00:11:24,090
a fraction of the overall width of this box.

193
00:11:24,090 --> 00:11:30,931
So, the width of this red box is maybe 90% of that blue line.

194
00:11:30,931 --> 00:11:35,030
And so BH is 0.9 and the height of

195
00:11:35,030 --> 00:11:42,075
this is maybe one half of the overall height of the grid cell.

196
00:11:42,075 --> 00:11:46,670
So in that case, BW would be, let's say 0.5.

197
00:11:46,670 --> 00:11:49,455
So, in other words, this bx, by, BH,

198
00:11:49,455 --> 00:11:53,690
BW as specified relative to the grid cell.

199
00:11:53,690 --> 00:11:55,505
And so bx and by,

200
00:11:55,505 --> 00:11:58,455
this has to be between 0 and 1, right?

201
00:11:58,455 --> 00:12:01,055
Because pretty much by definition that

202
00:12:01,055 --> 00:12:04,340
orange dot is within the bounds of that grid cell is assigned to.

203
00:12:04,340 --> 00:12:08,509
If it wasn't between 0 and 1 it was outside the square,

204
00:12:08,509 --> 00:12:11,680
then we'll have been assigned to a different grid cell.

205
00:12:11,680 --> 00:12:14,495
But these could be greater than one.

206
00:12:14,495 --> 00:12:18,785
In particular if you have a car where the bounding box was that,

207
00:12:18,785 --> 00:12:21,045
then the height and width of the bounding box,

208
00:12:21,045 --> 00:12:23,440
this could be greater than one.

209
00:12:23,440 --> 00:12:27,007
So, there are multiple ways of specifying the bounding boxes,

210
00:12:27,007 --> 00:12:30,710
but this would be one convention that's quite reasonable.

211
00:12:30,710 --> 00:12:33,710
Although, if you read the YOLO research papers,

212
00:12:33,710 --> 00:12:35,970
the YOLO research line there were

213
00:12:35,970 --> 00:12:39,040
other parameterizations that work even a little bit better,

214
00:12:39,040 --> 00:12:44,925
but I hope this gives one reasonable condition that should work okay.

215
00:12:44,925 --> 00:12:47,690
Although, there are some more complicated parameterizations

216
00:12:47,690 --> 00:12:51,980
involving sigmoid functions to make sure this is between 0 and 1.

217
00:12:51,980 --> 00:12:57,185
And using an explanation parameterization to make sure that these are non-negative,

218
00:12:57,185 --> 00:13:01,245
since 0.9, 0.5, this has to be greater or equal to zero.

219
00:13:01,245 --> 00:13:03,915
There are some other more advanced parameterizations

220
00:13:03,915 --> 00:13:05,457
that work things a little bit better,

221
00:13:05,457 --> 00:13:09,635
but the one you saw here should work okay.

222
00:13:09,635 --> 00:13:14,775
So, that's it for the YOLO or the You Only Look Once algorithm.

223
00:13:14,775 --> 00:13:17,115
And in the next few videos I'll show you

224
00:13:17,115 --> 00:13:21,470
a few other ideas that will help make this algorithm even better.

225
00:13:21,470 --> 00:13:23,170
In the meantime, if you want,

226
00:13:23,170 --> 00:13:24,445
you can take a look at

227
00:13:24,445 --> 00:13:29,667
YOLO paper reference at the bottom of these past couple slides I use.

228
00:13:29,667 --> 00:13:31,325
Although, just one warning,

229
00:13:31,325 --> 00:13:33,530
if you take a look at these papers which is

230
00:13:33,530 --> 00:13:37,425
the YOLO paper is one of the harder papers to read.

231
00:13:37,425 --> 00:13:40,265
I remember, when I was reading this paper for the first time,

232
00:13:40,265 --> 00:13:43,325
I had a really hard time figuring out what was going on.

233
00:13:43,325 --> 00:13:46,356
And I wound up asking a couple of my friends,

234
00:13:46,356 --> 00:13:48,950
very good researchers to help me figure it out,

235
00:13:48,950 --> 00:13:53,125
and even they had a hard time understanding some of the details of the paper.

236
00:13:53,125 --> 00:13:54,845
So, if you look at the paper,

237
00:13:54,845 --> 00:13:58,195
it's okay if you have a hard time figuring it out.

238
00:13:58,195 --> 00:14:00,885
I wish it was more uncommon,

239
00:14:00,885 --> 00:14:02,795
but it's not that uncommon, sadly,

240
00:14:02,795 --> 00:14:05,130
for even senior researchers,

241
00:14:05,130 --> 00:14:08,840
that review research papers and have a hard time figuring out the details.

242
00:14:08,840 --> 00:14:10,895
And have to look at open source code,

243
00:14:10,895 --> 00:14:12,005
or contact the authors,

244
00:14:12,005 --> 00:14:15,610
or something else to figure out the details of these outcomes.

245
00:14:15,610 --> 00:14:19,715
But don't let me stop you from taking a look at the paper yourself though if you wish,

246
00:14:19,715 --> 00:14:21,700
but this is one of the harder ones.

247
00:14:21,700 --> 00:14:25,975
So, that though, you now understand the basics of the YOLO algorithm.

248
00:14:25,975 --> 00:14:31,000
Let's go on to some additional pieces that will make this algorithm work even better.