1
00:00:01,100 --> 00:00:02,740
Hello and welcome back.

2
00:00:02,740 --> 00:00:05,368
This week you learn
about object detection.

3
00:00:05,368 --> 00:00:08,980
This is one of the areas of computer
vision that's just exploding and

4
00:00:08,980 --> 00:00:12,460
is working so much better than
just a couple of years ago.

5
00:00:12,460 --> 00:00:18,430
In order to build up to object detection,
you first learn about object localization.

6
00:00:18,430 --> 00:00:20,595
Let's start by defining what that means.

7
00:00:20,595 --> 00:00:25,760
You're already familiar with the image
classification task where an algorithm

8
00:00:25,760 --> 00:00:30,500
looks at this picture and might be
responsible for saying this is a car.

9
00:00:30,500 --> 00:00:31,920
So that was classification.

10
00:00:34,560 --> 00:00:38,964
The problem you learn to build in your
network to address later on this video is

11
00:00:38,964 --> 00:00:41,550
classification with localization.

12
00:00:41,550 --> 00:00:45,659
Which means not only do you have
to label this as say a car but

13
00:00:45,659 --> 00:00:49,758
the algorithm also is responsible for
putting a bounding box,

14
00:00:49,758 --> 00:00:55,090
or drawing a red rectangle around
the position of the car in the image.

15
00:00:55,090 --> 00:00:59,310
So that's called the classification
with localization problem.

16
00:00:59,310 --> 00:01:03,790
Where the term localization refers
to figuring out where in the picture

17
00:01:03,790 --> 00:01:05,760
is the car you've detective.

18
00:01:05,760 --> 00:01:09,530
Later this week,
you then learn about the detection problem

19
00:01:09,530 --> 00:01:13,590
where now there might be multiple
objects in the picture and

20
00:01:13,590 --> 00:01:17,900
you have to detect them all and
and localized them all.

21
00:01:17,900 --> 00:01:21,820
And if you're doing this for
an autonomous driving application,

22
00:01:21,820 --> 00:01:24,480
then you might need to
detect not just other cars,

23
00:01:24,480 --> 00:01:29,310
but maybe other pedestrians and
motorcycles and maybe even other objects.

24
00:01:29,310 --> 00:01:31,090
So you'll see that later this week.

25
00:01:31,090 --> 00:01:36,220
So in the terminology we'll use this week,
the classification and

26
00:01:36,220 --> 00:01:42,130
the classification of localization
problems usually have one object.

27
00:01:42,130 --> 00:01:45,930
Usually one big object in the middle of
the image that you're trying to recognize

28
00:01:45,930 --> 00:01:47,600
or recognize and localize.

29
00:01:47,600 --> 00:01:53,150
In contrast, in the detection problem
there can be multiple objects.

30
00:01:53,150 --> 00:01:57,450
And in fact, maybe even multiple
objects of different categories

31
00:01:57,450 --> 00:01:59,110
within a single image.

32
00:01:59,110 --> 00:02:03,195
So the ideas you've learned about for
image classification will be useful for

33
00:02:03,195 --> 00:02:04,815
classification with localization.

34
00:02:04,815 --> 00:02:06,885
And that the ideas you learn for

35
00:02:06,885 --> 00:02:10,795
localization will then turn out
to be useful for detection.

36
00:02:10,795 --> 00:02:14,245
So let's start by talking about
classification with localization.

37
00:02:15,255 --> 00:02:20,535
You're already familiar with the image
classification problem, in which you might

38
00:02:20,535 --> 00:02:26,210
input a picture into a ConfNet with
multiple layers so that's our ConfNet.

39
00:02:26,210 --> 00:02:31,348
And this results in a vector
features that is fed

40
00:02:31,348 --> 00:02:38,170
to maybe a softmax unit that
outputs the predicted clause.

41
00:02:38,170 --> 00:02:41,070
So if you are building a self driving car,

42
00:02:41,070 --> 00:02:44,370
maybe your object categories
are the following.

43
00:02:44,370 --> 00:02:49,740
Where you might have a pedestrian, or
a car, or a motorcycle, or a background.

44
00:02:49,740 --> 00:02:51,660
This means none of the above.

45
00:02:51,660 --> 00:02:53,160
So if there's no pedestrian,

46
00:02:53,160 --> 00:02:57,735
no car, no motorcycle,
then you might have an output background.

47
00:02:57,735 --> 00:03:03,755
So these are your classes, they have
a softmax with four possible outputs.

48
00:03:03,755 --> 00:03:07,775
So this is the standard
classification pipeline.

49
00:03:07,775 --> 00:03:12,585
How about if you want to localize
the car in the image as well.

50
00:03:12,585 --> 00:03:17,156
To do that,
you can change your neural network to have

51
00:03:17,156 --> 00:03:21,940
a few more output units
that output a bounding box.

52
00:03:21,940 --> 00:03:27,179
So, in particular, you can have
the neural network output four

53
00:03:27,179 --> 00:03:32,236
more numbers, and
I'm going to call them bx, by, bh, and bw.

54
00:03:32,236 --> 00:03:40,110
And these four numbers parameterized
the bounding box of the detected object.

55
00:03:40,110 --> 00:03:44,820
So in these videos, I am going to use
the notational convention that the upper

56
00:03:44,820 --> 00:03:49,610
left of the image, I'm going to
denote as the coordinate (0,0),

57
00:03:49,610 --> 00:03:52,660
and at the lower right is (1,1).

58
00:03:52,660 --> 00:03:55,442
So, specifying the bounding box,

59
00:03:55,442 --> 00:04:00,060
the red rectangle requires
specifying the midpoint.

60
00:04:00,060 --> 00:04:03,057
So that’s the point bx,

61
00:04:03,057 --> 00:04:08,193
by as well as the height,
that would be bh,

62
00:04:08,193 --> 00:04:14,300
as well as the width,
bw of this bounding box.

63
00:04:14,300 --> 00:04:19,868
So now if your training set contains
not just the object cross label,

64
00:04:19,868 --> 00:04:23,950
which a neural network is
trying to predict up here, but

65
00:04:23,950 --> 00:04:26,740
it also contains four additional numbers.

66
00:04:26,740 --> 00:04:31,430
Giving the bounding box then you can use
supervised learning to make your algorithm

67
00:04:31,430 --> 00:04:35,910
outputs not just a class label but
also the four parameters

68
00:04:35,910 --> 00:04:39,830
to tell you where is the bounding
box of the object you detected.

69
00:04:39,830 --> 00:04:42,415
So in this example the ideal bx might

70
00:04:42,415 --> 00:04:47,254
be about 0.5 because this is about
halfway to the right to the image.

71
00:04:47,254 --> 00:04:55,863
by might be about 0.7 since it's about
maybe 70% to the way down to the image.

72
00:04:55,863 --> 00:05:00,620
bh might be about 0.3 because
the height of this red square is

73
00:05:00,620 --> 00:05:04,960
about 30% of the overall
height of the image.

74
00:05:04,960 --> 00:05:10,170
And bw might be about 0.4
let's say because the width

75
00:05:10,170 --> 00:05:14,510
of the red box is about 0.4 of
the overall width of the entire image.

76
00:05:15,940 --> 00:05:20,905
So let's formalize this a bit more in
terms of how we define the target label

77
00:05:20,905 --> 00:05:24,581
y for this as a supervised learning task.

78
00:05:24,581 --> 00:05:29,195
So just as a reminder these
are our four classes, and

79
00:05:29,195 --> 00:05:34,775
the neural network now outputs those
four numbers as well as a class label,

80
00:05:36,035 --> 00:05:39,195
or maybe probabilities
of the class labels.

81
00:05:40,530 --> 00:05:47,710
So, let's define the target
label y as follows.

82
00:05:47,710 --> 00:05:53,480
Is going to be a vector where
the first component pc is going to be,

83
00:05:53,480 --> 00:05:54,560
is there an object?

84
00:05:55,930 --> 00:06:02,140
So, if the object is, classes 1,
2 or 3, pc will be equal to 1.

85
00:06:02,140 --> 00:06:04,477
And if it's the background class, so

86
00:06:04,477 --> 00:06:09,018
if it's none of the objects you're
trying to detect, then pc will be 0.

87
00:06:09,018 --> 00:06:11,973
And pc you can think of
that as standing for

88
00:06:11,973 --> 00:06:15,020
the probability that there's an object.

89
00:06:15,020 --> 00:06:19,320
Probability that one of the classes
you're trying to detect is there.

90
00:06:19,320 --> 00:06:22,640
So something other than
the background class.

91
00:06:22,640 --> 00:06:28,338
Next if there is an object,
then you wanted to output bx,

92
00:06:28,338 --> 00:06:35,010
by, bh and bw, the bounding box for
the object you detected.

93
00:06:35,010 --> 00:06:40,436
And finally if there is an object,
so if pc is equal to 1,

94
00:06:40,436 --> 00:06:44,054
you wanted to also output c1, c2 and

95
00:06:44,054 --> 00:06:49,610
c3 which tells us is it the class 1,
class 2 or class 3.

96
00:06:49,610 --> 00:06:53,030
So is it a pedestrian,
a car or a motorcycle.

97
00:06:53,030 --> 00:06:56,340
And remember in the problem
we're addressing

98
00:06:56,340 --> 00:06:59,450
we assume that your image
has only one object.

99
00:06:59,450 --> 00:07:03,040
So at most, one of these
objects appears in the picture,

100
00:07:03,040 --> 00:07:06,490
in this classification
with localization problem.

101
00:07:06,490 --> 00:07:09,240
So let's go through a couple of examples.

102
00:07:09,240 --> 00:07:16,310
If this is a training set image,
so if that is x, then y will be

103
00:07:16,310 --> 00:07:22,650
the first component pc will be equal to 1
because there is an object, then bx, by,

104
00:07:22,650 --> 00:07:27,870
by, bh and
bw will specify the bounding box.

105
00:07:27,870 --> 00:07:32,260
So your labeled training set will
need bounding boxes in the labels.

106
00:07:32,260 --> 00:07:35,570
And then finally this is a car,
so it's class 2.

107
00:07:35,570 --> 00:07:38,680
So c1 will be 0 because
it's not a pedestrian,

108
00:07:38,680 --> 00:07:44,640
c2 will be 1 because it is car,
c3 will be 0 since it is not a motorcycle.

109
00:07:44,640 --> 00:07:50,630
So among c1, c2 and c3 at most
one of them should be equal to 1.

110
00:07:50,630 --> 00:07:54,010
So that's if there's
an object in the image.

111
00:07:54,010 --> 00:07:55,890
What if there's no object in the image?

112
00:07:55,890 --> 00:07:59,957
What if we have a training
example where x is equal to that?

113
00:07:59,957 --> 00:08:03,807
In this case, pc would be equal to 0, and

114
00:08:03,807 --> 00:08:08,979
the rest of the elements of this,
will be don't cares,

115
00:08:08,979 --> 00:08:13,940
so I'm going to write question
marks in all of them.

116
00:08:13,940 --> 00:08:18,318
So this is a don't care, because if
there is no object in this image,

117
00:08:18,318 --> 00:08:23,074
then you don't care what bounding box
the neural network outputs as well as

118
00:08:23,074 --> 00:08:27,280
which of the three objects,
c1, c2, c3 it thinks it is.

119
00:08:27,280 --> 00:08:33,870
So given a set of label training examples,
this is how you will construct x,

120
00:08:33,870 --> 00:08:38,680
the input image as well as y,
the cost label both for

121
00:08:38,680 --> 00:08:42,880
images where there is an object and
for images where there is no object.

122
00:08:42,880 --> 00:08:45,660
And the set of this will then
define your training set.

123
00:08:47,100 --> 00:08:51,520
Finally, next let's
describe the loss function

124
00:08:51,520 --> 00:08:53,930
you use to train the neural network.

125
00:08:53,930 --> 00:08:59,070
So the ground true label was y and
the neural network outputs some yhat.

126
00:08:59,070 --> 00:09:01,010
What should be the loss be?

127
00:09:01,010 --> 00:09:05,484
Well if you're using squared error

128
00:09:05,484 --> 00:09:10,105
then the loss can be (y1 hat- y1)

129
00:09:10,105 --> 00:09:15,026
squared + (y2 hat- y2) squared +

130
00:09:15,026 --> 00:09:19,810
...+( y8 hat- y8) squared.

131
00:09:19,810 --> 00:09:23,970
Notice that y here has eight components.

132
00:09:23,970 --> 00:09:28,200
So that goes from sum of the squares
of the difference of the elements.

133
00:09:28,200 --> 00:09:33,650
And that's the loss if y1=1.

134
00:09:33,650 --> 00:09:36,690
So that's the case where
there is an object.

135
00:09:36,690 --> 00:09:39,671
So y1= pc.

136
00:09:39,671 --> 00:09:43,685
So, pc = 1,
that if there is an object in the image

137
00:09:43,685 --> 00:09:47,475
then the loss can be the sum of
squares of all the different elements.

138
00:09:48,675 --> 00:09:53,418
The other case is if y1=0,

139
00:09:53,418 --> 00:09:57,790
so that's if this pc = 0.

140
00:09:57,790 --> 00:10:04,930
In that case the loss can be
just (y1 hat-y1) squared,

141
00:10:04,930 --> 00:10:11,170
because in that second case, all of the
rest of the components are don't care us.

142
00:10:11,170 --> 00:10:16,068
And so all you care about is
how accurately is the neural

143
00:10:16,068 --> 00:10:19,390
network ourputting pc in that case.

144
00:10:19,390 --> 00:10:23,304
So just a recap, if y1 = 1,
that's this case,

145
00:10:23,304 --> 00:10:28,343
then you can use squared error to
penalize square deviation from

146
00:10:28,343 --> 00:10:33,402
the predicted, and
the actual output of all eight components.

147
00:10:33,402 --> 00:10:39,749
Whereas if y1 = 0, then the second to
the eighth components I don't care.

148
00:10:39,749 --> 00:10:44,698
So all you care about is how
accurately is your neural network

149
00:10:44,698 --> 00:10:48,880
estimating y1, which is equal to pc.

150
00:10:48,880 --> 00:10:53,554
Just as a side comment for those of
you that want to know all the details,

151
00:10:53,554 --> 00:10:57,760
I've used the squared error just
to simplify the description here.

152
00:10:57,760 --> 00:11:02,840
In practice you could probably
use a log like feature loss for

153
00:11:02,840 --> 00:11:06,438
the c1, c2, c3 to the softmax output.

154
00:11:06,438 --> 00:11:10,118
One of those elements usually
you can use squared error or

155
00:11:10,118 --> 00:11:14,414
something like squared error for
the bounding box coordinates and

156
00:11:14,414 --> 00:11:19,200
if a pc you could use something
like the logistics regression loss.

157
00:11:19,200 --> 00:11:22,830
Although even if you use squared
error it'll probably work okay.

158
00:11:22,830 --> 00:11:27,030
So that's how you get a neural network
to not just classify an object but

159
00:11:27,030 --> 00:11:29,140
also to localize it.

160
00:11:29,140 --> 00:11:33,270
The idea of having a neural network
output a bunch of real numbers

161
00:11:33,270 --> 00:11:38,040
to tell you where things are in a picture
turns out to be a very powerful idea.

162
00:11:38,040 --> 00:11:42,940
In the next video I want to share with
you some other places where this idea of

163
00:11:42,940 --> 00:11:48,180
having a neural network output a set of
real numbers, almost as a regression task,

164
00:11:48,180 --> 00:11:51,980
can be very powerful to use elsewhere
in computer vision as well.

165
00:11:51,980 --> 00:11:53,360
So let's go on to the next video.