1
00:00:00,000 --> 00:00:03,990
One of the problems of Object Detection as you've learned about this so far,

2
00:00:03,990 --> 00:00:09,025
is that your algorithm may find multiple detections of the same objects.

3
00:00:09,025 --> 00:00:11,490
Rather than detecting an object just once,

4
00:00:11,490 --> 00:00:13,940
it might detect it multiple times.

5
00:00:13,940 --> 00:00:16,470
Non-max suppression is a way for you to make

6
00:00:16,470 --> 00:00:19,865
sure that your algorithm detects each object only once.

7
00:00:19,865 --> 00:00:21,220
Let's go through an example.

8
00:00:21,220 --> 00:00:23,610
Let's say you want to detect pedestrians,

9
00:00:23,610 --> 00:00:26,070
cars, and motorcycles in this image.

10
00:00:26,070 --> 00:00:29,040
You might place a grid over this,

11
00:00:29,040 --> 00:00:33,375
and this is a 19 by 19 grid.

12
00:00:33,375 --> 00:00:36,500
Now, while technically this car has just one midpoint,

13
00:00:36,500 --> 00:00:39,105
so it should be assigned just one grid cell.

14
00:00:39,105 --> 00:00:43,035
And the car on the left also has just one midpoint,

15
00:00:43,035 --> 00:00:48,745
so technically only one of those grid cells should predict that there is a car.

16
00:00:48,745 --> 00:00:50,355
In practice, you're running

17
00:00:50,355 --> 00:00:56,006
an object classification and localization algorithm for every one of these split cells.

18
00:00:56,006 --> 00:00:57,510
So it's quite possible that

19
00:00:57,510 --> 00:01:01,140
this split cell might think that the center of a car is in it,

20
00:01:01,140 --> 00:01:02,578
and so might this,

21
00:01:02,578 --> 00:01:05,130
and so might this, and for the car on the left as well.

22
00:01:05,130 --> 00:01:07,520
Maybe not only this box,

23
00:01:07,520 --> 00:01:09,957
if this is a test image you've seen before,

24
00:01:09,957 --> 00:01:14,070
not only that box might decide things that's on the car,

25
00:01:14,070 --> 00:01:16,665
maybe this box, and this box and maybe others as

26
00:01:16,665 --> 00:01:19,730
well will also think that they've found the car.

27
00:01:19,730 --> 00:01:24,865
Let's step through an example of how non-max suppression will work.

28
00:01:24,865 --> 00:01:26,550
So, because you're running

29
00:01:26,550 --> 00:01:31,710
the image classification and localization algorithm on every grid cell,

30
00:01:31,710 --> 00:01:34,635
on 361 grid cells,

31
00:01:34,635 --> 00:01:38,670
it's possible that many of them will raise their hand and say,

32
00:01:38,670 --> 00:01:43,845
"My Pc, my chance of thinking I have an object in it is large."

33
00:01:43,845 --> 00:01:47,120
Rather than just having two of the grid cells out of the

34
00:01:47,120 --> 00:01:51,330
19 squared or 361 think they have detected an object.

35
00:01:51,330 --> 00:01:52,905
So, when you run your algorithm,

36
00:01:52,905 --> 00:01:58,015
you might end up with multiple detections of each object.

37
00:01:58,015 --> 00:01:59,910
So, what non-max suppression does,

38
00:01:59,910 --> 00:02:02,645
is it cleans up these detections.

39
00:02:02,645 --> 00:02:06,660
So they end up with just one detection per car,

40
00:02:06,660 --> 00:02:09,910
rather than multiple detections per car.

41
00:02:09,910 --> 00:02:12,000
So concretely, what it does,

42
00:02:12,000 --> 00:02:16,645
is it first looks at the probabilities associated with each of these detections.

43
00:02:16,645 --> 00:02:18,630
Canada Pcs, although there are

44
00:02:18,630 --> 00:02:21,010
some details you'll learn about in this week's problem exercises,

45
00:02:21,010 --> 00:02:23,548
is actually Pc times C1,

46
00:02:23,548 --> 00:02:24,879
or C2, or C3.

47
00:02:24,879 --> 00:02:29,685
But for now, let's just say is Pc with the probability of a detection.

48
00:02:29,685 --> 00:02:32,190
And it first takes the largest one,

49
00:02:32,190 --> 00:02:35,070
which in this case is 0.9 and says,

50
00:02:35,070 --> 00:02:37,909
"That's my most confident detection,

51
00:02:37,909 --> 00:02:41,615
so let's highlight that and just say I found the car there."

52
00:02:41,615 --> 00:02:45,630
Having done that the non-max suppression part then looks at all of

53
00:02:45,630 --> 00:02:49,590
the remaining rectangles and all the ones with a high overlap,

54
00:02:49,590 --> 00:02:51,225
with a high IOU,

55
00:02:51,225 --> 00:02:54,625
with this one that you've just output will get suppressed.

56
00:02:54,625 --> 00:02:58,385
So those two rectangles with the 0.6 and the 0.7.

57
00:02:58,385 --> 00:03:02,048
Both of those overlap a lot with the light blue rectangle.

58
00:03:02,048 --> 00:03:03,555
So those, you are going to suppress

59
00:03:03,555 --> 00:03:07,105
and darken them to show that they are being suppressed.

60
00:03:07,105 --> 00:03:09,405
Next, you then go through the remaining rectangles

61
00:03:09,405 --> 00:03:11,760
and find the one with the highest probability,

62
00:03:11,760 --> 00:03:15,180
the highest Pc, which in this case is this one with 0.8.

63
00:03:15,180 --> 00:03:17,025
So let's commit to that and just say,

64
00:03:17,025 --> 00:03:18,480
"Oh, I've detected a car there."

65
00:03:18,480 --> 00:03:21,030
And then, the non-max suppression part is to

66
00:03:21,030 --> 00:03:25,785
then get rid of any other ones with a high IOU.

67
00:03:25,785 --> 00:03:30,315
So now, every rectangle has been either highlighted or darkened.

68
00:03:30,315 --> 00:03:33,295
And if you just get rid of the darkened rectangles,

69
00:03:33,295 --> 00:03:35,670
you are left with just the highlighted ones,

70
00:03:35,670 --> 00:03:39,325
and these are your two final predictions.

71
00:03:39,325 --> 00:03:41,445
So, this is non-max suppression.

72
00:03:41,445 --> 00:03:44,530
And non-max means that you're going to output

73
00:03:44,530 --> 00:03:48,215
your maximal probabilities classifications

74
00:03:48,215 --> 00:03:52,006
but suppress the close-by ones that are non-maximal.

75
00:03:52,006 --> 00:03:55,684
Hence the name, non-max suppression.

76
00:03:55,684 --> 00:03:58,185
Let's go through the details of the algorithm.

77
00:03:58,185 --> 00:04:00,590
First, on this 19 by 19 grid,

78
00:04:00,590 --> 00:04:07,925
you're going to get a 19 by 19 by eight output volume.

79
00:04:07,925 --> 00:04:09,945
Although, for this example,

80
00:04:09,945 --> 00:04:13,794
I'm going to simplify it to say that you only doing car detection.

81
00:04:13,794 --> 00:04:16,080
So, let me get rid of the C1, C2,

82
00:04:16,080 --> 00:04:18,480
C3, and pretend for this line,

83
00:04:18,480 --> 00:04:21,578
that each output for each of the 19 by 19,

84
00:04:21,578 --> 00:04:23,910
so for each of the 361,

85
00:04:23,910 --> 00:04:25,350
which is 19 squared,

86
00:04:25,350 --> 00:04:26,835
for each of the 361 positions,

87
00:04:26,835 --> 00:04:29,185
you get an output prediction of the following.

88
00:04:29,185 --> 00:04:31,443
Which is the chance there's an object,

89
00:04:31,443 --> 00:04:32,725
and then the bounding box.

90
00:04:32,725 --> 00:04:34,135
And if you have only one object,

91
00:04:34,135 --> 00:04:38,245
there's no C1, C2, C3 prediction.

92
00:04:38,245 --> 00:04:40,140
The details of what happens,

93
00:04:40,140 --> 00:04:41,875
you have multiple objects,

94
00:04:41,875 --> 00:04:43,870
I'll leave to the programming exercise,

95
00:04:43,870 --> 00:04:47,980
which you'll work on towards the end of this week.

96
00:04:47,980 --> 00:04:50,795
Now, to intimate non-max suppression,

97
00:04:50,795 --> 00:04:54,405
the first thing you can do is discard all the boxes,

98
00:04:54,405 --> 00:04:57,295
discard all the predictions of the bounding boxes with

99
00:04:57,295 --> 00:05:01,243
Pc less than or equal to some threshold, let's say 0.6.

100
00:05:01,243 --> 00:05:04,140
So we're going to say that unless you think there's at least a

101
00:05:04,140 --> 00:05:08,165
0.6 chance it is an object there, let's just get rid of it.

102
00:05:08,165 --> 00:05:13,590
This has caused all of the low probability output boxes.

103
00:05:13,590 --> 00:05:19,695
The way to think about this is for each of the 361 positions,

104
00:05:19,695 --> 00:05:23,625
you output a bounding box together

105
00:05:23,625 --> 00:05:28,140
with a probability of that bounding box being a good one.

106
00:05:28,140 --> 00:05:29,415
So we're just going to discard

107
00:05:29,415 --> 00:05:33,945
all the bounding boxes that were assigned a low probability.

108
00:05:33,945 --> 00:05:35,730
Next, while there are

109
00:05:35,730 --> 00:05:41,130
any remaining bounding boxes that you've not yet discarded or processed,

110
00:05:41,130 --> 00:05:45,910
you're going to repeatedly pick the box with the highest probability,

111
00:05:45,910 --> 00:05:47,835
with the highest Pc,

112
00:05:47,835 --> 00:05:50,380
and then output that as a prediction.

113
00:05:50,380 --> 00:05:54,720
So this is a process on a previous slide of taking one of the bounding boxes,

114
00:05:54,720 --> 00:05:56,725
and making it lighter in color.

115
00:05:56,725 --> 00:06:02,195
So you commit to outputting that as a prediction for that there is a car there.

116
00:06:02,195 --> 00:06:05,803
Next, you then discard any remaining box.

117
00:06:05,803 --> 00:06:08,905
Any box that you have not output as a prediction,

118
00:06:08,905 --> 00:06:10,955
and that was not previously discarded.

119
00:06:10,955 --> 00:06:14,490
So discard any remaining box with a high overlap,

120
00:06:14,490 --> 00:06:15,945
with a high IOU,

121
00:06:15,945 --> 00:06:20,305
with the box that you just output in the previous step.

122
00:06:20,305 --> 00:06:25,425
This second step in the while loop was when on the previous slide you would

123
00:06:25,425 --> 00:06:28,680
darken any remaining bounding box that had

124
00:06:28,680 --> 00:06:32,310
a high overlap with the bounding box that we just made lighter,

125
00:06:32,310 --> 00:06:34,115
that we just highlighted.

126
00:06:34,115 --> 00:06:36,835
And so, you keep doing this while there's

127
00:06:36,835 --> 00:06:40,000
still any remaining boxes that you've not yet processed,

128
00:06:40,000 --> 00:06:45,225
until you've taken each of the boxes and either output it as a prediction,

129
00:06:45,225 --> 00:06:48,990
or discarded it as having too high an overlap,

130
00:06:48,990 --> 00:06:50,580
or too high an IOU,

131
00:06:50,580 --> 00:06:53,550
with one of the boxes that you have just output as

132
00:06:53,550 --> 00:06:58,610
your predicted position for one of the detected objects.

133
00:07:00,000 --> 00:07:06,640
I've described the algorithm using just a single object on this slide.

134
00:07:06,640 --> 00:07:10,940
If you actually tried to detect three objects say pedestrians,

135
00:07:10,940 --> 00:07:16,215
cars, and motorcycles, then the output vector will have three additional components.

136
00:07:16,215 --> 00:07:18,860
And it turns out, the right thing to do is to

137
00:07:18,860 --> 00:07:22,745
independently carry out non-max suppression three times,

138
00:07:22,745 --> 00:07:26,635
one on each of the outputs classes.

139
00:07:26,635 --> 00:07:29,475
But the details of that, I'll leave to

140
00:07:29,475 --> 00:07:33,132
this week's program exercise where you get to implement that yourself,

141
00:07:33,132 --> 00:07:38,912
where you get to implement non-max suppression yourself on multiple object classes.

142
00:07:38,912 --> 00:07:41,210
So that's it for non-max suppression,

143
00:07:41,210 --> 00:07:45,090
and if you implement the Object Detection algorithm we've described,

144
00:07:45,090 --> 00:07:48,175
you actually get pretty decent results.

145
00:07:48,175 --> 00:07:51,876
But before wrapping up our discussion of the YOLO algorithm,

146
00:07:51,876 --> 00:07:54,810
there's just one last idea I want to share with you,

147
00:07:54,810 --> 00:07:57,295
which makes the algorithm work much better,

148
00:07:57,295 --> 00:08:00,235
which is the idea of using anchor boxes.

149
00:08:00,235 --> 00:08:02,000
Let's go on to the next video.