1 00:00:00,000 --> 00:00:03,990 One of the problems of Object Detection as you've learned about this so far, 2 00:00:03,990 --> 00:00:09,025 is that your algorithm may find multiple detections of the same objects. 3 00:00:09,025 --> 00:00:11,490 Rather than detecting an object just once, 4 00:00:11,490 --> 00:00:13,940 it might detect it multiple times. 5 00:00:13,940 --> 00:00:16,470 Non-max suppression is a way for you to make 6 00:00:16,470 --> 00:00:19,865 sure that your algorithm detects each object only once. 7 00:00:19,865 --> 00:00:21,220 Let's go through an example. 8 00:00:21,220 --> 00:00:23,610 Let's say you want to detect pedestrians, 9 00:00:23,610 --> 00:00:26,070 cars, and motorcycles in this image. 10 00:00:26,070 --> 00:00:29,040 You might place a grid over this, 11 00:00:29,040 --> 00:00:33,375 and this is a 19 by 19 grid. 12 00:00:33,375 --> 00:00:36,500 Now, while technically this car has just one midpoint, 13 00:00:36,500 --> 00:00:39,105 so it should be assigned just one grid cell. 14 00:00:39,105 --> 00:00:43,035 And the car on the left also has just one midpoint, 15 00:00:43,035 --> 00:00:48,745 so technically only one of those grid cells should predict that there is a car. 16 00:00:48,745 --> 00:00:50,355 In practice, you're running 17 00:00:50,355 --> 00:00:56,006 an object classification and localization algorithm for every one of these split cells. 18 00:00:56,006 --> 00:00:57,510 So it's quite possible that 19 00:00:57,510 --> 00:01:01,140 this split cell might think that the center of a car is in it, 20 00:01:01,140 --> 00:01:02,578 and so might this, 21 00:01:02,578 --> 00:01:05,130 and so might this, and for the car on the left as well. 22 00:01:05,130 --> 00:01:07,520 Maybe not only this box, 23 00:01:07,520 --> 00:01:09,957 if this is a test image you've seen before, 24 00:01:09,957 --> 00:01:14,070 not only that box might decide things that's on the car, 25 00:01:14,070 --> 00:01:16,665 maybe this box, and this box and maybe others as 26 00:01:16,665 --> 00:01:19,730 well will also think that they've found the car. 27 00:01:19,730 --> 00:01:24,865 Let's step through an example of how non-max suppression will work. 28 00:01:24,865 --> 00:01:26,550 So, because you're running 29 00:01:26,550 --> 00:01:31,710 the image classification and localization algorithm on every grid cell, 30 00:01:31,710 --> 00:01:34,635 on 361 grid cells, 31 00:01:34,635 --> 00:01:38,670 it's possible that many of them will raise their hand and say, 32 00:01:38,670 --> 00:01:43,845 "My Pc, my chance of thinking I have an object in it is large." 33 00:01:43,845 --> 00:01:47,120 Rather than just having two of the grid cells out of the 34 00:01:47,120 --> 00:01:51,330 19 squared or 361 think they have detected an object. 35 00:01:51,330 --> 00:01:52,905 So, when you run your algorithm, 36 00:01:52,905 --> 00:01:58,015 you might end up with multiple detections of each object. 37 00:01:58,015 --> 00:01:59,910 So, what non-max suppression does, 38 00:01:59,910 --> 00:02:02,645 is it cleans up these detections. 39 00:02:02,645 --> 00:02:06,660 So they end up with just one detection per car, 40 00:02:06,660 --> 00:02:09,910 rather than multiple detections per car. 41 00:02:09,910 --> 00:02:12,000 So concretely, what it does, 42 00:02:12,000 --> 00:02:16,645 is it first looks at the probabilities associated with each of these detections. 43 00:02:16,645 --> 00:02:18,630 Canada Pcs, although there are 44 00:02:18,630 --> 00:02:21,010 some details you'll learn about in this week's problem exercises, 45 00:02:21,010 --> 00:02:23,548 is actually Pc times C1, 46 00:02:23,548 --> 00:02:24,879 or C2, or C3. 47 00:02:24,879 --> 00:02:29,685 But for now, let's just say is Pc with the probability of a detection. 48 00:02:29,685 --> 00:02:32,190 And it first takes the largest one, 49 00:02:32,190 --> 00:02:35,070 which in this case is 0.9 and says, 50 00:02:35,070 --> 00:02:37,909 "That's my most confident detection, 51 00:02:37,909 --> 00:02:41,615 so let's highlight that and just say I found the car there." 52 00:02:41,615 --> 00:02:45,630 Having done that the non-max suppression part then looks at all of 53 00:02:45,630 --> 00:02:49,590 the remaining rectangles and all the ones with a high overlap, 54 00:02:49,590 --> 00:02:51,225 with a high IOU, 55 00:02:51,225 --> 00:02:54,625 with this one that you've just output will get suppressed. 56 00:02:54,625 --> 00:02:58,385 So those two rectangles with the 0.6 and the 0.7. 57 00:02:58,385 --> 00:03:02,048 Both of those overlap a lot with the light blue rectangle. 58 00:03:02,048 --> 00:03:03,555 So those, you are going to suppress 59 00:03:03,555 --> 00:03:07,105 and darken them to show that they are being suppressed. 60 00:03:07,105 --> 00:03:09,405 Next, you then go through the remaining rectangles 61 00:03:09,405 --> 00:03:11,760 and find the one with the highest probability, 62 00:03:11,760 --> 00:03:15,180 the highest Pc, which in this case is this one with 0.8. 63 00:03:15,180 --> 00:03:17,025 So let's commit to that and just say, 64 00:03:17,025 --> 00:03:18,480 "Oh, I've detected a car there." 65 00:03:18,480 --> 00:03:21,030 And then, the non-max suppression part is to 66 00:03:21,030 --> 00:03:25,785 then get rid of any other ones with a high IOU. 67 00:03:25,785 --> 00:03:30,315 So now, every rectangle has been either highlighted or darkened. 68 00:03:30,315 --> 00:03:33,295 And if you just get rid of the darkened rectangles, 69 00:03:33,295 --> 00:03:35,670 you are left with just the highlighted ones, 70 00:03:35,670 --> 00:03:39,325 and these are your two final predictions. 71 00:03:39,325 --> 00:03:41,445 So, this is non-max suppression. 72 00:03:41,445 --> 00:03:44,530 And non-max means that you're going to output 73 00:03:44,530 --> 00:03:48,215 your maximal probabilities classifications 74 00:03:48,215 --> 00:03:52,006 but suppress the close-by ones that are non-maximal. 75 00:03:52,006 --> 00:03:55,684 Hence the name, non-max suppression. 76 00:03:55,684 --> 00:03:58,185 Let's go through the details of the algorithm. 77 00:03:58,185 --> 00:04:00,590 First, on this 19 by 19 grid, 78 00:04:00,590 --> 00:04:07,925 you're going to get a 19 by 19 by eight output volume. 79 00:04:07,925 --> 00:04:09,945 Although, for this example, 80 00:04:09,945 --> 00:04:13,794 I'm going to simplify it to say that you only doing car detection. 81 00:04:13,794 --> 00:04:16,080 So, let me get rid of the C1, C2, 82 00:04:16,080 --> 00:04:18,480 C3, and pretend for this line, 83 00:04:18,480 --> 00:04:21,578 that each output for each of the 19 by 19, 84 00:04:21,578 --> 00:04:23,910 so for each of the 361, 85 00:04:23,910 --> 00:04:25,350 which is 19 squared, 86 00:04:25,350 --> 00:04:26,835 for each of the 361 positions, 87 00:04:26,835 --> 00:04:29,185 you get an output prediction of the following. 88 00:04:29,185 --> 00:04:31,443 Which is the chance there's an object, 89 00:04:31,443 --> 00:04:32,725 and then the bounding box. 90 00:04:32,725 --> 00:04:34,135 And if you have only one object, 91 00:04:34,135 --> 00:04:38,245 there's no C1, C2, C3 prediction. 92 00:04:38,245 --> 00:04:40,140 The details of what happens, 93 00:04:40,140 --> 00:04:41,875 you have multiple objects, 94 00:04:41,875 --> 00:04:43,870 I'll leave to the programming exercise, 95 00:04:43,870 --> 00:04:47,980 which you'll work on towards the end of this week. 96 00:04:47,980 --> 00:04:50,795 Now, to intimate non-max suppression, 97 00:04:50,795 --> 00:04:54,405 the first thing you can do is discard all the boxes, 98 00:04:54,405 --> 00:04:57,295 discard all the predictions of the bounding boxes with 99 00:04:57,295 --> 00:05:01,243 Pc less than or equal to some threshold, let's say 0.6. 100 00:05:01,243 --> 00:05:04,140 So we're going to say that unless you think there's at least a 101 00:05:04,140 --> 00:05:08,165 0.6 chance it is an object there, let's just get rid of it. 102 00:05:08,165 --> 00:05:13,590 This has caused all of the low probability output boxes. 103 00:05:13,590 --> 00:05:19,695 The way to think about this is for each of the 361 positions, 104 00:05:19,695 --> 00:05:23,625 you output a bounding box together 105 00:05:23,625 --> 00:05:28,140 with a probability of that bounding box being a good one. 106 00:05:28,140 --> 00:05:29,415 So we're just going to discard 107 00:05:29,415 --> 00:05:33,945 all the bounding boxes that were assigned a low probability. 108 00:05:33,945 --> 00:05:35,730 Next, while there are 109 00:05:35,730 --> 00:05:41,130 any remaining bounding boxes that you've not yet discarded or processed, 110 00:05:41,130 --> 00:05:45,910 you're going to repeatedly pick the box with the highest probability, 111 00:05:45,910 --> 00:05:47,835 with the highest Pc, 112 00:05:47,835 --> 00:05:50,380 and then output that as a prediction. 113 00:05:50,380 --> 00:05:54,720 So this is a process on a previous slide of taking one of the bounding boxes, 114 00:05:54,720 --> 00:05:56,725 and making it lighter in color. 115 00:05:56,725 --> 00:06:02,195 So you commit to outputting that as a prediction for that there is a car there. 116 00:06:02,195 --> 00:06:05,803 Next, you then discard any remaining box. 117 00:06:05,803 --> 00:06:08,905 Any box that you have not output as a prediction, 118 00:06:08,905 --> 00:06:10,955 and that was not previously discarded. 119 00:06:10,955 --> 00:06:14,490 So discard any remaining box with a high overlap, 120 00:06:14,490 --> 00:06:15,945 with a high IOU, 121 00:06:15,945 --> 00:06:20,305 with the box that you just output in the previous step. 122 00:06:20,305 --> 00:06:25,425 This second step in the while loop was when on the previous slide you would 123 00:06:25,425 --> 00:06:28,680 darken any remaining bounding box that had 124 00:06:28,680 --> 00:06:32,310 a high overlap with the bounding box that we just made lighter, 125 00:06:32,310 --> 00:06:34,115 that we just highlighted. 126 00:06:34,115 --> 00:06:36,835 And so, you keep doing this while there's 127 00:06:36,835 --> 00:06:40,000 still any remaining boxes that you've not yet processed, 128 00:06:40,000 --> 00:06:45,225 until you've taken each of the boxes and either output it as a prediction, 129 00:06:45,225 --> 00:06:48,990 or discarded it as having too high an overlap, 130 00:06:48,990 --> 00:06:50,580 or too high an IOU, 131 00:06:50,580 --> 00:06:53,550 with one of the boxes that you have just output as 132 00:06:53,550 --> 00:06:58,610 your predicted position for one of the detected objects. 133 00:07:00,000 --> 00:07:06,640 I've described the algorithm using just a single object on this slide. 134 00:07:06,640 --> 00:07:10,940 If you actually tried to detect three objects say pedestrians, 135 00:07:10,940 --> 00:07:16,215 cars, and motorcycles, then the output vector will have three additional components. 136 00:07:16,215 --> 00:07:18,860 And it turns out, the right thing to do is to 137 00:07:18,860 --> 00:07:22,745 independently carry out non-max suppression three times, 138 00:07:22,745 --> 00:07:26,635 one on each of the outputs classes. 139 00:07:26,635 --> 00:07:29,475 But the details of that, I'll leave to 140 00:07:29,475 --> 00:07:33,132 this week's program exercise where you get to implement that yourself, 141 00:07:33,132 --> 00:07:38,912 where you get to implement non-max suppression yourself on multiple object classes. 142 00:07:38,912 --> 00:07:41,210 So that's it for non-max suppression, 143 00:07:41,210 --> 00:07:45,090 and if you implement the Object Detection algorithm we've described, 144 00:07:45,090 --> 00:07:48,175 you actually get pretty decent results. 145 00:07:48,175 --> 00:07:51,876 But before wrapping up our discussion of the YOLO algorithm, 146 00:07:51,876 --> 00:07:54,810 there's just one last idea I want to share with you, 147 00:07:54,810 --> 00:07:57,295 which makes the algorithm work much better, 148 00:07:57,295 --> 00:08:00,235 which is the idea of using anchor boxes. 149 00:08:00,235 --> 00:08:02,000 Let's go on to the next video.