1
00:00:00,000 --> 00:00:03,235
If you look at the object detection literature,

2
00:00:03,235 --> 00:00:06,790
there's a set of ideas called region proposals

3
00:00:06,790 --> 00:00:10,995
that's been very influential in computer vision as well.

4
00:00:10,995 --> 00:00:14,460
I wanted to make this video optional because I tend to use

5
00:00:14,460 --> 00:00:19,275
the region proposal instead of algorithm a bit less often but nonetheless,

6
00:00:19,275 --> 00:00:22,170
it has been an influential body of work

7
00:00:22,170 --> 00:00:25,675
and an idea that you might come across in your own work.

8
00:00:25,675 --> 00:00:29,640
Let's take a look. So if you recall the sliding windows idea,

9
00:00:29,640 --> 00:00:33,225
you would take a train crossfire and run it

10
00:00:33,225 --> 00:00:37,695
across all of these different windows and run the detector to see if there's a car,

11
00:00:37,695 --> 00:00:40,190
pedestrian, or maybe a motorcycle.

12
00:00:40,190 --> 00:00:42,515
Now, you could run the algorithm convolutionally,

13
00:00:42,515 --> 00:00:45,570
but one downside that the algorithm is it just

14
00:00:45,570 --> 00:00:49,410
crossfires a lot of the regions where there's clearly no object.

15
00:00:49,410 --> 00:00:52,569
So this rectangle down here is pretty much blank.

16
00:00:52,569 --> 00:00:55,658
It's clearly nothing interesting there to classify,

17
00:00:55,658 --> 00:00:58,610
and maybe it was also running it on this rectangle,

18
00:00:58,610 --> 00:01:01,365
which look likes there's nothing that interesting there.

19
00:01:01,365 --> 00:01:04,275
So what Russ Girshik, Jeff Donahue, Trevor Darrell,

20
00:01:04,275 --> 00:01:06,548
and Jitendra Malik proposed in the paper,

21
00:01:06,548 --> 00:01:07,905
as cited to the bottom of the slide,

22
00:01:07,905 --> 00:01:10,470
is an algorithm called R-CNN,

23
00:01:10,470 --> 00:01:15,915
which stands for Regions with convolutional networks or regions with CNNs.

24
00:01:15,915 --> 00:01:18,330
And what that does is it tries to pick

25
00:01:18,330 --> 00:01:22,925
just a few regions that makes sense to run your continent crossfire.

26
00:01:22,925 --> 00:01:27,505
So rather than running your sliding windows on every single window,

27
00:01:27,505 --> 00:01:30,330
you instead select just a few windows and

28
00:01:30,330 --> 00:01:33,570
run your continent crossfire on just a few windows.

29
00:01:33,570 --> 00:01:35,205
The way that they perform

30
00:01:35,205 --> 00:01:40,425
the region proposals is to run an algorithm called a segmentation algorithm,

31
00:01:40,425 --> 00:01:42,915
that results in this output on the right,

32
00:01:42,915 --> 00:01:46,170
in order to figure out what could be objects.

33
00:01:46,170 --> 00:01:50,306
So, for example, the segmentation algorithm finds a blob over here.

34
00:01:50,306 --> 00:01:53,625
And so you might pick that pounding balls and say,

35
00:01:53,625 --> 00:01:55,680
"Let's run a crossfire on that blob."

36
00:01:55,680 --> 00:01:58,730
It looks like this little green thing finds a blob there,

37
00:01:58,730 --> 00:02:00,960
as you might also run the crossfire on

38
00:02:00,960 --> 00:02:04,650
that rectangle to see if there's some interesting there.

39
00:02:04,650 --> 00:02:06,000
And in this case,

40
00:02:06,000 --> 00:02:08,830
this blue blob, if you run a crossfire on that,

41
00:02:08,830 --> 00:02:10,793
hope you find the pedestrian,

42
00:02:10,793 --> 00:02:13,575
and if you run it on this light cyan blob,

43
00:02:13,575 --> 00:02:16,120
maybe you'll find a car, maybe not,.

44
00:02:16,120 --> 00:02:17,535
I'm not sure. So the details of this,

45
00:02:17,535 --> 00:02:20,080
this is called a segmentation algorithm,

46
00:02:20,080 --> 00:02:25,410
and what you do is you find maybe 2000 blobs and place bounding

47
00:02:25,410 --> 00:02:31,544
boxes around about 2000 blobs and value crossfire on just those 2000 blobs,

48
00:02:31,544 --> 00:02:34,380
and this can be a much smaller number of positions

49
00:02:34,380 --> 00:02:37,529
on which to run your continent crossfire,

50
00:02:37,529 --> 00:02:40,935
then if you have to run it at every single position throughout the image.

51
00:02:40,935 --> 00:02:44,172
And this is a special case if you are running your continent

52
00:02:44,172 --> 00:02:48,055
not just on square-shaped regions but running them on

53
00:02:48,055 --> 00:02:51,870
tall skinny regions to try to find pedestrians or running them on

54
00:02:51,870 --> 00:02:57,915
your white fat regions try to find cars and running them at multiple scales as well.

55
00:02:57,915 --> 00:03:02,170
So that's the R-CNN or the region with CNN,

56
00:03:02,170 --> 00:03:04,380
a region of CNN features idea.

57
00:03:04,380 --> 00:03:08,305
Now, it turns out the R-CNN algorithm is still quite slow.

58
00:03:08,305 --> 00:03:13,320
So there's been a line of work to explore how to speed up this algorithm.

59
00:03:13,320 --> 00:03:16,920
So the basic R-CNN algorithm with proposed regions using

60
00:03:16,920 --> 00:03:20,933
some algorithm and then crossfire the proposed regions one at a time.

61
00:03:20,933 --> 00:03:22,380
And for each of the regions,

62
00:03:22,380 --> 00:03:23,844
they will output the label.

63
00:03:23,844 --> 00:03:25,960
So is there a car? Is there a pedestrian?

64
00:03:25,960 --> 00:03:27,580
Is there a motorcycle there?

65
00:03:27,580 --> 00:03:30,090
And then also outputs a bounding box,

66
00:03:30,090 --> 00:03:36,510
so you can get an accurate bounding box if indeed there is a object in that region.

67
00:03:36,510 --> 00:03:37,645
So just to be clear,

68
00:03:37,645 --> 00:03:42,075
the R-CNN algorithm doesn't just trust the bounding box it was given.

69
00:03:42,075 --> 00:03:44,540
It also outputs a bounding box,

70
00:03:44,540 --> 00:03:46,620
B X B Y B H B W,

71
00:03:46,620 --> 00:03:51,045
in order to get a more accurate bounding box and whatever happened

72
00:03:51,045 --> 00:03:56,070
to surround the blob that the image segmentation algorithm gave it.

73
00:03:56,070 --> 00:03:58,705
So it can get pretty accurate bounding boxes.

74
00:03:58,705 --> 00:04:03,425
Now, one downside of the R-CNN algorithm was that it is actually quite slow.

75
00:04:03,425 --> 00:04:04,470
So over the years,

76
00:04:04,470 --> 00:04:08,295
there been a few improvements to the R-CNN algorithm.

77
00:04:08,295 --> 00:04:12,180
Russ Girshik proposed the fast R-CNN algorithm,

78
00:04:12,180 --> 00:04:15,150
and it's basically the R-CNN algorithm but with

79
00:04:15,150 --> 00:04:18,290
a convolutional implementation of sliding windows.

80
00:04:18,290 --> 00:04:23,745
So the original implementation would actually classify the regions one at a time.

81
00:04:23,745 --> 00:04:28,955
So far, R-CNN use a convolutional implementation of sliding windows,

82
00:04:28,955 --> 00:04:35,550
and this is roughly similar to the idea you saw in the fourth video of this week.

83
00:04:35,550 --> 00:04:39,850
And that speeds up R-CNN quite a bit.

84
00:04:40,390 --> 00:04:46,680
It turns out that one of the problems of fast R-CNN algorithm is that

85
00:04:46,680 --> 00:04:53,270
the clustering step to propose the regions is still quite slow and so a different group,

86
00:04:53,270 --> 00:04:56,025
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Son,

87
00:04:56,025 --> 00:04:59,043
proposed the faster R-CNN algorithm,

88
00:04:59,043 --> 00:05:02,520
which uses a convolutional neural network instead of one of

89
00:05:02,520 --> 00:05:07,550
the more traditional segmentation algorithms to propose a blob on those regions,

90
00:05:07,550 --> 00:05:12,487
and that wound up running quite a bit faster than the fast R-CNN algorithm.

91
00:05:12,487 --> 00:05:15,810
Although, I think the faster R-CNN algorithm,

92
00:05:15,810 --> 00:05:21,730
most implementations are usually still quit a bit slower than the YOLO algorithm.

93
00:05:21,730 --> 00:05:27,090
So the idea of region proposals has been quite influential in computer vision,

94
00:05:27,090 --> 00:05:32,995
and I wanted you to know about these ideas because you see others still used these ideas,

95
00:05:32,995 --> 00:05:35,595
for myself, and this is my personal opinion,

96
00:05:35,595 --> 00:05:38,893
not the opinion of the computer vision research committee as a whole.

97
00:05:38,893 --> 00:05:44,100
I think that we can propose an interesting idea but that not having two steps,

98
00:05:44,100 --> 00:05:45,630
first, proposed region and then crossfire,

99
00:05:45,630 --> 00:05:49,800
being able to do everything more or at the same time,

100
00:05:49,800 --> 00:05:53,085
similar to the YOLO or the You Only Look Once algorithm

101
00:05:53,085 --> 00:05:56,885
that seems to me like a more promising direction for the long term.

102
00:05:56,885 --> 00:05:58,995
But that's my personal opinion and not necessary

103
00:05:58,995 --> 00:06:01,865
the opinion of the whole computer vision research committee.

104
00:06:01,865 --> 00:06:04,868
So feel free to take that with a grain of salt,

105
00:06:04,868 --> 00:06:07,550
but I think that the R-CNN idea,

106
00:06:07,550 --> 00:06:10,438
you might come across others using it.

107
00:06:10,438 --> 00:06:14,460
So it was worth learning as well so you can understand others algorithms better.

108
00:06:14,460 --> 00:06:21,565
So we're now finished up our material for this week on object detection.

109
00:06:21,565 --> 00:06:25,133
I hope you enjoy working on this week's problem exercise,

110
00:06:25,133 --> 00:06:27,000
and I look forward to seeing you this week.