1
00:00:00,000 --> 00:00:03,348
[MUSIC]

2
00:00:03,348 --> 00:00:07,704
In this video, we will take a quick look
at other computer vision problems that

3
00:00:07,704 --> 00:00:10,480
successfully utilize
convolutional networks.

4
00:00:11,650 --> 00:00:15,050
So far, we have examined
image classification task,

5
00:00:15,050 --> 00:00:19,300
which has an image as an input and
a class label as an output.

6
00:00:19,300 --> 00:00:22,830
In this video,
we will review two more tasks.

7
00:00:22,830 --> 00:00:28,030
And the first one is semantic segmentation
where you have an image as an input.

8
00:00:28,030 --> 00:00:33,570
And as an output, you need to give a class
label for each pixel of that image.

9
00:00:33,570 --> 00:00:34,470
For example,

10
00:00:34,470 --> 00:00:39,560
which pixels correspond to water,
which pixels correspond to duck or grass.

11
00:00:39,560 --> 00:00:44,340
Another example is image
classification plus localization.

12
00:00:44,340 --> 00:00:49,170
In this task, you not only need to say
which object you see on the image, but

13
00:00:49,170 --> 00:00:51,210
also the way you see it.

14
00:00:51,210 --> 00:00:54,340
And for that,
you need to define a bounding box,

15
00:00:54,340 --> 00:00:58,510
that is a yellow rectangular
region that contains your object.

16
00:01:00,190 --> 00:01:02,770
Let's start with semantic segmentation.

17
00:01:02,770 --> 00:01:06,770
For this task, we need to
classify each pixel of our image.

18
00:01:06,770 --> 00:01:09,850
So what do we do when we
have an image as an input?

19
00:01:09,850 --> 00:01:12,890
We stack convolutional layer, right?

20
00:01:12,890 --> 00:01:14,970
And we use the same width and

21
00:01:14,970 --> 00:01:19,730
height as our input image because we
will need to classify each pixel.

22
00:01:20,920 --> 00:01:22,140
And what we do next?

23
00:01:22,140 --> 00:01:24,550
Usually we add pooling layers.

24
00:01:24,550 --> 00:01:28,420
But in this particular task,
it is not easy to implement.

25
00:01:28,420 --> 00:01:33,290
Because when you add pooling layers that
effectively down samples your image and

26
00:01:33,290 --> 00:01:37,510
our classification will not be crisp,
it will be pixelated.

27
00:01:37,510 --> 00:01:38,870
And we don't want that.

28
00:01:38,870 --> 00:01:41,150
So let's maintain the width and

29
00:01:41,150 --> 00:01:46,050
height of our convolutional filters as
we stack more and more of that layers.

30
00:01:47,140 --> 00:01:50,520
The final layer will be a different one.

31
00:01:50,520 --> 00:01:53,660
It will have the number of output channels

32
00:01:53,660 --> 00:01:58,280
equal to the number of classes
that we need for our segmentation.

33
00:01:58,280 --> 00:02:02,420
For example,
each depth slice will be responsible for

34
00:02:02,420 --> 00:02:06,550
classification of let's say,
water, duck, or grass.

35
00:02:06,550 --> 00:02:10,440
And what we do in the end is for
every pixel, and remember,

36
00:02:10,440 --> 00:02:15,460
we maintain the width and height, for
every pixel we can take all those values

37
00:02:15,460 --> 00:02:20,880
that are encoded in the depth of that
final volume which has an orange color.

38
00:02:20,880 --> 00:02:28,190
And we actually apply a softmax function
over those values in the output channels.

39
00:02:28,190 --> 00:02:30,230
This is a rather naive approach.

40
00:02:30,230 --> 00:02:33,910
We stack convolutional layers and
add per-pixel softmax.

41
00:02:35,010 --> 00:02:37,750
We go deep but we don't add pooling.

42
00:02:37,750 --> 00:02:41,920
And that is too expensive and
it is harder to explain.

43
00:02:41,920 --> 00:02:45,090
Let's add pooling which
acts like down-sampling.

44
00:02:45,090 --> 00:02:47,000
We will paint it with pink color.

45
00:02:47,000 --> 00:02:48,740
So we have an image as an input,

46
00:02:48,740 --> 00:02:53,260
we have first convolution layer,
then followed by pooling layer.

47
00:02:53,260 --> 00:02:57,930
After pooling layer, we actually reduce
the number, we reduce the width and

48
00:02:57,930 --> 00:03:00,790
height of our volume, and
we increase the depth.

49
00:03:00,790 --> 00:03:04,280
Then we have one more convolutional
layer and one more pooling layer.

50
00:03:04,280 --> 00:03:07,950
Then what we do next is we stack
one more convolutional layer.

51
00:03:07,950 --> 00:03:11,060
But wait a second,
we need to classify each pixel.

52
00:03:11,060 --> 00:03:17,110
And right now, the width and height of
our volume is significantly reduced.

53
00:03:17,110 --> 00:03:18,760
We need to do unpooling somehow.

54
00:03:20,330 --> 00:03:24,700
And for that task, we will use a special
layer which we’ll do upsampling.

55
00:03:24,700 --> 00:03:27,480
And we color it with green color.

56
00:03:28,480 --> 00:03:32,280
And after upsampling,
we will use convolutional layers so

57
00:03:32,280 --> 00:03:36,960
that we can learn a transformation
back to their original pixels.

58
00:03:36,960 --> 00:03:41,190
We add one more upsampling layer and
one more convolutional layer.

59
00:03:41,190 --> 00:03:45,730
And that is how we get our semantic
segmentation of input pixels.

60
00:03:45,730 --> 00:03:47,440
How do we do that unpooling?

61
00:03:47,440 --> 00:03:50,770
The easiest way is to fill
the nearest neighbor values.

62
00:03:50,770 --> 00:03:52,940
We have a 2x2 input and

63
00:03:52,940 --> 00:03:59,580
we replace each cell in that input
as a 2x2 patch with the same values.

64
00:03:59,580 --> 00:04:03,770
This way, we get a pixelated output and
it is not crisp.

65
00:04:03,770 --> 00:04:05,665
It's not the best way to go.

66
00:04:05,665 --> 00:04:08,270
Another technique is called max unpooling.

67
00:04:08,270 --> 00:04:10,130
Let's look at our architecture.

68
00:04:10,130 --> 00:04:14,960
We have corresponding pairs of
downsampling and upsampling layers.

69
00:04:14,960 --> 00:04:18,690
They actually do the same thing but
in the reverse order.

70
00:04:18,690 --> 00:04:20,970
Let's use that information.

71
00:04:20,970 --> 00:04:24,950
What if we remember which element
was maximum during pooling and

72
00:04:24,950 --> 00:04:27,740
fill that position during unpooling?

73
00:04:27,740 --> 00:04:28,913
Let's look at the example.

74
00:04:28,913 --> 00:04:36,100
We have a 4x4 input, and let's apply
max pooling of 2x2 with stride 2.

75
00:04:36,100 --> 00:04:41,120
And let's remember, which neurons
gave us the maximum activations.

76
00:04:41,120 --> 00:04:46,550
Then goes the rest of the network, and at
some point, we will have to do unpooling.

77
00:04:46,550 --> 00:04:52,020
We will have to do a 4x4
output out of the 2x2 input.

78
00:04:52,020 --> 00:04:55,420
And we do that not by filling
nearest neighbor values.

79
00:04:55,420 --> 00:04:59,750
But we rather pool these values and
put them

80
00:04:59,750 --> 00:05:04,730
in those locations where we had maximum
activations during corresponding pooling.

81
00:05:05,760 --> 00:05:09,800
This way, we get a crisper image and
it actually works better.

82
00:05:11,910 --> 00:05:14,770
Previous approaches are not data-driven.

83
00:05:14,770 --> 00:05:18,230
Imagine that you have
an object that are round, and

84
00:05:18,230 --> 00:05:21,800
not squared, and
we don't use that information.

85
00:05:21,800 --> 00:05:27,217
Nearest neighbor unpooling, or
max unpooling, is not aware of that.

86
00:05:27,217 --> 00:05:31,290
But we can actually use that
information to give better unpooling.

87
00:05:32,430 --> 00:05:36,550
Remember that we can replace max
pooling layer with convolutional layer

88
00:05:36,550 --> 00:05:38,230
that has a bigger stride.

89
00:05:38,230 --> 00:05:41,000
What if we can apply
convolutions to do unpooling?

90
00:05:42,820 --> 00:05:44,890
Let's see how it might work.

91
00:05:44,890 --> 00:05:50,540
We have a 2x2 input and
we somehow need to produce a 4x4 output.

92
00:05:50,540 --> 00:05:53,010
Let's use a 3x3 convolutional filter for
that.

93
00:05:54,310 --> 00:05:55,920
How it works?

94
00:05:55,920 --> 00:05:58,080
We take a convolutional filter and

95
00:05:58,080 --> 00:06:02,490
we multiply it by the value
in the red input cell.

96
00:06:02,490 --> 00:06:04,340
And we add those values to the output.

97
00:06:05,770 --> 00:06:09,240
Then we move to the next
pixel in the input.

98
00:06:09,240 --> 00:06:13,200
And for that,
we use a stride 2 in the output, so

99
00:06:13,200 --> 00:06:17,140
that we double the resolution
that we had in the input.

100
00:06:18,200 --> 00:06:19,930
Then we take the kernel weights or

101
00:06:19,930 --> 00:06:24,940
filter weights and multiply it by the
value that we have in the blue input cell.

102
00:06:24,940 --> 00:06:27,036
And we add it to the output as well.

103
00:06:27,036 --> 00:06:31,564
But what to do with that value
with those values where we have

104
00:06:31,564 --> 00:06:34,640
an intersection of our filters?

105
00:06:34,640 --> 00:06:38,680
We actually take a sum of those values and
it still works.

106
00:06:38,680 --> 00:06:42,130
Let's go to object classification and
localization task.

107
00:06:43,150 --> 00:06:47,090
For this, we need to find a bounding
box to localize our object.

108
00:06:47,090 --> 00:06:52,830
Let's parameterize our bounding box
with four numbers, x, y, w, and h.

109
00:06:52,830 --> 00:06:58,650
X and y stand for the coordinates of
the upper left corner of that box,

110
00:06:58,650 --> 00:07:03,190
and w and h stand for
width and height of that box.

111
00:07:03,190 --> 00:07:06,840
We can actually use regression for
those four parameters.

112
00:07:06,840 --> 00:07:07,860
Let's see how it might work.

113
00:07:09,150 --> 00:07:14,040
We have a classification network which
looks like a bunch of convolutional layers

114
00:07:14,040 --> 00:07:16,410
followed by a multilayer perceptron.

115
00:07:16,410 --> 00:07:19,550
And we train it using cross-entropy.

116
00:07:19,550 --> 00:07:24,170
But do we need a second network
to do a bounding box regression?

117
00:07:24,170 --> 00:07:29,420
Actually, we can reuse those
convolutional layers for our new task.

118
00:07:29,420 --> 00:07:33,780
And we can train a new,
fully connected layer that will

119
00:07:33,780 --> 00:07:38,940
predict bounding box parameters and
we will use mean squared error for that.

120
00:07:38,940 --> 00:07:42,980
But how do we train such a network
when we have two different losses?

121
00:07:42,980 --> 00:07:47,780
We actually take the sum of those losses
and that gives us the final loss for

122
00:07:47,780 --> 00:07:50,360
which we'll propagate the gradients
during back propagation.

123
00:07:52,280 --> 00:07:56,483
In this video, we took a sneak peek
into other computer vision problems that

124
00:07:56,483 --> 00:07:59,649
successfully utilize
convolutional neural networks.

125
00:07:59,649 --> 00:08:04,994
This video concludes our introduction
to neural networks for images.

126
00:08:04,994 --> 00:08:14,994
[MUSIC]