1
00:00:00,000 --> 00:00:03,566
[MUSIC]

2
00:00:03,566 --> 00:00:08,120
In this video, you will learn about
one more useful layer of neurons.

3
00:00:08,120 --> 00:00:12,450
And at the end, we will build our first
fully working neural network for images.

4
00:00:13,940 --> 00:00:17,640
But first, let's look at how
we deal with color images.

5
00:00:17,640 --> 00:00:22,560
When image has color, that means
that it has three input channels.

6
00:00:22,560 --> 00:00:27,590
And it makes it not a matrix but a tensor,
which is a multidimensional array,

7
00:00:27,590 --> 00:00:31,480
where W is an image width,
H is an image height, and

8
00:00:31,480 --> 00:00:36,230
Cin is a number of input channels,
for example 3 RGB channels.

9
00:00:37,710 --> 00:00:40,970
It looks like this, but
how do we apply convolutions?

10
00:00:40,970 --> 00:00:48,237
Convolutional kernel becomes a tensor
as well, of size Wk by Hk by Cin.

11
00:00:48,237 --> 00:00:52,930
And what we do, is we extract
volumetric patches from the image.

12
00:00:52,930 --> 00:00:56,330
Take a dot product with this kernel,
and get an output and

13
00:00:56,330 --> 00:00:58,540
feature map which is
denoted as red square.

14
00:00:59,680 --> 00:01:02,020
If we move that volumetric patch,

15
00:01:02,020 --> 00:01:05,900
we get a different output in different
location of our feature map.

16
00:01:06,990 --> 00:01:10,790
You see we have a volumetric
image as an input, and

17
00:01:10,790 --> 00:01:13,800
we have feature map as an output.

18
00:01:13,800 --> 00:01:18,560
And actually it looks like
we've lost some depth, and

19
00:01:18,560 --> 00:01:22,440
we need more filters because one
filter will not solve our problem.

20
00:01:24,340 --> 00:01:31,840
And that means that we need to train C
out kernels of size Wk by Hk by Cin.

21
00:01:31,840 --> 00:01:35,389
Having a stride of 1 and
enough zero padding,

22
00:01:35,389 --> 00:01:39,050
we can have W by H by Cout output neurons.

23
00:01:39,050 --> 00:01:44,620
So actually, we've taken a volume,

24
00:01:44,620 --> 00:01:48,740
which was an image, and
we translated it into another volume.

25
00:01:48,740 --> 00:01:53,440
Every depth slice of that output
volume corresponds to a feature map,

26
00:01:53,440 --> 00:01:54,850
to one convolutional kernel.

27
00:01:57,140 --> 00:02:04,940
Using Wk by Hk by Cin + 1 for the buyers
term multiply it by Cout parameters,

28
00:02:04,940 --> 00:02:09,240
that's how many parameters we have
used to turn one volume into another.

29
00:02:11,610 --> 00:02:15,240
But it turns out that one
convolutional layer is not enough.

30
00:02:15,240 --> 00:02:17,880
Let's say neurons of
the first convolutional layer

31
00:02:17,880 --> 00:02:20,420
look at the patches of
the image of size 3 by 3.

32
00:02:21,470 --> 00:02:24,770
But what if an object of
interest is bigger than that?

33
00:02:24,770 --> 00:02:29,260
Then it looks like we need a second
convolutional layer on top of the first.

34
00:02:29,260 --> 00:02:30,720
That's how it looks.

35
00:02:30,720 --> 00:02:36,072
The first 3 by 3 convolutional layer will
have a local receptive field of 3 by 3.

36
00:02:36,072 --> 00:02:41,950
You can see a green neuron that
uses 3 by 3 local receptive field.

37
00:02:41,950 --> 00:02:46,730
But if we take the second convolutional
layer on top of the first, then

38
00:02:46,730 --> 00:02:50,840
the neurons of the second convolutional
layer will actually have a receptive field

39
00:02:50,840 --> 00:02:54,830
of 5 by 5 because of the underlying
neurons and their receptive field.

40
00:02:56,030 --> 00:03:00,090
Let's look at what happens if
we stack N convolutional layers.

41
00:03:00,090 --> 00:03:03,320
For simplicity let's look
at one dimensional inputs,

42
00:03:03,320 --> 00:03:04,970
which are white circles.

43
00:03:04,970 --> 00:03:09,870
Our first one by three convolutional layer
will have a reception field of 1 by 3.

44
00:03:09,870 --> 00:03:13,430
When we take a second convolutional
layer with the same size,

45
00:03:13,430 --> 00:03:16,000
then we have a receptive field of 1 by 5.

46
00:03:16,000 --> 00:03:21,230
If it continued after the fourth level,
then we have a receptive field of 1 by 9.

47
00:03:21,230 --> 00:03:24,310
Can you derive a formula from this?

48
00:03:24,310 --> 00:03:30,090
Of course, if we stack N convolutional
layers with the same kernel size 3 by 3,

49
00:03:30,090 --> 00:03:34,920
the receptive field on N layer
will be 2N + 1 by 2N +1.

50
00:03:34,920 --> 00:03:35,930
What does it mean?

51
00:03:35,930 --> 00:03:39,500
It looks like we need to stack
a lot of convolutional layers to be

52
00:03:39,500 --> 00:03:44,460
able to identify objects as big as
input image, let's say 300 by 300,

53
00:03:44,460 --> 00:03:48,850
we will need 150 convolutional layers.

54
00:03:48,850 --> 00:03:51,430
We need to grow receptive field faster.

55
00:03:51,430 --> 00:03:54,310
We can increase a stride
in our convolutional layer

56
00:03:54,310 --> 00:03:56,680
to reduce the output dimensions.

57
00:03:56,680 --> 00:04:00,750
Let's see how it works for
2 by 2 convolution with stride 2.

58
00:04:00,750 --> 00:04:06,870
We're effectively splitting the image into
non-overlapping patches of color pink,

59
00:04:06,870 --> 00:04:08,940
red, yellow, and blue.

60
00:04:08,940 --> 00:04:15,422
If we use the same back slash kernel that
we have reviewed in the previous video,

61
00:04:15,422 --> 00:04:19,506
then we will have the result of 7,
9, 4 and 6.

62
00:04:19,506 --> 00:04:21,340
That's how our convolution works.

63
00:04:22,960 --> 00:04:28,262
If we add a second convolutional
layer of the same size 2 by 2,

64
00:04:28,262 --> 00:04:33,864
then those layers will effectively
double their receptor field,

65
00:04:33,864 --> 00:04:36,581
because we use the stride of 2.

66
00:04:36,581 --> 00:04:39,530
But how do we maintain
translation invariance?

67
00:04:39,530 --> 00:04:43,540
Remember this slide from the previous
video when we had a slash that traveled on

68
00:04:43,540 --> 00:04:48,380
our image, but we still weren't capable
to detect that that was a slash.

69
00:04:49,870 --> 00:04:53,830
Because we had a maximum of our
activations 2 in the first case and

70
00:04:53,830 --> 00:04:55,639
2 in the second case, it didn't change.

71
00:04:56,670 --> 00:04:58,570
Actually, we will use this idea and

72
00:04:58,570 --> 00:05:02,420
introduce a new layer that
is called a pooling layer.

73
00:05:02,420 --> 00:05:06,250
This layer works like a convolutional,
but doesn't have kernel.

74
00:05:06,250 --> 00:05:11,001
Instead, it calculates maximum or
average of its inputs.

75
00:05:11,001 --> 00:05:12,547
Let's look at the example.

76
00:05:12,547 --> 00:05:16,153
We have a 200 by 200 by 64 input volume,

77
00:05:16,153 --> 00:05:20,040
let's take a single depth
slice from that volume.

78
00:05:21,070 --> 00:05:25,730
Let's apply 2 by 2 max
pooling with stride 2.

79
00:05:25,730 --> 00:05:26,990
How max pooling works?

80
00:05:26,990 --> 00:05:31,740
We take a red patch and take a maximum
value from there, and that is our output.

81
00:05:31,740 --> 00:05:33,650
In this case, it is 6.

82
00:05:33,650 --> 00:05:37,188
Then we take the next patch,
which is a green one, and

83
00:05:37,188 --> 00:05:40,503
take the maximum value from that patch and
get 8.

84
00:05:40,503 --> 00:05:42,820
And that's how max pooling works.

85
00:05:42,820 --> 00:05:47,510
If you look at the feature map,
it actually means that we downsample our

86
00:05:47,510 --> 00:05:52,750
image, we're losing some details, but
it actually stays kind of the same.

87
00:05:53,830 --> 00:05:59,050
And notice one more thing, when we
apply pooling, we do it depth-wise.

88
00:05:59,050 --> 00:06:03,680
It means, that we don't change
the number of output channels,

89
00:06:03,680 --> 00:06:06,180
we only change the dimensions.

90
00:06:06,180 --> 00:06:12,894
So the volume of 200 by 200 by 64
becomes volume of 100 by 100 by 64.

91
00:06:13,950 --> 00:06:17,520
But how does back propagation works for
max pooling layer?

92
00:06:17,520 --> 00:06:22,190
Strictly speaking, maximum is not
a differentiable function,but

93
00:06:22,190 --> 00:06:25,990
we will apply some heuristics here and
make it work.

94
00:06:27,710 --> 00:06:34,444
Let's look at the patch of that max
pooling layer uses for taking maximum.

95
00:06:34,444 --> 00:06:38,456
Let's take one neuron,
which is not maximum activation.

96
00:06:38,456 --> 00:06:42,103
Let's say it is denoted
by yellow color here.

97
00:06:42,103 --> 00:06:46,476
If we change its value a little bit,
it will not change of the maximum

98
00:06:46,476 --> 00:06:52,090
over its this patch, the maximum will
stay the same which is in this case 8.

99
00:06:52,090 --> 00:06:56,921
That means that there is no gradient with
respect to non-maximum patch neurons,

100
00:06:56,921 --> 00:07:00,430
since changing them slightly
doesn't affect the output.

101
00:07:01,650 --> 00:07:04,640
But what happens if we
change the neuron that

102
00:07:04,640 --> 00:07:07,370
provides the maximum value
in the max pooling layer?

103
00:07:08,420 --> 00:07:12,250
If we change it,
then the maximum will change as well, and

104
00:07:12,250 --> 00:07:14,310
we will change linearly.

105
00:07:14,310 --> 00:07:19,240
That means that for the maximum patch
neuron, we have a gradient of 1.

106
00:07:19,240 --> 00:07:24,430
Let's put it all together into
a simple convolutional neural network

107
00:07:24,430 --> 00:07:27,490
that was developed in
1998 by Yann LeCun for

108
00:07:27,490 --> 00:07:31,090
handwritten digits
recognition on MNIST dataset.

109
00:07:31,090 --> 00:07:37,030
This data set contains 10 clusters
of hand written digits from 0 to 9.

110
00:07:38,380 --> 00:07:39,760
So how it works?

111
00:07:39,760 --> 00:07:44,690
We take our input, which is a grayscale
image of the size of 32 by 32.

112
00:07:44,690 --> 00:07:48,540
We apply our first convolutional layer,

113
00:07:48,540 --> 00:07:53,390
having 5 by 5 convolutions and
we learn six different kernels here.

114
00:07:54,730 --> 00:07:57,280
Then, we apply pooling layer so

115
00:07:57,280 --> 00:08:02,250
that we lose some details and
we have some translation in variance.

116
00:08:02,250 --> 00:08:07,600
Pooling layer effectively halves
the resolution of the image, and

117
00:08:07,600 --> 00:08:13,050
it becomes 14 by 14 by 6, the number
of output channels doesn't change.

118
00:08:13,050 --> 00:08:18,110
Then let's add one more convolutional
layer, which is a yellow one and

119
00:08:18,110 --> 00:08:23,010
let's use the same size of
the kernel which is 5 by 5 by 6.

120
00:08:23,010 --> 00:08:25,850
And let's learn 16 of these kernels.

121
00:08:27,540 --> 00:08:28,530
What do we do next?

122
00:08:28,530 --> 00:08:31,260
Then we apply one more pooling layer,
right?

123
00:08:31,260 --> 00:08:34,019
And have a 5 by 5 by 16 volume.

124
00:08:35,140 --> 00:08:39,180
We can go on and on, and
then we will have to stop.

125
00:08:39,180 --> 00:08:44,330
And then we will have to use some
classifier that will use those features,

126
00:08:44,330 --> 00:08:48,270
and output the probabilities
of the digits.

127
00:08:48,270 --> 00:08:53,300
And for that purpose, we will use
a bunch of fully connected layers,

128
00:08:53,300 --> 00:08:57,290
a fully connected layer of 120 neurons,
84, and

129
00:08:57,290 --> 00:09:01,650
10 neurons with applied softmax
function on the output.

130
00:09:03,680 --> 00:09:08,090
So what can we see from this diagram?

131
00:09:08,090 --> 00:09:12,140
It is known that neurons of
deep convolutional layers learn

132
00:09:12,140 --> 00:09:17,920
complex representations that can be used
as features for classification with MLP.

133
00:09:17,920 --> 00:09:23,015
The first convolutional slash pulling
part is actually an automatic feature

134
00:09:23,015 --> 00:09:28,380
extractor, it is stressed features that
are useful for classification with MLP.

135
00:09:29,620 --> 00:09:32,280
Let's takes a task of
human faces recognition.

136
00:09:33,380 --> 00:09:39,716
If you use convolutional neurons network
for that task, you can see that, different

137
00:09:39,716 --> 00:09:45,876
convolutional layers actually fire when
they see different patches of the image.

138
00:09:45,876 --> 00:09:50,700
The first convolutional layer
provides huge activations when it

139
00:09:50,700 --> 00:09:54,060
see edges with different angles.

140
00:09:54,060 --> 00:09:59,820
The second convolutional layer
uses those edges with different

141
00:09:59,820 --> 00:10:05,480
directions to learn some more complex
things like a human nose or a human eye.

142
00:10:06,750 --> 00:10:11,490
The third convolutional layer
actually uses the representations

143
00:10:11,490 --> 00:10:14,460
that the second convolutional
layer has learned.

144
00:10:14,460 --> 00:10:18,654
And we're using the concept of eye,
nose, or throat,

145
00:10:18,654 --> 00:10:24,382
then you can put it all together and
learn the representation of human face.

146
00:10:27,779 --> 00:10:29,740
What have we done so far?

147
00:10:29,740 --> 00:10:33,850
We have used convolutional pooling and
fully connected layers

148
00:10:33,850 --> 00:10:37,330
to build our first network for
handwritten digits recognition.

149
00:10:38,410 --> 00:10:41,888
In the next video,
we will overview tips and

150
00:10:41,888 --> 00:10:47,257
tricks that are utilized in modern
neural network architectures.

151
00:10:47,257 --> 00:10:57,257
[MUSIC]