[MUSIC] In this video, you will learn about
one more useful layer of neurons. And at the end, we will build our first
fully working neural network for images. But first, let's look at how
we deal with color images. When image has color, that means
that it has three input channels. And it makes it not a matrix but a tensor,
which is a multidimensional array, where W is an image width,
H is an image height, and Cin is a number of input channels,
for example 3 RGB channels. It looks like this, but
how do we apply convolutions? Convolutional kernel becomes a tensor
as well, of size Wk by Hk by Cin. And what we do, is we extract
volumetric patches from the image. Take a dot product with this kernel,
and get an output and feature map which is
denoted as red square. If we move that volumetric patch, we get a different output in different
location of our feature map. You see we have a volumetric
image as an input, and we have feature map as an output. And actually it looks like
we've lost some depth, and we need more filters because one
filter will not solve our problem. And that means that we need to train C
out kernels of size Wk by Hk by Cin. Having a stride of 1 and
enough zero padding, we can have W by H by Cout output neurons. So actually, we've taken a volume, which was an image, and
we translated it into another volume. Every depth slice of that output
volume corresponds to a feature map, to one convolutional kernel. Using Wk by Hk by Cin + 1 for the buyers
term multiply it by Cout parameters, that's how many parameters we have
used to turn one volume into another. But it turns out that one
convolutional layer is not enough. Let's say neurons of
the first convolutional layer look at the patches of
the image of size 3 by 3. But what if an object of
interest is bigger than that? Then it looks like we need a second
convolutional layer on top of the first. That's how it looks. The first 3 by 3 convolutional layer will
have a local receptive field of 3 by 3. You can see a green neuron that
uses 3 by 3 local receptive field. But if we take the second convolutional
layer on top of the first, then the neurons of the second convolutional
layer will actually have a receptive field of 5 by 5 because of the underlying
neurons and their receptive field. Let's look at what happens if
we stack N convolutional layers. For simplicity let's look
at one dimensional inputs, which are white circles. Our first one by three convolutional layer
will have a reception field of 1 by 3. When we take a second convolutional
layer with the same size, then we have a receptive field of 1 by 5. If it continued after the fourth level,
then we have a receptive field of 1 by 9. Can you derive a formula from this? Of course, if we stack N convolutional
layers with the same kernel size 3 by 3, the receptive field on N layer
will be 2N + 1 by 2N +1. What does it mean? It looks like we need to stack
a lot of convolutional layers to be able to identify objects as big as
input image, let's say 300 by 300, we will need 150 convolutional layers. We need to grow receptive field faster. We can increase a stride
in our convolutional layer to reduce the output dimensions. Let's see how it works for
2 by 2 convolution with stride 2. We're effectively splitting the image into
non-overlapping patches of color pink, red, yellow, and blue. If we use the same back slash kernel that
we have reviewed in the previous video, then we will have the result of 7,
9, 4 and 6. That's how our convolution works. If we add a second convolutional
layer of the same size 2 by 2, then those layers will effectively
double their receptor field, because we use the stride of 2. But how do we maintain
translation invariance? Remember this slide from the previous
video when we had a slash that traveled on our image, but we still weren't capable
to detect that that was a slash. Because we had a maximum of our
activations 2 in the first case and 2 in the second case, it didn't change. Actually, we will use this idea and introduce a new layer that
is called a pooling layer. This layer works like a convolutional,
but doesn't have kernel. Instead, it calculates maximum or
average of its inputs. Let's look at the example. We have a 200 by 200 by 64 input volume, let's take a single depth
slice from that volume. Let's apply 2 by 2 max
pooling with stride 2. How max pooling works? We take a red patch and take a maximum
value from there, and that is our output. In this case, it is 6. Then we take the next patch,
which is a green one, and take the maximum value from that patch and
get 8. And that's how max pooling works. If you look at the feature map,
it actually means that we downsample our image, we're losing some details, but
it actually stays kind of the same. And notice one more thing, when we
apply pooling, we do it depth-wise. It means, that we don't change
the number of output channels, we only change the dimensions. So the volume of 200 by 200 by 64
becomes volume of 100 by 100 by 64. But how does back propagation works for
max pooling layer? Strictly speaking, maximum is not
a differentiable function,but we will apply some heuristics here and
make it work. Let's look at the patch of that max
pooling layer uses for taking maximum. Let's take one neuron,
which is not maximum activation. Let's say it is denoted
by yellow color here. If we change its value a little bit,
it will not change of the maximum over its this patch, the maximum will
stay the same which is in this case 8. That means that there is no gradient with
respect to non-maximum patch neurons, since changing them slightly
doesn't affect the output. But what happens if we
change the neuron that provides the maximum value
in the max pooling layer? If we change it,
then the maximum will change as well, and we will change linearly. That means that for the maximum patch
neuron, we have a gradient of 1. Let's put it all together into
a simple convolutional neural network that was developed in
1998 by Yann LeCun for handwritten digits
recognition on MNIST dataset. This data set contains 10 clusters
of hand written digits from 0 to 9. So how it works? We take our input, which is a grayscale
image of the size of 32 by 32. We apply our first convolutional layer, having 5 by 5 convolutions and
we learn six different kernels here. Then, we apply pooling layer so that we lose some details and
we have some translation in variance. Pooling layer effectively halves
the resolution of the image, and it becomes 14 by 14 by 6, the number
of output channels doesn't change. Then let's add one more convolutional
layer, which is a yellow one and let's use the same size of
the kernel which is 5 by 5 by 6. And let's learn 16 of these kernels. What do we do next? Then we apply one more pooling layer,
right? And have a 5 by 5 by 16 volume. We can go on and on, and
then we will have to stop. And then we will have to use some
classifier that will use those features, and output the probabilities
of the digits. And for that purpose, we will use
a bunch of fully connected layers, a fully connected layer of 120 neurons,
84, and 10 neurons with applied softmax
function on the output. So what can we see from this diagram? It is known that neurons of
deep convolutional layers learn complex representations that can be used
as features for classification with MLP. The first convolutional slash pulling
part is actually an automatic feature extractor, it is stressed features that
are useful for classification with MLP. Let's takes a task of
human faces recognition. If you use convolutional neurons network
for that task, you can see that, different convolutional layers actually fire when
they see different patches of the image. The first convolutional layer
provides huge activations when it see edges with different angles. The second convolutional layer
uses those edges with different directions to learn some more complex
things like a human nose or a human eye. The third convolutional layer
actually uses the representations that the second convolutional
layer has learned. And we're using the concept of eye,
nose, or throat, then you can put it all together and
learn the representation of human face. What have we done so far? We have used convolutional pooling and
fully connected layers to build our first network for
handwritten digits recognition. In the next video,
we will overview tips and tricks that are utilized in modern
neural network architectures. [MUSIC]