[MUSIC] In this video, you will learn about one more useful layer of neurons. And at the end, we will build our first fully working neural network for images. But first, let's look at how we deal with color images. When image has color, that means that it has three input channels. And it makes it not a matrix but a tensor, which is a multidimensional array, where W is an image width, H is an image height, and Cin is a number of input channels, for example 3 RGB channels. It looks like this, but how do we apply convolutions? Convolutional kernel becomes a tensor as well, of size Wk by Hk by Cin. And what we do, is we extract volumetric patches from the image. Take a dot product with this kernel, and get an output and feature map which is denoted as red square. If we move that volumetric patch, we get a different output in different location of our feature map. You see we have a volumetric image as an input, and we have feature map as an output. And actually it looks like we've lost some depth, and we need more filters because one filter will not solve our problem. And that means that we need to train C out kernels of size Wk by Hk by Cin. Having a stride of 1 and enough zero padding, we can have W by H by Cout output neurons. So actually, we've taken a volume, which was an image, and we translated it into another volume. Every depth slice of that output volume corresponds to a feature map, to one convolutional kernel. Using Wk by Hk by Cin + 1 for the buyers term multiply it by Cout parameters, that's how many parameters we have used to turn one volume into another. But it turns out that one convolutional layer is not enough. Let's say neurons of the first convolutional layer look at the patches of the image of size 3 by 3. But what if an object of interest is bigger than that? Then it looks like we need a second convolutional layer on top of the first. That's how it looks. The first 3 by 3 convolutional layer will have a local receptive field of 3 by 3. You can see a green neuron that uses 3 by 3 local receptive field. But if we take the second convolutional layer on top of the first, then the neurons of the second convolutional layer will actually have a receptive field of 5 by 5 because of the underlying neurons and their receptive field. Let's look at what happens if we stack N convolutional layers. For simplicity let's look at one dimensional inputs, which are white circles. Our first one by three convolutional layer will have a reception field of 1 by 3. When we take a second convolutional layer with the same size, then we have a receptive field of 1 by 5. If it continued after the fourth level, then we have a receptive field of 1 by 9. Can you derive a formula from this? Of course, if we stack N convolutional layers with the same kernel size 3 by 3, the receptive field on N layer will be 2N + 1 by 2N +1. What does it mean? It looks like we need to stack a lot of convolutional layers to be able to identify objects as big as input image, let's say 300 by 300, we will need 150 convolutional layers. We need to grow receptive field faster. We can increase a stride in our convolutional layer to reduce the output dimensions. Let's see how it works for 2 by 2 convolution with stride 2. We're effectively splitting the image into non-overlapping patches of color pink, red, yellow, and blue. If we use the same back slash kernel that we have reviewed in the previous video, then we will have the result of 7, 9, 4 and 6. That's how our convolution works. If we add a second convolutional layer of the same size 2 by 2, then those layers will effectively double their receptor field, because we use the stride of 2. But how do we maintain translation invariance? Remember this slide from the previous video when we had a slash that traveled on our image, but we still weren't capable to detect that that was a slash. Because we had a maximum of our activations 2 in the first case and 2 in the second case, it didn't change. Actually, we will use this idea and introduce a new layer that is called a pooling layer. This layer works like a convolutional, but doesn't have kernel. Instead, it calculates maximum or average of its inputs. Let's look at the example. We have a 200 by 200 by 64 input volume, let's take a single depth slice from that volume. Let's apply 2 by 2 max pooling with stride 2. How max pooling works? We take a red patch and take a maximum value from there, and that is our output. In this case, it is 6. Then we take the next patch, which is a green one, and take the maximum value from that patch and get 8. And that's how max pooling works. If you look at the feature map, it actually means that we downsample our image, we're losing some details, but it actually stays kind of the same. And notice one more thing, when we apply pooling, we do it depth-wise. It means, that we don't change the number of output channels, we only change the dimensions. So the volume of 200 by 200 by 64 becomes volume of 100 by 100 by 64. But how does back propagation works for max pooling layer? Strictly speaking, maximum is not a differentiable function,but we will apply some heuristics here and make it work. Let's look at the patch of that max pooling layer uses for taking maximum. Let's take one neuron, which is not maximum activation. Let's say it is denoted by yellow color here. If we change its value a little bit, it will not change of the maximum over its this patch, the maximum will stay the same which is in this case 8. That means that there is no gradient with respect to non-maximum patch neurons, since changing them slightly doesn't affect the output. But what happens if we change the neuron that provides the maximum value in the max pooling layer? If we change it, then the maximum will change as well, and we will change linearly. That means that for the maximum patch neuron, we have a gradient of 1. Let's put it all together into a simple convolutional neural network that was developed in 1998 by Yann LeCun for handwritten digits recognition on MNIST dataset. This data set contains 10 clusters of hand written digits from 0 to 9. So how it works? We take our input, which is a grayscale image of the size of 32 by 32. We apply our first convolutional layer, having 5 by 5 convolutions and we learn six different kernels here. Then, we apply pooling layer so that we lose some details and we have some translation in variance. Pooling layer effectively halves the resolution of the image, and it becomes 14 by 14 by 6, the number of output channels doesn't change. Then let's add one more convolutional layer, which is a yellow one and let's use the same size of the kernel which is 5 by 5 by 6. And let's learn 16 of these kernels. What do we do next? Then we apply one more pooling layer, right? And have a 5 by 5 by 16 volume. We can go on and on, and then we will have to stop. And then we will have to use some classifier that will use those features, and output the probabilities of the digits. And for that purpose, we will use a bunch of fully connected layers, a fully connected layer of 120 neurons, 84, and 10 neurons with applied softmax function on the output. So what can we see from this diagram? It is known that neurons of deep convolutional layers learn complex representations that can be used as features for classification with MLP. The first convolutional slash pulling part is actually an automatic feature extractor, it is stressed features that are useful for classification with MLP. Let's takes a task of human faces recognition. If you use convolutional neurons network for that task, you can see that, different convolutional layers actually fire when they see different patches of the image. The first convolutional layer provides huge activations when it see edges with different angles. The second convolutional layer uses those edges with different directions to learn some more complex things like a human nose or a human eye. The third convolutional layer actually uses the representations that the second convolutional layer has learned. And we're using the concept of eye, nose, or throat, then you can put it all together and learn the representation of human face. What have we done so far? We have used convolutional pooling and fully connected layers to build our first network for handwritten digits recognition. In the next video, we will overview tips and tricks that are utilized in modern neural network architectures. [MUSIC]