1 00:00:00,000 --> 00:00:03,566 [MUSIC] 2 00:00:03,566 --> 00:00:08,120 In this video, you will learn about one more useful layer of neurons. 3 00:00:08,120 --> 00:00:12,450 And at the end, we will build our first fully working neural network for images. 4 00:00:13,940 --> 00:00:17,640 But first, let's look at how we deal with color images. 5 00:00:17,640 --> 00:00:22,560 When image has color, that means that it has three input channels. 6 00:00:22,560 --> 00:00:27,590 And it makes it not a matrix but a tensor, which is a multidimensional array, 7 00:00:27,590 --> 00:00:31,480 where W is an image width, H is an image height, and 8 00:00:31,480 --> 00:00:36,230 Cin is a number of input channels, for example 3 RGB channels. 9 00:00:37,710 --> 00:00:40,970 It looks like this, but how do we apply convolutions? 10 00:00:40,970 --> 00:00:48,237 Convolutional kernel becomes a tensor as well, of size Wk by Hk by Cin. 11 00:00:48,237 --> 00:00:52,930 And what we do, is we extract volumetric patches from the image. 12 00:00:52,930 --> 00:00:56,330 Take a dot product with this kernel, and get an output and 13 00:00:56,330 --> 00:00:58,540 feature map which is denoted as red square. 14 00:00:59,680 --> 00:01:02,020 If we move that volumetric patch, 15 00:01:02,020 --> 00:01:05,900 we get a different output in different location of our feature map. 16 00:01:06,990 --> 00:01:10,790 You see we have a volumetric image as an input, and 17 00:01:10,790 --> 00:01:13,800 we have feature map as an output. 18 00:01:13,800 --> 00:01:18,560 And actually it looks like we've lost some depth, and 19 00:01:18,560 --> 00:01:22,440 we need more filters because one filter will not solve our problem. 20 00:01:24,340 --> 00:01:31,840 And that means that we need to train C out kernels of size Wk by Hk by Cin. 21 00:01:31,840 --> 00:01:35,389 Having a stride of 1 and enough zero padding, 22 00:01:35,389 --> 00:01:39,050 we can have W by H by Cout output neurons. 23 00:01:39,050 --> 00:01:44,620 So actually, we've taken a volume, 24 00:01:44,620 --> 00:01:48,740 which was an image, and we translated it into another volume. 25 00:01:48,740 --> 00:01:53,440 Every depth slice of that output volume corresponds to a feature map, 26 00:01:53,440 --> 00:01:54,850 to one convolutional kernel. 27 00:01:57,140 --> 00:02:04,940 Using Wk by Hk by Cin + 1 for the buyers term multiply it by Cout parameters, 28 00:02:04,940 --> 00:02:09,240 that's how many parameters we have used to turn one volume into another. 29 00:02:11,610 --> 00:02:15,240 But it turns out that one convolutional layer is not enough. 30 00:02:15,240 --> 00:02:17,880 Let's say neurons of the first convolutional layer 31 00:02:17,880 --> 00:02:20,420 look at the patches of the image of size 3 by 3. 32 00:02:21,470 --> 00:02:24,770 But what if an object of interest is bigger than that? 33 00:02:24,770 --> 00:02:29,260 Then it looks like we need a second convolutional layer on top of the first. 34 00:02:29,260 --> 00:02:30,720 That's how it looks. 35 00:02:30,720 --> 00:02:36,072 The first 3 by 3 convolutional layer will have a local receptive field of 3 by 3. 36 00:02:36,072 --> 00:02:41,950 You can see a green neuron that uses 3 by 3 local receptive field. 37 00:02:41,950 --> 00:02:46,730 But if we take the second convolutional layer on top of the first, then 38 00:02:46,730 --> 00:02:50,840 the neurons of the second convolutional layer will actually have a receptive field 39 00:02:50,840 --> 00:02:54,830 of 5 by 5 because of the underlying neurons and their receptive field. 40 00:02:56,030 --> 00:03:00,090 Let's look at what happens if we stack N convolutional layers. 41 00:03:00,090 --> 00:03:03,320 For simplicity let's look at one dimensional inputs, 42 00:03:03,320 --> 00:03:04,970 which are white circles. 43 00:03:04,970 --> 00:03:09,870 Our first one by three convolutional layer will have a reception field of 1 by 3. 44 00:03:09,870 --> 00:03:13,430 When we take a second convolutional layer with the same size, 45 00:03:13,430 --> 00:03:16,000 then we have a receptive field of 1 by 5. 46 00:03:16,000 --> 00:03:21,230 If it continued after the fourth level, then we have a receptive field of 1 by 9. 47 00:03:21,230 --> 00:03:24,310 Can you derive a formula from this? 48 00:03:24,310 --> 00:03:30,090 Of course, if we stack N convolutional layers with the same kernel size 3 by 3, 49 00:03:30,090 --> 00:03:34,920 the receptive field on N layer will be 2N + 1 by 2N +1. 50 00:03:34,920 --> 00:03:35,930 What does it mean? 51 00:03:35,930 --> 00:03:39,500 It looks like we need to stack a lot of convolutional layers to be 52 00:03:39,500 --> 00:03:44,460 able to identify objects as big as input image, let's say 300 by 300, 53 00:03:44,460 --> 00:03:48,850 we will need 150 convolutional layers. 54 00:03:48,850 --> 00:03:51,430 We need to grow receptive field faster. 55 00:03:51,430 --> 00:03:54,310 We can increase a stride in our convolutional layer 56 00:03:54,310 --> 00:03:56,680 to reduce the output dimensions. 57 00:03:56,680 --> 00:04:00,750 Let's see how it works for 2 by 2 convolution with stride 2. 58 00:04:00,750 --> 00:04:06,870 We're effectively splitting the image into non-overlapping patches of color pink, 59 00:04:06,870 --> 00:04:08,940 red, yellow, and blue. 60 00:04:08,940 --> 00:04:15,422 If we use the same back slash kernel that we have reviewed in the previous video, 61 00:04:15,422 --> 00:04:19,506 then we will have the result of 7, 9, 4 and 6. 62 00:04:19,506 --> 00:04:21,340 That's how our convolution works. 63 00:04:22,960 --> 00:04:28,262 If we add a second convolutional layer of the same size 2 by 2, 64 00:04:28,262 --> 00:04:33,864 then those layers will effectively double their receptor field, 65 00:04:33,864 --> 00:04:36,581 because we use the stride of 2. 66 00:04:36,581 --> 00:04:39,530 But how do we maintain translation invariance? 67 00:04:39,530 --> 00:04:43,540 Remember this slide from the previous video when we had a slash that traveled on 68 00:04:43,540 --> 00:04:48,380 our image, but we still weren't capable to detect that that was a slash. 69 00:04:49,870 --> 00:04:53,830 Because we had a maximum of our activations 2 in the first case and 70 00:04:53,830 --> 00:04:55,639 2 in the second case, it didn't change. 71 00:04:56,670 --> 00:04:58,570 Actually, we will use this idea and 72 00:04:58,570 --> 00:05:02,420 introduce a new layer that is called a pooling layer. 73 00:05:02,420 --> 00:05:06,250 This layer works like a convolutional, but doesn't have kernel. 74 00:05:06,250 --> 00:05:11,001 Instead, it calculates maximum or average of its inputs. 75 00:05:11,001 --> 00:05:12,547 Let's look at the example. 76 00:05:12,547 --> 00:05:16,153 We have a 200 by 200 by 64 input volume, 77 00:05:16,153 --> 00:05:20,040 let's take a single depth slice from that volume. 78 00:05:21,070 --> 00:05:25,730 Let's apply 2 by 2 max pooling with stride 2. 79 00:05:25,730 --> 00:05:26,990 How max pooling works? 80 00:05:26,990 --> 00:05:31,740 We take a red patch and take a maximum value from there, and that is our output. 81 00:05:31,740 --> 00:05:33,650 In this case, it is 6. 82 00:05:33,650 --> 00:05:37,188 Then we take the next patch, which is a green one, and 83 00:05:37,188 --> 00:05:40,503 take the maximum value from that patch and get 8. 84 00:05:40,503 --> 00:05:42,820 And that's how max pooling works. 85 00:05:42,820 --> 00:05:47,510 If you look at the feature map, it actually means that we downsample our 86 00:05:47,510 --> 00:05:52,750 image, we're losing some details, but it actually stays kind of the same. 87 00:05:53,830 --> 00:05:59,050 And notice one more thing, when we apply pooling, we do it depth-wise. 88 00:05:59,050 --> 00:06:03,680 It means, that we don't change the number of output channels, 89 00:06:03,680 --> 00:06:06,180 we only change the dimensions. 90 00:06:06,180 --> 00:06:12,894 So the volume of 200 by 200 by 64 becomes volume of 100 by 100 by 64. 91 00:06:13,950 --> 00:06:17,520 But how does back propagation works for max pooling layer? 92 00:06:17,520 --> 00:06:22,190 Strictly speaking, maximum is not a differentiable function,but 93 00:06:22,190 --> 00:06:25,990 we will apply some heuristics here and make it work. 94 00:06:27,710 --> 00:06:34,444 Let's look at the patch of that max pooling layer uses for taking maximum. 95 00:06:34,444 --> 00:06:38,456 Let's take one neuron, which is not maximum activation. 96 00:06:38,456 --> 00:06:42,103 Let's say it is denoted by yellow color here. 97 00:06:42,103 --> 00:06:46,476 If we change its value a little bit, it will not change of the maximum 98 00:06:46,476 --> 00:06:52,090 over its this patch, the maximum will stay the same which is in this case 8. 99 00:06:52,090 --> 00:06:56,921 That means that there is no gradient with respect to non-maximum patch neurons, 100 00:06:56,921 --> 00:07:00,430 since changing them slightly doesn't affect the output. 101 00:07:01,650 --> 00:07:04,640 But what happens if we change the neuron that 102 00:07:04,640 --> 00:07:07,370 provides the maximum value in the max pooling layer? 103 00:07:08,420 --> 00:07:12,250 If we change it, then the maximum will change as well, and 104 00:07:12,250 --> 00:07:14,310 we will change linearly. 105 00:07:14,310 --> 00:07:19,240 That means that for the maximum patch neuron, we have a gradient of 1. 106 00:07:19,240 --> 00:07:24,430 Let's put it all together into a simple convolutional neural network 107 00:07:24,430 --> 00:07:27,490 that was developed in 1998 by Yann LeCun for 108 00:07:27,490 --> 00:07:31,090 handwritten digits recognition on MNIST dataset. 109 00:07:31,090 --> 00:07:37,030 This data set contains 10 clusters of hand written digits from 0 to 9. 110 00:07:38,380 --> 00:07:39,760 So how it works? 111 00:07:39,760 --> 00:07:44,690 We take our input, which is a grayscale image of the size of 32 by 32. 112 00:07:44,690 --> 00:07:48,540 We apply our first convolutional layer, 113 00:07:48,540 --> 00:07:53,390 having 5 by 5 convolutions and we learn six different kernels here. 114 00:07:54,730 --> 00:07:57,280 Then, we apply pooling layer so 115 00:07:57,280 --> 00:08:02,250 that we lose some details and we have some translation in variance. 116 00:08:02,250 --> 00:08:07,600 Pooling layer effectively halves the resolution of the image, and 117 00:08:07,600 --> 00:08:13,050 it becomes 14 by 14 by 6, the number of output channels doesn't change. 118 00:08:13,050 --> 00:08:18,110 Then let's add one more convolutional layer, which is a yellow one and 119 00:08:18,110 --> 00:08:23,010 let's use the same size of the kernel which is 5 by 5 by 6. 120 00:08:23,010 --> 00:08:25,850 And let's learn 16 of these kernels. 121 00:08:27,540 --> 00:08:28,530 What do we do next? 122 00:08:28,530 --> 00:08:31,260 Then we apply one more pooling layer, right? 123 00:08:31,260 --> 00:08:34,019 And have a 5 by 5 by 16 volume. 124 00:08:35,140 --> 00:08:39,180 We can go on and on, and then we will have to stop. 125 00:08:39,180 --> 00:08:44,330 And then we will have to use some classifier that will use those features, 126 00:08:44,330 --> 00:08:48,270 and output the probabilities of the digits. 127 00:08:48,270 --> 00:08:53,300 And for that purpose, we will use a bunch of fully connected layers, 128 00:08:53,300 --> 00:08:57,290 a fully connected layer of 120 neurons, 84, and 129 00:08:57,290 --> 00:09:01,650 10 neurons with applied softmax function on the output. 130 00:09:03,680 --> 00:09:08,090 So what can we see from this diagram? 131 00:09:08,090 --> 00:09:12,140 It is known that neurons of deep convolutional layers learn 132 00:09:12,140 --> 00:09:17,920 complex representations that can be used as features for classification with MLP. 133 00:09:17,920 --> 00:09:23,015 The first convolutional slash pulling part is actually an automatic feature 134 00:09:23,015 --> 00:09:28,380 extractor, it is stressed features that are useful for classification with MLP. 135 00:09:29,620 --> 00:09:32,280 Let's takes a task of human faces recognition. 136 00:09:33,380 --> 00:09:39,716 If you use convolutional neurons network for that task, you can see that, different 137 00:09:39,716 --> 00:09:45,876 convolutional layers actually fire when they see different patches of the image. 138 00:09:45,876 --> 00:09:50,700 The first convolutional layer provides huge activations when it 139 00:09:50,700 --> 00:09:54,060 see edges with different angles. 140 00:09:54,060 --> 00:09:59,820 The second convolutional layer uses those edges with different 141 00:09:59,820 --> 00:10:05,480 directions to learn some more complex things like a human nose or a human eye. 142 00:10:06,750 --> 00:10:11,490 The third convolutional layer actually uses the representations 143 00:10:11,490 --> 00:10:14,460 that the second convolutional layer has learned. 144 00:10:14,460 --> 00:10:18,654 And we're using the concept of eye, nose, or throat, 145 00:10:18,654 --> 00:10:24,382 then you can put it all together and learn the representation of human face. 146 00:10:27,779 --> 00:10:29,740 What have we done so far? 147 00:10:29,740 --> 00:10:33,850 We have used convolutional pooling and fully connected layers 148 00:10:33,850 --> 00:10:37,330 to build our first network for handwritten digits recognition. 149 00:10:38,410 --> 00:10:41,888 In the next video, we will overview tips and 150 00:10:41,888 --> 00:10:47,257 tricks that are utilized in modern neural network architectures. 151 00:10:47,257 --> 00:10:57,257 [MUSIC]