1 00:00:00,000 --> 00:00:03,568 [MUSIC] 2 00:00:03,568 --> 00:00:07,790 In this video, we will overview modern architectures of neural networks. 3 00:00:09,130 --> 00:00:12,073 Let's look at the ImageNet classification dataset. 4 00:00:12,073 --> 00:00:14,200 It has 1,000 classes. 5 00:00:14,200 --> 00:00:16,900 It has over 1 million labeled photos. 6 00:00:16,900 --> 00:00:22,010 And human top 5 error rate on this dataset is roughly 5%. 7 00:00:22,010 --> 00:00:23,060 Why is it not to zero? 8 00:00:24,300 --> 00:00:26,590 If you look at the examples from this dataset, 9 00:00:26,590 --> 00:00:30,200 you can see that the classes are really difficult. 10 00:00:30,200 --> 00:00:32,340 For example, you have a quail or 11 00:00:32,340 --> 00:00:37,920 partridge in the upper right corner, and for me they look exactly the same. 12 00:00:37,920 --> 00:00:40,260 I don't know how computer can distinguish between them. 13 00:00:42,300 --> 00:00:47,280 The first breakthrough that happened in 2012, that is the first 14 00:00:47,280 --> 00:00:52,390 time the deep convolutional neural net was applied for ImageNet data set. 15 00:00:52,390 --> 00:00:59,020 And it significantly reduced top five error from 26% to 15%. 16 00:00:59,020 --> 00:01:03,950 It uses 11 x 11, 5 x 5, 3 x 3 convolutions, 17 00:01:03,950 --> 00:01:09,560 max pooling, dropout, data augmentation, ReLU activations and 18 00:01:09,560 --> 00:01:15,360 SGD with momentum, all the tricks that you know from the previous video. 19 00:01:15,360 --> 00:01:20,200 It uses 60 million parameters and the trains on 2 GPUs for 6 days. 20 00:01:21,910 --> 00:01:26,890 The next breakthrough is 2015 with VGG architecture. 21 00:01:26,890 --> 00:01:32,537 It is very similar to AlexNet because it uses convolutional layers followed 22 00:01:32,537 --> 00:01:38,108 by pooling layers, just like LeNet architecture going back to 1998. 23 00:01:38,108 --> 00:01:41,460 But it has much more filters. 24 00:01:41,460 --> 00:01:46,200 ImageNet top 5 error reduced to 8% for single model. 25 00:01:46,200 --> 00:01:49,909 The training of this architecture is similar to AlexNet, 26 00:01:49,909 --> 00:01:54,384 but it uses additional multi-scale cropping as data augmentation. 27 00:01:54,384 --> 00:01:58,250 It uses 138 million parameters and 28 00:01:58,250 --> 00:02:02,127 it trains on 4 GPUs for 2 or 3 weeks. 29 00:02:02,127 --> 00:02:08,562 Then in 2015, Inception Architecture came to the world. 30 00:02:08,562 --> 00:02:12,951 It is not similar to AlexNet, it uses Inception block that 31 00:02:12,951 --> 00:02:18,160 was introduced in GoogLeNet or also known as Inception V1. 32 00:02:18,160 --> 00:02:23,280 ImageNet top 5 error was reduced to 5.6% for single model. 33 00:02:23,280 --> 00:02:26,460 You can see that this is a really complex and deep model. 34 00:02:27,530 --> 00:02:32,680 It uses batch normalization, image distortions as augmentation and 35 00:02:32,680 --> 00:02:37,230 RMSProp for grading decent. 36 00:02:37,230 --> 00:02:42,469 It uses only 25 million parameters, but it trains on 8 GPUs for 2 weeks. 37 00:02:43,570 --> 00:02:47,650 You can see that these deep architecture is made of inception blocks, 38 00:02:47,650 --> 00:02:49,720 which is in the blue circle. 39 00:02:49,720 --> 00:02:53,000 We will look in details how that block works. 40 00:02:53,000 --> 00:02:57,060 But first, we have to look at 1X1 convolutions. 41 00:02:57,060 --> 00:03:01,310 Such convolutions capture interactions of input channels 42 00:03:01,310 --> 00:03:02,710 in one pixel of feature map. 43 00:03:03,740 --> 00:03:07,600 They can reduce the number of channels not hurting the quality of the model, 44 00:03:07,600 --> 00:03:10,500 because different channels can correlate. 45 00:03:10,500 --> 00:03:16,120 It actually works like dimensionality reduction with added ReLU activation. 46 00:03:16,120 --> 00:03:20,322 Usually the number of output channels is less than the number of inputs. 47 00:03:23,521 --> 00:03:27,896 All operations inside an inception block use stride 1 and 48 00:03:27,896 --> 00:03:34,960 enough padding to output the same spatial dimensions which is W x H of feature map. 49 00:03:34,960 --> 00:03:38,950 Four different feature maps are concatenated on depth at the end. 50 00:03:38,950 --> 00:03:41,390 So, it looks like a layered cake. 51 00:03:41,390 --> 00:03:46,400 We stack all of those different feature maps on depth. 52 00:03:46,400 --> 00:03:50,890 And, we have different feature maps that are colored with different colors. 53 00:03:52,090 --> 00:03:54,850 And, inside of that block, 54 00:03:54,850 --> 00:04:00,140 we use 1x1 convolutions to reduce the number of filters, 55 00:04:00,140 --> 00:04:05,600 and we use 5x5, 3x3 convolutions, and the pooling layer. 56 00:04:05,600 --> 00:04:11,478 And also we add the input to the output just with 1x1 convolution. 57 00:04:11,478 --> 00:04:12,670 Why does it work better? 58 00:04:13,880 --> 00:04:17,217 In simple neural network architecture, 59 00:04:17,217 --> 00:04:21,802 you have a fixed size of a kernel in convolutional layer. 60 00:04:21,802 --> 00:04:26,814 But when you use different scales of that sliding window, 61 00:04:26,814 --> 00:04:30,490 let's say 5 by 5, 3 by 3, and 1 by 1, 62 00:04:30,490 --> 00:04:34,991 then you can use all that features at the same time and 63 00:04:34,991 --> 00:04:38,490 you can learn better representations. 64 00:04:39,820 --> 00:04:42,030 Let's replace 5 by 5 convolutions. 65 00:04:42,030 --> 00:04:47,850 5 by 5 convolutions are currently the most expensive parts in our inception block. 66 00:04:47,850 --> 00:04:52,820 Let's replace them with two layers of 3 by 3 convolutions, which, 67 00:04:52,820 --> 00:04:57,549 as we already know, have an effective receptive field of 5 by 5. 68 00:04:57,549 --> 00:05:03,438 You can see on the image we replaced 5 by 5 convolution by 2 3 by 3 convolutions, 69 00:05:03,438 --> 00:05:04,800 which are in blue. 70 00:05:06,900 --> 00:05:11,622 Another technique that is known in computer vision is filter decomposition. 71 00:05:11,622 --> 00:05:14,480 It is known that for a Gaussian blur filter, 72 00:05:15,760 --> 00:05:19,135 you can decompose it into two one dimensional filters. 73 00:05:19,135 --> 00:05:22,630 Your first blur, the source horizontally. 74 00:05:22,630 --> 00:05:24,890 Then you blur the blur vertically and 75 00:05:24,890 --> 00:05:31,000 you get the result which is identical to applying a Gaussian blur to the input. 76 00:05:31,000 --> 00:05:34,246 Let's use the same idea in our inception block. 77 00:05:34,246 --> 00:05:38,490 3x3 convolutions are currently the most expensive parts. 78 00:05:38,490 --> 00:05:44,130 Let's replace each 3x3 layer with 1x3 layer followed by 3x1 layer. 79 00:05:45,170 --> 00:05:48,090 Actually, what do we do is we decompose 80 00:05:48,090 --> 00:05:51,925 that 3x3 convolution into a series of one dimensional convolutions. 81 00:05:51,925 --> 00:05:56,000 You can see that for green, blue and 82 00:05:56,000 --> 00:05:59,830 purple 3x3 convolutional layers, 83 00:05:59,830 --> 00:06:05,870 we replace them by two layers of one dimensional convolutions. 84 00:06:05,870 --> 00:06:09,570 This is the final state of our Inception block, and 85 00:06:09,570 --> 00:06:14,400 this block is used in Inception V3 architecture. 86 00:06:16,490 --> 00:06:21,100 Another architecture that appeared in 2015 is ResNet. 87 00:06:21,100 --> 00:06:26,600 It introduces residual connections and it reduces top 5 ImageNet error 88 00:06:26,600 --> 00:06:32,560 down to 4.5% for single model, and 3.5% for an ensemble. 89 00:06:32,560 --> 00:06:37,319 It has 152 layers, it has few 7x7 convolutional layers 90 00:06:37,319 --> 00:06:40,874 that are expensive, but the rest are 3x3. 91 00:06:40,874 --> 00:06:45,270 It uses batch normalization, max and average pooling. 92 00:06:45,270 --> 00:06:51,610 It has 60 million parameters, it trains on 8 GPUs for 2 or 3 weeks. 93 00:06:51,610 --> 00:06:55,078 What is that residual connection in this architecture? 94 00:06:56,432 --> 00:07:02,210 What we actually do is, we create output channels, adding a small delta, 95 00:07:02,210 --> 00:07:06,728 which is modeled as F(x), to original input channels. 96 00:07:06,728 --> 00:07:13,010 And that F(x) is actually represented as a weight layer, 97 00:07:13,010 --> 00:07:15,900 followed by relu activation, and one more weight layer. 98 00:07:17,230 --> 00:07:20,110 This way, we can stick thousands of layers, and 99 00:07:20,110 --> 00:07:24,430 gradients do not vanish, thanks to that residual connection. 100 00:07:24,430 --> 00:07:28,710 We always add a small number to the input channels. 101 00:07:28,710 --> 00:07:33,470 So, that provides a better gradient flow during back propagation. 102 00:07:35,491 --> 00:07:40,100 To summarize, you can see that by stacking more convolutional and 103 00:07:40,100 --> 00:07:44,807 pooling layers, you can reduce the error, like in AlexNet or VGG. 104 00:07:44,807 --> 00:07:46,620 But you cannot do that forever. 105 00:07:46,620 --> 00:07:50,320 You need to utilize new kind of layers, like Inception block or 106 00:07:50,320 --> 00:07:52,290 residual connections. 107 00:07:52,290 --> 00:07:57,520 You have probably noticed that one needs a lot of time to train her neural network. 108 00:07:57,520 --> 00:08:02,387 In the following video, we will discuss the principle known as transfer 109 00:08:02,387 --> 00:08:07,039 learning that will help us to reduce the training time for a new task. 110 00:08:07,039 --> 00:08:13,144 [SOUND] 111 00:08:13,144 --> 00:08:19,209 [MUSIC]