[MUSIC] In this video, we will overview modern architectures of neural networks. Let's look at the ImageNet classification dataset. It has 1,000 classes. It has over 1 million labeled photos. And human top 5 error rate on this dataset is roughly 5%. Why is it not to zero? If you look at the examples from this dataset, you can see that the classes are really difficult. For example, you have a quail or partridge in the upper right corner, and for me they look exactly the same. I don't know how computer can distinguish between them. The first breakthrough that happened in 2012, that is the first time the deep convolutional neural net was applied for ImageNet data set. And it significantly reduced top five error from 26% to 15%. It uses 11 x 11, 5 x 5, 3 x 3 convolutions, max pooling, dropout, data augmentation, ReLU activations and SGD with momentum, all the tricks that you know from the previous video. It uses 60 million parameters and the trains on 2 GPUs for 6 days. The next breakthrough is 2015 with VGG architecture. It is very similar to AlexNet because it uses convolutional layers followed by pooling layers, just like LeNet architecture going back to 1998. But it has much more filters. ImageNet top 5 error reduced to 8% for single model. The training of this architecture is similar to AlexNet, but it uses additional multi-scale cropping as data augmentation. It uses 138 million parameters and it trains on 4 GPUs for 2 or 3 weeks. Then in 2015, Inception Architecture came to the world. It is not similar to AlexNet, it uses Inception block that was introduced in GoogLeNet or also known as Inception V1. ImageNet top 5 error was reduced to 5.6% for single model. You can see that this is a really complex and deep model. It uses batch normalization, image distortions as augmentation and RMSProp for grading decent. It uses only 25 million parameters, but it trains on 8 GPUs for 2 weeks. You can see that these deep architecture is made of inception blocks, which is in the blue circle. We will look in details how that block works. But first, we have to look at 1X1 convolutions. Such convolutions capture interactions of input channels in one pixel of feature map. They can reduce the number of channels not hurting the quality of the model, because different channels can correlate. It actually works like dimensionality reduction with added ReLU activation. Usually the number of output channels is less than the number of inputs. All operations inside an inception block use stride 1 and enough padding to output the same spatial dimensions which is W x H of feature map. Four different feature maps are concatenated on depth at the end. So, it looks like a layered cake. We stack all of those different feature maps on depth. And, we have different feature maps that are colored with different colors. And, inside of that block, we use 1x1 convolutions to reduce the number of filters, and we use 5x5, 3x3 convolutions, and the pooling layer. And also we add the input to the output just with 1x1 convolution. Why does it work better? In simple neural network architecture, you have a fixed size of a kernel in convolutional layer. But when you use different scales of that sliding window, let's say 5 by 5, 3 by 3, and 1 by 1, then you can use all that features at the same time and you can learn better representations. Let's replace 5 by 5 convolutions. 5 by 5 convolutions are currently the most expensive parts in our inception block. Let's replace them with two layers of 3 by 3 convolutions, which, as we already know, have an effective receptive field of 5 by 5. You can see on the image we replaced 5 by 5 convolution by 2 3 by 3 convolutions, which are in blue. Another technique that is known in computer vision is filter decomposition. It is known that for a Gaussian blur filter, you can decompose it into two one dimensional filters. Your first blur, the source horizontally. Then you blur the blur vertically and you get the result which is identical to applying a Gaussian blur to the input. Let's use the same idea in our inception block. 3x3 convolutions are currently the most expensive parts. Let's replace each 3x3 layer with 1x3 layer followed by 3x1 layer. Actually, what do we do is we decompose that 3x3 convolution into a series of one dimensional convolutions. You can see that for green, blue and purple 3x3 convolutional layers, we replace them by two layers of one dimensional convolutions. This is the final state of our Inception block, and this block is used in Inception V3 architecture. Another architecture that appeared in 2015 is ResNet. It introduces residual connections and it reduces top 5 ImageNet error down to 4.5% for single model, and 3.5% for an ensemble. It has 152 layers, it has few 7x7 convolutional layers that are expensive, but the rest are 3x3. It uses batch normalization, max and average pooling. It has 60 million parameters, it trains on 8 GPUs for 2 or 3 weeks. What is that residual connection in this architecture? What we actually do is, we create output channels, adding a small delta, which is modeled as F(x), to original input channels. And that F(x) is actually represented as a weight layer, followed by relu activation, and one more weight layer. This way, we can stick thousands of layers, and gradients do not vanish, thanks to that residual connection. We always add a small number to the input channels. So, that provides a better gradient flow during back propagation. To summarize, you can see that by stacking more convolutional and pooling layers, you can reduce the error, like in AlexNet or VGG. But you cannot do that forever. You need to utilize new kind of layers, like Inception block or residual connections. You have probably noticed that one needs a lot of time to train her neural network. In the following video, we will discuss the principle known as transfer learning that will help us to reduce the training time for a new task. [SOUND] [MUSIC]