[MUSIC] In this video, we will overview modern
architectures of neural networks. Let's look at the ImageNet
classification dataset. It has 1,000 classes. It has over 1 million labeled photos. And human top 5 error rate on
this dataset is roughly 5%. Why is it not to zero? If you look at the examples
from this dataset, you can see that the classes
are really difficult. For example, you have a quail or partridge in the upper right corner,
and for me they look exactly the same. I don't know how computer can
distinguish between them. The first breakthrough that
happened in 2012, that is the first time the deep convolutional neural net
was applied for ImageNet data set. And it significantly reduced
top five error from 26% to 15%. It uses 11 x 11, 5 x 5,
3 x 3 convolutions, max pooling, dropout,
data augmentation, ReLU activations and SGD with momentum, all the tricks that
you know from the previous video. It uses 60 million parameters and
the trains on 2 GPUs for 6 days. The next breakthrough is
2015 with VGG architecture. It is very similar to AlexNet because
it uses convolutional layers followed by pooling layers, just like LeNet
architecture going back to 1998. But it has much more filters. ImageNet top 5 error reduced to 8% for
single model. The training of this architecture
is similar to AlexNet, but it uses additional multi-scale
cropping as data augmentation. It uses 138 million parameters and it trains on 4 GPUs for 2 or 3 weeks. Then in 2015,
Inception Architecture came to the world. It is not similar to AlexNet,
it uses Inception block that was introduced in GoogLeNet or
also known as Inception V1. ImageNet top 5 error was reduced
to 5.6% for single model. You can see that this is
a really complex and deep model. It uses batch normalization,
image distortions as augmentation and RMSProp for grading decent. It uses only 25 million parameters,
but it trains on 8 GPUs for 2 weeks. You can see that these deep architecture
is made of inception blocks, which is in the blue circle. We will look in details
how that block works. But first,
we have to look at 1X1 convolutions. Such convolutions capture
interactions of input channels in one pixel of feature map. They can reduce the number of channels
not hurting the quality of the model, because different channels can correlate. It actually works like dimensionality
reduction with added ReLU activation. Usually the number of output channels
is less than the number of inputs. All operations inside an inception
block use stride 1 and enough padding to output the same spatial
dimensions which is W x H of feature map. Four different feature maps
are concatenated on depth at the end. So, it looks like a layered cake. We stack all of those different
feature maps on depth. And, we have different feature maps
that are colored with different colors. And, inside of that block, we use 1x1 convolutions to
reduce the number of filters, and we use 5x5, 3x3 convolutions,
and the pooling layer. And also we add the input to
the output just with 1x1 convolution. Why does it work better? In simple neural network architecture, you have a fixed size of
a kernel in convolutional layer. But when you use different
scales of that sliding window, let's say 5 by 5, 3 by 3, and 1 by 1, then you can use all that
features at the same time and you can learn better representations. Let's replace 5 by 5 convolutions. 5 by 5 convolutions are currently the most
expensive parts in our inception block. Let's replace them with two layers
of 3 by 3 convolutions, which, as we already know, have an effective
receptive field of 5 by 5. You can see on the image we replaced 5 by
5 convolution by 2 3 by 3 convolutions, which are in blue. Another technique that is known in
computer vision is filter decomposition. It is known that for
a Gaussian blur filter, you can decompose it into
two one dimensional filters. Your first blur, the source horizontally. Then you blur the blur vertically and you get the result which is identical to
applying a Gaussian blur to the input. Let's use the same idea
in our inception block. 3x3 convolutions are currently
the most expensive parts. Let's replace each 3x3 layer with
1x3 layer followed by 3x1 layer. Actually, what do we do is we decompose that 3x3 convolution into a series
of one dimensional convolutions. You can see that for green, blue and purple 3x3 convolutional layers, we replace them by two layers
of one dimensional convolutions. This is the final state of
our Inception block, and this block is used in
Inception V3 architecture. Another architecture that
appeared in 2015 is ResNet. It introduces residual connections and
it reduces top 5 ImageNet error down to 4.5% for single model,
and 3.5% for an ensemble. It has 152 layers,
it has few 7x7 convolutional layers that are expensive, but the rest are 3x3. It uses batch normalization,
max and average pooling. It has 60 million parameters,
it trains on 8 GPUs for 2 or 3 weeks. What is that residual connection
in this architecture? What we actually do is, we create
output channels, adding a small delta, which is modeled as F(x),
to original input channels. And that F(x) is actually
represented as a weight layer, followed by relu activation,
and one more weight layer. This way,
we can stick thousands of layers, and gradients do not vanish,
thanks to that residual connection. We always add a small number
to the input channels. So, that provides a better gradient
flow during back propagation. To summarize, you can see that by
stacking more convolutional and pooling layers, you can reduce the error,
like in AlexNet or VGG. But you cannot do that forever. You need to utilize new kind of layers,
like Inception block or residual connections. You have probably noticed that one needs
a lot of time to train her neural network. In the following video, we will discuss
the principle known as transfer learning that will help us to reduce
the training time for a new task. [SOUND] [MUSIC]