1
00:00:00,000 --> 00:00:03,568
[MUSIC]

2
00:00:03,568 --> 00:00:07,790
In this video, we will overview modern
architectures of neural networks.

3
00:00:09,130 --> 00:00:12,073
Let's look at the ImageNet
classification dataset.

4
00:00:12,073 --> 00:00:14,200
It has 1,000 classes.

5
00:00:14,200 --> 00:00:16,900
It has over 1 million labeled photos.

6
00:00:16,900 --> 00:00:22,010
And human top 5 error rate on
this dataset is roughly 5%.

7
00:00:22,010 --> 00:00:23,060
Why is it not to zero?

8
00:00:24,300 --> 00:00:26,590
If you look at the examples
from this dataset,

9
00:00:26,590 --> 00:00:30,200
you can see that the classes
are really difficult.

10
00:00:30,200 --> 00:00:32,340
For example, you have a quail or

11
00:00:32,340 --> 00:00:37,920
partridge in the upper right corner,
and for me they look exactly the same.

12
00:00:37,920 --> 00:00:40,260
I don't know how computer can
distinguish between them.

13
00:00:42,300 --> 00:00:47,280
The first breakthrough that
happened in 2012, that is the first

14
00:00:47,280 --> 00:00:52,390
time the deep convolutional neural net
was applied for ImageNet data set.

15
00:00:52,390 --> 00:00:59,020
And it significantly reduced
top five error from 26% to 15%.

16
00:00:59,020 --> 00:01:03,950
It uses 11 x 11, 5 x 5,
3 x 3 convolutions,

17
00:01:03,950 --> 00:01:09,560
max pooling, dropout,
data augmentation, ReLU activations and

18
00:01:09,560 --> 00:01:15,360
SGD with momentum, all the tricks that
you know from the previous video.

19
00:01:15,360 --> 00:01:20,200
It uses 60 million parameters and
the trains on 2 GPUs for 6 days.

20
00:01:21,910 --> 00:01:26,890
The next breakthrough is
2015 with VGG architecture.

21
00:01:26,890 --> 00:01:32,537
It is very similar to AlexNet because
it uses convolutional layers followed

22
00:01:32,537 --> 00:01:38,108
by pooling layers, just like LeNet
architecture going back to 1998.

23
00:01:38,108 --> 00:01:41,460
But it has much more filters.

24
00:01:41,460 --> 00:01:46,200
ImageNet top 5 error reduced to 8% for
single model.

25
00:01:46,200 --> 00:01:49,909
The training of this architecture
is similar to AlexNet,

26
00:01:49,909 --> 00:01:54,384
but it uses additional multi-scale
cropping as data augmentation.

27
00:01:54,384 --> 00:01:58,250
It uses 138 million parameters and

28
00:01:58,250 --> 00:02:02,127
it trains on 4 GPUs for 2 or 3 weeks.

29
00:02:02,127 --> 00:02:08,562
Then in 2015,
Inception Architecture came to the world.

30
00:02:08,562 --> 00:02:12,951
It is not similar to AlexNet,
it uses Inception block that

31
00:02:12,951 --> 00:02:18,160
was introduced in GoogLeNet or
also known as Inception V1.

32
00:02:18,160 --> 00:02:23,280
ImageNet top 5 error was reduced
to 5.6% for single model.

33
00:02:23,280 --> 00:02:26,460
You can see that this is
a really complex and deep model.

34
00:02:27,530 --> 00:02:32,680
It uses batch normalization,
image distortions as augmentation and

35
00:02:32,680 --> 00:02:37,230
RMSProp for grading decent.

36
00:02:37,230 --> 00:02:42,469
It uses only 25 million parameters,
but it trains on 8 GPUs for 2 weeks.

37
00:02:43,570 --> 00:02:47,650
You can see that these deep architecture
is made of inception blocks,

38
00:02:47,650 --> 00:02:49,720
which is in the blue circle.

39
00:02:49,720 --> 00:02:53,000
We will look in details
how that block works.

40
00:02:53,000 --> 00:02:57,060
But first,
we have to look at 1X1 convolutions.

41
00:02:57,060 --> 00:03:01,310
Such convolutions capture
interactions of input channels

42
00:03:01,310 --> 00:03:02,710
in one pixel of feature map.

43
00:03:03,740 --> 00:03:07,600
They can reduce the number of channels
not hurting the quality of the model,

44
00:03:07,600 --> 00:03:10,500
because different channels can correlate.

45
00:03:10,500 --> 00:03:16,120
It actually works like dimensionality
reduction with added ReLU activation.

46
00:03:16,120 --> 00:03:20,322
Usually the number of output channels
is less than the number of inputs.

47
00:03:23,521 --> 00:03:27,896
All operations inside an inception
block use stride 1 and

48
00:03:27,896 --> 00:03:34,960
enough padding to output the same spatial
dimensions which is W x H of feature map.

49
00:03:34,960 --> 00:03:38,950
Four different feature maps
are concatenated on depth at the end.

50
00:03:38,950 --> 00:03:41,390
So, it looks like a layered cake.

51
00:03:41,390 --> 00:03:46,400
We stack all of those different
feature maps on depth.

52
00:03:46,400 --> 00:03:50,890
And, we have different feature maps
that are colored with different colors.

53
00:03:52,090 --> 00:03:54,850
And, inside of that block,

54
00:03:54,850 --> 00:04:00,140
we use 1x1 convolutions to
reduce the number of filters,

55
00:04:00,140 --> 00:04:05,600
and we use 5x5, 3x3 convolutions,
and the pooling layer.

56
00:04:05,600 --> 00:04:11,478
And also we add the input to
the output just with 1x1 convolution.

57
00:04:11,478 --> 00:04:12,670
Why does it work better?

58
00:04:13,880 --> 00:04:17,217
In simple neural network architecture,

59
00:04:17,217 --> 00:04:21,802
you have a fixed size of
a kernel in convolutional layer.

60
00:04:21,802 --> 00:04:26,814
But when you use different
scales of that sliding window,

61
00:04:26,814 --> 00:04:30,490
let's say 5 by 5, 3 by 3, and 1 by 1,

62
00:04:30,490 --> 00:04:34,991
then you can use all that
features at the same time and

63
00:04:34,991 --> 00:04:38,490
you can learn better representations.

64
00:04:39,820 --> 00:04:42,030
Let's replace 5 by 5 convolutions.

65
00:04:42,030 --> 00:04:47,850
5 by 5 convolutions are currently the most
expensive parts in our inception block.

66
00:04:47,850 --> 00:04:52,820
Let's replace them with two layers
of 3 by 3 convolutions, which,

67
00:04:52,820 --> 00:04:57,549
as we already know, have an effective
receptive field of 5 by 5.

68
00:04:57,549 --> 00:05:03,438
You can see on the image we replaced 5 by
5 convolution by 2 3 by 3 convolutions,

69
00:05:03,438 --> 00:05:04,800
which are in blue.

70
00:05:06,900 --> 00:05:11,622
Another technique that is known in
computer vision is filter decomposition.

71
00:05:11,622 --> 00:05:14,480
It is known that for
a Gaussian blur filter,

72
00:05:15,760 --> 00:05:19,135
you can decompose it into
two one dimensional filters.

73
00:05:19,135 --> 00:05:22,630
Your first blur, the source horizontally.

74
00:05:22,630 --> 00:05:24,890
Then you blur the blur vertically and

75
00:05:24,890 --> 00:05:31,000
you get the result which is identical to
applying a Gaussian blur to the input.

76
00:05:31,000 --> 00:05:34,246
Let's use the same idea
in our inception block.

77
00:05:34,246 --> 00:05:38,490
3x3 convolutions are currently
the most expensive parts.

78
00:05:38,490 --> 00:05:44,130
Let's replace each 3x3 layer with
1x3 layer followed by 3x1 layer.

79
00:05:45,170 --> 00:05:48,090
Actually, what do we do is we decompose

80
00:05:48,090 --> 00:05:51,925
that 3x3 convolution into a series
of one dimensional convolutions.

81
00:05:51,925 --> 00:05:56,000
You can see that for green, blue and

82
00:05:56,000 --> 00:05:59,830
purple 3x3 convolutional layers,

83
00:05:59,830 --> 00:06:05,870
we replace them by two layers
of one dimensional convolutions.

84
00:06:05,870 --> 00:06:09,570
This is the final state of
our Inception block, and

85
00:06:09,570 --> 00:06:14,400
this block is used in
Inception V3 architecture.

86
00:06:16,490 --> 00:06:21,100
Another architecture that
appeared in 2015 is ResNet.

87
00:06:21,100 --> 00:06:26,600
It introduces residual connections and
it reduces top 5 ImageNet error

88
00:06:26,600 --> 00:06:32,560
down to 4.5% for single model,
and 3.5% for an ensemble.

89
00:06:32,560 --> 00:06:37,319
It has 152 layers,
it has few 7x7 convolutional layers

90
00:06:37,319 --> 00:06:40,874
that are expensive, but the rest are 3x3.

91
00:06:40,874 --> 00:06:45,270
It uses batch normalization,
max and average pooling.

92
00:06:45,270 --> 00:06:51,610
It has 60 million parameters,
it trains on 8 GPUs for 2 or 3 weeks.

93
00:06:51,610 --> 00:06:55,078
What is that residual connection
in this architecture?

94
00:06:56,432 --> 00:07:02,210
What we actually do is, we create
output channels, adding a small delta,

95
00:07:02,210 --> 00:07:06,728
which is modeled as F(x),
to original input channels.

96
00:07:06,728 --> 00:07:13,010
And that F(x) is actually
represented as a weight layer,

97
00:07:13,010 --> 00:07:15,900
followed by relu activation,
and one more weight layer.

98
00:07:17,230 --> 00:07:20,110
This way,
we can stick thousands of layers, and

99
00:07:20,110 --> 00:07:24,430
gradients do not vanish,
thanks to that residual connection.

100
00:07:24,430 --> 00:07:28,710
We always add a small number
to the input channels.

101
00:07:28,710 --> 00:07:33,470
So, that provides a better gradient
flow during back propagation.

102
00:07:35,491 --> 00:07:40,100
To summarize, you can see that by
stacking more convolutional and

103
00:07:40,100 --> 00:07:44,807
pooling layers, you can reduce the error,
like in AlexNet or VGG.

104
00:07:44,807 --> 00:07:46,620
But you cannot do that forever.

105
00:07:46,620 --> 00:07:50,320
You need to utilize new kind of layers,
like Inception block or

106
00:07:50,320 --> 00:07:52,290
residual connections.

107
00:07:52,290 --> 00:07:57,520
You have probably noticed that one needs
a lot of time to train her neural network.

108
00:07:57,520 --> 00:08:02,387
In the following video, we will discuss
the principle known as transfer

109
00:08:02,387 --> 00:08:07,039
learning that will help us to reduce
the training time for a new task.

110
00:08:07,039 --> 00:08:13,144
[SOUND]

111
00:08:13,144 --> 00:08:19,209
[MUSIC]