1
00:00:00,000 --> 00:00:03,370
When designing a layer for a ConvNet, you might have to pick,

2
00:00:03,370 --> 00:00:05,430
do you want a 1 by 3 filter,

3
00:00:05,430 --> 00:00:07,385
or 3 by 3, or 5 by 5,

4
00:00:07,385 --> 00:00:09,180
or do you want a pooling layer?

5
00:00:09,180 --> 00:00:11,920
What the inception network does is it says,

6
00:00:11,920 --> 00:00:13,095
why should you do them all?

7
00:00:13,095 --> 00:00:16,255
And this makes the network architecture more complicated,

8
00:00:16,255 --> 00:00:18,380
but it also works remarkably well.

9
00:00:18,380 --> 00:00:19,720
Let's see how this works.

10
00:00:19,720 --> 00:00:23,380
Let's say for the sake of example that you have inputted a

11
00:00:23,380 --> 00:00:26,990
28 by 28 by 192 dimensional volume.

12
00:00:26,990 --> 00:00:32,500
So what the inception network or what an inception layer says is,

13
00:00:32,500 --> 00:00:36,600
instead choosing what filter size you want in a Conv layer,

14
00:00:36,600 --> 00:00:40,205
or even do you want a convolutional layer or a pooling layer?

15
00:00:40,205 --> 00:00:46,795
Let's do them all. So what if you can use a 1 by 1 convolution,

16
00:00:46,795 --> 00:00:50,750
and that will output a 28 by 28 by something.

17
00:00:50,750 --> 00:00:56,180
Let's say 28 by 28 by 64 output,

18
00:00:56,180 --> 00:00:58,405
and you just have a volume there.

19
00:00:58,405 --> 00:01:06,510
But maybe you also want to try a 3 by 3 and that might output a 20 by 20 by 128.

20
00:01:06,510 --> 00:01:13,975
And then what you do is just stack up this second volume next to the first volume.

21
00:01:13,975 --> 00:01:16,415
And to make the dimensions match up,

22
00:01:16,415 --> 00:01:19,470
let's make this a same convolution.

23
00:01:19,470 --> 00:01:23,500
So the output dimension is still 28 by 28,

24
00:01:23,500 --> 00:01:26,965
same as the input dimension in terms of height and width.

25
00:01:26,965 --> 00:01:31,300
But 28 by 28 by in this example 128.

26
00:01:31,300 --> 00:01:34,510
And maybe you might say well I want to hedge my bets.

27
00:01:34,510 --> 00:01:36,710
Maybe a 5 by 5 filter works better.

28
00:01:36,710 --> 00:01:44,755
So let's do that too and have that output a 28 by 28 by 32.

29
00:01:44,755 --> 00:01:49,435
And again you use the same convolution to keep the dimensions the same.

30
00:01:49,435 --> 00:01:52,715
And maybe you don't want to convolutional layer.

31
00:01:52,715 --> 00:01:58,560
Let's apply pooling, and that has some other output and let's stack that up as well.

32
00:01:58,560 --> 00:02:05,070
And here pooling outputs 28 by 28 by 32.

33
00:02:05,070 --> 00:02:09,195
Now in order to make all the dimensions match,

34
00:02:09,195 --> 00:02:12,560
you actually need to use padding for max pooling.

35
00:02:12,560 --> 00:02:15,950
So this is an unusual formal pooling because if you want

36
00:02:15,950 --> 00:02:19,985
the input to have a higher than 28 by 28 and have the output,

37
00:02:19,985 --> 00:02:24,395
you'll match the dimension everything else also by 28 by 28,

38
00:02:24,395 --> 00:02:31,020
then you need to use the same padding as well as a stride of one for pooling.

39
00:02:31,020 --> 00:02:34,230
So this detail might seem a bit funny to you now,

40
00:02:34,230 --> 00:02:35,520
but let's keep going.

41
00:02:35,520 --> 00:02:39,310
And we'll make this all work later.

42
00:02:39,310 --> 00:02:43,170
But with a inception module like this,

43
00:02:43,170 --> 00:02:46,080
you can input some volume and output.

44
00:02:46,080 --> 00:02:48,640
In this case I guess if you add up all these numbers,

45
00:02:48,640 --> 00:02:51,705
32 plus 32 plus 128 plus 64,

46
00:02:51,705 --> 00:02:54,915
that's equal to 256.

47
00:02:54,915 --> 00:03:01,515
So you will have one inception module input 28 by 28 by 129,

48
00:03:01,515 --> 00:03:06,275
and output 28 by 28 by 256.

49
00:03:06,275 --> 00:03:11,300
And this is the heart of the inception network which is due

50
00:03:11,300 --> 00:03:13,260
to Christian Szegedy, Wei Liu,

51
00:03:13,260 --> 00:03:15,130
Yangqing Jia, Pierre Sermanet,

52
00:03:15,130 --> 00:03:17,750
Scott Reed, Dragomir Anguelov, Dumitru Erhan,

53
00:03:17,750 --> 00:03:20,660
Vincent Vanhoucke and Andrew Rabinovich.

54
00:03:20,660 --> 00:03:23,950
And the basic idea is that instead of

55
00:03:23,950 --> 00:03:30,000
you needing to pick one of these filter sizes or pooling you want and committing to that,

56
00:03:30,000 --> 00:03:33,200
you can do them all and just concatenate all the outputs,

57
00:03:33,200 --> 00:03:36,270
and let the network learn whatever parameters it wants to use,

58
00:03:36,270 --> 00:03:40,050
whatever the combinations of these filter sizes it wants.

59
00:03:40,050 --> 00:03:42,420
Now it turns out that there is a problem

60
00:03:42,420 --> 00:03:44,745
with the inception layer as we've described it here,

61
00:03:44,745 --> 00:03:46,795
which is computational cost.

62
00:03:46,795 --> 00:03:48,060
On the next slide,

63
00:03:48,060 --> 00:03:54,485
let's figure out what's the computational cost of this 5 by 5 filter resulting

64
00:03:54,485 --> 00:03:56,655
in this block over here.

65
00:03:56,655 --> 00:04:02,735
So just focusing on the 5 by 5 pot on the previous slide,

66
00:04:02,735 --> 00:04:07,010
we had as input a 28 by 28 by 192 block,

67
00:04:07,010 --> 00:04:14,620
and you implement a 5 by 5 same convolution of 32 filters to output 28 by 28 by 32.

68
00:04:14,620 --> 00:04:18,750
On the previous slide I had drawn this as a thin purple slide.

69
00:04:18,750 --> 00:04:23,035
So I'm just going draw this as a more normal looking blue block here.

70
00:04:23,035 --> 00:04:30,700
So let's look at the computational costs of outputting this 20 by 20 by 32.

71
00:04:30,700 --> 00:04:38,065
So you have 32 filters because the outputs has 32 channels,

72
00:04:38,065 --> 00:04:44,805
and each filter is going to be 5 by 5 by 192.

73
00:04:44,805 --> 00:04:48,410
And so the output size is 20 by 20 by 32,

74
00:04:48,410 --> 00:04:53,600
and so you need to compute 28 by 28 by 32 numbers.

75
00:04:53,600 --> 00:04:58,685
And for each of them you need to do these many multiplications, right?

76
00:04:58,685 --> 00:05:01,185
5 by 5 by 192.

77
00:05:01,185 --> 00:05:03,550
So the total number of multiplies you need

78
00:05:03,550 --> 00:05:07,010
is the number of multiplies you need to compute each

79
00:05:07,010 --> 00:05:12,615
of the output values times the number of output values you need to compute.

80
00:05:12,615 --> 00:05:15,330
And if you multiply all of these numbers,

81
00:05:15,330 --> 00:05:18,790
this is equal to 120 million.

82
00:05:18,790 --> 00:05:24,725
And so, while you can do 120 million multiplies on the modern computer,

83
00:05:24,725 --> 00:05:27,385
this is still a pretty expensive operation.

84
00:05:27,385 --> 00:05:32,390
On the next slide you see how using the idea of 1 by 1 convolutions,

85
00:05:32,390 --> 00:05:34,210
which you learnt about in the previous video,

86
00:05:34,210 --> 00:05:38,630
you'll be able to reduce the computational costs by about a factor of 10.

87
00:05:38,630 --> 00:05:44,400
To go from about 120 million multiplies to about one tenth of that.

88
00:05:44,400 --> 00:05:48,575
So please remember the number 120 so you can compare it

89
00:05:48,575 --> 00:05:52,045
with what you see on the next slide, 120 million.

90
00:05:52,045 --> 00:05:58,540
Here is an alternative architecture for inputting 28 by 28 by 192,

91
00:05:58,540 --> 00:06:03,020
and outputting 28 by 28 by 32, which is falling.

92
00:06:03,020 --> 00:06:05,175
You are going to input the volume,

93
00:06:05,175 --> 00:06:14,580
use a 1 by 1 convolution to reduce the volume to 16 channels instead of 192 channels,

94
00:06:14,580 --> 00:06:17,370
and then on this much smaller volume,

95
00:06:17,370 --> 00:06:21,915
run your 5 by 5 convolution to give you your final output.

96
00:06:21,915 --> 00:06:24,665
So notice the input and output dimensions are still the same.

97
00:06:24,665 --> 00:06:31,280
You input 28 by 28 by 192 and output 28 by 28 by 32,

98
00:06:31,280 --> 00:06:33,865
same as the previous slide.

99
00:06:33,865 --> 00:06:37,340
But what we've done is we're taking this huge volume we had on the left,

100
00:06:37,340 --> 00:06:42,110
and we shrunk it to this much smaller intermediate volume,

101
00:06:42,110 --> 00:06:46,740
which only has 16 instead of 192 channels.

102
00:06:46,740 --> 00:06:53,000
Sometimes this is called a bottleneck layer, right?

103
00:06:53,600 --> 00:06:59,890
I guess because a bottleneck is usually the smallest part of something, right?

104
00:06:59,890 --> 00:07:04,820
So I guess if you have a glass bottle that looks like this,

105
00:07:04,820 --> 00:07:09,065
then you know this is I guess where the cork goes.

106
00:07:09,065 --> 00:07:13,615
And then the bottleneck is the smallest part of this bottle.

107
00:07:13,615 --> 00:07:18,035
So in the same way, the bottleneck layer is the smallest part of this network.

108
00:07:18,035 --> 00:07:22,625
We shrink the representation before increasing the size again.

109
00:07:22,625 --> 00:07:26,945
Now let's look at the computational costs involved.

110
00:07:26,945 --> 00:07:30,320
To apply this 1 by 1 convolution,

111
00:07:30,320 --> 00:07:32,510
we have 16 filters.

112
00:07:32,510 --> 00:07:37,145
Each of the filters is going to be of dimension 1 by 1 by 192,

113
00:07:37,145 --> 00:07:40,170
this 192 matches that 192.

114
00:07:40,170 --> 00:07:43,300
And so the cost of computing this 28 by 28

115
00:07:43,300 --> 00:07:45,870
by 16 volumes is going to be well,

116
00:07:45,870 --> 00:07:48,205
you need these many outputs,

117
00:07:48,205 --> 00:07:54,900
and for each of them you need to do 192 multiplications.

118
00:07:54,900 --> 00:07:58,515
I could have written 1 times 1 times 192, right?

119
00:07:58,515 --> 00:08:00,960
Which is this. And if you multiply this out,

120
00:08:00,960 --> 00:08:02,650
this is 2.4 million,

121
00:08:02,650 --> 00:08:04,370
it's about 2.4 million.

122
00:08:04,370 --> 00:08:05,655
How about the second?

123
00:08:05,655 --> 00:08:11,325
So that's the cost of this first convolutional layer.

124
00:08:11,325 --> 00:08:15,670
The cost of this second convolutional layer would be that well,

125
00:08:15,670 --> 00:08:17,290
you have these many outputs.

126
00:08:17,290 --> 00:08:19,340
So 28 by 28 by 32.

127
00:08:19,340 --> 00:08:28,200
And then for each of the outputs you have to apply a 5 by 5 by 16 dimensional filter.

128
00:08:28,200 --> 00:08:31,305
And so by 5 by 5 by 16.

129
00:08:31,305 --> 00:08:36,520
And you multiply that out is equals to 10.0.

130
00:08:36,520 --> 00:08:41,160
And so the total number of multiplications you need to do is the sum of those

131
00:08:41,160 --> 00:08:45,020
which is 12.4 million multiplications.

132
00:08:45,020 --> 00:08:47,955
And you compare this with what we had on the previous slide,

133
00:08:47,955 --> 00:08:53,095
you reduce the computational cost from about 120 million multiplies,

134
00:08:53,095 --> 00:08:55,810
down to about one tenth of that,

135
00:08:55,810 --> 00:08:59,335
to 12.4 million multiplications.

136
00:08:59,335 --> 00:09:02,345
And the number of additions you need to do is

137
00:09:02,345 --> 00:09:06,305
about very similar to the number of multiplications you need to do.

138
00:09:06,305 --> 00:09:10,230
So that's why I'm just counting the number of multiplications.

139
00:09:10,230 --> 00:09:13,490
So to summarize, if you are building a layer

140
00:09:13,490 --> 00:09:16,140
of a neural network and you don't want to have to decide,

141
00:09:16,140 --> 00:09:17,820
do you want a 1 by 1,

142
00:09:17,820 --> 00:09:20,095
or 3 by 3, or 5 by 5, or pooling layer,

143
00:09:20,095 --> 00:09:23,560
the inception module let's you say let's do them all,

144
00:09:23,560 --> 00:09:25,645
and let's concatenate the results.

145
00:09:25,645 --> 00:09:28,720
And then we run to the problem of computational cost.

146
00:09:28,720 --> 00:09:32,460
And what you saw here was how using a 1 by 1 convolution,

147
00:09:32,460 --> 00:09:34,585
you can create this bottleneck layer

148
00:09:34,585 --> 00:09:38,115
thereby reducing the computational cost significantly.

149
00:09:38,115 --> 00:09:39,725
Now you might be wondering,

150
00:09:39,725 --> 00:09:43,370
does shrinking down the representation size so dramatically,

151
00:09:43,370 --> 00:09:46,730
does it hurt the performance of your neural network?

152
00:09:46,730 --> 00:09:52,440
It turns out that so long as you implement this bottleneck layer so that within reason,

153
00:09:52,440 --> 00:09:56,530
you can shrink down the representation size significantly,

154
00:09:56,530 --> 00:09:59,240
and it doesn't seem to hurt the performance,

155
00:09:59,240 --> 00:10:01,700
but saves you a lot of computation.

156
00:10:01,700 --> 00:10:07,835
So these are the key ideas of the inception module.

157
00:10:07,835 --> 00:10:09,790
Let's put them together and in

158
00:10:09,790 --> 00:10:14,100
the next video show you what the full inception network looks like.