1 00:00:00,000 --> 00:00:01,320 In the previous video, 2 00:00:01,320 --> 00:00:06,365 you've already seen all the basic building blocks of the Inception network. 3 00:00:06,365 --> 00:00:09,120 In this video, let's see how you can put 4 00:00:09,120 --> 00:00:13,910 these building blocks together to build your own Inception network. 5 00:00:13,910 --> 00:00:16,710 So the Inception module takes as input 6 00:00:16,710 --> 00:00:20,220 the activation or the output from some previous layer. 7 00:00:20,220 --> 00:00:21,870 So let's say for the sake of argument, 8 00:00:21,870 --> 00:00:25,383 this is 28 by 28 by 192, 9 00:00:25,383 --> 00:00:28,365 same as our previous video. 10 00:00:28,365 --> 00:00:36,357 The example we worked through in depth was the one by one followed by five by five layer. 11 00:00:36,357 --> 00:00:40,680 So maybe the one by one has 16 channels, 12 00:00:40,680 --> 00:00:49,215 and then the five by five will output a 28 by 28 by let say, 32 channels. 13 00:00:49,215 --> 00:00:54,700 And this was the example we worked through on the last slide of the previous video. 14 00:00:54,700 --> 00:00:59,190 Then, to save computation on your three by three convolution, 15 00:00:59,190 --> 00:01:01,885 you can also do the same here. 16 00:01:01,885 --> 00:01:08,865 And then the three by three outputs 28 by 28 by 128. 17 00:01:08,865 --> 00:01:14,210 And then, maybe you want to consider a one by one convolution as well. 18 00:01:14,210 --> 00:01:18,115 There's no need to do a one by one conv followed by another one by one conv. 19 00:01:18,115 --> 00:01:20,560 So there's just one step here. 20 00:01:20,560 --> 00:01:28,170 And let's say this outputs 28 by 28 by 64. 21 00:01:28,170 --> 00:01:33,155 And then finally is the pooling layer. 22 00:01:33,155 --> 00:01:36,270 And here we're going to do something funny. 23 00:01:36,270 --> 00:01:44,010 So we are going to use- we want to use a same padding for max pooling. 24 00:01:44,010 --> 00:01:52,490 So the output is 28 by 28 and by- so here we're going to do something funny. 25 00:01:52,490 --> 00:01:56,367 In order to deliver concatenate all of these outputs at the end, 26 00:01:56,367 --> 00:02:00,470 we are going to use same type of padding for 27 00:02:00,470 --> 00:02:05,210 pooling so that the output item with is still 28 by 28. 28 00:02:05,210 --> 00:02:09,990 So we can concatenate it with these other outputs. 29 00:02:09,990 --> 00:02:12,664 But notice that if you do max pooling, 30 00:02:12,664 --> 00:02:16,705 even with same padding three by three filter is right at one. 31 00:02:16,705 --> 00:02:25,200 The output here will be twenty 28 by 28 by 192. 32 00:02:25,200 --> 00:02:27,615 You will have the same number of channels and 33 00:02:27,615 --> 00:02:32,190 the same depth as the input that we had here. 34 00:02:32,190 --> 00:02:36,090 So this seems like it has a lot of channels. 35 00:02:36,090 --> 00:02:39,530 So what we're going to do is actually add one more one by 36 00:02:39,530 --> 00:02:44,510 one conv layer to then do what we saw in the one by 37 00:02:44,510 --> 00:02:49,000 one convolutional video to shrink the number of channels so as to get this 38 00:02:49,000 --> 00:02:56,730 down to 28 by 28 by, let's say, 32. 39 00:02:56,730 --> 00:03:06,505 And the way you do that is to use 32 filters of dimension one by one by 192. 40 00:03:06,505 --> 00:03:12,255 So that's why the output dimension has the number of channels shrunk down to 32, 41 00:03:12,255 --> 00:03:13,950 so then you don't end up with 42 00:03:13,950 --> 00:03:18,895 the pooling layer taking up all the channels in the final output. 43 00:03:18,895 --> 00:03:25,215 And then finally, you take these all of these blocks and you do channel concatenation, 44 00:03:25,215 --> 00:03:31,435 just concatenate across this 64 plus 128 plus 32 plus 32, 45 00:03:31,435 --> 00:03:33,345 and this if you add it up, 46 00:03:33,345 --> 00:03:40,670 this gives you a 28 by 28 by 256 dimensional output. 47 00:03:40,670 --> 00:03:50,660 Channel concat is just concatenating the blocks that we saw in the previous video. 48 00:03:50,660 --> 00:03:54,800 So this is one Inception module. 49 00:03:54,800 --> 00:04:00,830 And what the Inception network does is more or less put a lot of these modules together. 50 00:04:00,830 --> 00:04:08,400 Here's a picture of the Inception that were taken from the paper by Szegety et al. 51 00:04:08,400 --> 00:04:13,370 And you notice a lot of repeated blocks in this. 52 00:04:13,370 --> 00:04:17,325 Maybe this picture looks really complicated but you look at one of the blocks there, 53 00:04:17,325 --> 00:04:25,255 that block is basically the Inception module that you saw on the previous slide. 54 00:04:25,255 --> 00:04:29,960 And subject to a little details I won't discuss, 55 00:04:29,960 --> 00:04:32,185 this is another Inception block, 56 00:04:32,185 --> 00:04:35,160 this is another Inception block. 57 00:04:35,160 --> 00:04:38,970 There's some extra Max pooling layers here to 58 00:04:38,970 --> 00:04:42,530 change the dimension of the height and width. 59 00:04:42,530 --> 00:04:44,725 But there's another Inception block, 60 00:04:44,725 --> 00:04:47,340 and then there's another max pool here to change the height 61 00:04:47,340 --> 00:04:49,930 and width but basically there's another Inception block. 62 00:04:49,930 --> 00:04:53,550 But the Inception network is just a lot of these blocks that you've learned 63 00:04:53,550 --> 00:04:57,200 about repeated to different positions of the network. 64 00:04:57,200 --> 00:05:01,195 And so if you understand the Inception block from the previous slide, 65 00:05:01,195 --> 00:05:03,504 then you understand the Inception network. 66 00:05:03,504 --> 00:05:07,235 It turns out that 67 00:05:07,235 --> 00:05:12,050 this one last detail to the Inception network if you read the original research paper, 68 00:05:12,050 --> 00:05:15,820 which is that there are these additional side branches that I just added. 69 00:05:15,820 --> 00:05:20,145 So what do they do? 70 00:05:20,145 --> 00:05:22,610 Well, the last few layers of the network is 71 00:05:22,610 --> 00:05:27,990 a fully connected layer followed by a softmax layer to try to make a prediction. 72 00:05:27,990 --> 00:05:31,910 What these side branches do is it take some hidden layer, 73 00:05:31,910 --> 00:05:34,610 and it tries to use that to make a prediction. 74 00:05:34,610 --> 00:05:38,795 So this is actually a softmax output, and so is that. 75 00:05:38,795 --> 00:05:40,730 And this other side branch, 76 00:05:40,730 --> 00:05:44,420 again takes a hidden layer passes through a few layers, 77 00:05:44,420 --> 00:05:45,470 a few fully connected layers, 78 00:05:45,470 --> 00:05:49,590 and then as a softmax tried to predict what's the output label. 79 00:05:49,590 --> 00:05:56,150 You should think of this as maybe just another detail of the Inception network, 80 00:05:56,150 --> 00:06:03,020 but what it does is it helps ensure that the features computer even in the hidden units, 81 00:06:03,020 --> 00:06:05,660 even at that intermediate layers that they're not too 82 00:06:05,660 --> 00:06:09,220 bad for predicting the output cause of a image. 83 00:06:09,220 --> 00:06:13,800 And this appears to have a regularizing effect on the Inception network 84 00:06:13,800 --> 00:06:18,945 and helps prevent this network from overfitting. 85 00:06:18,945 --> 00:06:22,660 Oh, and by the way, this particular Inception network 86 00:06:22,660 --> 00:06:28,720 was developed by authors at Google who 87 00:06:28,720 --> 00:06:32,840 called it GoogLenet spelled like that to pay 88 00:06:32,840 --> 00:06:39,515 homage to the LeNet network that you learned about in an earlier video as well. 89 00:06:39,515 --> 00:06:43,440 So I think it's actually really nice that the deep learning community is so 90 00:06:43,440 --> 00:06:47,910 collaborative and that there's 91 00:06:47,910 --> 00:06:52,635 such strong healthy respect for each other's work in the deep learning community. 92 00:06:52,635 --> 00:06:54,685 Finally, here's one fun fact. 93 00:06:54,685 --> 00:06:57,890 Where does the name Inception network come from? 94 00:06:57,890 --> 00:07:04,060 The Inception paper actually cites this meme for we need to go deeper, 95 00:07:04,060 --> 00:07:11,140 and this URL is an actual reference in the Inception paper which links to this image. 96 00:07:11,140 --> 00:07:14,410 And if you've seen the movie titled The Inception, 97 00:07:14,410 --> 00:07:21,310 maybe this meme will make sense to you but the authors actually cite this meme as 98 00:07:21,310 --> 00:07:23,830 motivation for needing to build 99 00:07:23,830 --> 00:07:29,465 deeper neural networks and that's how they came up with the Inception architecture. 100 00:07:29,465 --> 00:07:33,355 So I guess it's not often that research papers get to cite 101 00:07:33,355 --> 00:07:37,720 internet memes in their citations but in this case, 102 00:07:37,720 --> 00:07:40,110 I guess it worked out quite well. 103 00:07:40,110 --> 00:07:43,880 So to summarize, if you understand the Inception module, 104 00:07:43,880 --> 00:07:46,540 then you understand the Inception network, 105 00:07:46,540 --> 00:07:51,940 which is largely the Inception module repeated a bunch of times throughout the network. 106 00:07:51,940 --> 00:07:55,975 Since the development of the original Inception module 107 00:07:55,975 --> 00:08:00,520 the authors and others have built on it and come up with other versions as well. 108 00:08:00,520 --> 00:08:04,300 So there are research papers on newer versions of 109 00:08:04,300 --> 00:08:06,880 the Inception algorithm and you sometimes see 110 00:08:06,880 --> 00:08:10,855 people use some of these later versions as well in their work, 111 00:08:10,855 --> 00:08:14,100 like Inception V2, Inception V3, Inception V4. 112 00:08:14,100 --> 00:08:16,990 There's also an Inception version that's combined with 113 00:08:16,990 --> 00:08:22,400 the resident idea of having skip connections and that sometimes works even better. 114 00:08:22,400 --> 00:08:25,195 But all of these variations are built 115 00:08:25,195 --> 00:08:27,910 on the basic idea that you learned about in this and 116 00:08:27,910 --> 00:08:29,860 the previous video of coming up with 117 00:08:29,860 --> 00:08:34,120 the Inception module and then stacking up a bunch of them together. 118 00:08:34,120 --> 00:08:35,980 And with these videos, 119 00:08:35,980 --> 00:08:38,575 you should be able to read and understand, I think, 120 00:08:38,575 --> 00:08:41,560 the Inception paper, as well as maybe 121 00:08:41,560 --> 00:08:45,385 some of the papers describing the later variations as well. 122 00:08:45,385 --> 00:08:48,070 So that;s it, you've gone through 123 00:08:48,070 --> 00:08:51,395 quite a lot of specialized neural network architectures. 124 00:08:51,395 --> 00:08:53,020 In the next video, 125 00:08:53,020 --> 00:08:56,440 I want to start showing you some more practical advice on how to 126 00:08:56,440 --> 00:09:00,545 actually use these algorithms to build your own computer vision system. 127 00:09:00,545 --> 00:09:02,210 Let's go on to the next video.