1 00:00:00,190 --> 00:00:01,320 In a previous video, 2 00:00:01,320 --> 00:00:06,420 you've already seen all the basic building blocks of the Inception network. 3 00:00:06,420 --> 00:00:10,680 In this video, let's see how you can put these building blocks together 4 00:00:10,680 --> 00:00:12,780 to build your own Inception network. 5 00:00:13,910 --> 00:00:17,970 So the inception module takes as input the activation or 6 00:00:17,970 --> 00:00:20,260 the output from some previous layer. 7 00:00:20,260 --> 00:00:25,860 So let's say for the sake of argument this is 28 by 28 by 192, 8 00:00:25,860 --> 00:00:28,030 same as our previous video. 9 00:00:28,030 --> 00:00:35,130 The example we worked through in depth was the 1 by 1 followed by 5 by 5. 10 00:00:35,130 --> 00:00:41,120 There, so maybe the 1 by 1 has 16 channels and 11 00:00:41,120 --> 00:00:48,170 then the 5 by 5 will output a 28 by 28 by, let's say, 32 channels. 12 00:00:49,700 --> 00:00:53,790 And this is the example we worked through on the last slide of the previous video. 13 00:00:54,900 --> 00:01:02,260 Then to save computation on your 3 by 3 convolution you can also do the same here. 14 00:01:02,260 --> 00:01:08,310 And then the 3 by 3 outputs, 28 by 28 by 1 by 28. 15 00:01:09,390 --> 00:01:14,100 And then maybe you want to consider a 1 by 1 convolution as well. 16 00:01:14,100 --> 00:01:18,481 There's no need to do a 1 by 1 conv followed by another 1 by 1 conv so 17 00:01:18,481 --> 00:01:23,409 there's just one step here and let's say these outputs 28 by 28 by 64. 18 00:01:23,409 --> 00:01:31,900 And then finally is the pulling layer. 19 00:01:34,000 --> 00:01:35,730 So here I'm going to do something funny. 20 00:01:35,730 --> 00:01:40,154 In order to really concatenate all of these outputs at the end 21 00:01:40,154 --> 00:01:44,073 we are going to use the same type of padding for pooling. 22 00:01:44,073 --> 00:01:47,480 So that the output height and width is still 28 by 28. 23 00:01:47,480 --> 00:01:53,300 So we can concatenate it with these other outputs. 24 00:01:53,300 --> 00:01:57,842 But notice that if you do max-pooling, even with same padding, 25 00:01:57,842 --> 00:01:59,917 3 by 3 filter is tried at 1. 26 00:01:59,917 --> 00:02:07,016 The output here will be 28 by 28, By 192. 27 00:02:07,016 --> 00:02:10,790 It will have the same number of channels and 28 00:02:10,790 --> 00:02:15,570 the same depth as the input that we had here. 29 00:02:15,570 --> 00:02:19,252 So, this seems like is has a lot of channels. 30 00:02:19,252 --> 00:02:23,752 So what we're going to do is actually add one more 1 by 1 conv 31 00:02:23,752 --> 00:02:28,607 layer to then to what we saw in the one by one convilational video, 32 00:02:28,607 --> 00:02:31,541 to strengthen the number of channels. 33 00:02:31,541 --> 00:02:37,730 So it gets us down to 28 by 28 by let's say, 32. 34 00:02:37,730 --> 00:02:44,770 And the way you do that, is to use 32 filters, 35 00:02:44,770 --> 00:02:49,660 of dimension 1 by 1 by 192. 36 00:02:49,660 --> 00:02:54,110 So that's why the output dimension has a number of channels shrunk down to 32. 37 00:02:54,110 --> 00:02:58,460 So then we don't end up with the pulling layer 38 00:02:58,460 --> 00:03:00,920 taking up all the channels in the final output. 39 00:03:02,310 --> 00:03:08,121 And finally you take all of these blocks and you do channel concatenation. 40 00:03:08,121 --> 00:03:12,865 Just concatenate across this 64 plus 128 plus 41 00:03:12,865 --> 00:03:18,165 32 plus 32 and this if you add it up this gives you 42 00:03:18,165 --> 00:03:24,055 a 28 by 28 by 256 dimension output. 43 00:03:24,055 --> 00:03:30,879 Concat is just this concatenating the blocks that we saw in the previous video. 44 00:03:33,918 --> 00:03:39,598 So this is one inception module, and what the inception network does, 45 00:03:39,598 --> 00:03:44,057 is, more or less, put a lot of these modules together. 46 00:03:45,624 --> 00:03:50,671 Here's a picture of the inception network, taken from the paper by Szegedy et al 47 00:03:53,615 --> 00:03:56,570 And you notice a lot of repeated blocks in this. 48 00:03:56,570 --> 00:03:58,580 Maybe this picture looks really complicated. 49 00:03:58,580 --> 00:04:03,563 But if you look at one of the blocks there, that block is basically 50 00:04:03,563 --> 00:04:07,920 the inception module that you saw on the previous slide. 51 00:04:10,046 --> 00:04:17,085 And subject to little details I won't discuss, this is another inception block. 52 00:04:17,085 --> 00:04:19,460 This is another inception block. 53 00:04:19,460 --> 00:04:23,200 There's some extra max pooling layers here to change the dimension 54 00:04:24,250 --> 00:04:25,800 of the heightened width. 55 00:04:25,800 --> 00:04:28,140 But that's another inception block. 56 00:04:28,140 --> 00:04:31,020 And then there's another max put here to change the height and width but 57 00:04:31,020 --> 00:04:33,280 basically there's another inception block. 58 00:04:33,280 --> 00:04:37,370 But the inception network is just a lot of these blocks that you've learned about 59 00:04:37,370 --> 00:04:40,930 repeated to different positions of the network. 60 00:04:40,930 --> 00:04:44,509 But so you understand the inception block from the previous slide, 61 00:04:44,509 --> 00:04:46,914 then you understand the inception network. 62 00:04:49,518 --> 00:04:53,403 It turns out that there's one last detail to the inception network if we read 63 00:04:53,403 --> 00:04:55,430 the optional research paper. 64 00:04:55,430 --> 00:04:59,311 Which is that there are these additional side-branches that I just added. 65 00:05:01,835 --> 00:05:03,440 So what do they do? 66 00:05:03,440 --> 00:05:07,800 Well, the last few layers of the network is a fully connected layer 67 00:05:07,800 --> 00:05:11,360 followed by a softmax layer to try to make a prediction. 68 00:05:11,360 --> 00:05:15,430 What these side branches do is it takes some hidden layer and 69 00:05:15,430 --> 00:05:17,760 it tries to use that to make a prediction. 70 00:05:17,760 --> 00:05:22,140 So this is actually a softmax output and so is that. 71 00:05:22,140 --> 00:05:23,952 And this other side branch, 72 00:05:23,952 --> 00:05:29,173 again it is a hidden layer passes through a few layers like a few connected layers. 73 00:05:29,173 --> 00:05:33,243 And then has the softmax try to predict what's the output label. 74 00:05:35,545 --> 00:05:38,798 And you should think of this as maybe just another detail of the inception 75 00:05:38,798 --> 00:05:40,000 that's worked. 76 00:05:40,000 --> 00:05:44,460 But what is does is it helps to ensure that the features computed. 77 00:05:44,460 --> 00:05:48,060 Even in the heading units, even at intermediate layers. 78 00:05:48,060 --> 00:05:52,490 That they're not too bad for protecting the output cause of a image. 79 00:05:52,490 --> 00:05:56,875 And this appears to have a regularizing effect on the inception network and 80 00:05:56,875 --> 00:05:59,667 helps prevent this network from overfitting. 81 00:06:03,048 --> 00:06:07,907 And by the way, this particular Inception network 82 00:06:07,907 --> 00:06:11,770 was developed by authors at Google. 83 00:06:11,770 --> 00:06:18,850 Who called it GoogleNet, spelled like that, to pay homage to the network. 84 00:06:18,850 --> 00:06:21,350 That you learned about in an earlier video as well. 85 00:06:23,460 --> 00:06:29,086 So I think it's actually really nice that the Deep Learning Community is so 86 00:06:29,086 --> 00:06:30,436 collaborative. 87 00:06:30,436 --> 00:06:32,593 And that there's such strong healthy respect for 88 00:06:32,593 --> 00:06:35,865 each other's' work in the Deep Learning Learning community. 89 00:06:35,865 --> 00:06:37,895 FInally here's one fun fact. 90 00:06:37,895 --> 00:06:40,425 Where does the name inception network come from? 91 00:06:41,585 --> 00:06:47,375 The inception paper actually cites this meme for we need to go deeper. 92 00:06:47,375 --> 00:06:52,315 And this URL is an actual reference in the inception paper, 93 00:06:52,315 --> 00:06:54,320 which links to this image. 94 00:06:54,320 --> 00:06:57,610 And if you've seen the movie titled The Inception, 95 00:06:57,610 --> 00:07:00,070 maybe this meme will make sense to you. 96 00:07:00,070 --> 00:07:05,380 But the authors actually cite this meme as motivation for 97 00:07:05,380 --> 00:07:09,040 needing to build deeper new networks. 98 00:07:09,040 --> 00:07:12,890 And that's how they came up with the inception architecture. 99 00:07:12,890 --> 00:07:17,830 So I guess it's not often that research papers get to cite Internet memes 100 00:07:17,830 --> 00:07:19,030 in their citations. 101 00:07:19,030 --> 00:07:22,165 But in this case, I guess it worked out quite well. 102 00:07:23,285 --> 00:07:27,015 So to summarize, if you understand the inception module, 103 00:07:27,015 --> 00:07:29,765 then you understand the inception network. 104 00:07:29,765 --> 00:07:33,865 Which is largely the inception module repeated a bunch of times 105 00:07:33,865 --> 00:07:35,435 throughout the network. 106 00:07:35,435 --> 00:07:40,025 Since the development of the original inception module, the author and 107 00:07:40,025 --> 00:07:43,760 others have built on it and come up with other versions as well. 108 00:07:43,760 --> 00:07:49,090 So there are research papers on newer versions of the inception algorithm. 109 00:07:49,090 --> 00:07:53,380 And you sometimes see people use some of these later versions as well 110 00:07:53,380 --> 00:07:57,040 in their work, like inception v2, inception v3, inception v4. 111 00:07:57,040 --> 00:07:59,360 There's also an inception version. 112 00:07:59,360 --> 00:08:02,800 This combined with the resonant idea of having skipped connections, 113 00:08:02,800 --> 00:08:05,740 and that sometimes works even better. 114 00:08:05,740 --> 00:08:10,600 But all of these variations are built on the basic idea that you learned about 115 00:08:10,600 --> 00:08:14,710 this in the previous video of coming up with the inception module and 116 00:08:14,710 --> 00:08:17,510 then stacking up a bunch of them together. 117 00:08:17,510 --> 00:08:20,530 And with these videos you should be able to read and 118 00:08:20,530 --> 00:08:23,940 understand, I think, the inception paper, 119 00:08:23,940 --> 00:08:28,790 as well as maybe some of the papers describing the later derivation as well. 120 00:08:28,790 --> 00:08:30,020 So that's it, 121 00:08:30,020 --> 00:08:34,820 you've gone through quite a lot of specialized neural network architectures. 122 00:08:34,820 --> 00:08:39,650 In the next video, I want to start showing you some more practical advice on how you 123 00:08:39,650 --> 00:08:43,830 actually use these algorithms to build your own computer vision system. 124 00:08:43,830 --> 00:08:45,090 Let's go on to the next video.