1 00:00:00,000 --> 00:00:02,470 In this video, you'll learn about some of 2 00:00:02,470 --> 00:00:05,418 the classic neural network architecture starting with LeNet-5, 3 00:00:05,418 --> 00:00:10,122 and then AlexNet, and then VGGNet. Let's take a look. 4 00:00:10,122 --> 00:00:12,935 Here is the LeNet-5 architecture. 5 00:00:12,935 --> 00:00:15,700 You start off with an image which say, 6 00:00:15,700 --> 00:00:17,540 32 by 32 by 1. 7 00:00:17,540 --> 00:00:21,520 And the goal of LeNet-5 was to recognize handwritten digits, 8 00:00:21,520 --> 00:00:25,490 so maybe an image of a digits like that. 9 00:00:25,490 --> 00:00:28,815 And LeNet-5 was trained on grayscale images, 10 00:00:28,815 --> 00:00:32,180 which is why it's 32 by 32 by 1. 11 00:00:32,180 --> 00:00:34,400 This neural network architecture is actually quite 12 00:00:34,400 --> 00:00:38,315 similar to the last example you saw last week. 13 00:00:38,315 --> 00:00:39,847 In the first step, 14 00:00:39,847 --> 00:00:41,765 you use a set of six, 15 00:00:41,765 --> 00:00:45,220 5 by 5 filters with a stride of one because you use 16 00:00:45,220 --> 00:00:50,218 six filters you end up with a 20 by 20 by 6 over there. 17 00:00:50,218 --> 00:00:52,730 And with a stride of one and no padding, 18 00:00:52,730 --> 00:00:58,640 the image dimensions reduces from 32 by 32 down to 28 by 28. 19 00:00:58,640 --> 00:01:02,295 Then the LeNet neural network applies pooling. 20 00:01:02,295 --> 00:01:04,970 And back then when this paper was written, 21 00:01:04,970 --> 00:01:07,576 people use average pooling much more. 22 00:01:07,576 --> 00:01:09,290 If you're building a modern variant, 23 00:01:09,290 --> 00:01:12,145 you probably use max pooling instead. 24 00:01:12,145 --> 00:01:13,360 But in this example, 25 00:01:13,360 --> 00:01:17,825 you average pool and with a filter width two and a stride of two, 26 00:01:17,825 --> 00:01:20,780 you wind up reducing the dimensions, 27 00:01:20,780 --> 00:01:22,730 the height and width by a factor of two, 28 00:01:22,730 --> 00:01:28,784 so we now end up with a 14 by 14 by 6 volume. 29 00:01:28,784 --> 00:01:32,180 I guess the height and width of these volumes aren't entirely drawn to scale. 30 00:01:32,180 --> 00:01:35,210 Now technically, if I were drawing these volumes to scale, 31 00:01:35,210 --> 00:01:38,150 the height and width would be stronger by a factor of two. 32 00:01:38,150 --> 00:01:41,180 Next, you apply another convolutional layer. 33 00:01:41,180 --> 00:01:44,070 This time you use a set of 16 filters, 34 00:01:44,070 --> 00:01:48,515 the 5 by 5, so you end up with 16 channels to the next volume. 35 00:01:48,515 --> 00:01:52,355 And back when this paper was written in 1998, 36 00:01:52,355 --> 00:01:57,200 people didn't really use padding or you always using valid convolutions, 37 00:01:57,200 --> 00:01:59,635 which is why every time you apply convolutional layer, 38 00:01:59,635 --> 00:02:01,965 they heightened with strengths. 39 00:02:01,965 --> 00:02:03,380 So that's why, here, 40 00:02:03,380 --> 00:02:06,393 you go from 14 to 14 down to 10 by 10. 41 00:02:06,393 --> 00:02:08,580 Then another pooling layer, 42 00:02:08,580 --> 00:02:11,060 so that reduces the height and width by a factor of two, 43 00:02:11,060 --> 00:02:13,715 then you end up with 5 by 5 over here. 44 00:02:13,715 --> 00:02:16,640 And if you multiply all these numbers 5 by 5 by 16, 45 00:02:16,640 --> 00:02:20,375 this multiplies up to 400. 46 00:02:20,375 --> 00:02:24,020 That's 25 times 16 is 400. 47 00:02:24,020 --> 00:02:29,900 And the next layer is then a fully connected layer that fully connects each of 48 00:02:29,900 --> 00:02:36,840 these 400 nodes with every one of 120 neurons, 49 00:02:36,840 --> 00:02:38,385 so there's a fully connected layer. 50 00:02:38,385 --> 00:02:41,955 And sometimes, that would draw out exclusively 51 00:02:41,955 --> 00:02:46,080 a layer with 400 nodes, I'm skipping that here. 52 00:02:46,080 --> 00:02:49,590 There's a fully connected layer and then another a fully connected layer. 53 00:02:49,590 --> 00:02:51,690 And then the final step is it uses 54 00:02:51,690 --> 00:02:57,280 these essentially 84 features and uses it with one final output. 55 00:02:57,280 --> 00:03:01,375 I guess you could draw one more node here to make a prediction for ŷ. 56 00:03:01,375 --> 00:03:04,560 And ŷ took on 10 possible values 57 00:03:04,560 --> 00:03:09,090 corresponding to recognising each of the digits from 0 to 9. 58 00:03:09,090 --> 00:03:11,100 A modern version of this neural network, 59 00:03:11,100 --> 00:03:17,300 we'll use a softmax layer with a 10 way classification output. 60 00:03:17,300 --> 00:03:23,385 Although back then, LeNet-5 actually use a different classifier at the output layer, 61 00:03:23,385 --> 00:03:25,633 one that's useless today. 62 00:03:25,633 --> 00:03:29,220 So this neural network was small by modern standards, 63 00:03:29,220 --> 00:03:32,645 had about 60,000 parameters. 64 00:03:32,645 --> 00:03:35,934 And today, you often see neural networks with 65 00:03:35,934 --> 00:03:39,690 anywhere from 10 million to 100 million parameters, 66 00:03:39,690 --> 00:03:41,850 and it's not unusual to see networks that are 67 00:03:41,850 --> 00:03:45,295 literally about a thousand times bigger than this network. 68 00:03:45,295 --> 00:03:49,600 But one thing you do see is that as you go deeper in a network, 69 00:03:49,600 --> 00:03:51,790 so as you go from left to right, 70 00:03:51,790 --> 00:03:55,360 the height and width tend to go down. 71 00:03:55,360 --> 00:03:57,690 So you went from 32 by 32, to 28 to 14, 72 00:03:57,690 --> 00:04:03,100 to 10 to 5, whereas the number of channels does increase. 73 00:04:03,100 --> 00:04:11,250 It goes from 1 to 6 to 16 as you go deeper into the layers of the network. 74 00:04:11,250 --> 00:04:15,400 One other pattern you see in this neural network that's still often repeated today is 75 00:04:15,400 --> 00:04:20,500 that you might have some one or more conu layers followed by pooling layer, 76 00:04:20,500 --> 00:04:25,758 and then one or sometimes more than one conu layer followed by a pooling layer, 77 00:04:25,758 --> 00:04:29,940 and then some fully connected layers and then the outputs. 78 00:04:29,940 --> 00:04:34,090 So this type of arrangement of layers is quite common. 79 00:04:34,090 --> 00:04:39,515 Now finally, this is maybe only for those of you that want to try reading the paper. 80 00:04:39,515 --> 00:04:41,880 There are a couple other things that were different. 81 00:04:41,880 --> 00:04:43,690 The rest of this slide, 82 00:04:43,690 --> 00:04:47,065 I'm going to make a few more advanced comments, 83 00:04:47,065 --> 00:04:52,265 only for those of you that want to try to read this classic paper. 84 00:04:52,265 --> 00:04:54,903 And so, everything I'm going to write in red, 85 00:04:54,903 --> 00:04:57,490 you can safely skip on the slide, 86 00:04:57,490 --> 00:05:00,520 and there's maybe an interesting historical footnote 87 00:05:00,520 --> 00:05:04,350 that is okay if you don't follow fully. 88 00:05:04,350 --> 00:05:07,990 So it turns out that if you read the original paper, back then, 89 00:05:07,990 --> 00:05:12,453 people used sigmoid and tanh nonlinearities, 90 00:05:12,453 --> 00:05:16,330 and people weren't using value nonlinearities back then. 91 00:05:16,330 --> 00:05:20,065 So if you look at the paper, you see sigmoid and tanh referred to. 92 00:05:20,065 --> 00:05:23,260 And there are also some funny ways about 93 00:05:23,260 --> 00:05:26,835 this network was wired that is funny by modern standards. 94 00:05:26,835 --> 00:05:33,775 So for example, you've seen how if you have a nh by nw by nc network with 95 00:05:33,775 --> 00:05:40,985 nc channels then you use f by f by nc dimensional filter, 96 00:05:40,985 --> 00:05:44,480 where everything looks at every one of these channels. 97 00:05:44,480 --> 00:05:47,195 But back then, computers were much slower. 98 00:05:47,195 --> 00:05:50,230 And so to save on computation as well as some parameters, 99 00:05:50,230 --> 00:05:53,785 the original LeNet-5 had some crazy complicated way 100 00:05:53,785 --> 00:05:58,040 where different filters would look at different channels of the input block. 101 00:05:58,040 --> 00:06:00,343 And so the paper talks about those details, 102 00:06:00,343 --> 00:06:07,090 but the more modern implementation wouldn't have that type of complexity these days. 103 00:06:07,090 --> 00:06:12,280 And then one last thing that was done back then I guess but isn't really done right 104 00:06:12,280 --> 00:06:19,705 now is that the original LeNet-5 had a non-linearity after pooling, 105 00:06:19,705 --> 00:06:25,005 and I think it actually uses sigmoid non-linearity after the pooling layer. 106 00:06:25,005 --> 00:06:27,130 So if you do read this paper, 107 00:06:27,130 --> 00:06:29,345 and this is one of the harder ones to read than 108 00:06:29,345 --> 00:06:32,100 the ones we'll go over in the next few videos, 109 00:06:32,100 --> 00:06:34,670 the next one might be an easy one to start with. 110 00:06:34,670 --> 00:06:40,135 Most of the ideas on the slide I just tried in sections two and three of the paper, 111 00:06:40,135 --> 00:06:44,485 and later sections of the paper talked about some other ideas. 112 00:06:44,485 --> 00:06:47,260 It talked about something called the graph transformer network, 113 00:06:47,260 --> 00:06:49,215 which isn't widely used today. 114 00:06:49,215 --> 00:06:50,935 So if you do try to read this paper, 115 00:06:50,935 --> 00:06:55,660 I recommend focusing really on section two which talks about this architecture, 116 00:06:55,660 --> 00:06:58,165 and maybe take a quick look at section three 117 00:06:58,165 --> 00:07:01,720 which has a bunch of experiments and results, which is pretty interesting. 118 00:07:01,720 --> 00:07:06,155 The second example of a neural network I want to show you is AlexNet, 119 00:07:06,155 --> 00:07:12,510 named after Alex Krizhevsky, who was the first author of the paper describing this work. 120 00:07:12,510 --> 00:07:13,725 The other author's were Ilya Sutskever and Geoffrey Hinton. 121 00:07:13,725 --> 00:07:21,048 So, AlexNet input starts with 227 by 227 by 3 images. 122 00:07:21,048 --> 00:07:22,525 And if you read the paper, 123 00:07:22,525 --> 00:07:27,010 the paper refers to 224 by 224 by 3 images. 124 00:07:27,010 --> 00:07:28,120 But if you look at the numbers, 125 00:07:28,120 --> 00:07:33,100 I think that the numbers make sense only of actually 227 by 227. 126 00:07:33,100 --> 00:07:40,230 And then the first layer applies a set of 96 11 by 11 filters with a stride of four. 127 00:07:40,230 --> 00:07:42,740 And because it uses a large stride of four, 128 00:07:42,740 --> 00:07:45,574 the dimensions shrinks to 55 by 55. 129 00:07:45,574 --> 00:07:50,930 So roughly, going down by a factor of 4 because of a large stride. 130 00:07:50,930 --> 00:07:55,110 And then it applies max pooling with a 3 by 3 filter. 131 00:07:55,110 --> 00:07:57,925 So f equals three and a stride of two. 132 00:07:57,925 --> 00:08:04,570 So this reduces the volume to 27 by 27 by 96, 133 00:08:04,570 --> 00:08:08,530 and then it performs a 5 by 5 same convolution, 134 00:08:08,530 --> 00:08:14,730 same padding, so you end up with 27 by 27 by 276. 135 00:08:14,730 --> 00:08:20,025 Max pooling again, this reduces the height and width to 13. 136 00:08:20,025 --> 00:08:23,860 And then another same convolution, so same padding. 137 00:08:23,860 --> 00:08:29,805 So it's 13 by 13 by now 384 filters. 138 00:08:29,805 --> 00:08:35,275 And then 3 by 3, same convolution again, gives you that. 139 00:08:35,275 --> 00:08:39,680 Then 3 by 3, same convolution, gives you that. 140 00:08:39,680 --> 00:08:45,285 Then max pool, brings it down to 6 by 6 by 256. 141 00:08:45,285 --> 00:08:52,020 If you multiply all these numbers,6 times 6 times 256, that's 9216. 142 00:08:52,020 --> 00:08:56,947 So we're going to unroll this into 9216 nodes. 143 00:08:56,947 --> 00:09:00,790 And then finally, it has a few fully connected layers. 144 00:09:00,790 --> 00:09:04,250 And then finally, it uses a softmax to output 145 00:09:04,250 --> 00:09:09,515 which one of 1000 causes the object could be. 146 00:09:09,515 --> 00:09:16,920 So this neural network actually had a lot of similarities to LeNet, 147 00:09:16,920 --> 00:09:20,210 but it was much bigger. 148 00:09:20,210 --> 00:09:27,740 So whereas the LeNet-5 from previous slide had about 60,000 parameters, 149 00:09:27,740 --> 00:09:31,935 this AlexNet that had about 60 million parameters. 150 00:09:31,935 --> 00:09:34,024 And the fact that they could take 151 00:09:34,024 --> 00:09:36,925 pretty similar basic building blocks that 152 00:09:36,925 --> 00:09:40,270 have a lot more hidden units and training on a lot more data, 153 00:09:40,270 --> 00:09:42,820 they trained on the image that dataset that 154 00:09:42,820 --> 00:09:46,255 allowed it to have a just remarkable performance. 155 00:09:46,255 --> 00:09:49,810 Another aspect of this architecture that made it much 156 00:09:49,810 --> 00:09:53,575 better than LeNet was using the value activation function. 157 00:09:53,575 --> 00:09:56,425 And then again, just if you read the bay paper 158 00:09:56,425 --> 00:09:59,020 some more advanced details that you don't really need 159 00:09:59,020 --> 00:10:01,840 to worry about if you don't read the paper, one is that, 160 00:10:01,840 --> 00:10:03,445 when this paper was written, 161 00:10:03,445 --> 00:10:06,197 GPUs was still a little bit slower, 162 00:10:06,197 --> 00:10:11,135 so it had a complicated way of training on two GPUs. 163 00:10:11,135 --> 00:10:13,310 And the basic idea was that, 164 00:10:13,310 --> 00:10:18,250 a lot of these layers was actually split across two different GPUs and there was 165 00:10:18,250 --> 00:10:23,497 a thoughtful way for when the two GPUs would communicate with each other. 166 00:10:23,497 --> 00:10:25,360 And the paper also, 167 00:10:25,360 --> 00:10:29,650 the original AlexNet architecture also had another set of a layer 168 00:10:29,650 --> 00:10:34,125 called a Local Response Normalization. 169 00:10:34,125 --> 00:10:36,820 And this type of layer isn't really used much, 170 00:10:36,820 --> 00:10:38,830 which is why I didn't talk about it. 171 00:10:38,830 --> 00:10:42,220 But the basic idea of Local Response Normalization is, 172 00:10:42,220 --> 00:10:44,845 if you look at one of these blocks, 173 00:10:44,845 --> 00:10:46,940 one of these volumes that we have on top, 174 00:10:46,940 --> 00:10:49,360 let's say for the sake of argument, this one, 175 00:10:49,360 --> 00:10:52,380 13 by 13 by 256, 176 00:10:52,380 --> 00:10:54,765 what Local Response Normalization, 177 00:10:54,765 --> 00:10:57,805 (LRN) does, is you look at one position. 178 00:10:57,805 --> 00:10:59,570 So one position height and width, 179 00:10:59,570 --> 00:11:02,935 and look down this across all the channels, 180 00:11:02,935 --> 00:11:07,195 look at all 256 numbers and normalize them. 181 00:11:07,195 --> 00:11:10,750 And the motivation for this Local Response Normalization was that for 182 00:11:10,750 --> 00:11:14,934 each position in this 13 by 13 image, 183 00:11:14,934 --> 00:11:20,123 maybe you don't want too many neurons with a very high activation. 184 00:11:20,123 --> 00:11:25,730 But subsequently, many researchers have found that this doesn't help that much so this is 185 00:11:25,730 --> 00:11:27,995 one of those ideas I guess I'm drawing in red 186 00:11:27,995 --> 00:11:31,880 because it's less important for you to understand this one. 187 00:11:31,880 --> 00:11:33,940 And in practice, I don't really use 188 00:11:33,940 --> 00:11:38,760 local response normalizations really in the networks language trained today. 189 00:11:38,760 --> 00:11:41,380 So if you are interested in the history of deep learning, 190 00:11:41,380 --> 00:11:43,395 I think even before AlexNet, 191 00:11:43,395 --> 00:11:48,978 deep learning was starting to gain traction in speech recognition and a few other areas, 192 00:11:48,978 --> 00:11:52,690 but it was really just paper that convinced a lot of 193 00:11:52,690 --> 00:11:56,350 the computer vision community to take a serious look at 194 00:11:56,350 --> 00:12:00,280 deep learning to convince them that deep learning really works in computer vision. 195 00:12:00,280 --> 00:12:02,710 And then it grew on to have a huge impact not 196 00:12:02,710 --> 00:12:05,508 just in computer vision but beyond computer vision as well. 197 00:12:05,508 --> 00:12:08,170 And if you want to try reading some of these papers 198 00:12:08,170 --> 00:12:11,635 yourself and you really don't have to for this course, 199 00:12:11,635 --> 00:12:14,200 but if you want to try reading some of these papers, 200 00:12:14,200 --> 00:12:19,354 this one is one of the easier ones to read so this might be a good one to take a look at. 201 00:12:19,354 --> 00:12:23,257 So whereas AlexNet had a relatively complicated architecture, 202 00:12:23,257 --> 00:12:25,585 there's just a lot of hyperparameters, right? 203 00:12:25,585 --> 00:12:28,255 Where you have all these numbers 204 00:12:28,255 --> 00:12:33,240 that Alex Krizhevsky and his co-authors had to come up with. 205 00:12:33,240 --> 00:12:39,765 Let me show you a third and final example on this video called the VGG or VGG-16 network. 206 00:12:39,765 --> 00:12:44,820 And a remarkable thing about the VGG-16 net is that they said, 207 00:12:44,820 --> 00:12:46,966 instead of having so many hyperparameters, 208 00:12:46,966 --> 00:12:52,495 let's use a much simpler network where you focus on just having conv-layers 209 00:12:52,495 --> 00:12:58,690 that are just three-by-three filters with a stride of one and always use same padding. 210 00:12:58,690 --> 00:13:03,640 And make all your max pulling layers two-by-two with a stride of two. 211 00:13:03,640 --> 00:13:06,250 And so, one very nice thing about 212 00:13:06,250 --> 00:13:12,224 the VGG network was it really simplified this neural network architectures. 213 00:13:12,224 --> 00:13:14,494 So, let's go through the architecture. 214 00:13:14,494 --> 00:13:19,660 So, you solve up with an image for them and then the first two layers are convolutions, 215 00:13:19,660 --> 00:13:24,315 which are therefore these three-by-three filters. 216 00:13:24,315 --> 00:13:27,930 And in the first two layers use 64 filters. 217 00:13:27,930 --> 00:13:35,830 You end up with a 224 by 224 because using same convolutions and then with 64 channels. 218 00:13:35,830 --> 00:13:39,345 So because VGG-16 is a relatively deep network, 219 00:13:39,345 --> 00:13:42,335 am going to not draw all the volumes here. 220 00:13:42,335 --> 00:13:46,270 So what this little picture denotes is what we would previously have 221 00:13:46,270 --> 00:13:50,890 drawn as this 224 by 224 by 3. 222 00:13:50,890 --> 00:13:55,362 And then a convolution that results in I guess a 224 223 00:13:55,362 --> 00:14:00,535 by 224 by 64 is to be drawn as a deeper volume, 224 00:14:00,535 --> 00:14:07,227 and then another layer that results in 224 by 224 by 64. 225 00:14:07,227 --> 00:14:15,730 So this conv64 times two represents that you're doing two conv-layers with 64 filters. 226 00:14:15,730 --> 00:14:17,380 And as I mentioned earlier, 227 00:14:17,380 --> 00:14:20,555 the filters are always three-by-three 228 00:14:20,555 --> 00:14:24,455 with a stride of one and they are always same convolutions. 229 00:14:24,455 --> 00:14:26,395 So rather than drawing all these volumes, 230 00:14:26,395 --> 00:14:28,400 am just going to use text to represent this network. 231 00:14:28,400 --> 00:14:31,413 Next, then uses are pulling layer, 232 00:14:31,413 --> 00:14:33,580 so the pulling layer will reduce. 233 00:14:33,580 --> 00:14:36,725 I think it goes from 224 by 224 down to what? 234 00:14:36,725 --> 00:14:40,755 Right. Goes to 112 by 112 by 64. 235 00:14:40,755 --> 00:14:44,339 And then it has a couple more conv-layers. 236 00:14:44,339 --> 00:14:50,426 So this means it has 128 filters and because these are the same convolutions, 237 00:14:50,426 --> 00:14:52,365 let's see what is the new dimension. 238 00:14:52,365 --> 00:14:57,020 Right? It will be 112 by 112 by 128 and then 239 00:14:57,020 --> 00:15:02,205 pulling layer so you can figure out what's the new dimension of that. 240 00:15:02,205 --> 00:15:07,210 And now, three conv-layers with 241 00:15:07,210 --> 00:15:14,300 256 filters to the pulling layer and then a few more conv-layers, 242 00:15:14,300 --> 00:15:18,945 pulling layer, more conv-layers, pulling layer. 243 00:15:18,945 --> 00:15:26,345 And then it takes this final 7 by 7 by 512 these in to fully connected layer, 244 00:15:26,345 --> 00:15:30,230 fully connected with four thousand ninety six 245 00:15:30,230 --> 00:15:36,080 units and then a softmax output one of a thousand classes. 246 00:15:36,080 --> 00:15:39,875 By the way, the 16 in the VGG-16 247 00:15:39,875 --> 00:15:45,080 refers to the fact that this has 16 layers that have weights. 248 00:15:45,080 --> 00:15:47,470 And this is a pretty large network, 249 00:15:47,470 --> 00:15:52,415 this network has a total of about 138 million parameters. 250 00:15:52,415 --> 00:15:55,615 And that's pretty large even by modern standards. 251 00:15:55,615 --> 00:16:00,673 But the simplicity of the VGG-16 architecture made it quite appealing. 252 00:16:00,673 --> 00:16:03,935 You can tell his architecture is really quite uniform. 253 00:16:03,935 --> 00:16:07,130 There is a few conv-layers followed by a pulling layer, 254 00:16:07,130 --> 00:16:09,590 which reduces the height and width, right? 255 00:16:09,590 --> 00:16:13,396 So the pulling layers reduce the height and width. 256 00:16:13,396 --> 00:16:15,570 You have a few of them here. 257 00:16:15,570 --> 00:16:20,260 But then also, if you look at the number of filters in the conv-layers, 258 00:16:20,260 --> 00:16:28,675 here you have 64 filters and then you double to 128 double to 256 doubles to 512. 259 00:16:28,675 --> 00:16:33,160 And then I guess the authors thought 512 was big enough and did double on the game here. 260 00:16:33,160 --> 00:16:36,410 But this sort of roughly doubling on every step, 261 00:16:36,410 --> 00:16:39,915 or doubling through every stack of conv-layers was 262 00:16:39,915 --> 00:16:45,040 another simple principle used to design the architecture of this network. 263 00:16:45,040 --> 00:16:48,230 And so I think the relative uniformity of 264 00:16:48,230 --> 00:16:52,460 this architecture made it quite attractive to researchers. 265 00:16:52,460 --> 00:16:54,680 The main downside was that it was 266 00:16:54,680 --> 00:16:58,910 a pretty large network in terms of the number of parameters you had to train. 267 00:16:58,910 --> 00:17:00,995 And if you read the literature, 268 00:17:00,995 --> 00:17:04,700 you sometimes see people talk about the VGG-19, 269 00:17:04,700 --> 00:17:08,600 that is an even bigger version of this network. 270 00:17:08,600 --> 00:17:11,780 And you could see the details in the paper cited at 271 00:17:11,780 --> 00:17:16,595 the bottom by Karen Simonyan and Andrew Zisserman. 272 00:17:16,595 --> 00:17:20,875 But because VGG-16 does almost as well as VGG-19. 273 00:17:20,875 --> 00:17:23,570 A lot of people will use VGG-16. 274 00:17:23,570 --> 00:17:26,090 But the thing I liked most about this was that, 275 00:17:26,090 --> 00:17:28,540 this made this pattern of how, 276 00:17:28,540 --> 00:17:31,100 as you go deeper and height and width goes down, 277 00:17:31,100 --> 00:17:33,540 it just goes down by a factor of two each time for 278 00:17:33,540 --> 00:17:36,890 the pulling layers whereas the number of channels increases. 279 00:17:36,890 --> 00:17:42,855 And here roughly goes up by a factor of two every time you have a new set of conv-layers. 280 00:17:42,855 --> 00:17:49,155 So by making the rate at which it goes down and that go up very systematic, 281 00:17:49,155 --> 00:17:54,410 I thought this paper was very attractive from that perspective. 282 00:17:54,410 --> 00:17:57,845 So that's it for the three classic architecture's. 283 00:17:57,845 --> 00:18:00,931 If you want, you should really now read some of these papers. 284 00:18:00,931 --> 00:18:05,270 I recommend starting with the AlexNet paper followed by the VGG net paper and 285 00:18:05,270 --> 00:18:07,460 then the LeNet paper is a bit harder to 286 00:18:07,460 --> 00:18:09,984 read but it is a good classic once you go over that. 287 00:18:09,984 --> 00:18:14,725 But next, let's go beyond these classic networks and look at some even more advanced, 288 00:18:14,725 --> 00:18:18,040 even more powerful neural network architectures. Let's go into.