1 00:00:00,000 --> 00:00:02,970 So, why do ResNets work so well? 2 00:00:02,970 --> 00:00:07,830 Let's go through one example that illustrates why ResNets work so well, 3 00:00:07,830 --> 00:00:12,390 at least in the sense of how you can make them deeper and deeper without really 4 00:00:12,390 --> 00:00:17,540 hurting your ability to at least get them to do well on the training set. 5 00:00:17,540 --> 00:00:22,185 And hopefully as you've understood from the third course in this sequence, 6 00:00:22,185 --> 00:00:25,845 doing well on the training set is usually a prerequisite to doing 7 00:00:25,845 --> 00:00:29,800 well on your hold up or on your depth or on your test sets. 8 00:00:29,800 --> 00:00:33,570 So, being able to at least train ResNet to do well on 9 00:00:33,570 --> 00:00:38,650 the training set is a good first step toward that. Let's look at an example. 10 00:00:38,650 --> 00:00:43,910 What we saw on the last video was that if you make a network deeper, 11 00:00:43,910 --> 00:00:49,195 it can hurt your ability to train the network to do well on the training set. 12 00:00:49,195 --> 00:00:53,420 And that's why sometimes you don't want a network that is too deep. 13 00:00:53,420 --> 00:00:58,555 But this is not true or at least is much less true when you training a ResNet. 14 00:00:58,555 --> 00:01:01,365 So let's go through an example. 15 00:01:01,365 --> 00:01:04,980 Let's say you have X feeding in to 16 00:01:04,980 --> 00:01:10,365 some big neural network and just outputs some activation a[l]. 17 00:01:10,365 --> 00:01:14,080 Let's say for this example that you are going to modify 18 00:01:14,080 --> 00:01:19,150 the neural network to make it a little bit deeper. 19 00:01:19,150 --> 00:01:20,755 So, use the same big NN, 20 00:01:20,755 --> 00:01:23,035 and this output's a[l], 21 00:01:23,035 --> 00:01:28,240 and we're going to add a couple extra layers to this network so 22 00:01:28,240 --> 00:01:33,635 let's add one layer there and another layer there. 23 00:01:33,635 --> 00:01:38,195 And just for output a[l+2]. 24 00:01:38,195 --> 00:01:40,930 Only let's make this a ResNet block, 25 00:01:40,930 --> 00:01:46,060 a residual block with that extra short cut. 26 00:01:46,060 --> 00:01:47,380 And for the sake our argument, 27 00:01:47,380 --> 00:01:52,690 let's say throughout this network we're using the value activation functions. 28 00:01:52,690 --> 00:01:58,330 So, all the activations are going to be greater than or equal to zero, 29 00:01:58,330 --> 00:02:01,955 with the possible exception of the input X. 30 00:02:01,955 --> 00:02:07,685 Right. Because the value activation output's numbers that are either zero or positive. 31 00:02:07,685 --> 00:02:10,510 Now, let's look at what's a[l+2] will be. 32 00:02:10,510 --> 00:02:15,025 To copy the expression from the previous video, 33 00:02:15,025 --> 00:02:21,715 a[l+2] will be value apply to z[l+2], 34 00:02:21,715 --> 00:02:26,505 and then plus a[l] where is this addition of a[l] 35 00:02:26,505 --> 00:02:31,750 comes from the short circle from the skip connection that we just added. 36 00:02:31,750 --> 00:02:33,300 And if we expand this out, 37 00:02:33,300 --> 00:02:34,710 this is equal to g of w[l+2], 38 00:02:34,710 --> 00:02:43,570 times a of [l+1], plus b[l+2]. 39 00:02:43,570 --> 00:02:48,810 So that's z[l+2] is equal to that, plus a[l]. 40 00:02:48,810 --> 00:02:53,450 Now notice something, if you are using L two regularisation away to K, 41 00:02:53,450 --> 00:02:58,435 that will tend to shrink the value of w[l+2]. 42 00:02:58,435 --> 00:03:02,540 If you are applying way to K to B that will also shrink this although 43 00:03:02,540 --> 00:03:06,665 I guess in practice sometimes you do and sometimes you don't apply way to K to B, 44 00:03:06,665 --> 00:03:12,050 but W is really the key term to pay attention to here. 45 00:03:12,050 --> 00:03:16,175 And if w[l+2] is equal to zero. 46 00:03:16,175 --> 00:03:20,975 And let's say for the sake of argument that B is also equal to zero, 47 00:03:20,975 --> 00:03:25,820 then these terms go away because they're equal to zero, 48 00:03:25,820 --> 00:03:27,995 and then g of a[l], 49 00:03:27,995 --> 00:03:36,805 this is just equal to a[l] because we assumed we're using the value activation function. 50 00:03:36,805 --> 00:03:39,635 And so all of the activations are all negative and so, 51 00:03:39,635 --> 00:03:43,930 g of a[l] is the value applied to a non-negative quantity, 52 00:03:43,930 --> 00:03:46,800 so you just get back, a[l]. 53 00:03:46,800 --> 00:03:56,270 So, what this shows is that the identity function is easy for residual block to learn. 54 00:03:56,270 --> 00:04:03,515 And it's easy to get a[l+2] equals to a[l] because of this skip connection. 55 00:04:03,515 --> 00:04:08,060 And what that means is that adding these two layers in your neural network, 56 00:04:08,060 --> 00:04:11,885 it doesn't really hurt your neural network's ability to do as 57 00:04:11,885 --> 00:04:16,205 well as this simpler network without these two extra layers, 58 00:04:16,205 --> 00:04:20,930 because it's quite easy for it to learn the identity function to just copy 59 00:04:20,930 --> 00:04:26,305 a[l] to a[l+2] using despite the addition of these two layers. 60 00:04:26,305 --> 00:04:30,250 And this is why adding two extra layers, 61 00:04:30,250 --> 00:04:34,910 adding this residual block to somewhere in 62 00:04:34,910 --> 00:04:40,392 the middle or the end of this big neural network it doesn't hurt performance. 63 00:04:40,392 --> 00:04:43,910 But of course our goal is to not just not hurt performance, 64 00:04:43,910 --> 00:04:48,650 is to help performance and so you can imagine that if all of 65 00:04:48,650 --> 00:04:51,770 these heading units if they actually learned something useful then 66 00:04:51,770 --> 00:04:55,745 maybe you can do even better than learning the identity function. 67 00:04:55,745 --> 00:05:00,970 And what goes wrong in very deep plain nets in very deep network without 68 00:05:00,970 --> 00:05:03,650 this residual of the skip connections is 69 00:05:03,650 --> 00:05:06,555 that when you make the network deeper and deeper, 70 00:05:06,555 --> 00:05:10,865 it's actually very difficult for it to choose parameters that learn 71 00:05:10,865 --> 00:05:14,570 even the identity function which is why a lot of layers 72 00:05:14,570 --> 00:05:19,025 end up making your result worse rather than making your result better. 73 00:05:19,025 --> 00:05:22,430 And I think the main reason the residual network works is 74 00:05:22,430 --> 00:05:26,360 that it's so easy for these extra layers to learn 75 00:05:26,360 --> 00:05:30,260 the identity function that you're kind of guaranteed that it doesn't hurt 76 00:05:30,260 --> 00:05:35,450 performance and then a lot the time you maybe get lucky and then even helps performance. 77 00:05:35,450 --> 00:05:39,530 At least is easier to go from a decent baseline of not 78 00:05:39,530 --> 00:05:44,435 hurting performance and then great in decent can only improve the solution from there. 79 00:05:44,435 --> 00:05:46,450 So, one more detail in the residual network that's 80 00:05:46,450 --> 00:05:50,570 worth discussing which is through this edition here, 81 00:05:50,570 --> 00:05:54,995 we're assuming that z[l+2] and a[l] have the same dimension. 82 00:05:54,995 --> 00:06:01,175 And so what you see in ResNet is a lot of use of same convolutions 83 00:06:01,175 --> 00:06:03,665 so that the dimension of this is 84 00:06:03,665 --> 00:06:08,280 equal to the dimension I guess of this layer or the outputs layer. 85 00:06:08,280 --> 00:06:12,450 So that we can actually do this short circle connection, 86 00:06:12,450 --> 00:06:15,615 because the same convolution preserve dimensions, 87 00:06:15,615 --> 00:06:19,205 and so makes that easier for you to carry out 88 00:06:19,205 --> 00:06:25,575 this short circle and then carry out this addition of two equal dimension vectors. 89 00:06:25,575 --> 00:06:30,635 In case the input and output have different dimensions so for example, 90 00:06:30,635 --> 00:06:36,420 if this is a 128 dimensional and Z or therefore, 91 00:06:36,420 --> 00:06:40,660 a[l] is 256 dimensional as an example. 92 00:06:40,660 --> 00:06:46,640 What you would do is add an extra matrix and then call that Ws over here, 93 00:06:46,640 --> 00:06:54,905 and Ws in this example would be a[l] 256 by 128 dimensional matrix. 94 00:06:54,905 --> 00:06:59,930 So then Ws times a[l] becomes 256 dimensional and 95 00:06:59,930 --> 00:07:01,550 this addition is now between 96 00:07:01,550 --> 00:07:05,300 two 256 dimensional vectors and there are few things you could do with Ws, 97 00:07:05,300 --> 00:07:07,940 it could be a matrix of parameters we learned, 98 00:07:07,940 --> 00:07:10,490 it could be a fixed matrix that just implements 99 00:07:10,490 --> 00:07:13,220 zero paddings that takes a[l] and then zero 100 00:07:13,220 --> 00:07:20,000 pads it to be 256 dimensional and either of those versions I guess could work. 101 00:07:20,000 --> 00:07:23,650 So finally, let's take a look at ResNets on images. 102 00:07:23,650 --> 00:07:27,870 So these are images I got from the paper by Harlow. 103 00:07:27,870 --> 00:07:35,465 This is an example of a plain network and in which you input an image 104 00:07:35,465 --> 00:07:39,155 and then have a number of conv layers 105 00:07:39,155 --> 00:07:44,300 until eventually you have a softmax output at the end. 106 00:07:44,300 --> 00:07:46,010 To turn this into a ResNet, 107 00:07:46,010 --> 00:07:50,945 you add those extra skip connections. 108 00:07:50,945 --> 00:07:53,620 And I'll just mention a few details, 109 00:07:53,620 --> 00:07:57,744 there are a lot of three by three convolutions here and most of these are 110 00:07:57,744 --> 00:08:01,435 three by three same convolutions 111 00:08:01,435 --> 00:08:05,990 and that's why you're adding equal dimension feature vectors. 112 00:08:05,990 --> 00:08:08,360 So rather than a fully connected layer, 113 00:08:08,360 --> 00:08:11,820 these are actually convolutional layers but because the same convolutions, 114 00:08:11,820 --> 00:08:21,360 the dimensions are preserved and so the z[l+2] plus a[l] by addition makes sense. 115 00:08:21,360 --> 00:08:25,085 And similar to what you've seen in a lot of NetRes before, 116 00:08:25,085 --> 00:08:29,540 you have a bunch of convolutional layers and then there are 117 00:08:29,540 --> 00:08:34,410 occasionally pulling layers as well or pulling a pulling likely is. 118 00:08:34,410 --> 00:08:36,170 And whenever one of those things happen, 119 00:08:36,170 --> 00:08:42,200 then you need to make an adjustment to the dimension which we saw on the previous slide. 120 00:08:42,200 --> 00:08:44,815 You can do of the matrix Ws, 121 00:08:44,815 --> 00:08:46,955 and then as is common in these networks, 122 00:08:46,955 --> 00:08:50,724 you have pool, 123 00:08:50,724 --> 00:08:52,220 and then at the end you now have 124 00:08:52,220 --> 00:08:56,915 a fully connected layer that then makes a prediction using a softmax. 125 00:08:56,915 --> 00:08:58,340 So that's it for ResNet. 126 00:08:58,340 --> 00:09:02,060 Next, there's a very interesting idea 127 00:09:02,060 --> 00:09:05,955 behind using neural networks with one by one filters, 128 00:09:05,955 --> 00:09:07,490 one by one convolutions. 129 00:09:07,490 --> 00:09:10,280 So, one could use a one by one convolution. 130 00:09:10,280 --> 00:09:12,220 Let's take a look at the next video.