1 00:00:00,000 --> 00:00:03,390 Very, very deep neural networks are difficult to train because 2 00:00:03,390 --> 00:00:07,215 of vanishing and exploding gradient types of problems. 3 00:00:07,215 --> 00:00:08,790 In this video, you'll learn about 4 00:00:08,790 --> 00:00:12,150 skip connections which allows you to take the activation from 5 00:00:12,150 --> 00:00:17,498 one layer and suddenly feed it to another layer even much deeper in the neural network. 6 00:00:17,498 --> 00:00:22,600 And using that, you'll build ResNet which enables you to train very, very deep networks. 7 00:00:22,600 --> 00:00:26,865 Sometimes even networks of over 100 layers. Let's take a look. 8 00:00:26,865 --> 00:00:30,390 ResNets are built out of something called a residual block, 9 00:00:30,390 --> 00:00:33,185 let's first describe what that is. 10 00:00:33,185 --> 00:00:35,370 Here are two layers of a neural network where you 11 00:00:35,370 --> 00:00:38,005 start off with some activations in layer a[l], 12 00:00:38,005 --> 00:00:43,940 then goes a[l+1] and then deactivation two layers later is a[l+2]. 13 00:00:43,940 --> 00:00:48,798 So let's to through the steps in this computation you have a[l], 14 00:00:48,798 --> 00:00:54,459 and then the first thing you do is you apply this linear operator to it, 15 00:00:54,459 --> 00:00:57,660 which is governed by this equation. 16 00:00:57,660 --> 00:01:01,690 So you go from a[l] to compute z[l 17 00:01:01,690 --> 00:01:07,975 +1] by multiplying by the weight matrix and adding that bias vector. 18 00:01:07,975 --> 00:01:17,945 After that, you apply the ReLU nonlinearity, to get a[l+1]. 19 00:01:17,945 --> 00:01:24,750 And that's governed by this equation where a[l+1] is g(z[l+1]). 20 00:01:24,750 --> 00:01:26,280 Then in the next layer, 21 00:01:26,280 --> 00:01:30,540 you apply this linear step again, 22 00:01:30,540 --> 00:01:33,432 so is governed by that equation. 23 00:01:33,432 --> 00:01:38,040 So this is quite similar to this equation we saw on the left. 24 00:01:38,040 --> 00:01:43,890 And then finally, you apply another ReLU operation which is 25 00:01:43,890 --> 00:01:52,105 now governed by that equation where G here would be the ReLU nonlinearity. 26 00:01:52,105 --> 00:01:56,880 And this gives you a[l+2]. 27 00:01:56,880 --> 00:01:57,900 So in other words, 28 00:01:57,900 --> 00:02:03,035 for information from a[l] to flow to a[l+2], 29 00:02:03,035 --> 00:02:07,455 it needs to go through all of these steps which I'm going to call 30 00:02:07,455 --> 00:02:13,140 the main path of this set of layers. 31 00:02:13,140 --> 00:02:14,550 In a residual net, 32 00:02:14,550 --> 00:02:16,900 we're going to make a change to this. 33 00:02:16,900 --> 00:02:18,495 We're going to take a[l], 34 00:02:18,495 --> 00:02:22,805 and just first forward it, copy it, 35 00:02:22,805 --> 00:02:26,200 match further into the neural network to here, 36 00:02:26,200 --> 00:02:28,860 and just at a[l], 37 00:02:28,860 --> 00:02:34,080 before applying to non-linearity, the ReLU non-linearity. 38 00:02:34,080 --> 00:02:37,730 And I'm going to call this the shortcut. 39 00:02:37,730 --> 00:02:40,725 So rather than needing to follow the main path, 40 00:02:40,725 --> 00:02:43,335 the information from a[l] can now follow 41 00:02:43,335 --> 00:02:46,910 a shortcut to go much deeper into the neural network. 42 00:02:46,910 --> 00:02:49,680 And what that means is that this last equation 43 00:02:49,680 --> 00:02:52,760 goes away and we instead have that the output 44 00:02:52,760 --> 00:03:00,810 a[l+2] is the ReLU non-linearity g applied to z[l+2] as before, 45 00:03:00,810 --> 00:03:02,830 but now plus a[l]. 46 00:03:02,830 --> 00:03:05,515 So, the addition of this a[l] here, 47 00:03:05,515 --> 00:03:07,355 it makes this a residual block. 48 00:03:07,355 --> 00:03:11,070 And in pictures, you can also modify this picture on 49 00:03:11,070 --> 00:03:15,945 top by drawing this picture shortcut to go here. 50 00:03:15,945 --> 00:03:20,805 And we are going to draw it as it going into this second layer here 51 00:03:20,805 --> 00:03:26,220 because the short cut is actually added before the ReLU non-linearity. 52 00:03:26,220 --> 00:03:27,570 So each of these nodes here, 53 00:03:27,570 --> 00:03:30,560 whwre there applies a linear function and a ReLU. 54 00:03:30,560 --> 00:03:34,320 So a[l] is being injected after the linear part but before the ReLU part. 55 00:03:34,320 --> 00:03:37,815 And sometimes instead of a term short cut, 56 00:03:37,815 --> 00:03:40,485 you also hear the term skip connection, 57 00:03:40,485 --> 00:03:44,835 and that refers to a[l] just skipping over a layer or kind of skipping over 58 00:03:44,835 --> 00:03:51,090 almost two layers in order to process information deeper into the neural network. 59 00:03:51,090 --> 00:03:54,030 So, what the inventors of ResNet, 60 00:03:54,030 --> 00:03:55,950 so that'll will be Kaiming He, Xiangyu Zhang, 61 00:03:55,950 --> 00:03:58,925 Shaoqing Ren and Jian Sun. 62 00:03:58,925 --> 00:04:02,010 What they found was that using residual blocks 63 00:04:02,010 --> 00:04:05,920 allows you to train much deeper neural networks. 64 00:04:05,920 --> 00:04:10,785 And the way you build a ResNet is by taking many of these residual blocks, 65 00:04:10,785 --> 00:04:15,695 blocks like these, and stacking them together to form a deep network. 66 00:04:15,695 --> 00:04:18,150 So, let's look at this network. 67 00:04:18,150 --> 00:04:19,730 This is not the residual network, 68 00:04:19,730 --> 00:04:22,950 this is called as a plain network. 69 00:04:22,950 --> 00:04:26,830 This is the terminology of the ResNet paper. 70 00:04:26,830 --> 00:04:28,675 To turn this into a ResNet, 71 00:04:28,675 --> 00:04:31,050 what you do is you add all those 72 00:04:31,050 --> 00:04:36,475 skip connections although those short like a connections like so. 73 00:04:36,475 --> 00:04:39,875 So every two layers ends up with 74 00:04:39,875 --> 00:04:44,710 that additional change that we saw on 75 00:04:44,710 --> 00:04:49,520 the previous slide to turn each of these into residual block. 76 00:04:49,520 --> 00:04:53,770 So this picture shows five residual blocks stacked together, 77 00:04:53,770 --> 00:04:56,565 and this is a residual network. 78 00:04:56,565 --> 00:04:59,615 And it turns out that if you use 79 00:04:59,615 --> 00:05:02,620 your standard optimization algorithm such as 80 00:05:02,620 --> 00:05:04,120 a gradient descent or one of 81 00:05:04,120 --> 00:05:07,340 the fancier optimization algorithms to the train or plain network. 82 00:05:07,340 --> 00:05:10,270 So without all the extra residual, 83 00:05:10,270 --> 00:05:14,030 without all the extra short cuts or skip connections I just drew in. 84 00:05:14,030 --> 00:05:18,965 Empirically, you find that as you increase the number of layers, 85 00:05:18,965 --> 00:05:21,100 the training error will tend to decrease after 86 00:05:21,100 --> 00:05:24,240 a while but then they'll tend to go back up. 87 00:05:24,240 --> 00:05:29,170 And in theory as you make a neural network deeper, 88 00:05:29,170 --> 00:05:32,935 it should only do better and better on the training set. 89 00:05:32,935 --> 00:05:35,155 Right. So, the theory, in theory, 90 00:05:35,155 --> 00:05:37,815 having a deeper network should only help. 91 00:05:37,815 --> 00:05:40,435 But in practice or in reality, 92 00:05:40,435 --> 00:05:42,925 having a plain network, so no ResNet, 93 00:05:42,925 --> 00:05:45,890 having a plain network that is very deep means that 94 00:05:45,890 --> 00:05:50,220 all your optimization algorithm just has a much harder time training. 95 00:05:50,220 --> 00:05:51,685 And so, in reality, 96 00:05:51,685 --> 00:05:55,865 your training error gets worse if you pick a network that's too deep. 97 00:05:55,865 --> 00:06:01,530 But what happens with ResNet is that even as the number of layers gets deeper, 98 00:06:01,530 --> 00:06:06,120 you can have the performance of the training error kind of keep on going down. 99 00:06:06,120 --> 00:06:10,030 Even if we train a network with over a hundred layers. 100 00:06:10,030 --> 00:06:12,820 And then now some people experimenting with networks of 101 00:06:12,820 --> 00:06:17,845 over a thousand layers although I don't see that it used much in practice yet. 102 00:06:17,845 --> 00:06:20,230 But by taking these activations be it X of 103 00:06:20,230 --> 00:06:24,950 these intermediate activations and allowing it to go much deeper in the neural network, 104 00:06:24,950 --> 00:06:30,355 this really helps with the vanishing and exploding gradient problems 105 00:06:30,355 --> 00:06:31,930 and allows you to train 106 00:06:31,930 --> 00:06:36,220 much deeper neural networks without really appreciable loss in performance, 107 00:06:36,220 --> 00:06:39,370 and maybe at some point, this will plateau, this will flatten out, 108 00:06:39,370 --> 00:06:43,090 and it doesn't help that much deeper and deeper networks. 109 00:06:43,090 --> 00:06:49,120 But ResNet is not even effective at helping train very deep networks. 110 00:06:49,120 --> 00:06:52,645 So you've now gotten an overview of how ResNets work. 111 00:06:52,645 --> 00:06:55,495 And in fact, in this week's programming exercise, 112 00:06:55,495 --> 00:06:59,205 you get to implement these ideas and see it work for yourself. 113 00:06:59,205 --> 00:07:02,350 But next, I want to share with you better intuition or 114 00:07:02,350 --> 00:07:06,160 even more intuition about why ResNets work so well, 115 00:07:06,160 --> 00:07:07,730 let's go on to the next