1 00:00:00,000 --> 00:00:02,352 One of the problems of training neural network, 2 00:00:02,352 --> 00:00:04,585 especially very deep neural networks, 3 00:00:04,585 --> 00:00:07,395 is data vanishing and exploding gradients. 4 00:00:07,395 --> 00:00:09,180 What that means is that when you're training 5 00:00:09,180 --> 00:00:13,650 a very deep network your derivatives or your slopes can sometimes get either very, 6 00:00:13,650 --> 00:00:15,825 very big or very, very small, 7 00:00:15,825 --> 00:00:17,420 maybe even exponentially small, 8 00:00:17,420 --> 00:00:19,450 and this makes training difficult. 9 00:00:19,450 --> 00:00:21,690 In this video you see what this problem of 10 00:00:21,690 --> 00:00:25,185 exploding and vanishing gradients really means, 11 00:00:25,185 --> 00:00:28,630 as well as how you can use careful choices of 12 00:00:28,630 --> 00:00:32,780 the random weight initialization to significantly reduce this problem. 13 00:00:32,780 --> 00:00:36,015 Unless you're training a very deep neural network like this, 14 00:00:36,015 --> 00:00:37,210 to save space on the slide, 15 00:00:37,210 --> 00:00:40,508 I've drawn it as if you have only two hidden units per layer, 16 00:00:40,508 --> 00:00:42,575 but it could be more as well. 17 00:00:42,575 --> 00:00:45,625 But this neural network will have parameters W1, 18 00:00:45,625 --> 00:00:51,585 W2, W3 and so on up to WL. 19 00:00:51,585 --> 00:00:53,025 For the sake of simplicity, 20 00:00:53,025 --> 00:00:56,960 let's say we're using an activation function G of Z equals Z, 21 00:00:56,960 --> 00:00:58,725 so linear activation function. 22 00:00:58,725 --> 00:01:02,985 And let's ignore B, let's say B of L equals zero. 23 00:01:02,985 --> 00:01:07,755 So in that case you can show that the output Y will be 24 00:01:07,755 --> 00:01:13,700 WL times WL minus one times WL minus two, 25 00:01:13,700 --> 00:01:18,193 dot, dot, dot down to the W3, 26 00:01:18,193 --> 00:01:21,445 W2, W1 times X. 27 00:01:21,445 --> 00:01:23,830 But if you want to just check my math, 28 00:01:23,830 --> 00:01:27,915 W1 times X is going to be Z1, 29 00:01:27,915 --> 00:01:30,225 because B is equal to zero. 30 00:01:30,225 --> 00:01:33,540 So Z1 is equal to, I guess, 31 00:01:33,540 --> 00:01:37,960 W1 times X and then plus B which is zero. 32 00:01:37,960 --> 00:01:42,440 But then A1 is equal to G of Z1. 33 00:01:42,440 --> 00:01:45,150 But because we use linear activation function, 34 00:01:45,150 --> 00:01:47,755 this is just equal to Z1. 35 00:01:47,755 --> 00:01:50,360 So this first term W1X is equal to A1. 36 00:01:50,360 --> 00:01:57,950 And then by the reasoning you can figure out that W2 times W1 times X is equal to A2, 37 00:01:57,950 --> 00:02:00,118 because that's going to be G of Z2, 38 00:02:00,118 --> 00:02:03,565 is going to be G of 39 00:02:03,565 --> 00:02:12,570 W2 times A1 which you can plug that in here. 40 00:02:12,570 --> 00:02:16,690 So this thing is going to be equal to A2, 41 00:02:16,690 --> 00:02:21,505 and then this thing is going to be 42 00:02:21,505 --> 00:02:29,065 A3 and so on until the protocol of all these matrices gives you Y-hat, not Y. 43 00:02:29,065 --> 00:02:33,080 Now, let's say that each of you weight matrices 44 00:02:33,080 --> 00:02:39,677 WL is just a little bit larger than one times the identity. 45 00:02:39,677 --> 00:02:43,825 So it's 1.5_1.5_0_0. 46 00:02:43,825 --> 00:02:46,000 Technically, the last one has different dimensions so 47 00:02:46,000 --> 00:02:49,220 maybe this is just the rest of these weight matrices. 48 00:02:49,220 --> 00:02:51,508 Then Y-hat will be, 49 00:02:51,508 --> 00:02:54,903 ignoring this last one with different dimension, 50 00:02:54,903 --> 00:03:01,770 this 1.5_0_0_1.5 matrix to the power of L minus 1 times X, 51 00:03:01,770 --> 00:03:08,050 because we assume that each one of these matrices is equal to this thing. 52 00:03:08,050 --> 00:03:12,945 It's really 1.5 times the identity matrix, then you end up with this calculation. 53 00:03:12,945 --> 00:03:19,150 And so Y-hat will be essentially 1.5 to the power of L, 54 00:03:19,150 --> 00:03:21,715 to the power of L minus 1 times X, 55 00:03:21,715 --> 00:03:24,505 and if L was large for very deep neural network, 56 00:03:24,505 --> 00:03:26,640 Y-hat will be very large. 57 00:03:26,640 --> 00:03:28,375 In fact, it just grows exponentially, 58 00:03:28,375 --> 00:03:32,145 it grows like 1.5 to the number of layers. 59 00:03:32,145 --> 00:03:34,290 And so if you have a very deep neural network, 60 00:03:34,290 --> 00:03:36,850 the value of Y will explode. 61 00:03:36,850 --> 00:03:40,832 Now, conversely, if we replace this with 0.5, 62 00:03:40,832 --> 00:03:42,268 so something less than 1, 63 00:03:42,268 --> 00:03:45,860 then this becomes 0.5 to the power of L. 64 00:03:45,860 --> 00:03:51,515 This matrix becomes 0.5 to the L minus one times X, again ignoring WL. 65 00:03:51,515 --> 00:03:57,220 And so each of your matrices are less than 1, 66 00:03:57,220 --> 00:04:00,415 then let's say X1, X2 were one one, 67 00:04:00,415 --> 00:04:02,778 then the activations will be one half, 68 00:04:02,778 --> 00:04:04,450 one half, one fourth, 69 00:04:04,450 --> 00:04:07,227 one fourth, one eighth, one eighth, 70 00:04:07,227 --> 00:04:11,470 and so on until this becomes one over two to the L. So 71 00:04:11,470 --> 00:04:16,710 the activation values will decrease exponentially as a function of the def, 72 00:04:16,710 --> 00:04:19,945 as a function of the number of layers L of the network. 73 00:04:19,945 --> 00:04:26,150 So in the very deep network, the activations end up decreasing exponentially. 74 00:04:26,150 --> 00:04:30,940 So the intuition I hope you can take away from this is that at the weights W, 75 00:04:30,940 --> 00:04:33,760 if they're all just a little bit bigger than one 76 00:04:33,760 --> 00:04:36,805 or just a little bit bigger than the identity matrix, 77 00:04:36,805 --> 00:04:41,290 then with a very deep network the activations can explode. 78 00:04:41,290 --> 00:04:45,525 And if W is just a little bit less than identity. 79 00:04:45,525 --> 00:04:49,020 So this maybe here's 0.9, 0.9, 80 00:04:49,020 --> 00:04:50,062 then you have a very deep network, 81 00:04:50,062 --> 00:04:53,235 the activations will decrease exponentially. 82 00:04:53,235 --> 00:04:56,175 And even though I went through this argument in terms of 83 00:04:56,175 --> 00:05:00,795 activations increasing or decreasing exponentially as a function of L, 84 00:05:00,795 --> 00:05:03,180 a similar argument can be used to show that 85 00:05:03,180 --> 00:05:05,515 the derivatives or the gradients the computer is going to send 86 00:05:05,515 --> 00:05:08,485 will also increase exponentially 87 00:05:08,485 --> 00:05:11,720 or decrease exponentially as a function of the number of layers. 88 00:05:11,720 --> 00:05:16,210 With some of the modern neural networks, L equals 150. 89 00:05:16,210 --> 00:05:19,018 Microsoft recently got great results with 152 layer neural network. 90 00:05:19,018 --> 00:05:22,900 But with such a deep neural network, 91 00:05:22,900 --> 00:05:27,760 if your activations or gradients increase or decrease exponentially as a function of L, 92 00:05:27,760 --> 00:05:31,435 then these values could get really big or really small. 93 00:05:31,435 --> 00:05:33,777 And this makes training difficult, 94 00:05:33,777 --> 00:05:36,970 especially if your gradients are exponentially smaller than L, 95 00:05:36,970 --> 00:05:40,540 then gradient descent will take tiny little steps. 96 00:05:40,540 --> 00:05:44,580 It will take a long time for gradient descent to learn anything. 97 00:05:44,580 --> 00:05:47,380 To summarize, you've seen how deep networks suffer from 98 00:05:47,380 --> 00:05:50,945 the problems of vanishing or exploding gradients. 99 00:05:50,945 --> 00:05:52,750 In fact, for a long time this problem was 100 00:05:52,750 --> 00:05:56,040 a huge barrier to training deep neural networks. 101 00:05:56,040 --> 00:05:59,620 It turns out there's a partial solution that doesn't completely solve 102 00:05:59,620 --> 00:06:01,670 this problem but it helps a lot which is 103 00:06:01,670 --> 00:06:04,455 careful choice of how you initialize the weights. 104 00:06:04,455 --> 00:06:07,090 To see that, let's go to the next video.