1 00:00:00,000 --> 00:00:03,890 [MUSIC] 2 00:00:03,890 --> 00:00:06,696 In this video, we'll discuss how to combat exploding and 3 00:00:06,696 --> 00:00:10,944 vanishing gradients in practice, let's start with the exploding gradient problem. 4 00:00:10,944 --> 00:00:12,233 In the previous video, 5 00:00:12,233 --> 00:00:16,681 we've learned that this problem occurs when the gradient norms become large, so 6 00:00:16,681 --> 00:00:21,400 possibly even NaNs, which makes training of the recurrent neural network unstable. 7 00:00:22,800 --> 00:00:26,510 The resulting instability is actually very easy to detect. 8 00:00:26,510 --> 00:00:29,014 On the slide, you can see the learning curve of some neural network. 9 00:00:29,014 --> 00:00:32,061 The training iterations are on the x axis, and 10 00:00:32,061 --> 00:00:35,048 the loss on the training data is on the y axis. 11 00:00:35,048 --> 00:00:39,165 This particular neural network doesn't suffer from the exploding gradient 12 00:00:39,165 --> 00:00:40,460 problem. 13 00:00:40,460 --> 00:00:42,838 Therefore, the loss decreases with the number of iterations, and 14 00:00:42,838 --> 00:00:43,722 the training is stable. 15 00:00:43,722 --> 00:00:48,203 But sometimes you can see spikes in the learning curve, like this one, 16 00:00:48,203 --> 00:00:51,351 this a result of the exploding gradient problem. 17 00:00:51,351 --> 00:00:56,960 The gradient explodes, and you make a long step in the parameter space. 18 00:00:56,960 --> 00:01:01,048 As a result, you may end up with a model with quite random weights, and 19 00:01:01,048 --> 00:01:02,377 high training costs. 20 00:01:02,377 --> 00:01:06,652 In the worst case, the gradient may even become not a number, and 21 00:01:06,652 --> 00:01:11,118 you may end up with not numbers in the weights of the neural network. 22 00:01:11,118 --> 00:01:15,260 The most common way to combat this problem is gradient clipping, 23 00:01:15,260 --> 00:01:18,690 this technique is very simple, but still very effective. 24 00:01:19,730 --> 00:01:22,486 If the network suffers from the exploding gradient problem, 25 00:01:22,486 --> 00:01:26,201 then the gradient of the loss respect to all the parameters of the network. 26 00:01:26,201 --> 00:01:30,452 So let's simply clip this one, it's our threshold, by doing this, 27 00:01:30,452 --> 00:01:35,080 we don't change the direction of the gradient, we only change its length. 28 00:01:36,360 --> 00:01:40,420 Actually, we can clip not the norm of the whole gradient vector, but 29 00:01:40,420 --> 00:01:45,404 just the norm of the part which causes the problem, do you remember which part is it? 30 00:01:45,404 --> 00:01:48,845 Yeah, it is a Jacobian matrix of hidden units at one timestamp, 31 00:01:48,845 --> 00:01:52,640 with respect to hidden units in the previous timestamp. 32 00:01:52,640 --> 00:01:56,552 So it is enough to clip just the value of the Jacobian at each timestamp. 33 00:01:57,849 --> 00:02:01,202 Okay, and how to chose the threshold for gradient clipping? 34 00:02:01,202 --> 00:02:04,534 We can choose it manually, so we start with a large threshold and 35 00:02:04,534 --> 00:02:08,811 decrease it while the network doesn't suffer from exploding gradient problem. 36 00:02:08,811 --> 00:02:11,447 Also, we can look at the norm of the gradient 37 00:02:11,447 --> 00:02:14,382 at a sufficient number of training iterations. 38 00:02:14,382 --> 00:02:19,160 And choose the threshold in such a way that we clip unusually large values. 39 00:02:19,160 --> 00:02:21,899 There is another interesting technique which can help us 40 00:02:21,899 --> 00:02:24,960 to overcome the exploiting gradient problem. 41 00:02:24,960 --> 00:02:29,450 It is not designed specifically for this purpose, but it still can be helpful. 42 00:02:29,450 --> 00:02:32,055 As you remember, when we do back propagation through time, 43 00:02:32,055 --> 00:02:35,441 we need to make a forward pass through the entire sequence to compute the loss. 44 00:02:35,441 --> 00:02:39,131 And then we need to make a backward pass through the entire sequence to compute 45 00:02:39,131 --> 00:02:39,880 the gradient. 46 00:02:41,400 --> 00:02:45,540 If our training sequences are quite long, then this is a very computationally 47 00:02:45,540 --> 00:02:50,720 expensive procedure, and additionally, we may have the exploding gradient problem. 48 00:02:50,720 --> 00:02:52,310 So let's run forward and 49 00:02:52,310 --> 00:02:56,210 backward passes through the chunks of the sequence, instead of the whole sequence. 50 00:02:56,210 --> 00:02:58,070 In this case, we first make forward and 51 00:02:58,070 --> 00:03:01,010 backward passes through the first chunk and store the last hidden state. 52 00:03:01,010 --> 00:03:02,770 Then we can go to the next chunk, 53 00:03:02,770 --> 00:03:06,360 and start forward pass from the hidden state we stored. 54 00:03:06,360 --> 00:03:09,293 Then we make forward and backward passes through the second chunk, 55 00:03:09,293 --> 00:03:11,667 store the last hidden state, and go to the next chunk. 56 00:03:11,667 --> 00:03:15,990 And so on, while we don't reach the end of the sequence. 57 00:03:15,990 --> 00:03:20,050 So we carry hidden states forward in time forever, but only backpropagate for 58 00:03:20,050 --> 00:03:21,260 some smaller number of steps. 59 00:03:22,560 --> 00:03:26,350 This algorithm is called truncated backpropagation through time, and 60 00:03:26,350 --> 00:03:29,720 it is much faster than the usual backpropagation through time. 61 00:03:29,720 --> 00:03:34,270 It also doesn't suffer from the exploding gradient problem that much, since we don't 62 00:03:34,270 --> 00:03:38,180 take into account the contributions to the gradient from faraway steps. 63 00:03:38,180 --> 00:03:41,018 But of course, these advantages do not come without a price. 64 00:03:41,018 --> 00:03:45,145 Dependencies that are longer than the chunk size don't affect the training, so 65 00:03:45,145 --> 00:03:49,403 it's much more difficult to rank long range dependencies with these algorithms. 66 00:03:49,403 --> 00:03:52,955 Now let's speak about the vanishing gradient problem. 67 00:03:52,955 --> 00:03:56,088 From the previous video, we know that these problem occurs when 68 00:03:56,088 --> 00:03:59,525 the contributions to the gradient from faraway steps becomes small. 69 00:03:59,525 --> 00:04:02,951 Which makes the learning of long range dependencies very difficult. 70 00:04:02,951 --> 00:04:06,928 This problem is more complicated than the exploding gradient problem. 71 00:04:06,928 --> 00:04:10,921 It is difficult to detect the vanishing gradient problem, and 72 00:04:10,921 --> 00:04:15,757 there is no one simple way to overcome it, let's start with the detection. 73 00:04:15,757 --> 00:04:18,040 In the case of the exploding gradient problem, 74 00:04:18,040 --> 00:04:20,650 we had a clear indication that it occurred. 75 00:04:20,650 --> 00:04:23,567 We saw the spikes in the learning curve, and in the curve with the gradient norm. 76 00:04:23,567 --> 00:04:27,607 But in the case of vanishing gradient problem, the learning curve and 77 00:04:27,607 --> 00:04:30,920 the number of the gradient look quite okay. 78 00:04:30,920 --> 00:04:32,303 I mean, from the learning curve, 79 00:04:32,303 --> 00:04:34,748 you may only see that the loss of the network is not that good. 80 00:04:34,748 --> 00:04:39,421 And it's not clear whether this is due to the vanishing gradient problem or 81 00:04:39,421 --> 00:04:41,880 because the task itself is difficult. 82 00:04:42,980 --> 00:04:46,370 And if you look, for example, at the gradient of the most intensive term, 83 00:04:46,370 --> 00:04:51,180 with respect to faraway units, you may see that the norm of this gradient is small. 84 00:04:51,180 --> 00:04:53,950 But it could because there are no longer any dependencies in the data. 85 00:04:54,950 --> 00:04:59,171 So we can be sure that there is a vanishing gradient problem in the network, 86 00:04:59,171 --> 00:05:02,726 only when we overcome it and see that the network works better. 87 00:05:02,726 --> 00:05:08,590 There are a lot of different techniques to deal with the vanishing gradients. 88 00:05:08,590 --> 00:05:11,800 The most common approach is to use specifically designed recurrent 89 00:05:11,800 --> 00:05:15,730 architectures, such as long short-term memory, or LSTM, and 90 00:05:15,730 --> 00:05:18,350 gated recurrent unit, or GRU. 91 00:05:18,350 --> 00:05:22,450 These architectures are very important, so we'll speak about them in a separate 92 00:05:22,450 --> 00:05:26,340 video later this week, and now we will briefly discuss some other ideas. 93 00:05:28,250 --> 00:05:33,088 As you already know, the Jacobian matrix, which may cause the problem of vanishing 94 00:05:33,088 --> 00:05:37,654 and exploding gradients, depends on the choice of the activation function and 95 00:05:37,654 --> 00:05:40,130 the values of your recurrent weights, W. 96 00:05:41,170 --> 00:05:45,080 So if we want to overcome the vanishing gradient problem, 97 00:05:45,080 --> 00:05:49,070 we should try to use the rectified linear units activation function, which is much 98 00:05:49,070 --> 00:05:53,150 more resistant to this problem, and what about the recurrent weight matrix W? 99 00:05:53,150 --> 00:05:58,530 From linear algebra, you may remember that an orthogonal matrix is a square matrix, 100 00:05:58,530 --> 00:06:01,780 such as its transpose is equal to its inverse. 101 00:06:01,780 --> 00:06:04,960 Orthogonal matrices have many interesting properties, but 102 00:06:04,960 --> 00:06:09,706 the most important for us, is that all the agent values of an orthogonal 103 00:06:09,706 --> 00:06:11,730 matrix have absolute value of 1. 104 00:06:11,730 --> 00:06:16,980 This means that no matter how many times we perform repeated matrix multiplication, 105 00:06:16,980 --> 00:06:19,530 the resulting matrix doesn't explode or vanish. 106 00:06:21,310 --> 00:06:27,210 Therefore, if we initialize the recurrent weight matrix W with an orthogonal matrix, 107 00:06:27,210 --> 00:06:30,770 the second part of the Jacobian doesn't cause the vanishing gradient problem, 108 00:06:30,770 --> 00:06:32,550 at least on the first iterations of the training. 109 00:06:33,560 --> 00:06:38,340 In this case, the network has a chance to find long range dependencies in the data, 110 00:06:38,340 --> 00:06:39,610 at least first iterations. 111 00:06:41,000 --> 00:06:44,846 There are some approaches that utilize the properties of orthogonal matrices, 112 00:06:44,846 --> 00:06:47,275 not just for a proper initialization, but also for 113 00:06:47,275 --> 00:06:50,631 the parameterization of the weights for the whole training process. 114 00:06:50,631 --> 00:06:53,549 But these methods are out of scope of this course. 115 00:06:53,549 --> 00:06:58,033 The last idea that we will discuss in this video is the idea of using skip 116 00:06:58,033 --> 00:06:59,025 connections. 117 00:06:59,025 --> 00:07:03,004 In a recurrent neural network, we can't carry the contributions through 118 00:07:03,004 --> 00:07:06,477 the gradient through a lot of timestamps, because at each step, 119 00:07:06,477 --> 00:07:10,730 we need to multiply them by the Jacobian matrix, and as a result, they vanish. 120 00:07:12,210 --> 00:07:15,870 Let's add shortcuts between hidden states that are separated by more than 121 00:07:15,870 --> 00:07:17,400 one timestamp. 122 00:07:17,400 --> 00:07:20,873 These shortcuts are usual connections, with their own parameter matrices. 123 00:07:20,873 --> 00:07:26,566 By using them, we create much shorter ways between faraway timestamps in the network. 124 00:07:26,566 --> 00:07:30,681 So when we backpropagate the gradients along these short ways, 125 00:07:30,681 --> 00:07:35,800 they vanish slower, and we can learn longer dependencies with such a network. 126 00:07:37,030 --> 00:07:41,151 The concept of skip connections is not exclusive for recurrent neural networks. 127 00:07:41,151 --> 00:07:45,020 You already saw a similar idea in one of the architectures in the computer vision 128 00:07:45,020 --> 00:07:46,960 part of this course, in which one? 129 00:07:49,400 --> 00:07:51,890 Yeah, I spoke about the residual network, 130 00:07:51,890 --> 00:07:55,710 which contains a date shortcut connections in each where. 131 00:07:55,710 --> 00:07:59,750 We can also of course use the shortcuts in recurrent neural networks, 132 00:07:59,750 --> 00:08:03,450 as well as known identity shortcuts in big form architectures for 133 00:08:03,450 --> 00:08:04,370 the computer vision tasks. 134 00:08:05,380 --> 00:08:08,620 Okay, let's summarize what we've learned in this video. 135 00:08:08,620 --> 00:08:11,075 Exploding gradients problem is very easy to detect, 136 00:08:11,075 --> 00:08:14,640 whereas it's not clear how to detect the vanishing gradient problem. 137 00:08:15,750 --> 00:08:19,630 Gradient clipping is a simple method to combat the exploding gradients. 138 00:08:19,630 --> 00:08:22,834 And truncated backpropagation through time also can help us with this, 139 00:08:22,834 --> 00:08:25,049 in addition to the acceleration of the training. 140 00:08:25,049 --> 00:08:29,036 To overcome the vanishing gradient problem, we can use several methods, 141 00:08:29,036 --> 00:08:32,012 including careful choice of the activation function, 142 00:08:32,012 --> 00:08:34,924 proper initialization of the recurrent weights, and 143 00:08:34,924 --> 00:08:38,430 modification of the network with additional skip connections. 144 00:08:39,640 --> 00:08:44,013 In the next video, we will discuss more advanced architectures, LSTM and GRU, 145 00:08:44,013 --> 00:08:47,440 that are the most popular recurrent architectures nowadays. 146 00:08:50,158 --> 00:09:00,158 [MUSIC]