1 00:00:02,900 --> 00:00:05,720 In this video, we will discuss 2 00:00:05,720 --> 00:00:08,905 two main problems that arise in training Recurrent Neural Networks, 3 00:00:08,905 --> 00:00:11,370 the problems of exploding and vanishing gradients. 4 00:00:11,370 --> 00:00:13,706 As you already know, 5 00:00:13,706 --> 00:00:17,327 we can use the backpropagation algorithm to train a recurrent neural network, 6 00:00:17,327 --> 00:00:19,820 but in this case, we would backpropagate 7 00:00:19,820 --> 00:00:22,590 gradients not only through layers but also through time. 8 00:00:22,590 --> 00:00:25,905 As an example in the previous video, 9 00:00:25,905 --> 00:00:28,990 we derived the expression for the gradient of the loss L with 10 00:00:28,990 --> 00:00:32,645 respect to the current weight marked as W. And now we know, 11 00:00:32,645 --> 00:00:35,821 that to compute the gradient of loss at time step t, 12 00:00:35,821 --> 00:00:37,960 respect to W. We should sum up 13 00:00:37,960 --> 00:00:42,340 the contributions from all the previous time step to these gradient. 14 00:00:42,340 --> 00:00:45,930 Now let's look at the expression for the gradient more closely. 15 00:00:45,930 --> 00:00:50,830 As you can see. There is a product of Jacobian matrices in each turn of the sum. 16 00:00:50,830 --> 00:00:53,810 And if we look at the particular tool which corresponds to 17 00:00:53,810 --> 00:00:56,575 the contributions from sum sub k to the gradient of 18 00:00:56,575 --> 00:01:03,620 time step t. Then we see the more steps between time moments k and t, 19 00:01:03,620 --> 00:01:06,210 the more elements are in this products. 20 00:01:06,210 --> 00:01:08,810 So the values of Jacobian matrices have 21 00:01:08,810 --> 00:01:13,570 a very strong influence especially on the contributions from faraway steps. 22 00:01:13,570 --> 00:01:18,770 Let's suppose for a moment that we have only one hidden unit in our network. 23 00:01:18,770 --> 00:01:20,915 So h is a scalar. 24 00:01:20,915 --> 00:01:24,950 Then all the elements in the expression for the gradient are also scalars. 25 00:01:24,950 --> 00:01:31,125 So the gradient itself is a scalar and all the Jacobian matrices are scalars and so on. 26 00:01:31,125 --> 00:01:35,445 In this case. it is clear that if all the Jacobian matrices, 27 00:01:35,445 --> 00:01:38,449 now Jacobian scalars are less than one in absolute value, 28 00:01:38,449 --> 00:01:41,670 then their products goes to zero 29 00:01:41,670 --> 00:01:46,090 exponentially faster than the number of elements in this product tends to infinity. 30 00:01:46,090 --> 00:01:50,569 And on the contrary, if all the Jacobian scores are more than one in absolute value, 31 00:01:50,569 --> 00:01:55,645 then the product goes to infinity exponentially fast. 32 00:01:55,645 --> 00:01:58,380 As a result, in the first case, 33 00:01:58,380 --> 00:02:01,560 the contributions from the following steps go to zero and 34 00:02:01,560 --> 00:02:05,800 the gradient contained under the information about nearby steps. 35 00:02:05,800 --> 00:02:08,393 This is difficult to know long range dependency, 36 00:02:08,393 --> 00:02:10,050 so this is a simple recurrent neutral network. 37 00:02:10,050 --> 00:02:14,040 This problem is usually called the vanishing gradient problem, 38 00:02:14,040 --> 00:02:19,070 because a lot of elements in the gradient simply vanish and don't affect the training. 39 00:02:19,070 --> 00:02:22,720 In the second case the contributions from 40 00:02:22,720 --> 00:02:28,610 prior steps grow exponentially fast so the gradient itself grows too. 41 00:02:28,610 --> 00:02:30,610 If an input sequence is long 42 00:02:30,610 --> 00:02:34,710 enough the gradient may even become not a number in practice. 43 00:02:34,710 --> 00:02:37,590 This problem is called the explosion gradient 44 00:02:37,590 --> 00:02:41,465 problem and it makes the training very unstable. 45 00:02:41,465 --> 00:02:44,851 There is an if that if the gradient is 46 00:02:44,851 --> 00:02:46,860 a large number then we make 47 00:02:46,860 --> 00:02:50,780 a long step in the direction of this gradient in the parameter space. 48 00:02:50,780 --> 00:02:57,145 Since we optimize a very complex multi-model function and we use the drastic methods. 49 00:02:57,145 --> 00:03:01,275 We may end up in a very poor point after such step. 50 00:03:01,275 --> 00:03:05,290 OK. We have discussed the simplified case, 51 00:03:05,290 --> 00:03:07,365 now let's return to the real life. 52 00:03:07,365 --> 00:03:11,385 A recurrent neural network usually contains not just one hidden unit, 53 00:03:11,385 --> 00:03:13,570 but the whole vector of them. 54 00:03:13,570 --> 00:03:18,560 Consequently, the Jacobian matrices are really matrices, not scalars. 55 00:03:18,560 --> 00:03:20,915 We can apply the same reasoning here. 56 00:03:20,915 --> 00:03:22,125 But instead of the absolute value, 57 00:03:22,125 --> 00:03:26,220 we need to use the spectral matrix now which is equal to 58 00:03:26,220 --> 00:03:30,745 the largest singular value of the matrix. 59 00:03:30,745 --> 00:03:35,555 If all the Jacobian matrices in the product have the norms which are less than one, 60 00:03:35,555 --> 00:03:37,470 then the gradient finishes. 61 00:03:37,470 --> 00:03:40,465 And if all the Jacobian matrices have the norms which have higher than one, 62 00:03:40,465 --> 00:03:43,300 then the gradient explodes. 63 00:03:43,300 --> 00:03:46,465 Now we know the values 64 00:03:46,465 --> 00:03:49,345 of the Jacobian matrices are crucial in training a recurrent neural network. 65 00:03:49,345 --> 00:03:53,150 So lets see what values they have in practice. 66 00:03:53,150 --> 00:03:55,430 As you remember, the hidden units at time step t, 67 00:03:55,430 --> 00:03:59,755 can be computed by applying some nonlinear function f to 68 00:03:59,755 --> 00:04:04,855 a linear combination of inputs at the startup and hidden units and the previous timestep, 69 00:04:04,855 --> 00:04:07,990 let us denote this linear combination by prioritization t. 70 00:04:07,990 --> 00:04:12,750 To compute the Jacobian matrix of the hidden units attempts at time step T, 71 00:04:12,750 --> 00:04:15,830 with respect to the hidden units in the previous time step, 72 00:04:15,830 --> 00:04:17,625 we can use the chain rule. 73 00:04:17,625 --> 00:04:23,340 So the first compute the Jacobian of H with respect to its preactivation and 74 00:04:23,340 --> 00:04:25,760 then compute the Jacobian matrix 75 00:04:25,760 --> 00:04:29,580 for this preactivation with respect to the previous hidden units. 76 00:04:29,580 --> 00:04:33,110 Since F is an element twice on minority, 77 00:04:33,110 --> 00:04:38,760 the first Jacobian here is a diagonal matrix with the derivatives of F in the diagonal. 78 00:04:38,760 --> 00:04:44,155 And how to compute the second Jacobian? This is the question for you. 79 00:04:44,155 --> 00:04:48,325 Since the preactivation is a linear combination of some elements, 80 00:04:48,325 --> 00:04:50,700 the second Jacobian consists of the weights 81 00:04:50,700 --> 00:04:53,320 of the heathen units in this linear combination. 82 00:04:53,320 --> 00:04:57,610 So it is equal to the weight matrix W. Now lets look at 83 00:04:57,610 --> 00:05:00,244 the best parts of the Jacobian matrix of the hidden unit H T 84 00:05:00,244 --> 00:05:04,460 with respect to the hidden units H T minus one separately. 85 00:05:04,460 --> 00:05:07,905 The first part depend on the type of non-linearity we use, 86 00:05:07,905 --> 00:05:12,680 the usual choice of non-linearity for neural networks. 87 00:05:12,680 --> 00:05:17,080 In this segment of hyperbolic tangent or rectified linear unit functions. 88 00:05:17,080 --> 00:05:19,848 As you can see on the left part of the slide, 89 00:05:19,848 --> 00:05:23,895 all these small linearites are very flat in the large part of the input space. 90 00:05:23,895 --> 00:05:26,577 So the sigmoid and hyperbolic tangent are 91 00:05:26,577 --> 00:05:29,456 almost constant for both small and large inputs, 92 00:05:29,456 --> 00:05:35,325 and the rectified linear units is equal to zero for all the negative inputs. 93 00:05:35,325 --> 00:05:38,440 The derivatives of this non-linearity are very 94 00:05:38,440 --> 00:05:41,690 close to zero in the regions where where they are flipped. 95 00:05:41,690 --> 00:05:44,935 So as you can see on the right part of this slide 96 00:05:44,935 --> 00:05:48,590 the derivative of the sigmoid and hyperbolic tangent are less than one. 97 00:05:48,590 --> 00:05:50,245 Almost everywhere. 98 00:05:50,245 --> 00:05:53,035 And this may very likely cause 99 00:05:53,035 --> 00:05:55,720 the vanishing gradient problem and the situation 100 00:05:55,720 --> 00:05:59,180 with rectified linear unit is much better at least for positive inputs, 101 00:05:59,180 --> 00:06:01,630 its derivative is equal to one. 102 00:06:01,630 --> 00:06:04,350 But the gradients may seem vanished because of the zero 103 00:06:04,350 --> 00:06:08,300 derivative in negative part of the input space. 104 00:06:08,300 --> 00:06:13,010 Now let's look at the second part of the Jacobian of H. The value of metrics is 105 00:06:13,010 --> 00:06:18,425 a parameter in all of the model so its norm would be either large or small. 106 00:06:18,425 --> 00:06:22,075 The small norm could aggravate the vanishing gradient problem and 107 00:06:22,075 --> 00:06:24,770 the large norm could cause the exploding gradient problem 108 00:06:24,770 --> 00:06:26,801 especially in the combination with rectified linear unit non-linearity. 109 00:06:26,801 --> 00:06:32,475 OK let's summarize what we have learned In this video. 110 00:06:32,475 --> 00:06:37,035 Recurrent Neural Networks have sequential nature so they are very deep in time 111 00:06:37,035 --> 00:06:42,610 before the invasion and exploiting graden problems may arise during the training of them. 112 00:06:42,610 --> 00:06:46,690 Actually these problems are not exclusive for recurrent neural networks, 113 00:06:46,690 --> 00:06:49,685 which also occur in deep feedforward networks. 114 00:06:49,685 --> 00:06:54,370 Vanishing gradients make the learning of long-range dependencies very 115 00:06:54,370 --> 00:06:56,571 difficult and exploding gradients 116 00:06:56,571 --> 00:07:00,670 make the learning process unstable and may even crash it. 117 00:07:00,670 --> 00:07:02,990 In the next videos, we will discuss 118 00:07:02,990 --> 00:07:07,440 different methods that can help us to overcome these issues.