1
00:00:02,900 --> 00:00:05,720
In this video, we will discuss

2
00:00:05,720 --> 00:00:08,905
two main problems that arise in training Recurrent Neural Networks,

3
00:00:08,905 --> 00:00:11,370
the problems of exploding and vanishing gradients.

4
00:00:11,370 --> 00:00:13,706
As you already know,

5
00:00:13,706 --> 00:00:17,327
we can use the backpropagation algorithm to train a recurrent neural network,

6
00:00:17,327 --> 00:00:19,820
but in this case, we would backpropagate

7
00:00:19,820 --> 00:00:22,590
gradients not only through layers but also through time.

8
00:00:22,590 --> 00:00:25,905
As an example in the previous video,

9
00:00:25,905 --> 00:00:28,990
we derived the expression for the gradient of the loss L with

10
00:00:28,990 --> 00:00:32,645
respect to the current weight marked as W. And now we know,

11
00:00:32,645 --> 00:00:35,821
that to compute the gradient of loss at time step t,

12
00:00:35,821 --> 00:00:37,960
respect to W. We should sum up

13
00:00:37,960 --> 00:00:42,340
the contributions from all the previous time step to these gradient.

14
00:00:42,340 --> 00:00:45,930
Now let's look at the expression for the gradient more closely.

15
00:00:45,930 --> 00:00:50,830
As you can see. There is a product of Jacobian matrices in each turn of the sum.

16
00:00:50,830 --> 00:00:53,810
And if we look at the particular tool which corresponds to

17
00:00:53,810 --> 00:00:56,575
the contributions from sum sub k to the gradient of

18
00:00:56,575 --> 00:01:03,620
time step t. Then we see the more steps between time moments k and t,

19
00:01:03,620 --> 00:01:06,210
the more elements are in this products.

20
00:01:06,210 --> 00:01:08,810
So the values of Jacobian matrices have

21
00:01:08,810 --> 00:01:13,570
a very strong influence especially on the contributions from faraway steps.

22
00:01:13,570 --> 00:01:18,770
Let's suppose for a moment that we have only one hidden unit in our network.

23
00:01:18,770 --> 00:01:20,915
So h is a scalar.

24
00:01:20,915 --> 00:01:24,950
Then all the elements in the expression for the gradient are also scalars.

25
00:01:24,950 --> 00:01:31,125
So the gradient itself is a scalar and all the Jacobian matrices are scalars and so on.

26
00:01:31,125 --> 00:01:35,445
In this case. it is clear that if all the Jacobian matrices,

27
00:01:35,445 --> 00:01:38,449
now Jacobian scalars are less than one in absolute value,

28
00:01:38,449 --> 00:01:41,670
then their products goes to zero

29
00:01:41,670 --> 00:01:46,090
exponentially faster than the number of elements in this product tends to infinity.

30
00:01:46,090 --> 00:01:50,569
And on the contrary, if all the Jacobian scores are more than one in absolute value,

31
00:01:50,569 --> 00:01:55,645
then the product goes to infinity exponentially fast.

32
00:01:55,645 --> 00:01:58,380
As a result, in the first case,

33
00:01:58,380 --> 00:02:01,560
the contributions from the following steps go to zero and

34
00:02:01,560 --> 00:02:05,800
the gradient contained under the information about nearby steps.

35
00:02:05,800 --> 00:02:08,393
This is difficult to know long range dependency,

36
00:02:08,393 --> 00:02:10,050
so this is a simple recurrent neutral network.

37
00:02:10,050 --> 00:02:14,040
This problem is usually called the vanishing gradient problem,

38
00:02:14,040 --> 00:02:19,070
because a lot of elements in the gradient simply vanish and don't affect the training.

39
00:02:19,070 --> 00:02:22,720
In the second case the contributions from

40
00:02:22,720 --> 00:02:28,610
prior steps grow exponentially fast so the gradient itself grows too.

41
00:02:28,610 --> 00:02:30,610
If an input sequence is long

42
00:02:30,610 --> 00:02:34,710
enough the gradient may even become not a number in practice.

43
00:02:34,710 --> 00:02:37,590
This problem is called the explosion gradient

44
00:02:37,590 --> 00:02:41,465
problem and it makes the training very unstable.

45
00:02:41,465 --> 00:02:44,851
There is an if that if the gradient is

46
00:02:44,851 --> 00:02:46,860
a large number then we make

47
00:02:46,860 --> 00:02:50,780
a long step in the direction of this gradient in the parameter space.

48
00:02:50,780 --> 00:02:57,145
Since we optimize a very complex multi-model function and we use the drastic methods.

49
00:02:57,145 --> 00:03:01,275
We may end up in a very poor point after such step.

50
00:03:01,275 --> 00:03:05,290
OK. We have discussed the simplified case,

51
00:03:05,290 --> 00:03:07,365
now let's return to the real life.

52
00:03:07,365 --> 00:03:11,385
A recurrent neural network usually contains not just one hidden unit,

53
00:03:11,385 --> 00:03:13,570
but the whole vector of them.

54
00:03:13,570 --> 00:03:18,560
Consequently, the Jacobian matrices are really matrices, not scalars.

55
00:03:18,560 --> 00:03:20,915
We can apply the same reasoning here.

56
00:03:20,915 --> 00:03:22,125
But instead of the absolute value,

57
00:03:22,125 --> 00:03:26,220
we need to use the spectral matrix now which is equal to

58
00:03:26,220 --> 00:03:30,745
the largest singular value of the matrix.

59
00:03:30,745 --> 00:03:35,555
If all the Jacobian matrices in the product have the norms which are less than one,

60
00:03:35,555 --> 00:03:37,470
then the gradient finishes.

61
00:03:37,470 --> 00:03:40,465
And if all the Jacobian matrices have the norms which have higher than one,

62
00:03:40,465 --> 00:03:43,300
then the gradient explodes.

63
00:03:43,300 --> 00:03:46,465
Now we know the values

64
00:03:46,465 --> 00:03:49,345
of the Jacobian matrices are crucial in training a recurrent neural network.

65
00:03:49,345 --> 00:03:53,150
So lets see what values they have in practice.

66
00:03:53,150 --> 00:03:55,430
As you remember, the hidden units at time step t,

67
00:03:55,430 --> 00:03:59,755
can be computed by applying some nonlinear function f to

68
00:03:59,755 --> 00:04:04,855
a linear combination of inputs at the startup and hidden units and the previous timestep,

69
00:04:04,855 --> 00:04:07,990
let us denote this linear combination by prioritization t.

70
00:04:07,990 --> 00:04:12,750
To compute the Jacobian matrix of the hidden units attempts at time step T,

71
00:04:12,750 --> 00:04:15,830
with respect to the hidden units in the previous time step,

72
00:04:15,830 --> 00:04:17,625
we can use the chain rule.

73
00:04:17,625 --> 00:04:23,340
So the first compute the Jacobian of H with respect to its preactivation and

74
00:04:23,340 --> 00:04:25,760
then compute the Jacobian matrix

75
00:04:25,760 --> 00:04:29,580
for this preactivation with respect to the previous hidden units.

76
00:04:29,580 --> 00:04:33,110
Since F is an element twice on minority,

77
00:04:33,110 --> 00:04:38,760
the first Jacobian here is a diagonal matrix with the derivatives of F in the diagonal.

78
00:04:38,760 --> 00:04:44,155
And how to compute the second Jacobian? This is the question for you.

79
00:04:44,155 --> 00:04:48,325
Since the preactivation is a linear combination of some elements,

80
00:04:48,325 --> 00:04:50,700
the second Jacobian consists of the weights

81
00:04:50,700 --> 00:04:53,320
of the heathen units in this linear combination.

82
00:04:53,320 --> 00:04:57,610
So it is equal to the weight matrix W. Now lets look at

83
00:04:57,610 --> 00:05:00,244
the best parts of the Jacobian matrix of the hidden unit H T

84
00:05:00,244 --> 00:05:04,460
with respect to the hidden units H T minus one separately.

85
00:05:04,460 --> 00:05:07,905
The first part depend on the type of non-linearity we use,

86
00:05:07,905 --> 00:05:12,680
the usual choice of non-linearity for neural networks.

87
00:05:12,680 --> 00:05:17,080
In this segment of hyperbolic tangent or rectified linear unit functions.

88
00:05:17,080 --> 00:05:19,848
As you can see on the left part of the slide,

89
00:05:19,848 --> 00:05:23,895
all these small linearites are very flat in the large part of the input space.

90
00:05:23,895 --> 00:05:26,577
So the sigmoid and hyperbolic tangent are

91
00:05:26,577 --> 00:05:29,456
almost constant for both small and large inputs,

92
00:05:29,456 --> 00:05:35,325
and the rectified linear units is equal to zero for all the negative inputs.

93
00:05:35,325 --> 00:05:38,440
The derivatives of this non-linearity are very

94
00:05:38,440 --> 00:05:41,690
close to zero in the regions where where they are flipped.

95
00:05:41,690 --> 00:05:44,935
So as you can see on the right part of this slide

96
00:05:44,935 --> 00:05:48,590
the derivative of the sigmoid and hyperbolic tangent are less than one.

97
00:05:48,590 --> 00:05:50,245
Almost everywhere.

98
00:05:50,245 --> 00:05:53,035
And this may very likely cause

99
00:05:53,035 --> 00:05:55,720
the vanishing gradient problem and the situation

100
00:05:55,720 --> 00:05:59,180
with rectified linear unit is much better at least for positive inputs,

101
00:05:59,180 --> 00:06:01,630
its derivative is equal to one.

102
00:06:01,630 --> 00:06:04,350
But the gradients may seem vanished because of the zero

103
00:06:04,350 --> 00:06:08,300
derivative in negative part of the input space.

104
00:06:08,300 --> 00:06:13,010
Now let's look at the second part of the Jacobian of H. The value of metrics is

105
00:06:13,010 --> 00:06:18,425
a parameter in all of the model so its norm would be either large or small.

106
00:06:18,425 --> 00:06:22,075
The small norm could aggravate the vanishing gradient problem and

107
00:06:22,075 --> 00:06:24,770
the large norm could cause the exploding gradient problem

108
00:06:24,770 --> 00:06:26,801
especially in the combination with rectified linear unit non-linearity.

109
00:06:26,801 --> 00:06:32,475
OK let's summarize what we have learned In this video.

110
00:06:32,475 --> 00:06:37,035
Recurrent Neural Networks have sequential nature so they are very deep in time

111
00:06:37,035 --> 00:06:42,610
before the invasion and exploiting graden problems may arise during the training of them.

112
00:06:42,610 --> 00:06:46,690
Actually these problems are not exclusive for recurrent neural networks,

113
00:06:46,690 --> 00:06:49,685
which also occur in deep feedforward networks.

114
00:06:49,685 --> 00:06:54,370
Vanishing gradients make the learning of long-range dependencies very

115
00:06:54,370 --> 00:06:56,571
difficult and exploding gradients

116
00:06:56,571 --> 00:07:00,670
make the learning process unstable and may even crash it.

117
00:07:00,670 --> 00:07:02,990
In the next videos, we will discuss

118
00:07:02,990 --> 00:07:07,440
different methods that can help us to overcome these issues.