1
00:00:00,000 --> 00:00:03,890
[MUSIC]

2
00:00:03,890 --> 00:00:06,696
In this video,
we'll discuss how to combat exploding and

3
00:00:06,696 --> 00:00:10,944
vanishing gradients in practice, let's
start with the exploding gradient problem.

4
00:00:10,944 --> 00:00:12,233
In the previous video,

5
00:00:12,233 --> 00:00:16,681
we've learned that this problem occurs
when the gradient norms become large, so

6
00:00:16,681 --> 00:00:21,400
possibly even NaNs, which makes training
of the recurrent neural network unstable.

7
00:00:22,800 --> 00:00:26,510
The resulting instability is
actually very easy to detect.

8
00:00:26,510 --> 00:00:29,014
On the slide, you can see the learning
curve of some neural network.

9
00:00:29,014 --> 00:00:32,061
The training iterations are on the x axis,
and

10
00:00:32,061 --> 00:00:35,048
the loss on the training
data is on the y axis.

11
00:00:35,048 --> 00:00:39,165
This particular neural network doesn't
suffer from the exploding gradient

12
00:00:39,165 --> 00:00:40,460
problem.

13
00:00:40,460 --> 00:00:42,838
Therefore, the loss decreases with
the number of iterations, and

14
00:00:42,838 --> 00:00:43,722
the training is stable.

15
00:00:43,722 --> 00:00:48,203
But sometimes you can see spikes in
the learning curve, like this one,

16
00:00:48,203 --> 00:00:51,351
this a result of the exploding
gradient problem.

17
00:00:51,351 --> 00:00:56,960
The gradient explodes, and you make
a long step in the parameter space.

18
00:00:56,960 --> 00:01:01,048
As a result, you may end up with
a model with quite random weights, and

19
00:01:01,048 --> 00:01:02,377
high training costs.

20
00:01:02,377 --> 00:01:06,652
In the worst case, the gradient
may even become not a number, and

21
00:01:06,652 --> 00:01:11,118
you may end up with not numbers in
the weights of the neural network.

22
00:01:11,118 --> 00:01:15,260
The most common way to combat this
problem is gradient clipping,

23
00:01:15,260 --> 00:01:18,690
this technique is very simple,
but still very effective.

24
00:01:19,730 --> 00:01:22,486
If the network suffers from
the exploding gradient problem,

25
00:01:22,486 --> 00:01:26,201
then the gradient of the loss respect
to all the parameters of the network.

26
00:01:26,201 --> 00:01:30,452
So let's simply clip this one,
it's our threshold, by doing this,

27
00:01:30,452 --> 00:01:35,080
we don't change the direction of
the gradient, we only change its length.

28
00:01:36,360 --> 00:01:40,420
Actually, we can clip not the norm
of the whole gradient vector, but

29
00:01:40,420 --> 00:01:45,404
just the norm of the part which causes the
problem, do you remember which part is it?

30
00:01:45,404 --> 00:01:48,845
Yeah, it is a Jacobian matrix of
hidden units at one timestamp,

31
00:01:48,845 --> 00:01:52,640
with respect to hidden units
in the previous timestamp.

32
00:01:52,640 --> 00:01:56,552
So it is enough to clip just the value
of the Jacobian at each timestamp.

33
00:01:57,849 --> 00:02:01,202
Okay, and how to chose the threshold for
gradient clipping?

34
00:02:01,202 --> 00:02:04,534
We can choose it manually, so
we start with a large threshold and

35
00:02:04,534 --> 00:02:08,811
decrease it while the network doesn't
suffer from exploding gradient problem.

36
00:02:08,811 --> 00:02:11,447
Also, we can look at
the norm of the gradient

37
00:02:11,447 --> 00:02:14,382
at a sufficient number
of training iterations.

38
00:02:14,382 --> 00:02:19,160
And choose the threshold in such a way
that we clip unusually large values.

39
00:02:19,160 --> 00:02:21,899
There is another interesting
technique which can help us

40
00:02:21,899 --> 00:02:24,960
to overcome the exploiting
gradient problem.

41
00:02:24,960 --> 00:02:29,450
It is not designed specifically for
this purpose, but it still can be helpful.

42
00:02:29,450 --> 00:02:32,055
As you remember,
when we do back propagation through time,

43
00:02:32,055 --> 00:02:35,441
we need to make a forward pass through
the entire sequence to compute the loss.

44
00:02:35,441 --> 00:02:39,131
And then we need to make a backward pass
through the entire sequence to compute

45
00:02:39,131 --> 00:02:39,880
the gradient.

46
00:02:41,400 --> 00:02:45,540
If our training sequences are quite long,
then this is a very computationally

47
00:02:45,540 --> 00:02:50,720
expensive procedure, and additionally, we
may have the exploding gradient problem.

48
00:02:50,720 --> 00:02:52,310
So let's run forward and

49
00:02:52,310 --> 00:02:56,210
backward passes through the chunks of the
sequence, instead of the whole sequence.

50
00:02:56,210 --> 00:02:58,070
In this case, we first make forward and

51
00:02:58,070 --> 00:03:01,010
backward passes through the first
chunk and store the last hidden state.

52
00:03:01,010 --> 00:03:02,770
Then we can go to the next chunk,

53
00:03:02,770 --> 00:03:06,360
and start forward pass from
the hidden state we stored.

54
00:03:06,360 --> 00:03:09,293
Then we make forward and
backward passes through the second chunk,

55
00:03:09,293 --> 00:03:11,667
store the last hidden state,
and go to the next chunk.

56
00:03:11,667 --> 00:03:15,990
And so on, while we don't
reach the end of the sequence.

57
00:03:15,990 --> 00:03:20,050
So we carry hidden states forward in
time forever, but only backpropagate for

58
00:03:20,050 --> 00:03:21,260
some smaller number of steps.

59
00:03:22,560 --> 00:03:26,350
This algorithm is called truncated
backpropagation through time, and

60
00:03:26,350 --> 00:03:29,720
it is much faster than the usual
backpropagation through time.

61
00:03:29,720 --> 00:03:34,270
It also doesn't suffer from the exploding
gradient problem that much, since we don't

62
00:03:34,270 --> 00:03:38,180
take into account the contributions
to the gradient from faraway steps.

63
00:03:38,180 --> 00:03:41,018
But of course, these advantages
do not come without a price.

64
00:03:41,018 --> 00:03:45,145
Dependencies that are longer than the
chunk size don't affect the training, so

65
00:03:45,145 --> 00:03:49,403
it's much more difficult to rank long
range dependencies with these algorithms.

66
00:03:49,403 --> 00:03:52,955
Now let's speak about
the vanishing gradient problem.

67
00:03:52,955 --> 00:03:56,088
From the previous video,
we know that these problem occurs when

68
00:03:56,088 --> 00:03:59,525
the contributions to the gradient
from faraway steps becomes small.

69
00:03:59,525 --> 00:04:02,951
Which makes the learning of long
range dependencies very difficult.

70
00:04:02,951 --> 00:04:06,928
This problem is more complicated
than the exploding gradient problem.

71
00:04:06,928 --> 00:04:10,921
It is difficult to detect
the vanishing gradient problem, and

72
00:04:10,921 --> 00:04:15,757
there is no one simple way to overcome it,
let's start with the detection.

73
00:04:15,757 --> 00:04:18,040
In the case of the exploding
gradient problem,

74
00:04:18,040 --> 00:04:20,650
we had a clear indication
that it occurred.

75
00:04:20,650 --> 00:04:23,567
We saw the spikes in the learning curve,
and in the curve with the gradient norm.

76
00:04:23,567 --> 00:04:27,607
But in the case of vanishing gradient
problem, the learning curve and

77
00:04:27,607 --> 00:04:30,920
the number of the gradient
look quite okay.

78
00:04:30,920 --> 00:04:32,303
I mean, from the learning curve,

79
00:04:32,303 --> 00:04:34,748
you may only see that the loss
of the network is not that good.

80
00:04:34,748 --> 00:04:39,421
And it's not clear whether this is due
to the vanishing gradient problem or

81
00:04:39,421 --> 00:04:41,880
because the task itself is difficult.

82
00:04:42,980 --> 00:04:46,370
And if you look, for example, at
the gradient of the most intensive term,

83
00:04:46,370 --> 00:04:51,180
with respect to faraway units, you may see
that the norm of this gradient is small.

84
00:04:51,180 --> 00:04:53,950
But it could because there are no
longer any dependencies in the data.

85
00:04:54,950 --> 00:04:59,171
So we can be sure that there is a
vanishing gradient problem in the network,

86
00:04:59,171 --> 00:05:02,726
only when we overcome it and
see that the network works better.

87
00:05:02,726 --> 00:05:08,590
There are a lot of different techniques
to deal with the vanishing gradients.

88
00:05:08,590 --> 00:05:11,800
The most common approach is to use
specifically designed recurrent

89
00:05:11,800 --> 00:05:15,730
architectures, such as long
short-term memory, or LSTM, and

90
00:05:15,730 --> 00:05:18,350
gated recurrent unit, or GRU.

91
00:05:18,350 --> 00:05:22,450
These architectures are very important,
so we'll speak about them in a separate

92
00:05:22,450 --> 00:05:26,340
video later this week, and now we will
briefly discuss some other ideas.

93
00:05:28,250 --> 00:05:33,088
As you already know, the Jacobian matrix,
which may cause the problem of vanishing

94
00:05:33,088 --> 00:05:37,654
and exploding gradients, depends on
the choice of the activation function and

95
00:05:37,654 --> 00:05:40,130
the values of your recurrent weights, W.

96
00:05:41,170 --> 00:05:45,080
So if we want to overcome
the vanishing gradient problem,

97
00:05:45,080 --> 00:05:49,070
we should try to use the rectified linear
units activation function, which is much

98
00:05:49,070 --> 00:05:53,150
more resistant to this problem, and
what about the recurrent weight matrix W?

99
00:05:53,150 --> 00:05:58,530
From linear algebra, you may remember that
an orthogonal matrix is a square matrix,

100
00:05:58,530 --> 00:06:01,780
such as its transpose is
equal to its inverse.

101
00:06:01,780 --> 00:06:04,960
Orthogonal matrices have many
interesting properties, but

102
00:06:04,960 --> 00:06:09,706
the most important for us, is that
all the agent values of an orthogonal

103
00:06:09,706 --> 00:06:11,730
matrix have absolute value of 1.

104
00:06:11,730 --> 00:06:16,980
This means that no matter how many times
we perform repeated matrix multiplication,

105
00:06:16,980 --> 00:06:19,530
the resulting matrix doesn't explode or
vanish.

106
00:06:21,310 --> 00:06:27,210
Therefore, if we initialize the recurrent
weight matrix W with an orthogonal matrix,

107
00:06:27,210 --> 00:06:30,770
the second part of the Jacobian doesn't
cause the vanishing gradient problem,

108
00:06:30,770 --> 00:06:32,550
at least on the first
iterations of the training.

109
00:06:33,560 --> 00:06:38,340
In this case, the network has a chance to
find long range dependencies in the data,

110
00:06:38,340 --> 00:06:39,610
at least first iterations.

111
00:06:41,000 --> 00:06:44,846
There are some approaches that utilize
the properties of orthogonal matrices,

112
00:06:44,846 --> 00:06:47,275
not just for a proper initialization,
but also for

113
00:06:47,275 --> 00:06:50,631
the parameterization of the weights for
the whole training process.

114
00:06:50,631 --> 00:06:53,549
But these methods are out
of scope of this course.

115
00:06:53,549 --> 00:06:58,033
The last idea that we will discuss in
this video is the idea of using skip

116
00:06:58,033 --> 00:06:59,025
connections.

117
00:06:59,025 --> 00:07:03,004
In a recurrent neural network,
we can't carry the contributions through

118
00:07:03,004 --> 00:07:06,477
the gradient through a lot of timestamps,
because at each step,

119
00:07:06,477 --> 00:07:10,730
we need to multiply them by the Jacobian
matrix, and as a result, they vanish.

120
00:07:12,210 --> 00:07:15,870
Let's add shortcuts between hidden
states that are separated by more than

121
00:07:15,870 --> 00:07:17,400
one timestamp.

122
00:07:17,400 --> 00:07:20,873
These shortcuts are usual connections,
with their own parameter matrices.

123
00:07:20,873 --> 00:07:26,566
By using them, we create much shorter ways
between faraway timestamps in the network.

124
00:07:26,566 --> 00:07:30,681
So when we backpropagate
the gradients along these short ways,

125
00:07:30,681 --> 00:07:35,800
they vanish slower, and we can learn
longer dependencies with such a network.

126
00:07:37,030 --> 00:07:41,151
The concept of skip connections is not
exclusive for recurrent neural networks.

127
00:07:41,151 --> 00:07:45,020
You already saw a similar idea in one of
the architectures in the computer vision

128
00:07:45,020 --> 00:07:46,960
part of this course, in which one?

129
00:07:49,400 --> 00:07:51,890
Yeah, I spoke about the residual network,

130
00:07:51,890 --> 00:07:55,710
which contains a date shortcut
connections in each where.

131
00:07:55,710 --> 00:07:59,750
We can also of course use the shortcuts
in recurrent neural networks,

132
00:07:59,750 --> 00:08:03,450
as well as known identity shortcuts
in big form architectures for

133
00:08:03,450 --> 00:08:04,370
the computer vision tasks.

134
00:08:05,380 --> 00:08:08,620
Okay, let's summarize what
we've learned in this video.

135
00:08:08,620 --> 00:08:11,075
Exploding gradients problem
is very easy to detect,

136
00:08:11,075 --> 00:08:14,640
whereas it's not clear how to detect
the vanishing gradient problem.

137
00:08:15,750 --> 00:08:19,630
Gradient clipping is a simple method
to combat the exploding gradients.

138
00:08:19,630 --> 00:08:22,834
And truncated backpropagation through
time also can help us with this,

139
00:08:22,834 --> 00:08:25,049
in addition to the acceleration
of the training.

140
00:08:25,049 --> 00:08:29,036
To overcome the vanishing gradient
problem, we can use several methods,

141
00:08:29,036 --> 00:08:32,012
including careful choice
of the activation function,

142
00:08:32,012 --> 00:08:34,924
proper initialization of
the recurrent weights, and

143
00:08:34,924 --> 00:08:38,430
modification of the network with
additional skip connections.

144
00:08:39,640 --> 00:08:44,013
In the next video, we will discuss more
advanced architectures, LSTM and GRU,

145
00:08:44,013 --> 00:08:47,440
that are the most popular
recurrent architectures nowadays.

146
00:08:50,158 --> 00:09:00,158
[MUSIC]