1
00:00:00,000 --> 00:00:06,311
In this video, I'm going to talk about the
exploding and vanishing gradients problem,

2
00:00:06,311 --> 00:00:11,420
which is what makes it difficult to train
recurrent neural networks.

3
00:00:11,680 --> 00:00:16,825
For many years, researchers in URL
networks thought they would never be able

4
00:00:16,825 --> 00:00:21,496
to train these networks to model
dependencies over long time periods.

5
00:00:21,496 --> 00:00:26,776
But at the end of this video, I can
describe four different ways in which that

6
00:00:26,776 --> 00:00:31,278
can now be done.
To understand why it's so difficult to

7
00:00:31,278 --> 00:00:36,742
train recurrent neural networks, we have
to understand a very important difference

8
00:00:36,742 --> 00:00:41,140
between the forward and backward passes in
a recurrent neural net.

9
00:00:42,080 --> 00:00:47,860
In the forward pass, we used squashing
functions, like the logistic, to prevent

10
00:00:47,860 --> 00:00:53,339
the activity vectors from exploding.
So, if you look at the picture on the

11
00:00:53,339 --> 00:00:59,420
right, each neuron is using a logistic
unit shown by that blue curve and it can't

12
00:00:59,420 --> 00:01:03,323
output any value greater than one or less
than zero,

13
00:01:03,323 --> 00:01:09,440
So that stops explosions.
The backward pass, however, is completely

14
00:01:09,440 --> 00:01:12,860
linear.
Most people find this very surprising.

15
00:01:12,860 --> 00:01:17,554
If you double the error derivatives is it
the final layers of this net, all the

16
00:01:17,554 --> 00:01:20,770
error derivatives will double when you
back propagate.

17
00:01:20,770 --> 00:01:24,914
So, if you look at the red dots that I put
on the blue curves,

18
00:01:24,914 --> 00:01:30,350
We'll suppose those are the activity
levels of the neurons on the forward pass.

19
00:01:30,350 --> 00:01:36,313
And so, when you back propagate, you're
using the gradients of the blue curves at

20
00:01:36,313 --> 00:01:40,413
those red dots.
So the red lines are meant to throw the

21
00:01:40,413 --> 00:01:43,692
tangents to the blue curves at the red
dots.

22
00:01:43,692 --> 00:01:49,060
And, once you finish the forward pass, the
slope of that tangent is fixed.

23
00:01:49,060 --> 00:01:54,456
You now back propagate and the back
propagation is like going forwards though

24
00:01:54,456 --> 00:01:59,368
a linear system in which the slope of the
non-linearity has been fixed.

25
00:01:59,368 --> 00:02:04,972
Of course, each time you back propagate,
the slopes will be different because they

26
00:02:04,972 --> 00:02:10,022
were determined by the forward pass.
But during the back propagation, it's a

27
00:02:10,022 --> 00:02:15,488
linear system and so it suffers from a
problem of linear systems, which is when

28
00:02:15,488 --> 00:02:18,740
you iterate, they tend to either explode
or die.

29
00:02:18,740 --> 00:02:24,315
So when we backpropagate through many
layers if the weights are small the

30
00:02:24,315 --> 00:02:29,816
gradients will shrink and become
exponentially small. And that means that

31
00:02:29,816 --> 00:02:35,768
when you backpropagate through time
gradients that are many steps earlier than

32
00:02:35,768 --> 00:02:40,861
the targets arrive will be tiny.
Similarly, if the weights are big, the

33
00:02:40,861 --> 00:02:44,415
gradients will explode.
And that means when you back propagate

34
00:02:44,415 --> 00:02:48,600
through time, the gradients will get huge
and wipe out all your knowledge.

35
00:02:50,540 --> 00:02:56,120
In a feed-forward neural net, unless it's
very deep, these problems aren't nearly as

36
00:02:56,120 --> 00:02:59,300
bad because we typically only have a few
hidden layers.

37
00:03:00,280 --> 00:03:05,379
But as soon as we have a recurrent neural
network trained on a long sequence, for

38
00:03:05,379 --> 00:03:09,659
example 100 time steps, then if the
gradients are growing as we back

39
00:03:09,659 --> 00:03:14,947
propagate, we'll get whatever that growth
rate is to the power of 100 and if they're

40
00:03:14,947 --> 00:03:20,046
dying, we'll get whatever that decay is to
the power of 100 and, so, they'll either

41
00:03:20,046 --> 00:03:25,048
explode or vanish.
We might be able to initialize the weights

42
00:03:25,048 --> 00:03:29,375
very carefully to avoid this and more
recent work, shows that indeed careful

43
00:03:29,375 --> 00:03:33,020
initialization of the weights does make
things work much better.

44
00:03:34,420 --> 00:03:39,639
But even with good initial weights, it's
hard to detect the dependency of the

45
00:03:39,639 --> 00:03:43,164
current output or an input from many
time-steps ago.

46
00:03:43,164 --> 00:03:48,520
So it's hard to make the output depend on
things that happened a long time ago.

47
00:03:49,941 --> 00:03:54,140
Rnn's have difficulty dealing with
long-range dependencies.

48
00:03:56,020 --> 00:04:01,921
Here's an example of exploding and dying
gradients for a system that's trying to

49
00:04:01,921 --> 00:04:06,074
learn attractors.
So suppose we try and train a recurrent

50
00:04:06,074 --> 00:04:10,082
neural network,
So that whatever state we started in, it

51
00:04:10,082 --> 00:04:13,360
ends up in one of these two attractor
states.

52
00:04:13,360 --> 00:04:18,971
So we're going a learn blue basin of
attraction and a pink basin of attraction.

53
00:04:18,971 --> 00:04:25,018
And if we start anywhere within the blue
basin of attraction, we will end up at the

54
00:04:25,018 --> 00:04:28,905
same point.
What that means is that, small differences

55
00:04:28,905 --> 00:04:33,240
in our initial state make no difference to
where we end up.

56
00:04:33,580 --> 00:04:39,186
So the derivative of the final state with
respect to changes in the initial state,

57
00:04:39,186 --> 00:04:41,579
is zero.
That's vanishing gradients.

58
00:04:41,579 --> 00:04:46,775
When we back propagate through the
dynamics of this system we will discover

59
00:04:46,775 --> 00:04:52,107
there's no gradience from where you start,
and the same with the pink basin of

60
00:04:52,107 --> 00:04:57,143
attraction.
If however, we start very close to the

61
00:04:57,143 --> 00:05:03,413
boundary between the two attractors.
Then, a tiny difference in where we start,

62
00:05:03,413 --> 00:05:09,286
that's the other side of the watershed,
makes the huge difference to where we end

63
00:05:09,286 --> 00:05:14,955
up, that's the explosion gradient problem.
And so whenever your trying to use a

64
00:05:14,955 --> 00:05:19,160
recurrent neural network to learn
attractors like this, you're bound to get

65
00:05:19,160 --> 00:05:25,382
vanishing or exploding gradients.
It turns out, there's at least four

66
00:05:25,382 --> 00:05:28,680
effective ways to learn a recurrent neural
network.

67
00:05:29,280 --> 00:05:34,685
The first is a method called long short
term memory and I'll talk about that more

68
00:05:34,685 --> 00:05:38,047
in this lecture.
The idea is we actually change the

69
00:05:38,047 --> 00:05:42,860
architecture of the neural network to make
it good at remembering things.

70
00:05:45,340 --> 00:05:50,813
The second method is to use a much better
optimizer that can deal with very small

71
00:05:50,813 --> 00:05:54,218
gradients.
I'll talk about that in the next lecture.

72
00:05:54,218 --> 00:05:59,558
The real problem in optimization is to
detect small gradients that have an even

73
00:05:59,558 --> 00:06:03,296
smaller curvature.
Heissan-free Optimization, tailored to

74
00:06:03,296 --> 00:06:12,099
your own apps is good at doing that.
The third method really kind of evades the

75
00:06:12,099 --> 00:06:15,583
problem.
What we do is we carefully initialize the

76
00:06:15,583 --> 00:06:20,790
input to hidden weights and we very
carefully initialize the hidden to hidden

77
00:06:20,790 --> 00:06:25,529
weights, and also feedback weights from
the outputs to the hidden units.

78
00:06:25,529 --> 00:06:30,135
And the idea of this careful
initialization is to make sure that the

79
00:06:30,135 --> 00:06:34,407
hidden state has a huge reservoir of
weakly coupled oscillators.

80
00:06:34,407 --> 00:06:39,614
So if you hit it with an input sequence,
it will reverberate for a long time and

81
00:06:39,614 --> 00:06:44,905
those reverberations are remembering what
happened in the input sequence You then

82
00:06:44,905 --> 00:06:51,660
try and couple those reverberations to the
output you want and so the only thing that

83
00:06:51,660 --> 00:06:57,279
learns in an Echo State Network is the
connections between the hidden units and

84
00:06:57,279 --> 00:07:01,120
the outputs.
And if the output units are linear, that's

85
00:07:01,120 --> 00:07:04,890
very easy to train.
So this hasn't really learned the

86
00:07:04,890 --> 00:07:10,581
recurrent. It's used a fixed random
recurrent bit, but a carefully chosen one

87
00:07:10,581 --> 00:07:14,280
and then just learned the hidden tripod
connections.

88
00:07:16,040 --> 00:07:21,090
And the final method is to use momentum,
but to use momentum with the kind of

89
00:07:21,090 --> 00:07:26,271
initialization that was being used for
Echo State Networks and that makes them

90
00:07:26,271 --> 00:07:30,075
work even better.
So it was very clever to find out how to

91
00:07:30,075 --> 00:07:35,126
initialize these recurrent networks so
they'll have interesting dynamics, but

92
00:07:35,126 --> 00:07:40,373
they work even better if you now modify
that dynamic slightly in that direction

93
00:07:40,373 --> 00:07:42,800
that will help with the task at hand.