1
00:00:00,000 --> 00:00:05,734
In this video we're going to look at the
momentum method for improving the learning

2
00:00:05,734 --> 00:00:08,849
speed when doing grading descent into
neural network.

3
00:00:08,849 --> 00:00:14,867
The momentum method can be applied to full
batch learning, but it also works for mini

4
00:00:14,867 --> 00:00:17,486
batch learning.
It's very widely used.

5
00:00:17,486 --> 00:00:22,654
And probably the commonest recipe for
learning big neural nets is to use

6
00:00:22,654 --> 00:00:27,610
stochastic grade and descent with mini
batches combined with momentum.

7
00:00:27,610 --> 00:00:32,000
I'm going to start with the intuition
behind the momentum method.

8
00:00:32,500 --> 00:00:37,320
So, we think of a ball on the area
surface, where the location of the ball in

9
00:00:37,320 --> 00:00:41,000
the horizontal plane represents the
current weight vector.

10
00:00:42,060 --> 00:00:47,187
The ball starts off stationary and so
initially it will follow the direction of

11
00:00:47,187 --> 00:00:50,072
steepest descent.
It will follow the gradient.

12
00:00:50,072 --> 00:00:55,136
But as soon as it's got some velocity
it'll no longer go in the same direction

13
00:00:55,136 --> 00:00:58,725
as the gradient.
Its momentum will make it keep going in

14
00:00:58,725 --> 00:01:03,693
the previous direction.
Obviously we wanted eventually to get to a

15
00:01:03,693 --> 00:01:07,211
low point on the surface, so we wanted to
lose energy.

16
00:01:07,211 --> 00:01:10,065
So we need to introduce a bit of
viscosity.

17
00:01:10,065 --> 00:01:14,180
That is, we make its velocity die off
gently on each update.

18
00:01:15,800 --> 00:01:21,623
What the momentum method does, is it damps
oscillations in directions of high

19
00:01:21,623 --> 00:01:25,645
curvature.
So if you look at the red starting point,

20
00:01:25,645 --> 00:01:31,761
and then look at the green point we get to
after two steps, they have gradients that

21
00:01:31,761 --> 00:01:36,863
are pretty much equal and opposite.
As a result, the gradient across the

22
00:01:36,863 --> 00:01:41,616
ravine has cancelled out.
But the gradient along the ravine has not

23
00:01:41,616 --> 00:01:45,021
cancelled out.
Along the ravine, we're going to keep

24
00:01:45,021 --> 00:01:49,631
building up speed, and so, after the
momentum method has settled down, it'll

25
00:01:49,631 --> 00:01:55,448
tend to go along the bottom of the ravine,
accumulating velocity as it goes, and if

26
00:01:55,448 --> 00:02:01,194
you're lucky, that'll make you go a whole
lot faster, than if you just judge

27
00:02:01,194 --> 00:02:06,275
steepest descent.
The equations of the momentum method are

28
00:02:06,275 --> 00:02:10,949
fairly simple.
We say that the velocity vector at time t,

29
00:02:10,949 --> 00:02:16,986
is just the velocity vector at time t
minus one, time here is the updates of the

30
00:02:16,986 --> 00:02:20,500
weights.
So it's the velocity vector that we got

31
00:02:20,500 --> 00:02:23,483
after mini batch t minus one, attenuated a
bit.

32
00:02:23,483 --> 00:02:26,612
So we multiply by some number like
point.9.

33
00:02:26,612 --> 00:02:30,832
Which is really viscosity, or it's related
to viscosity.

34
00:02:30,832 --> 00:02:35,853
But unfortunately, I called it momentum.
So we now call alpha momentum.

35
00:02:35,853 --> 00:02:39,782
And then we add in the effect of the
current gradient,

36
00:02:39,782 --> 00:02:45,312
Which is to make us go downhill by some
learning rate times the gradient that we

37
00:02:45,312 --> 00:02:52,247
have at time t And that'll be our new
velocity at time t We then make our weight

38
00:02:52,247 --> 00:02:57,602
change at time t equal to velocity.
That velocity can actually be expressed in

39
00:02:57,602 --> 00:03:02,133
terms of previous weight changes as it's
shown on the slide share.

40
00:03:02,133 --> 00:03:05,360
Then I will leave it to you to follow the
math.

41
00:03:07,600 --> 00:03:11,463
The behavior of the momentum method is
very intuitive.

42
00:03:11,463 --> 00:03:17,474
On an air surface that's just a plane, the
ball will reach some terminal velocity of

43
00:03:17,474 --> 00:03:22,840
which the gaining velocity that comes from
the gradient is balanced by the

44
00:03:22,840 --> 00:03:27,419
multiplicative attenuation of velocity due
to the momentum term,

45
00:03:27,419 --> 00:03:33,283
Which is really viscosity.
If that momentum term is close to one,

46
00:03:33,283 --> 00:03:39,260
then it'll be going down much faster than
a simple gradient descent method would.

47
00:03:40,120 --> 00:03:46,914
So the terminal velocity, the velocity you
get at time infinity is the gradient times

48
00:03:46,914 --> 00:03:52,367
the learning weight, multiplied by this
factor of one over one minus alpha.

49
00:03:52,367 --> 00:03:58,522
So if alpha is 0.99, you'll go 100 times
as fast as you would with the learning

50
00:03:58,522 --> 00:04:02,540
rate alone.
You have to be careful in setting

51
00:04:02,540 --> 00:04:05,245
momentum.
At the very beginning of learning, if you

52
00:04:05,245 --> 00:04:09,358
make the initial random weights quite big,
there may be very large gradients.

53
00:04:09,358 --> 00:04:13,363
You have a bunch of weights that's
completely no good for the task you're

54
00:04:13,363 --> 00:04:15,907
doing.
And it may be very obvious how to change

55
00:04:15,907 --> 00:04:20,058
these weights to make things a lot better.
You don't want a big momentum.

56
00:04:20,058 --> 00:04:23,247
Because you're going to quickly change
them to make things better.

57
00:04:23,247 --> 00:04:27,449
And then you're going to start on the hard
problem of finding out how to get just the

58
00:04:27,449 --> 00:04:29,625
right relative values of different
weights.

59
00:04:29,625 --> 00:04:35,130
So you have sensible feature detectors.
So it pays at the beginning of learning to

60
00:04:35,130 --> 00:04:39,768
have a small momentum.
It is probably better to have 0.5 than

61
00:04:39,768 --> 00:04:44,560
zero, because 0.5 will average out some
sloshes and obvious ravines.

62
00:04:45,020 --> 00:04:49,890
Once the large gradients have disappeared,
and you've reached the sort of normal

63
00:04:49,890 --> 00:04:52,873
phase of learning, where you're stuck in a
ravine.

64
00:04:52,873 --> 00:04:57,682
And you need to go along the bottom of
this ravine without sloshing to and fro

65
00:04:57,682 --> 00:05:00,848
sideways.
You can smoothly raise the momentum to its

66
00:05:00,848 --> 00:05:03,891
final value.
Or you could raise it in one step, but

67
00:05:03,891 --> 00:05:09,534
that might start an oscillation.
You might think that, why didn't we just

68
00:05:09,534 --> 00:05:13,978
use a bigger learning rate.
But what you'll discover is that, using a

69
00:05:13,978 --> 00:05:19,077
small learning rate and a big momentum
allows you to get away with an overall

70
00:05:19,077 --> 00:05:24,436
learning rate that's much bigger than you
could have had if you used learning rate

71
00:05:24,436 --> 00:05:28,619
alone with no momentum.
If you use a big learning rate by itself,

72
00:05:28,619 --> 00:05:32,280
you'll get big divergent oscillations]
across the ravine.

73
00:05:34,100 --> 00:05:40,895
Very recently Ilya Sutskever has
discovered that there's a better type of

74
00:05:40,895 --> 00:05:44,820
momentum.
The standard momentum method works by

75
00:05:44,820 --> 00:05:48,280
first computing the gradient at the
current location.

76
00:05:48,560 --> 00:05:53,764
It combines that with its stored memory of
previous gradients, which is in the

77
00:05:53,764 --> 00:05:57,567
velocity of the ball.
And then it takes a big jump in the

78
00:05:57,567 --> 00:06:02,037
direction of the current gradient combined
with previous gradients.

79
00:06:02,037 --> 00:06:05,040
So that's its accumulated gradient
direction.

80
00:06:06,120 --> 00:06:11,067
Ilya Sutskever has found that it works
better in many cases to use a form of

81
00:06:11,067 --> 00:06:16,482
momentum suggested by Nesterov who was
trying to optimize convex functions, where

82
00:06:16,482 --> 00:06:22,098
we first make a big jump in the direction
of the previous accumulating gradient, and

83
00:06:22,098 --> 00:06:26,577
then we measure the gradient where we end
up and make a correction.

84
00:06:26,577 --> 00:06:31,324
It's very, very similar, and you need a
picture to really understand the

85
00:06:31,324 --> 00:06:35,718
difference.
One way of thinking about what's going on

86
00:06:35,718 --> 00:06:41,200
is in the standard momentum method, you
add in the current gradient and then you

87
00:06:41,200 --> 00:06:45,312
gamble on this big jump.
In the Nesterov method, you use your

88
00:06:45,312 --> 00:06:51,069
previously accumulated gradient, you make
the big jump and then you correct yourself

89
00:06:51,069 --> 00:06:57,669
at the place you've got to.
So here's the picture, when we first make

90
00:06:57,669 --> 00:07:04,675
the jump and then make a correction.
Here is a stamp in the direction of the

91
00:07:04,675 --> 00:07:10,075
accumulated gradient.
So this depends on the gradient that we've

92
00:07:10,075 --> 00:07:14,800
accumulated on, in our previous iteration.
We take that step.

93
00:07:15,100 --> 00:07:21,340
We then make it the gradient, and go
downhill in the direction of the gradient.

94
00:07:21,340 --> 00:07:26,799
Like that.
We then combine that little correction

95
00:07:26,799 --> 00:07:32,640
stat with the big jump we made to get our
new accumulated gradient.

96
00:07:33,380 --> 00:07:39,070
We then take that accumulated gradient, we
attenuate it by some number, like nine.

97
00:07:39,070 --> 00:07:44,538
Or 99. multiply it by that number, and we
now take our next big jump in the

98
00:07:44,538 --> 00:07:48,160
direction of that accumulated gradient,
like that.

99
00:07:48,420 --> 00:07:54,357
Then again, at the place where we end up,
we measure the gradient and we go

100
00:07:54,357 --> 00:07:58,429
downhill.
That correct any errors you made, and we

101
00:07:58,429 --> 00:08:04,422
our new accumulated gradient.
Now if you compare that with the standard

102
00:08:04,422 --> 00:08:09,706
momentum method, the standard momentum
method starts with a accumulating

103
00:08:09,706 --> 00:08:15,798
gradient, like that initial brand vector,
but then it measures the gradient where it

104
00:08:15,798 --> 00:08:21,596
is, so it measures the gradient at its
current location, and it adds that to the

105
00:08:21,596 --> 00:08:26,220
brown vector, so that it makes a jump like
this big blue vector.

106
00:08:26,600 --> 00:08:31,151
That is just the brown vector plus the
current gradient.

107
00:08:31,151 --> 00:08:37,572
It turns out, if you're going to gamble,
it's much better to gamble and then make a

108
00:08:37,572 --> 00:08:41,880
correction, than to make a correction and
then gamble.