In this video we're going to look at the
momentum method for improving the learning
speed when doing grading descent into
neural network.
The momentum method can be applied to full
batch learning, but it also works for mini
batch learning.
It's very widely used.
And probably the commonest recipe for
learning big neural nets is to use
stochastic grade and descent with mini
batches combined with momentum.
I'm going to start with the intuition
behind the momentum method.
So, we think of a ball on the area
surface, where the location of the ball in
the horizontal plane represents the
current weight vector.
The ball starts off stationary and so
initially it will follow the direction of
steepest descent.
It will follow the gradient.
But as soon as it's got some velocity
it'll no longer go in the same direction
as the gradient.
Its momentum will make it keep going in
the previous direction.
Obviously we wanted eventually to get to a
low point on the surface, so we wanted to
lose energy.
So we need to introduce a bit of
viscosity.
That is, we make its velocity die off
gently on each update.
What the momentum method does, is it damps
oscillations in directions of high
curvature.
So if you look at the red starting point,
and then look at the green point we get to
after two steps, they have gradients that
are pretty much equal and opposite.
As a result, the gradient across the
ravine has cancelled out.
But the gradient along the ravine has not
cancelled out.
Along the ravine, we're going to keep
building up speed, and so, after the
momentum method has settled down, it'll
tend to go along the bottom of the ravine,
accumulating velocity as it goes, and if
you're lucky, that'll make you go a whole
lot faster, than if you just judge
steepest descent.
The equations of the momentum method are
fairly simple.
We say that the velocity vector at time t,
is just the velocity vector at time t
minus one, time here is the updates of the
weights.
So it's the velocity vector that we got
after mini batch t minus one, attenuated a
bit.
So we multiply by some number like
point.9.
Which is really viscosity, or it's related
to viscosity.
But unfortunately, I called it momentum.
So we now call alpha momentum.
And then we add in the effect of the
current gradient,
Which is to make us go downhill by some
learning rate times the gradient that we
have at time t And that'll be our new
velocity at time t We then make our weight
change at time t equal to velocity.
That velocity can actually be expressed in
terms of previous weight changes as it's
shown on the slide share.
Then I will leave it to you to follow the
math.
The behavior of the momentum method is
very intuitive.
On an air surface that's just a plane, the
ball will reach some terminal velocity of
which the gaining velocity that comes from
the gradient is balanced by the
multiplicative attenuation of velocity due
to the momentum term,
Which is really viscosity.
If that momentum term is close to one,
then it'll be going down much faster than
a simple gradient descent method would.
So the terminal velocity, the velocity you
get at time infinity is the gradient times
the learning weight, multiplied by this
factor of one over one minus alpha.
So if alpha is 0.99, you'll go 100 times
as fast as you would with the learning
rate alone.
You have to be careful in setting
momentum.
At the very beginning of learning, if you
make the initial random weights quite big,
there may be very large gradients.
You have a bunch of weights that's
completely no good for the task you're
doing.
And it may be very obvious how to change
these weights to make things a lot better.
You don't want a big momentum.
Because you're going to quickly change
them to make things better.
And then you're going to start on the hard
problem of finding out how to get just the
right relative values of different
weights.
So you have sensible feature detectors.
So it pays at the beginning of learning to
have a small momentum.
It is probably better to have 0.5 than
zero, because 0.5 will average out some
sloshes and obvious ravines.
Once the large gradients have disappeared,
and you've reached the sort of normal
phase of learning, where you're stuck in a
ravine.
And you need to go along the bottom of
this ravine without sloshing to and fro
sideways.
You can smoothly raise the momentum to its
final value.
Or you could raise it in one step, but
that might start an oscillation.
You might think that, why didn't we just
use a bigger learning rate.
But what you'll discover is that, using a
small learning rate and a big momentum
allows you to get away with an overall
learning rate that's much bigger than you
could have had if you used learning rate
alone with no momentum.
If you use a big learning rate by itself,
you'll get big divergent oscillations]
across the ravine.
Very recently Ilya Sutskever has
discovered that there's a better type of
momentum.
The standard momentum method works by
first computing the gradient at the
current location.
It combines that with its stored memory of
previous gradients, which is in the
velocity of the ball.
And then it takes a big jump in the
direction of the current gradient combined
with previous gradients.
So that's its accumulated gradient
direction.
Ilya Sutskever has found that it works
better in many cases to use a form of
momentum suggested by Nesterov who was
trying to optimize convex functions, where
we first make a big jump in the direction
of the previous accumulating gradient, and
then we measure the gradient where we end
up and make a correction.
It's very, very similar, and you need a
picture to really understand the
difference.
One way of thinking about what's going on
is in the standard momentum method, you
add in the current gradient and then you
gamble on this big jump.
In the Nesterov method, you use your
previously accumulated gradient, you make
the big jump and then you correct yourself
at the place you've got to.
So here's the picture, when we first make
the jump and then make a correction.
Here is a stamp in the direction of the
accumulated gradient.
So this depends on the gradient that we've
accumulated on, in our previous iteration.
We take that step.
We then make it the gradient, and go
downhill in the direction of the gradient.
Like that.
We then combine that little correction
stat with the big jump we made to get our
new accumulated gradient.
We then take that accumulated gradient, we
attenuate it by some number, like nine.
Or 99. multiply it by that number, and we
now take our next big jump in the
direction of that accumulated gradient,
like that.
Then again, at the place where we end up,
we measure the gradient and we go
downhill.
That correct any errors you made, and we
our new accumulated gradient.
Now if you compare that with the standard
momentum method, the standard momentum
method starts with a accumulating
gradient, like that initial brand vector,
but then it measures the gradient where it
is, so it measures the gradient at its
current location, and it adds that to the
brown vector, so that it makes a jump like
this big blue vector.
That is just the brown vector plus the
current gradient.
It turns out, if you're going to gamble,
it's much better to gamble and then make a
correction, than to make a correction and
then gamble.