In this video, I'm going to give a brief
overview of the Hessian-Free optimizer
that can be used to train recurrent neural
networks very effectively.
This is a very complicated optimizer and I
don't expect you to get all the details of
it from this video.
I just want you to have a general feel for
how it works.
And then in the next video, we will see
how well it does on an interesting
problem.
When we're training the weights of a
neural network, we are trying to get as
far down the error surface as possible.
So one question is if we choose a given
direction to go in,
How much reduction in the arrow can we
achieve by going just the right distance
in that direction?
How much does the arrow decrease before it
starts rising again?
And here we'll assume that the curvature
is constant.
I will assume it really is a quadratic
error surface.
We can assume that the magnitude of the
gradient decreases as we move down the
gradient.
That amounts to assuming that the error
surface is concave upward like a bowl.
The maximum reduction that we can get in
the error by going in a particular
direction depends on the ratio of the
gradient to the curvature.
So we want to move in directions that have
a good ratio.
Even if the gradient is quite small, we
want the curvature to be even smaller.
So here's an example of a direction we
could move in where the vertical axis
corresponds to the error, the horizontal
axis corresponds to the weights in the
direction we're moving in, and the blue
arrow corresponds to the reduction we get
if we start at that red point.
Here's a surface that has a gentler
gradient but because it's got a better
ratio as the gradient to the curvature, we
get a bigger reduction in the error by the
time we get to the minimum.
The question is, how can we find
directions like that second one?
Directions in which even though the
gradient may be small, the curvature is
even smaller.
So let's start with Newton's method.
Newton's method addresses the basic
problem at its deepest descent, which is
that the gradient isn't the direction that
you want to go in.
If the error surface has circular cross
sections and is quadrant, the gradient is
a good direction to go.
It will point straight at the minimum.
So the idea of Newton's method is to apply
a linear transformation that turns
ellipses into circles.
If that we apply that transformation to
the gradient vector, it will be as if we
were going downhill in a circular error
surface.
To do this, we need to multiply the
gradient dE by dW by the inverse of the
curvature matrix.
So H is the curvature matrix, sometimes
called the Hessian.
Its the function of the weights we have
and we need to take its inverse and
multiply the gradient by that,
Then we need to go some distance in that
direction.
If it's a truly quadratic surface and we
choose epsilon correctly, which is quite
easy to do, we'll arrive at the minimum of
the surface in a single step.
Of course, that single step involves
something complicated, which was inverting
that Hessian matrix.
The problem with this is that even if we
only have a million weights in our neural
network, the curvature matrix, the
Hessian, will have a trillion terms, is
completely infeasible to invert it.
So curvature matrices look like this.
For each weight, Wi or Wj,
They tell you how the gradient in one
direction changes as you change in another
direction.
In other words, as I change weight i, how
does the gradient of the error with
respect to weight j change?
That's what a typical off diagonal term
tells you.
The terms on the diagonal tell you how the
gradient of the arrow changes in the
direction of a weight as you change that
weight.
So the off-diagonal terms in a curvature
matrix correspond to twists in the error
surface.
A twist means, when you travel in one
direction, the gradient in another
direction changes.
If we have a nice circular [inaudible],
all those off-diagonal terms are zero.
As we travel in one direction, the
gradient in other directions doesn't
change.
So, what's going wrong with steepest
descent, when you have an elliptical error
surface,
Is that, as we travel in one direction,
the gradient in another direction changes.
And so if I update one of the weights, at
the same time as I'm updating all the
other weights, all those other updates
will cause a change in the gradient for
the first weight.
And that means, when I update it, I may
actually make things worse.
The gradient may have actually reversed
sine due to all the changes in the other
weights.
And so. as we get more and more weights,
we need to be more and more cautious about
changing each one of them,
Because the simultaneous changes in all
the other weights can change the gradient
of our range.
The curvature matrix determines the size
of those interactions.
So we have to deal with the curvature.
We can't just ignore it.
And we'd like to deal with it without
actually inverting a huge matrix, because
the matrix has too many terms in a big
neural net.
One thing we can do is to just look at the
leading diagonal of the curvature matrix
and make our step size depend on that
leading diagonal.
That helps a bit.
It will get us to make different step
sizes for different weights,
But the diagonal turns only a tiny
fraction of the interactions, so we're
ignoring most of the turns in the
curvature matrix when we do that.
In fact, we're ignoring nearly all of
them.
Another thing we could do is, turn
approximate the coverage of matrix with
the matrix of much lower rank that
captures the main aspects of the coverage
matrix.
That was done in Hessian-Free methods and
LBFGS, and many other methods that trying
and do an approximate second order method
for minimizing the error.
In the Hessian-Free Method, we make an
approximation to the curvature matrix and
then we assume that the approximation is
correct.
So we assume we know what the curvature is
and that the error surface really is
quadratic.
And then, starting from wherever we are
now, we minimize the error using an
efficient technique called conjugate
gradient.
Once we've done that, once we got close to
a minimum on this approximation to the
curvature, we then make another
approximation to the curvature matrix and
we use conjugate gradient to minimize
again.
It's also important in recurrent neural
networks to add a penalty for changing any
of the hidden activities too much.
That will prevent us for example, from
changing a weight early on that causes
huge effects later on in the sequence.
We don't want to get effects that are too
big, and if we look at the changes in the
hidden activities we can prevent that by
penalizing those changes.
If we put a quadratic penalty on those
changes, we can combine that with the rest
of the Hessian-Free method.
The last thing I need to explain is
conjugate gradient and I'm just going to
explain it briefly.
Conjugate gradient is a very clever method
that, instead of trying to go straight to
the minimum like in Newton's method, tries
to minimize in one direction at a time.
So it starts off by taking the direction
of steepest descend and goes to the
minimum in that direction.
That might involve re-evaluating the
gradient, re-evaluating the error a few
times to find the minimum in that
direction.
Once its done that, it now finds another
direction and goes to the minimum in that
second direction.
The clever thing about the technique is,
it chooses its second direction in such a
way that it doesn't mess up the
minimization it already did in the first
direction.
That's called a conjugate direction.
Conjugate means that as you go in the new
direction, you don't change the gradients
in the previous directions.
It's a funny idea.
It's like the idea of a twist in an error
surface.
A twist means when you go in one
direction, you change the gradient in
another direction.
And a conjugate direction is one you can
go in that in a sense, doesn't have a
twist.
You go in that direction and the gradient
in the first direction doesn't change.
So here is a picture of an ellipse and the
red line is the major axis of the ellipse.
We start off by doing one step of steepest
descent all the way to the minimum in that
direction.
And if you think about it a bit, you can
see that the minimum won't actually lie on
the red line.
On the red line, the gradient will be
zero, at right angles for that red line,
cuz it's the bottom of the ravine.
But the direction we're going in, isn't
actually at right angles to that point.
We can make a little bit more progress by
making a small step at right angles to the
red line and then a small step along the
red line.
Since the red line slopes down towards the
middle of the elipse, that's going to make
some progress for us.
So when we minimize in the first
direction, we'll go slightly across the
bottom of the ellipse. And when we reach
that point that's a minimum, there's an
interesting property of all the points
that lie on the green line.
On that green line, the gradient in the
direction of that black arrow is zero.
So we can go anywhere along that green
line and we won't destroy the fact that we
are at a minimum in the direction of the
black arrow.
If we can keep doing that from many
directions in a high dimensional error
surface, we'll eventually be at a minimum
in many different directions.
And if we are at a minimum in as many
different directions as there are
dimensions in the space, we'll be at the
global minimum.
So, we take this first step of steepest
descent, we then figure out, and I'm not
going to explain how we do that.
We figure out the direction of that green
line, and then, we do a search along the
green line to find how far we should go in
order to minimize the error along the
green line.
And we take our second step, like this.
And now, in this 2-dimensional space,
we'll be at the minimum.
Because, we're at the minimum in the
direction of the first step and we're now
at a minimum in the direction of the
second step,
While still being at a minimum in the
direction of the first step and so that
must be the global minimum.
What conjugate gradient achieves is that
it gets to the global minimum of an
N-dimensional quadratic surface in only N
steps.
It's very efficient.
It does that because it manages to get the
gradient to be zero in N different
directions.
They're not orthogonal directions,
But they are independent of one another
and so that's efficient to be at the
global minimum.
More importantly, in many less than N
steps on a typical quadratic surface, it
will have reduced the area very close to
the minimum value, and that's why we use
it.
We're not going to do the full N steps,
that would be as expensive as inverting
the whole matrix.
We're going to do many less than N steps,
and we're going to get quite close to the
minimum.
You can apply conjugate gradient directly
to a non-quadratic error surface, like the
error surface for a multilayer non-linear
neural net and it usually works quite
well.
It's essentially a batch method, but you
can apply it to large mini batches.
And when you do that, you do many steps of
conjugate gradient on the same large mini
batch and then you move on to the next
large mini batch.
That's called non-linear conjugate
gradient.
The Hessian-Free optimizer uses conjugate
gradient for minimization on a genuinely
quadratic surface and that's what
conjugate gradient is best at.
It works much better for that than for a
non-linear surface.
This genuinely quadratic surface that HF
is using it for is the quadratic
approximation to the true surface that was
made by the Hessian-Free method.
So it makes that approximation,
It uses conjugant gradient to get close to
a minimum, for the first approximation.
And then it makes a new approximation to
the curvature, and does it again.