In this video, we're going to look at
stochastic gradient descent learning for a
neural network,
Particularly the mini batch version, which
is probably the most widely used learning
algorithm for large neural networks.
We've seen this before, but let's start
with a reminder about what the error
surface looks like for a linear neuron.
The error surface means a surface that
lies in a space where the horizontal axes
correspond to the weights of the neural
net.
And the vertical axis corresponds to the
error it makes.
For a linear neuron with a squared error,
that surface always forms a quadratic
bowl.
The vertical cross sections are parabolas,
and the horizontal cross sections are
ellipses.
For multilayer non linear nets the error
surface is much more complicated,
But as long as the weights aren't to big
it's a smooth error surface, and locally
it's well approximated by a fraction of a
quadratic bowl.
It might not be the bottom of the bowl but
there's a piece of quadratic bowl that
will fit the local error surface very
well.
If we look at the conversion speed when we
do full-batch learning, when the error
surface is a quadratic bubble,
The obvious thing to do is go downhill,
this will reduce the error.
But the problem is, that the direction of
steepest descent does not point to the
place we want to go to.
As you can see in the ellipse, the
direction of steepest descent is almost at
rectangles to the direction we want to go
in.
You've got a gradient that's very big
across the ellipse, which is the direction
which we only want to travel a small
distance, and the gradient's very small
along the ellipse, and that's the
direction which we want to travel a large
distance.
It's precisely the wrong way around.
Now you might think that studying linear
systems like this, is not a good idea if
you want to optimize big non-linear nets.
But even for these non-linear multi-line
nets, this kind of a problem arises.
It's a very similar problem that arises
even though the error surfaces aren't
globally quadratic bowls.
Locally they have all these same kind of
properties.
That is they tend to be very curved in
some directions, and very uncurved in
other directions.
So the way the learning goes wrong if you
use a big learning rate is that you slash
to and fro in the directions in which the
area surface is very curved.
So we'll say call that slashing across a
ravine.
And with the line rate too big you'll
actually diverge.
What we want to achieve, is that we go
quickly along the ravine in directions
that have small, but very consistent
gradients.
And we move slowly in directions with
these big, but very inconsistent
gradients.
That is if you go in that direction for a
short distance, the gradient will reverse
sign.
Before we go into how we achieve that, I
need to talk a little bit about stochastic
gradient descent, and the motivation for
using it.
If you have a data set that's highly
redundant, then if you compute the
gradient for a weight on the first half of
the data set, you'll get almost exactly
the same answer as you get if you compute
the gradient on the second half.
So it's a complete waste of time to
compute the gradient on the whole data
set.
You'd be much better off computing the
gradient on a subset of the data, then
updating the weights and on the remaining
data, computing the gradient for the
updated weights.
We can take that to extremes and say we're
going to compute the gradient on a single
training case, we're going to update the
weights and then we're going to compute
the gradient on the next training case
using those new weights.
That's called online learning.
In general, we don't want to go quite that
far.
It's usually better to use small mini
batches, typically ten or a 100 or even
1000 examples. One advantage of a small
mini batch, is that less computation is
used for actually updating the weights,
cuz you do that less often, compared with
online.
Another advantage is that when you compute
the gradient, you can compute the gradient
for a whole bunch of cases in parallel.
Most computers are very good at doing
matrix, matrix multiplies, and that will
allow you to consider a whole bunch of
training cases and apply the weights to a
whole bunch of training cases at the same
time to figure out the activities going
into the next layer for all of those
training cases.
That gives you a matrix, matrix multiply,
and it's very efficient, especially on a
graphics processor unit.
One point about using mini batches is you
wouldn't want to have a mini batch in
which the answer is always the same and
then on the next mini batch have a
different answer that's always the same.
That would cause the weights to slosh
unnecessarily.
The ideal, if you have say ten classes,
would be to have a mini batch with say ten
examples or 100 examples, that has exactly
the same number from each class in the
mini batch.
One way to approximate that is simply to
take all your data and just put it in
random order and grab random mini-batches.
But you must avoid having mini batches
that are very uncharacteristic of the
whole set of data because the mini-batches
are all of one class.
So basically there's two types of learning
algorithms for neural nets.
There's full gradient algorithms, where
you compute the gradient from all of the
training cases.
And once you've done that, there's a lot
of clever ways to speed up learning.
There's things like nonlinear versions of
a method called conjugate gradient.
The optimization community has been
studying the general problem of how you
optimize smooth nonlinear functions for
many years.
Now multi-layer neural networks are pretty
untypical of the kinds of problems they
study.
So applying the methods they developed may
need a lot of modification to make them
work for these multi-layer neural
networks.
But when you have highly redundant and
large training sets, it's nearly always
better to use mini batch learning.
The mini batches may need to be quite big,
But that's not so bad because big mini
batches are more computationally
efficient.
I'm now going to describe a basic
mini-batch grading descent linear
algorithm.
This is what most people would use when
they started training a big neural net on
a big redundant data set.
Tyou start by guessing an initial learning
rate,
And you look to see if the network learned
satisfactorily or if the error keeps
getting worse, oscillates wildly.
If that happens, you reduce the learning
rate.
You also look to see if the error is
falling too slowly.
You expect that the error might fluctuate
a bit if you measure it on a validation
set, because the great electronic
mini-batch is just a rough estimate of the
over all gradient.
So you don't want to reduce the learning
rate every time the error arises.
But what you're hoping is that the error
will fall fairly consistently.
And if it is falling fairly consistently
and very slowly, you can probably increase
the learning rate.
Once you've got that working, you can then
write a simple program to automate that
way of adjusting the learning rate.
One thing that nearly always helps is,
towards the end of learning with
mini-batches. It helps to turn down the
learning rate.
That's because you're going to get
fluctuations in the weights caused by the
fluctuations in the gradients that come
from the mini batches.
And you'd like a final set of weights.
As a good compromise.
So, when you turn down the learning rate,
you're smoothing away those fluctuations,
and getting a final set of weights that's
good for many mini-batches.
So a good time to turn down the learning
rate is when the error stops decreasing
consistently.
And a good criterion for saying the error
stopped decreasing is to use the error on
a separate validation set.
That is, it's a bunch of examples that you
are not using for training and also
they're not going to be used for your
final test.