In this video, we are going to look at a
number of issues that arise when using
stochastic gradient descent with mini
patches.
There is a large number of tricks that
make things work much better.
These are the kind of black outed neural
networks.
And I'm going to go over some of the main
tricks in this video.
The first issue I want to talk about, is
initializing the way it's in your own
network.
If two hidden units have exactly the same
weights, the same bioses, with incoming
and I current, then they can never become
different from one another.
Because they would always get exactly the
same gradient.
So, to allow them to learn diffrent
feature detectors, you need to start them
off different from one another.
We do this by using small random weights
to initialize the weights.
That breaks the symmetry.
Those small random weights umm shouldn't
all necessarily be the same size as each
other.
So if you've got a hidden unit that has a
very big fan in if you use quite big
weights it'll tend to saturate it so you
can afford to use much smaller weights for
a hidden unit that has a big fan in.
If you have a hidden unit with a very
small fan, then you want to use bigger
weights.
And since the weights are random, it
scales with the square root of the number
of the weights.
And so a good principle is to make the
size of the initial weights be
proportional to the square root of the
fan.
We can also scale the learning rates for
the weights the same way.
One thing that has a surprisingly big
affect on the speed with which a neural
network will learn, is shifting the
inputs.
That is adding a constant to each of the
components of the inputs.
It seems surprising that, that could make
much difference.
But when you're using steepest decent,
shifting an input value by adding a
constant can make a very big difference.
It usually helps to shift each component
of the input, so that averaged over all of
the training data, it has a value of zero.
That is, make sure it's mean is zero.
So suppose we have a little neuron-like
likeness, just a linear neuron with two
weights.
And suppose we have some training cases.
The first training case is where the
inputs are 101 and a 101, you should give
an output of two.
And the second one says when there are a
101 and 99 you should output a zero.
And I'm using color here to indicate which
training case I'm talking about If you
look at the error surface you get for
those two training cases, it looks like
this.
The green line is the line along which the
weights will satisfy the first training
case, and the red line is the line along
which the weights will satisfy the second
training case.
And what we notice is that they're almost
parallel, and so when you combine them,
you get a very elongated ellipse.
One way to think about what's going on
here is that, because we're using a
squared error measure, we get a parabolic
trough along the red line.
The red line is the bottom of this
parabolic trough that tells us the squared
error we'll be getting on the red case.
And there's another parabolic trough with
the green line along its bottom.
And it turns out, although this may
surprise your spatial intuition.
If you add together two parabolic troughs,
you get a quadratic bowl.
And elongated quadratic bowl, in this
case.
So that's where that error surface came
from.
Now, look what happens, if we subtract a
hundred from each of those two inbook
components.
We get a completely different area
surface.
It's, in this case, it's a circle, it's
ideal.
The green line is the line along which the
weights add to two.
We're going to take the first weight, and
multiply it by one.
We're going to take the second weight and
multiply it by one.
And we need to get two.
So the weights better add to two.
The red line is the line along which the
two weights are equal.
Because we're going to take the first
weight, and multiply it by one.
And we're going to take the second weight,
and multiply it by -one.
So if the weights are equal, we'll be able
to get that zero that we need.
So the error surface in this case is a
nice circle where gradient descent is
really easy, and all we did was subtract
100 from every input.
If you're thinking about what happens not
with the input but with the hidden units.
It makes sense to have hidden units that
are hyperbolic tangents that go between
-one and one.
The hyperbolic tangent is simply twice the
logistic -one.
And the reason that makes sense is because
then the activities of the hidden units
are roughly mean zero and that should make
the learning faster in the next level.
Of course, that's only true if the inputs
to the hyperbolic tangents are distributed
sensibly around zero.
But in that respect, a hyperbolic tangent
is better than a logistic.
However there is other respects in which a
logistic is better.
For example, logistic gives you a rug to
sweep things under.
It gives an output of zero, and if you
make the input even smaller than it was,
the output is still zero.
So fluctuations in big native inputs are
ignored by the logistic.
For the hyperbolic tangent you have to go
out to the end of its plateaus before it
can ignore anything.
Another thing that makes a big difference
is scaling the inputs.
When we use the steepest descent, scaling
the input values is a very simple thing to
do.
We transform them so that each component
of the input has unit variance over the
whole training set.
So it has a typical value of one or -one.
So, again if we take this simple net with
two rates and we look at the error surface
when the first component is very small and
the second component is much bigger.
We get an error surface in which we get an
ellipse that has got a very high
curvature, when the input components big
because small changes in the weight make a
big difference in the output.
And very low curvature in the direction in
which the input component is small because
small changes to the weight hardly make
any difference to the error.
The color here is indicating which axis
we're using, not which training example
we're using, as it did in the previous
slide.
If we simply change the variance of the
inputs, just re-scale them.
Make the first component ten times as big
and the second component ten times as
small, we now get a nice circular error
surface.
Shifting and scaling the inputs is a very
simple thing to do, but something that's a
bit more complicated.
That actually works even better cause it's
guaranteed to give you a circle, a
circular error surface.
At least it is for linear neuron. What we
do is we try and decorrelate the
components of the input vectors.
In other words, if you take two components
and look at how they're correlated with
one another over the whole training set.
Like, if you remember the early example
how the number of portions of chips.
And the number of portions of ketchup
might be highly correlated.
We want to try and get rid of those
correlations.
That will make learning much easier.
There's actually many ways to de-correlate
things.
For those of you who know about principle
components analysis.
A very sensible thing to do is apply
principle components analysis.
Remove the components that have the
smallest eigenvalues which already
achieves some dimensionality reduction.
And then scale the remaining components by
dividing them by the square roots of their
eigenvalues.
For a linear system, that will give you a
circular error surface.
If you don't know about principle
components, we'll cover it later in the
course.
Once you got a circular error surface, the
gradient points straight towards the
minimum, so learning is really easy.
Now, let's talk about a few of the common
problems that people encounter.
One thing that can happen is if you start
with a learning rate that's much too big,
you drive the hidden units either to be
firmly on, or firmly off.
That is the incoming weights are very big
in positive or very big in negative.
And their state no longer depends on the
input and of course that means that error
root is coming from output won't affect
them, because they are on the plateaus
where the derivative is basically zero.
And so learning will stop.
Because people are expecting to see local
minimum, when learning stops they say, oh,
I'm at a local minimum and the error's
terrible.
So there are these really bad local
minimum,
Usually that's not true.
Usually it's because you got stuck out on
the end of a plateau.
A second problem that occurs, is, if you
are classifying things and you're using
either a squared error or a cross entropy
error.
The best guessing strategy is normally to
make the output unit equal to the
proportion of the time that it should be
one.
The network will fairly quickly find that
strategy and so the error will fall
quickly, but particularly if the network
has many layers it may take a long time
before it improves much on that.
Because to improve over the guessing
statedgy it has to get sensible
information from the input through all the
hidden layers to the output and that could
take a long time to learn if you start
with small weights.
So again, you learn quickly and then the
error stops decreasing, and it looks like
a local minimum but actually it's another
platter.
I mentioned earlier that towards the end
of learning, you should turn down the
learning rate.
You should also be careful about turning
down the learning rate too soon.
When you turn down the learning rate you
reduce the random fluctuations in the area
do to the different gradings on different
mini batches.
But of course you also reduce the rate of
learning.
So if you look at the red curve you see
that when we turn the learning rate down
we got a great win.
The error fell but after that we get
slower learning.
And if we do that too soon we're gonna
loose relative to the green curve.
So don't turn down the learning rate too
soon, not too much.
I'm now gonna talk about four main ways to
speed up mini-batch learning a lot.
The previous things I talked about were
kind of a bag of tricks for making things
work better.
And these are four methods all explicitly
designed to make the learning go much
faster.
I'm now gonna talk about a mathical
moment.
In this method we don't use the gradient
to change the position of the whites.
That is, if you think of the whites as a
ball on the error surface, standard
gradient descent uses the gradient to
change the position of that ball.
You simply multiply the gradient by a
learning rate and change the position of
the ball by that vector.
In the momentum method, we use the
gradient to accelerate this ball.
That is the gradient changes it's
velocity.
And then the velocity is what changes the
position of the ball.
The reason that's different is because the
bull can have momentum.
That is, it remembers previous gradients
in its philosophy.
A second method for speeding up when
you're batch learning is to use a separate
adaptive learning rate for each parameter.
And then to slowly adjust that learning
rate based on empirical measurements.
And the obvious empirical measurement is
are we keeping making progress by changing
the weights in the same direction?
Or does the gradient keep oscillating
around so that the sign of the grading
keeps changing.
If the sign of the grading keeps changing,
what we're going to do is reduce the
learning rate and if it keeps staying the
same, we're going to increase the learning
rate.
A third method is what I now call rms prop
and what we do in this method is we divide
by a running average of the magnitudes of
the recent gradients flat weight.
So that if the gradients are big you
divided by a large number and if the
gradients is small and you divide then
divide by small number.
That will deal very nicely with a wide
range of different gradients.
It's actually a mini batch version of just
using the sign of the gradient which is a
method called R prompt, that was designed
for full batch learning.
The final way of speeding up learning,
which is what optimization people would
naturally recommend, is to use full batch
learning.
And to use a fancy method that takes
curvature information into account.
To adapt that method to work for neural
nets.
And then maybe to try and adapt it some
more, so it works with mini batches.
I am not going to talk about that in this
lecture.