In this video, we're going to look at a
method that was developed in the late
1980's by Robbie Jacobs and then improved
by a number of other people.
The idea is that each connection in the
neural net should have its own adaptive
learning rate, which we set empirically by
observing what happens to the weight on
that connection when we update it.
So that if the weight keeps reversing its
gradient, we turn down the learning
weight.
And if the gradient stays consistent, we
turn up the learning weight.
So, let's start by thinking why having
separate adaptive learning weights on each
connection is a good idea.
The problem is, they're in a deep
multilayer net.
The learning weights can vary widely
between different weights, especially
between weights in different layers.
So, if for example, we start with small
weights, the gradience starts from much
smaller in the initial layers than in the
later layers.
Another factor that causes one different
learning rate for different weights is the
fan-in of the unit.
The fan-in determines the size of the
overshoot effects that you get when you
simultaneously change many of the
different incoming weights to fix up the
same error.
It maybe that the unit didn't get enough
input, when you change all these weights
at the same time to fix up the error, it
now gets too much input.
Obviously, that effect is going to be
bigger if there's a bigger fan-in.
So, the net in the diagram on the right
has the same fain-in for both layers more
or less the same fain-in for both layers,
but that's very different in some nets.
So, the idea is that we're going to use a
global learning weight which we set by
hand, and then we're going to multiply it
by a local gain that is determined
empirically for each weight.
A simple way to determine what those local
gains should be is to start with a local
gain of one for every weight.
So that, initially we're going to change
the weight, Wij, by the learning rate
times the gain of one, gij times the error
derivative for that weight.
Then, what we're going to do is we're
going to adapt gij.
We're going to increase gij if the
gradient for the weight does not change
side.
And we're going to use small additive
increases, and multiplicative decreases.
So, if the gradient for the weight at time
t has the same sign as the gradient for
the weight at time t minus one, with t
refers to weight updates, then when you
take that product, it'll be positive.
Cuz you already get two negative gradients
or two positive gradients, and then what
we're going to go is increase gij by small
additive amount.
If the gradients have opposite signs,
we're going to decrease gij. And because
we want to damp down gij quickly if it's
already big, we're going to decrease it
multiplicatively.
That ensures that big gains will decay
very rapidly if oscillation start.
It's interesting to ask what would happen
if the grading was totally random.
So, on each update of the weights, pick a
random gradient.
Then, you'll get an equal number of
increases and decreases cuz it will
equally often be the same sign as the
previous gradient or the opposite sign.
And so, you'll get a bunch of additive
0.05 increases, and multiplicative 0.95
decreases, and they have an equilibrium
point which is when the gain is one.
If the gain's bigger than one, the
multiplying by 0.95 will reduce it by more
than adding 0.05. If the gain's smaller
than one, adding 0.05 will increase it
more than multiplying by 0.95 decreases
it.
So, with random gradients, we'll hover
around one.
And if the gradient is consistently in the
same direction we can get much bigger than
one.
If the gradient is consistently in
opposite directions, which means we're
oscillating across a ravine, we can get
much smaller than one.
There's a number of tricks for making the
adaptive learning rates work better.
It's important to limit the size of the
gains.
A reasonable range is 0.1 to ten.
Or 0.1 to 100.
You don't want the gains to get huge
because then you can easily get into an
instability and they won't die down fast
enough, and you'll destroy all the
weights.
The adaptive learning rates was designed
for full batch learning.
You can also apply it with mini batches
but they had better be pretty big mini
batches.
That'll ensure that the sign, changing
signs of gradience aren't due to the
sampling error of mini batches,
They are really due to the other side of
the ravine.
There's nothing to prevent you combining
adaptive learning rates with momentum.
So, Jacob suggests that, instead of using
the agreement in sign between the current
gradient and the previous gradient, you
use the agreement in sign between the
current gradient and the velocity for that
weight, so the accumulated gradient.
And, if you do that, you get a nice
combination of the advantages of momentum,
and the advantages of adaptive learning
rates.
So, adaptive learning rates only deal with
axis of line defects.
Whereas, momentum doesn't care about the
alignment of the axis.
Momentum can deal with these diagonal
ellipses and going in that diagonal
direction quickly which adaptive learning
rates can't do.