In this video we're going to look at the
error surface for a linear neuron.
By understanding the shape of this error
surface, we can understand a lot about
what happens as a linear neuron is
learning.
We can get a nice geometrical
understanding of what's happening when we
learn the weights of a linear neuron.
By considering a space that's very like
the weight space that we use to understand
perceptrons, but it has one extra
dimension.
So we imagine a space in which all the
horizontal dimensions correspond to the
weights.
And there's one vertical dimension that
corresponds to the error.
So in this space, points on the horizontal
plane, correspond to different settings of
the weights.
And the height corresponds to the error
that your making with that set of weights,
summed over all training cases.
For a linear neuron, the errors you make
for each set of weights define error
surface.
And this error surface is a quadratic
bowl.
That is, if you take a vertical
cross-section, it's always a parabola.
And if you take a horizontal
cross-section, it's always an ellipse.
This is only true for linear systems with
a squared error.
As soon as we go to a multilayer nonlinear
neuron nets, this error surface will get
more complicated.
As long as the weights aren't too big, the
error surface will still be smooth, but it
may have many local minimum.
Using this error surface we can get a
picture of what's happening as we do
gradient descent learning using the delta
rule.
So what the delta rule does is it computes
the derivative of the error with respect
to the weights.
And if you change the weights in
proportion to that derivative, that's
equivalent to doing steepest descent on
the error surface.
To put it another way, if we look at the
error surface from above, we get
elliptical contour lines.
And the delta rule is gonna take it at
right angles to those elliptical contour
lines, as shown in the picture.
That's what happens with what's called
batch learning, where we get the grayed
in, summed overall training cases.
But we could also do online learning,
where after each training case, we change
the weights in proportion to the gradient
for that single training case.
That's much more like what we do in
perceptrons.
And, as you can see, the change in the
weights moves us towards one of these
constraint planes.
So in the picture on the right, there are
two training cases.
To get the first training case correct, we
must lie on one of those blue lines.
And to get the second training case
correct, the two weights must lie on the
other blue line.
So if we start at one of those red points,
and we compute the gradient on the first
training case, the delta rule will move us
perpendicularly towards that line.
If we then consider the other training
case, we'll move perpendicularly towards
the other line.
And if we alternate between the two
training cases, we'll zigzag backwards and
forwards, moving towards the solution
point which is where those two lines
intersect.
That's the set of weights that is correct
for both training cases.
Using this picture of the error surface,
we can also understand the conditions it
will make learning very slow.
If that ellipse is very elongated, which
is gonna happen if the lines that
correspond to training cases is almost
parallel, then when we look at the
gradient, it's going to have a nasty
property.
If you look at the red arrow in the
picture, the gradient is big in the
direction in which we don't want to move
very far, and it's small in the direction
in which we want to move a long way.
So the gradient will quickly take us
across the bottom of that ravine.
Corresponding to the narrow axis of the
ellipse.
And will take a long time to take us along
the ravine, corresponding to the long Xs
of the ellipse.
It's just the opposite of what we want.
We'd like to get a great into a small
across the ravine, and big along the
ravine but that's not what we get.
And so, simple steepest descent, in which
you change each weight in proportion to a
learning rate times the error derivative,
is gonna have great difficulty, with very
elongated surfaces like the one shown in
the picture.