This video introduces the learning
algorithm for a linear neuron.
This is quite like the learning algorithm
for a perceptron, but it achieves
something different.
In a perceptron, what's happening is the
weight, so always getting closer to a good
set of weights.
In a linear neuron, the outputs are always
getting closer to the target outputs.
The perception convergence procedure works
by ensuring that when we change the
weights, we get closer to a good set of
weights.
That type of guarantee cannot be extended
to more complex networks.
Because in more complex networks when you
average two good set of weights, you might
get a bad set of weights.
So for multilayer neural networks, we
don't use the perceptron learning
procedure.
And to prove that when they're learning
something is improving, we don't use the
same kind of proof at all.
They should never have been called
multilayer perceptrons.
It's partly my fault and I'm sorry.
For multilayer nets we're gonna need a
different way to show that the learning
procedure makes progress.
Instead of showing that the weights get
closer to a good set of weights, we're
gonna show that the actual output values
get closer to the target output values.
This can be true even for non-convex
problems in which averaging the weights of
two good solutions does not give you a
good solution.
It's not true for perceptual learning.
In perceptual learning, the outputs as a
whole can get further away from the target
outputs even though the weights are
getting closer to good sets of weights.
The simplest example of learning in which
you're making the outputs get closer to
the target outputs is learning in a linear
neuron with a squared error measure.
Linear neurons, which are also called
linear filters in electrical engineering,
have a real valued output that's simply
the weighted sum of their outputs.
So the output Y, which is the neuron's
estimate of the target value, is the sum
over all the inputs i of a weight vector
times an input vector.
So we can write it in summation form or we
can write it in vector notation.
The aim of the learning is to minimize the
error summed over all training cases.
We need a measure of that error and to
keep life simple, we use the square
difference between the target output and
the actual output.
So one question is why don't we just solve
it analytically.
It's straightforward to write down a set
of equations with one equation per
training case, and to solve for the best
set of weights.
That's the standard engineering approach,
and so why don't we use it?
The first answer, and the scientific
answer, is we'd like to understand what
real neurons might be doing, and they're
probably not solving a set of equations
symbolically.
An engineering answer is that we want a
method that we can then generalize to
multilayer, nonlinear networks.
The analytic solution relies on it being
linear and having a squared error measure.
An iterative method, which we're gonna see
next, is usually less efficient, but much
easier to generalize to more complex
systems.
So I'm now gonna go through a toy example
that illustrates an iterative method for
finding the weights of a linear neuron.
Suppose that every day, you get lunch at a
cafeteria.
And your diet consists entirely of fish,
chips, and ketchup.
Each day, you order several portions of
each, but on different days, it's
different numbers of portions.
The cashier only shows you the total price
of the meal, but after a few days, you
ought to be able to figure out what the
price is for each portion of each kind of
thing.
In the iterative approach, you start with
random guesses for the prices of portions.
And then you adjust these guesses so that
you get a better fit to the prices that
the cashier tells you.
Those are the observed prices of whole
meals.
So each meal, you get a price and that
gives you a linear constraint on the
prices of the individual portions.
It looks like this, the price of the whole
meal is the number of portion of fish, x
fish, times the cost of a portion of fish,
w fish.
And the same for chips and ketchup.
So the prices of the portions are like the
weights of a linear neuron.
And we can think of the whole weight
vector as being the price of a portion of
fish, the price of a portion of chips, and
the price of a portion of ketchup.
We're going to start with guesses for
these prices and then we're going to
adjust the guesses slightly, so that we
agree better with what the cashier says.
So let's suppose that the true weights
that the cashier using to figure out the
price, are 150 for a portion of fish, 50
for portion of chips and a 100 for a
portion of Ketchup.
For the meals shown here, that will lead
to a price of 850.
So that's going to be our target value.
That suppose that we start with guesses,
but each portion costs 50.
So for the meal with two portions of fish,
five of chips, and three of ketchup, we're
going to initially think that the price
should be 500.
That gives us a residual error of 350.
The residual error is the difference
between what the cashier says and what we
think the price should be with our current
weights.
We're then gonna use the delta rule for
revising our prices of portions.
We make the change in a weight, delta WI
be equal to a learning rate, epsilon times
the number of portions of the i-th thing,
times the residual error.
The difference between the target and our
estimate.
So if we make the learning rate be one
over 35, so the maths stays simple, then
the learning rate times the residual error
for this particular example is ten.
And so, our change in the weight for fish
will be two times ten.
We'll increase that weight by twenty.
Our change in the weight for chips will be
five times ten.
And our change in the weight for ketchup
will be three times ten.
That'll give us new weights of 70, 100,
and 80.
And notice, the weight for chips actually
got worse.
There's no guarantee with this kind of
learning that the individual weights will
keep getting better.
What's getting better is the difference
between what the cashier says and our
estimate.
So now, we're going to derive the delta
rule.
We start by defining the arrow measure,
which is simply our squared residual
summed over all training cases.
That is the squared difference between the
target and what the neural net predicts.
Or the linear neuron predicts.
Squared, in some liberal training cases.
And we put a one-half in front, which will
cancel the two, when we differentiate.
We now differentiate that error measure
with respect to one of the weights, WI.
To do that differentiation we need to use
the chain rule.
The chain rule says that how the error
changes as we change a weight, will be how
the output changes as we change the
weight, times how the error changes as we
change the output.
The chain rule is easy to remember, you
just cancel those two DYs but you can only
do that when there's no mathematicians
looking.
The reason the first one, DY by DW is
written with a curly D is because it's a
partial derivative.
That is, there's many different weights
you can change to change the output.
And here, we're just considering the
change to weight i.
So, DY by DWi, is actually equal to Xi,
and that's because Y is just Wi times Xi,
and DE by DY, is just T minus Y, because
when we differentiate that T minus Y
squared, and use the half to cancel the
two we just get T minus Y.
So our learning rule is now, we change the
weights by an amount that's equal to the
learning rate epsilon times the derivative
of the error with respect to a weight, to
E by DWi.
And with a minus sign in front cuz we want
the error to go down.
And that minus sign cancels the minus sign
in the line above and we get that.
The change in a weight is the sum of all
training cases of the learning rate times
the input value times the difference
between the target and actual outputs.
Now we can ask how does this learning
procedure, this delta rule, behave?
Does this, for example, eventually get the
right answer?
There may be no perfect answer.
It may be that we give the linear neuron a
bunch of training cases with desired
answers.
And there's no set of weights that'll give
the desired answer.
There's still some set of weights that
gets the best approximation on all those
training cases, minimizes that error
measure.
Some that are all training cases.
And if we make the learning rate small
enough and we learn for long enough, we
can get as close as we like to that best
answer.
Another question is, how quickly do we get
towards the best answer.
And even for a linear system.
The learning can be quite slow in this
kind of intricate learning.
If two input dimensions are highly
correlated, its very hard to tell how much
of the sum of the weight on both input
dimensions should be attributed to each
input dimension.
So if for example, you always get the same
number of portions of ketchup and chips
is, we can't decide how much of the price
is due to the ketchup and how much is used
to the chips.
And if they're almost always the same, it
can take a long time for the learning to
correctly attribute the price to the
ketchup and the chips.
There's an interesting relationship
between the delta rule and the learning
rule for perceptrons.
So, if you, you use the online version of
the delta rule, but we change the weights
after each training case, it's quite
similar to the perceptron learning rule.
In perceptron learning, we increment or
decrement the weight vector by the input
vector, but we only change the input
vector when we make an error.
In the online version of the delta rule,
we increment or decrement the weight
vector by the imperfector.
But we scale that by both the residual
error and the learning rate.
And one annoying thing about this is we
have to choose a learning rate.
If we choose a learning rate that's too
big, the system will be unstable.
And if we choose a learning rate that's
too small, it will take an unnecessarily
long time to, to learn a sensible set of
weights