To extend the learning rule for a linear
neuron to a learning rule we can use for
multilayer nets of nonlinear neurons, we
need two steps.
First, we need to extend the learning rule
to a single nonlinear neuron.
And we're going to use logistic neurons,
although many other kinds of nonlinear
neurons could be used instead.
We're now going to generalize the learning
rule for a linear neuron to a logistic
neuron, which is a non linear neuron.
So, a logistic neuron, computes its logic,
z, which is its total input, its, its bias
plus the sum over all its input lines of
the value of, on an input line xi times
the weight on that line, wi.
It then gives an output y that's a smooth
nonlinear function of that logit.
As shown in the graph here, that function
is approximately zero when z is big and
negative, approximately one when z is big
and positive, and in bet, in between, it
changes smoothly and nonlinearly.
The fact that it changes continuously
gives it nice derivatives, which make
learning easy.
So to get the derivatives of a logistic
neuron with respect to the weight, which
is what we need for learning, we first
need to compute the derivative of the
logit itself, that is the total input with
respect to our weight, that's very simple.
The logit is just a bias plus the sum of
all the input lines of the failure on the
input lines times the weight.
So, when we differentiate with respect to
wi, we just get xi.
So, the derivative of the logit with
respect to wi is xi, and similarly, the
derivative of the logit with respect to xi
is wi.
The derivative of the output with respect
to the logic is also simple if you express
it terms of the output.
So, the output is one / one + e^-z. And dy
by dz is just y into one - y. That's not
obvious.
For those of you who like to see the math,
I've put it on the next slide.
The math is tedious but perfectly
straightforward so you can go through it
by yourself.
Now, we've got the derivative, the output
with respect to the logic and the
derivative, the logit with respect to the
weight, we can start to figure out the
derivative, the output with respect to the
weight.
We just use the chain rule again.
So, dy by dw is dz by dw times dy by dz.
And dz by dw, as we just saw, is xi, dy by
dz is y into one minus y.
And so, we now have the learning row for a
logistic neuron.
We've got dy by dw, and all we need to do
is use the chain rule once more, and
multiply it by de by dy.
And we get something that looks very like
the delta rule.
So, the way the arrow changes is we change
the weight, de by dwi, is just the sum of
all the row of training cases and of the
value on input line xin times the
residual, the difference between the
target and the output, on the actual
output of the neuron.
But it's got this extra term in it, which
comes from the slope of the logistic
function, which is yn into one - yn.
So, a slight modification of the delta
rule gives us the gradiant decent learning
rule for training a logistic unit.