1
00:00:00,000 --> 00:00:05,060
To extend the learning rule for a linear
neuron to a learning rule we can use for

2
00:00:05,060 --> 00:00:09,036
multilayer nets of nonlinear neurons, we
need two steps.

3
00:00:09,036 --> 00:00:14,021
First, we need to extend the learning rule
to a single nonlinear neuron.

4
00:00:14,021 --> 00:00:19,040
And we're going to use logistic neurons,
although many other kinds of nonlinear

5
00:00:19,040 --> 00:00:25,012
neurons could be used instead.
We're now going to generalize the learning

6
00:00:25,012 --> 00:00:31,141
rule for a linear neuron to a logistic
neuron, which is a non linear neuron.

7
00:00:31,141 --> 00:00:38,048
So, a logistic neuron, computes its logic,
z, which is its total input, its, its bias

8
00:00:38,048 --> 00:00:45,038
plus the sum over all its input lines of
the value of, on an input line xi times

9
00:00:45,038 --> 00:00:50,088
the weight on that line, wi.
It then gives an output y that's a smooth

10
00:00:50,088 --> 00:00:57,019
nonlinear function of that logit.
As shown in the graph here, that function

11
00:00:57,019 --> 00:01:02,090
is approximately zero when z is big and
negative, approximately one when z is big

12
00:01:02,090 --> 00:01:07,091
and positive, and in bet, in between, it
changes smoothly and nonlinearly.

13
00:01:07,091 --> 00:01:13,021
The fact that it changes continuously
gives it nice derivatives, which make

14
00:01:13,021 --> 00:01:18,011
learning easy.
So to get the derivatives of a logistic

15
00:01:18,011 --> 00:01:23,681
neuron with respect to the weight, which
is what we need for learning, we first

16
00:01:23,681 --> 00:01:29,379
need to compute the derivative of the
logit itself, that is the total input with

17
00:01:29,379 --> 00:01:35,051
respect to our weight, that's very simple.
The logit is just a bias plus the sum of

18
00:01:35,051 --> 00:01:40,539
all the input lines of the failure on the
input lines times the weight.

19
00:01:40,539 --> 00:01:45,338
So, when we differentiate with respect to
wi, we just get xi.

20
00:01:45,338 --> 00:01:50,387
So, the derivative of the logit with
respect to wi is xi, and similarly, the

21
00:01:50,387 --> 00:01:53,696
derivative of the logit with respect to xi
is wi.

22
00:01:53,696 --> 00:02:01,310
The derivative of the output with respect
to the logic is also simple if you express

23
00:02:01,310 --> 00:02:10,872
it terms of the output.
So, the output is one / one + e^-z. And dy

24
00:02:10,872 --> 00:02:16,086
by dz is just y into one - y. That's not
obvious.

25
00:02:16,086 --> 00:02:21,012
For those of you who like to see the math,
I've put it on the next slide.

26
00:02:21,012 --> 00:02:25,056
The math is tedious but perfectly
straightforward so you can go through it

27
00:02:25,056 --> 00:02:30,687
by yourself.
Now, we've got the derivative, the output

28
00:02:30,687 --> 00:02:37,067
with respect to the logic and the
derivative, the logit with respect to the

29
00:02:37,067 --> 00:02:44,026
weight, we can start to figure out the
derivative, the output with respect to the

30
00:02:44,026 --> 00:02:47,055
weight.
We just use the chain rule again.

31
00:02:47,055 --> 00:02:54,703
So, dy by dw is dz by dw times dy by dz.
And dz by dw, as we just saw, is xi, dy by

32
00:02:54,703 --> 00:03:00,635
dz is y into one minus y.
And so, we now have the learning row for a

33
00:03:00,635 --> 00:03:05,256
logistic neuron.
We've got dy by dw, and all we need to do

34
00:03:05,256 --> 00:03:10,277
is use the chain rule once more, and
multiply it by de by dy.

35
00:03:10,277 --> 00:03:15,404
And we get something that looks very like
the delta rule.

36
00:03:15,404 --> 00:03:23,002
So, the way the arrow changes is we change
the weight, de by dwi, is just the sum of

37
00:03:23,002 --> 00:03:29,627
all the row of training cases and of the
value on input line xin times the

38
00:03:29,627 --> 00:03:35,971
residual, the difference between the
target and the output, on the actual

39
00:03:35,971 --> 00:03:41,201
output of the neuron.
But it's got this extra term in it, which

40
00:03:41,201 --> 00:03:47,062
comes from the slope of the logistic
function, which is yn into one - yn.

41
00:03:47,062 --> 00:03:53,714
So, a slight modification of the delta
rule gives us the gradiant decent learning

42
00:03:53,714 --> 00:03:56,082
rule for training a logistic unit.