1
00:00:00,000 --> 00:00:04,015
Figuring out how to get the error
derivatives for all of the weights in a

2
00:00:04,015 --> 00:00:08,591
multilayer network is the key to being
able to learn efficient neural networks.

3
00:00:08,591 --> 00:00:12,890
But there are a number of other issues
that have to be addressed before we

4
00:00:12,890 --> 00:00:16,120
actually get a learning procedure that's
fully specified.

5
00:00:16,120 --> 00:00:20,305
For example, we need to decide how often
to update the weights.

6
00:00:20,305 --> 00:00:26,428
And we need to decide how to prevent the
network from over-fitting very badly if we

7
00:00:26,428 --> 00:00:30,398
use a large network.
The back propagation algorithm is an

8
00:00:30,398 --> 00:00:36,055
efficient way to compute the derivatives
with respect to each weight of the error

9
00:00:36,055 --> 00:00:40,476
for a single training case.
But that's not a learning algorithm.

10
00:00:40,476 --> 00:00:46,614
You have to specify a number of other
things to get a proper learning procedure.

11
00:00:46,614 --> 00:00:52,577
We need to make lots of other decisions.
Some of these decisions are about how

12
00:00:52,577 --> 00:00:57,533
we're going to optimized, that is how
we're going to use the other derivatives

13
00:00:57,533 --> 00:01:00,911
on the individual cases, to discover good
set of weights.

14
00:01:00,911 --> 00:01:03,828
Those will be described in detail in
Lecture six.

15
00:01:03,829 --> 00:01:09,226
Another set of issues is how do we ensure
that the weights that we've learned will

16
00:01:09,226 --> 00:01:14,072
generalize well, that is how do we make
sure they work on cases that we didn't see

17
00:01:14,072 --> 00:01:18,015
during training and Lecture seven will be
devoted to that issue.

18
00:01:18,092 --> 00:01:23,046
What I'm going to do now is give you a
very brief overview of these two sets of

19
00:01:23,046 --> 00:01:27,063
issues.
So, optimization issues are about how you

20
00:01:27,063 --> 00:01:32,053
use the weight derivatives.
The first question is how often should you

21
00:01:32,053 --> 00:01:36,085
update the weights?
We could try out dating the weights after

22
00:01:36,085 --> 00:01:40,076
each training case.
So, you compute the error derivatives on a

23
00:01:40,076 --> 00:01:45,063
training case using back propagation and
then, you make a small change to the

24
00:01:45,063 --> 00:01:47,856
weights.
Obviously, this is going to zigzag around

25
00:01:47,856 --> 00:01:52,075
cuz on each training case, you'll get
different error derivatives.

26
00:01:52,075 --> 00:01:57,049
But on average, if we make the weight
changes small enough, it'll go in the

27
00:01:57,049 --> 00:02:01,096
right direction.
What seems more sensible is to use full

28
00:02:01,096 --> 00:02:06,490
batch training, where you do a full sweep
through all of the training data, you add

29
00:02:06,490 --> 00:02:11,434
together all of the error derivatives you
get on the individual cases, and then you

30
00:02:11,434 --> 00:02:17,078
take a small step in that direction.
A problem with this is that we start off

31
00:02:17,078 --> 00:02:20,692
with a bad set of weights, and we might
have a very big training set.

32
00:02:20,692 --> 00:02:24,995
And we don't want to do all that work of
going through the whole training set in

33
00:02:24,995 --> 00:02:27,926
order to fix up some weights that we know
are pretty bad.

34
00:02:27,926 --> 00:02:32,498
Really, we only need to look at a few
training cases before we get a reasonable

35
00:02:32,498 --> 00:02:35,574
idea of what direction we want to move the
weights in.

36
00:02:35,574 --> 00:02:39,501
And we don't need to look at a large
number of training cases until we get

37
00:02:39,501 --> 00:02:43,319
towards the end of learning.
So, that gives us mini batch learning,

38
00:02:43,319 --> 00:02:47,601
where we take a small random sample of the
training cases and we go in that

39
00:02:47,601 --> 00:02:50,622
direction.
We'll do a little bit of zigzagging, not

40
00:02:50,622 --> 00:02:56,596
nearly as much zigzagging as if we did
online where we use one training case at a

41
00:02:56,596 --> 00:02:59,306
time.
And mini batch learning is what people

42
00:02:59,306 --> 00:03:04,053
typically do when they're training big
neural networks on big data sets.

43
00:03:04,053 --> 00:03:08,323
Then there's the issue of how much we
update the weights.

44
00:03:08,323 --> 00:03:12,826
How big a change we make.
So, we could just by hand try and pick

45
00:03:12,826 --> 00:03:18,937
some fixed learning rate and then learn
the weights by changing each weight by the

46
00:03:18,937 --> 00:03:23,053
derivative that we've computed times that
learning rate.

47
00:03:23,053 --> 00:03:26,946
It seems more sensible to actually adapt
the learning rate.

48
00:03:26,946 --> 00:03:31,982
We could get the computer to adapt it by,
if we're oscillating around, if the error

49
00:03:31,982 --> 00:03:35,824
keeps going up and down, then we'll reduce
the learning rate.

50
00:03:35,824 --> 00:03:40,800
But if we're making steady progress, we
might increase the learning rate.

51
00:03:40,800 --> 00:03:46,061
We might even have a separate learning
rate for each connection in the network,

52
00:03:46,061 --> 00:03:50,770
so that some weights learn rapidly and
other weights learned more slowly, or we

53
00:03:50,770 --> 00:03:55,658
might go even further and say, we don't
really want to go in the direction of

54
00:03:55,658 --> 00:03:59,488
steepest decent at all.
If you look at the figure on the right,

55
00:03:59,488 --> 00:04:04,722
when we had a very elongated ellipse, the
direction of steepest decent is almost at

56
00:04:04,722 --> 00:04:08,733
right angles to the direction to the
minimum that we want to find.

57
00:04:08,733 --> 00:04:13,497
And this is typical particularly towards
the end of learning of most learning

58
00:04:13,497 --> 00:04:16,615
problems.
So, there are much better directions to go

59
00:04:16,615 --> 00:04:21,853
in than the direction of steepest decent.
The problem is, it's quite hard to figure

60
00:04:21,853 --> 00:04:27,595
out what they are.
The second set of issues is to do with how

61
00:04:27,595 --> 00:04:32,422
well the network generalizes the cases it
didn't see during training.

62
00:04:32,422 --> 00:04:38,840
And the problem here is that the training
data contains information about the

63
00:04:38,840 --> 00:04:45,548
regularities in the mapping from input to
output, but it also contains two types of

64
00:04:45,548 --> 00:04:49,193
noise.
The first type of noise is that the target

65
00:04:49,193 --> 00:04:53,085
values may be unreliable.
And from neural network, that's usually

66
00:04:53,085 --> 00:04:56,904
only a minor worry.
The second type of noise is that the

67
00:04:56,904 --> 00:05:00,476
sampling error.
If we take any particular training set,

68
00:05:00,476 --> 00:05:06,490
especially if it's a small one, there will
be accident irregularities that are caused

69
00:05:06,490 --> 00:05:12,030
by the particular cases that we chose.
So, for example, if you show someone some

70
00:05:12,030 --> 00:05:17,060
polygons, if you're a bad teacher, you
might choose to show them a square and a

71
00:05:17,060 --> 00:05:21,017
rectangle.
Those are both polygons, but there's no

72
00:05:21,017 --> 00:05:26,017
way for someone to realize from that, that
polygons might have three sides or seven

73
00:05:26,017 --> 00:05:29,012
sides.
There's no way for them to understand that

74
00:05:29,012 --> 00:05:34,000
the angles don't have to be right angles.
If you're a slightly better teacher, you

75
00:05:34,000 --> 00:05:39,024
might show them a triangle and a hexagon.
But again, from that, they can't tell

76
00:05:39,024 --> 00:05:44,027
whether polygons are always convex, and
they can't tell whether the angles in

77
00:05:44,027 --> 00:05:47,001
polygons are always multiples of 60
degrees.

78
00:05:47,001 --> 00:05:49,334
And, however carefully, you choose
examples.

79
00:05:49,334 --> 00:05:53,940
For any finite set of examples, there'll
be accidental regularities.

80
00:05:53,940 --> 00:06:00,052
Now, when we fit a model, there's no way
it can tell the difference between an

81
00:06:00,052 --> 00:06:06,054
accident regularity that's just there
because of the particular samples we chose

82
00:06:06,054 --> 00:06:11,089
and a real regularity that's, that we'll
generalize properly to new cases.

83
00:06:12,048 --> 00:06:16,039
So, what the model will do is it will fit
both kinds of regularity.

84
00:06:16,080 --> 00:06:21,083
And if you've got a big powerful model,
it'll be very good at fitting the sampling

85
00:06:21,083 --> 00:06:25,811
error, band that will be a real disaster.
That will cause it to generalize really

86
00:06:25,811 --> 00:06:29,259
badly.
This is best understood by looking at a

87
00:06:29,259 --> 00:06:34,548
little example.
So here, we've got six data points shown

88
00:06:34,548 --> 00:06:40,976
in black, and we can fit a straight line
to them, but the model has two degree of

89
00:06:40,976 --> 00:06:47,156
freedom and it's fitting the six y values,
given the six x values, or we can fit a

90
00:06:47,156 --> 00:06:50,453
polynomial that has six degrees of
freedom.

91
00:06:50,453 --> 00:06:56,453
Ad by hand, I've drawn in red, my idea of
a polynomial with six degrees of freedom

92
00:06:56,453 --> 00:07:01,006
fitting this data.
And you'll see the polynomial goes through

93
00:07:01,006 --> 00:07:05,076
the data points exactly and so it's a much
better fit to the data.

94
00:07:05,076 --> 00:07:11,056
But which model do you trust?
The complicated model certainly fits the

95
00:07:11,056 --> 00:07:15,018
data much better.
But it's not economical.

96
00:07:16,000 --> 00:07:20,052
For a model to be convincing, what you
want it to do is be a simple model that

97
00:07:20,052 --> 00:07:24,093
explains a lot of data surprisingly well.
And the polynomial doesn't do that.

98
00:07:24,093 --> 00:07:28,099
It explains these six data points, but
it's got six degrees of freedom.

99
00:07:28,099 --> 00:07:33,000
So, wherever these data points were, it
won't be able to explain them.

100
00:07:34,045 --> 00:07:38,067
We're not surprise that the model as
complicated can fit that data very well

101
00:07:38,067 --> 00:07:41,057
and it doesn't convince us that this is a
good model.

102
00:07:42,018 --> 00:07:47,020
So, if you look at the arrow, which output
value do you predict for this input value?

103
00:07:47,020 --> 00:07:52,034
Well, you'd have to have a lot of faith in
the polynomial model in order to predict a

104
00:07:52,034 --> 00:07:57,043
value that's outside the range of values
in all of the training data you've seen so

105
00:07:57,043 --> 00:08:00,021
far.
And I think almost everybody would prefer

106
00:08:00,021 --> 00:08:05,023
to predict the blue circle that's on the
green line rather than the one on the red

107
00:08:05,023 --> 00:08:09,029
line.
However, if we had ten times as much data,

108
00:08:09,029 --> 00:08:14,027
and all of these data points lay very
close to the red line, then we would

109
00:08:14,027 --> 00:08:21,290
certainly prefer the red line.
There's a number of ways to reduce

110
00:08:21,290 --> 00:08:26,931
over-fitting that have been developed for
neural networks and for many other models,

111
00:08:26,931 --> 00:08:30,655
and I'm going to give just a brief survey
of them here.

112
00:08:30,655 --> 00:08:35,749
There's weight decay where you try and
keep the weights of the network small, or

113
00:08:35,749 --> 00:08:41,083
try and keep many of the weights at zero.
And the idea of this is that it will make

114
00:08:41,083 --> 00:08:45,360
the model simpler.
It's weight sharing, where again, you make

115
00:08:45,360 --> 00:08:50,281
the model simpler, by insisting that many
of the weights have the exactly same value

116
00:08:50,281 --> 00:08:53,038
as each other.
You don't know what the value is and

117
00:08:53,038 --> 00:08:56,098
you're going to learn it.
But it has to be exactly the same for many

118
00:08:56,098 --> 00:08:59,098
of the weights.
We'll see that in the next lecture, how

119
00:08:59,098 --> 00:09:03,998
weight sharing is used.
There's early stopping, where you make

120
00:09:03,998 --> 00:09:08,076
yourself a fake test set.
And as you're training the net, you peek

121
00:09:08,076 --> 00:09:14,021
at what's happening on this fake test set.
And once the performance on the fake test

122
00:09:14,021 --> 00:09:17,000
set starts getting worse, you stop
training.

123
00:09:17,029 --> 00:09:21,574
This model averaging where you train not
so different neural on that, and you

124
00:09:21,574 --> 00:09:26,072
average them together in the hopes that,
that will reduce the errors you're making,

125
00:09:27,036 --> 00:09:32,044
Those Bayesian fitting of your own eyes,
which is just a fancy form of model

126
00:09:32,044 --> 00:09:37,988
averaging, is dropped out but you try and
make your model more robust by randomly

127
00:09:37,988 --> 00:09:41,015
emitting hidden units when you're training
it.

128
00:09:42,008 --> 00:09:46,048
And this generative pretraining which are
somewhat more complicated and I'll

129
00:09:46,048 --> 00:09:48,067
describe towards the end of the course.