1
00:00:00,000 --> 00:00:05,376
In this video, I'm going to talk about
improving generalization by reducing the

2
00:00:05,376 --> 00:00:11,092
overfitting that occurs when a network has
too much capacity for the amount of data

3
00:00:11,092 --> 00:00:15,888
it's given during training.
I'll describe various ways of controlling

4
00:00:15,888 --> 00:00:20,536
the capacity of a network.
And I'll also describe how we determine

5
00:00:20,536 --> 00:00:26,100
how to set the metric parameters when we
use a method for controlling capacity.

6
00:00:26,100 --> 00:00:31,523
I'll then go on to give an example where
we control capacity by stopping the

7
00:00:31,523 --> 00:00:35,062
learning early.
Just to remind you, the reason we get

8
00:00:35,062 --> 00:00:40,765
over-fitting is because as well as having
information about the true regularities in

9
00:00:40,765 --> 00:00:46,333
the mapping from the input or output, any
finite set of training data also contains

10
00:00:46,333 --> 00:00:50,972
sampling error.
There's accidental regularities in the

11
00:00:50,972 --> 00:00:55,064
training set, just because of the
particular training cases that were

12
00:00:55,064 --> 00:00:59,322
chosen.
So when we fit the model, it can't tell

13
00:00:59,322 --> 00:01:03,804
which of the regularities are real, and
would also exist if we sampled the

14
00:01:03,804 --> 00:01:07,196
training set again,
And which are caused by the sampling

15
00:01:07,196 --> 00:01:11,568
error.
So the model fits both kinds of

16
00:01:11,568 --> 00:01:14,927
regularity.
And if the model's too flexible, it'll fit

17
00:01:14,927 --> 00:01:18,920
the sampling error really well, and then
it'll generalize badly.

18
00:01:19,580 --> 00:01:22,560
So we need a way to prevent this over
fitting.

19
00:01:23,020 --> 00:01:26,228
The first method I'll describe is by far
the best.

20
00:01:26,228 --> 00:01:30,784
And it's simply to get more data.
There's no point coming up with fancy

21
00:01:30,784 --> 00:01:35,019
schemes to prevent over fitting if you can
get yourself more data.

22
00:01:35,019 --> 00:01:39,318
Data has exactly the right characteristics
to prevent over fitting.

23
00:01:39,318 --> 00:01:44,130
The more of it you have the better.
Assuming your computer's fast enough to

24
00:01:44,130 --> 00:01:49,622
use it.
A second method is to try and judiciously

25
00:01:49,622 --> 00:01:55,066
limit the capacity of the model so that
it's got enough capacity to fit the true

26
00:01:55,066 --> 00:02:00,443
regularities but not enough capacity to
fit the spurious regularities caused by

27
00:02:00,443 --> 00:02:04,467
the sampling error.
This of course is very difficult to do.

28
00:02:04,467 --> 00:02:09,443
And I'll describe in the rest of this
lecture, various approaches to trying to

29
00:02:09,443 --> 00:02:16,808
regulate the capacity appropriately.
In the next lecture, I'll talk about

30
00:02:16,808 --> 00:02:22,627
averaging together many different models.
If we average models that have different

31
00:02:22,627 --> 00:02:27,462
forms and make different mistakes, the
average will do better than the individual

32
00:02:27,462 --> 00:02:32,497
models.
We could make the models different just by

33
00:02:32,497 --> 00:02:35,646
training them on different subsets of the
training data.

34
00:02:35,646 --> 00:02:39,919
This is a technique called bagging.
There's also other ways to mess with the

35
00:02:39,919 --> 00:02:43,180
training data to make the models as
different as possible.

36
00:02:44,320 --> 00:02:49,942
And the fourth approach, which is the
basian approach, is to use a single neural

37
00:02:49,942 --> 00:02:54,584
network architecture, but to find many
different sets of weights that do a good

38
00:02:54,584 --> 00:02:58,964
job of predicting the output.
And then on test data, you average the

39
00:02:58,964 --> 00:03:02,560
predictions made by all those different
weight vectors.

40
00:03:05,100 --> 00:03:08,577
So, there's many ways to control the
capacity of a model.

41
00:03:08,577 --> 00:03:13,732
The most obvious is via the architecture.
You limit the number of hidden layers, and

42
00:03:13,732 --> 00:03:17,582
the number of units per layer.
And this controls the number of

43
00:03:17,582 --> 00:03:21,060
connections in the network, i.e.
The number of parameters.

44
00:03:22,900 --> 00:03:27,397
A second method which is often very
convenient is to start with small weights

45
00:03:27,397 --> 00:03:30,683
and then stop the learning before it has
time to overfit.

46
00:03:30,683 --> 00:03:35,296
Again on the assumption that it finds the
true regularities before it finds the

47
00:03:35,296 --> 00:03:40,139
spurious regularities that have just to do
with the particular training set we have.

48
00:03:40,139 --> 00:03:43,080
I'll describe that method at the end of
this video.

49
00:03:45,140 --> 00:03:50,500
A very common way to control the capacity
of a neural network is to give it a number

50
00:03:50,500 --> 00:03:55,545
of hidden lairs or units per lair is a
little to large, but then to penalize the

51
00:03:55,545 --> 00:04:00,464
weights using penalties or constraints
using squared values of the weights or

52
00:04:00,464 --> 00:04:06,922
absolute values of the weights.
And finally, we can control the capacity

53
00:04:06,922 --> 00:04:12,260
of a model by adding noise to the weights,
or by adding noise to the activities.

54
00:04:12,600 --> 00:04:18,078
Typically, we use a combination of several
of these different capacity control

55
00:04:18,078 --> 00:04:23,429
methods.
Now for most of these methods, there's

56
00:04:23,429 --> 00:04:28,007
meta parameters that you have to set.
Like the number of hidden units, or the

57
00:04:28,007 --> 00:04:31,080
number of layers, or the size of the
weight penalty.

58
00:04:32,580 --> 00:04:36,934
An obvious way to transit those meta
parameters is to try lots of different

59
00:04:36,934 --> 00:04:41,575
values of one of the meta parameters like,
for example, the number of hidden units,

60
00:04:41,575 --> 00:04:44,841
and see which gives the best performance
on the test set.

61
00:04:44,841 --> 00:04:47,420
But there's something deeply wrong with
that.

62
00:04:47,860 --> 00:04:53,041
It gives a false impression of how well
the method will work if you give it

63
00:04:53,041 --> 00:04:57,760
another test set.
So the settings that work best for one

64
00:04:57,760 --> 00:05:02,275
particular test set are unlikely to work
as well on a new test set that's drawn

65
00:05:02,275 --> 00:05:06,903
from the same distribution because they've
been tuned to that particular test set.

66
00:05:06,903 --> 00:05:11,474
And that means you get a false impression
of how well you would do on a new test

67
00:05:11,474 --> 00:05:16,473
set.
Let me give you an extreme example of

68
00:05:16,473 --> 00:05:19,545
that.
Suppose the test set really is random,

69
00:05:19,545 --> 00:05:23,175
quite a lot of financial data seems to be
like that.

70
00:05:23,175 --> 00:05:28,691
So the answers just don't depend on the
inputs or can't be predictive from the

71
00:05:28,691 --> 00:05:33,084
inputs.
If you choose the model that does best on

72
00:05:33,084 --> 00:05:38,425
your test set, that will obviously do
better than chance because you selected it

73
00:05:38,425 --> 00:05:42,832
to do better than chance.
But if you take that model and try it on

74
00:05:42,832 --> 00:05:47,706
new data that's also random, you can't
expect it to do better than chance.

75
00:05:47,706 --> 00:05:53,181
So by selecting a model, you got a false
impression of how well a model will do on

76
00:05:53,181 --> 00:05:56,920
new data and the question is, is there a
way around that?

77
00:05:59,640 --> 00:06:03,131
So here's a better way to choose the
meta-parameters.

78
00:06:03,131 --> 00:06:07,084
You start by dividing the total data set
into three subsets.

79
00:06:07,084 --> 00:06:12,420
You have the training data, which is what
you're going to use to train your model.

80
00:06:13,560 --> 00:06:17,676
You hold back some validation data, which
isn't going to be used for training.

81
00:06:17,676 --> 00:06:21,292
But is going to be used for deciding how
to set the meta parameters.

82
00:06:21,292 --> 00:06:25,631
In other words, you're going to look at
how well the model does on the validation

83
00:06:25,631 --> 00:06:29,970
data to decide what's an appropriate
number of hidden units or an appropriate

84
00:06:29,970 --> 00:06:33,366
size of weight penalty.
But then once you've done that, and

85
00:06:33,366 --> 00:06:38,203
trained your model with what looks like
the best number of hidden units and the

86
00:06:38,203 --> 00:06:41,952
best weight penalty,
You're then going to see how well it does

87
00:06:41,952 --> 00:06:46,184
on the final set of data that you've held
back which is the test data.

88
00:06:46,184 --> 00:06:50,658
And you must only use that once.
And that'll give you an unbiased estimate

89
00:06:50,658 --> 00:06:54,830
of how well the network works.
And in general that estimate will be a

90
00:06:54,830 --> 00:06:59,417
little worse than on the validation data.
Nowadays in competitions, the people

91
00:06:59,417 --> 00:07:04,217
organizing the competitions have learned
to hold back that true test data and get

92
00:07:04,217 --> 00:07:08,959
people to send in predictions so they can
see whether they really can predict on

93
00:07:08,959 --> 00:07:13,525
true test data, or whether they're just
over-fitting to the validation data by

94
00:07:13,525 --> 00:07:17,915
selecting meta-parameters that do
particularly well on the validation data

95
00:07:17,915 --> 00:07:25,689
but won't generalize to new test sets.
One way we can get a better estimate of

96
00:07:25,689 --> 00:07:31,295
our weight penalties or number of hidden
units or anything else we're trying to fix

97
00:07:31,295 --> 00:07:35,165
using the validation data, is to rotate
the validation set.

98
00:07:35,165 --> 00:07:39,703
So, we hold back a final test set to get
our final unbiased estimate.

99
00:07:39,703 --> 00:07:45,109
But then we divide the other data into N
equal sized subsets and we train on all

100
00:07:45,109 --> 00:07:48,780
but one of those N, and use the Nth as a
validation set.

101
00:07:48,780 --> 00:07:53,065
Then we can rotate and a hold back a
different subset as a validation set, and

102
00:07:53,065 --> 00:07:57,516
so we can get many different estimates of
what the best weight penalty is, or the

103
00:07:57,516 --> 00:08:03,004
best number of hidden units is.
This is called N-fold cross-validation.

104
00:08:03,004 --> 00:08:07,825
It's important to remember, the N
different estimates we get are not

105
00:08:07,825 --> 00:08:12,993
independent of one another.
If for example, we were really unlucky and

106
00:08:12,993 --> 00:08:16,731
all the examples of one class fell into
one of those subsets,

107
00:08:16,731 --> 00:08:21,511
We'd expect to generalize very badly.
And we'd expect to generalize very badly,

108
00:08:21,511 --> 00:08:26,413
whether that subset was the validation
subset or whether it was in the training

109
00:08:26,413 --> 00:08:29,060
data.
So now I'm going to describe one

110
00:08:29,060 --> 00:08:32,681
particularly easy to use method for
printing over fitting.

111
00:08:32,681 --> 00:08:37,926
It's good when you have a big model on a
small computer and you don't have the time

112
00:08:37,926 --> 00:08:42,921
to train a model many different times with
different numbers of hidden units or

113
00:08:42,921 --> 00:08:48,207
different size weight penalties.
What you do is you start with small

114
00:08:48,207 --> 00:08:51,318
weights, and as the model trains, they
grow.

115
00:08:51,318 --> 00:08:55,171
And you watch the performance on the
validation set.

116
00:08:55,171 --> 00:08:59,320
And as soon as it starts to get worse, you
stop training.

117
00:09:00,220 --> 00:09:05,110
Now, the performance civilization on the
set may fluctuate particularly if you're

118
00:09:05,110 --> 00:09:08,672
error rate rather than a squared error or
presentory error.

119
00:09:08,672 --> 00:09:13,622
And so its hard to decide when to stop and
so what you typically do is keep going

120
00:09:13,622 --> 00:09:18,573
until you're sure things are getting worse
and then go back to the point at which

121
00:09:18,573 --> 00:09:22,623
things were best.
The reason this controls the capacity of

122
00:09:22,623 --> 00:09:27,228
the model, is because models with small
weights generally don't have as much

123
00:09:27,228 --> 00:09:30,500
capacity, and the weights haven't had time
to grow big.

124
00:09:31,180 --> 00:09:36,300
It's interesting to ask why small weights
lower the capacity.

125
00:09:37,380 --> 00:09:43,680
So consider a model with some input units,
some hidden units, and some output units.

126
00:09:43,960 --> 00:09:49,116
When the weight's very small, if the
hidden unit's a logistic units, their

127
00:09:49,116 --> 00:09:54,767
total inputs will be close to zero, and
they'll be in the middle of their linear

128
00:09:54,767 --> 00:09:58,016
range.
That is, they'll behave very like linear

129
00:09:58,016 --> 00:10:02,139
units.
What that means is, when the weights are

130
00:10:02,139 --> 00:10:08,452
small, the whole network is the same as a
linear network that maps the inputs

131
00:10:08,452 --> 00:10:13,445
straight to the outputs.
So, if you multiply that weight matrix W1

132
00:10:13,678 --> 00:10:18,504
by that weight matrix W2, you'll get a
weight matrix that you can use to connect

133
00:10:18,504 --> 00:10:23,330
the inputs to the outputs and provided the
weights are small, a net with a layer of

134
00:10:23,330 --> 00:10:27,692
logistic hidden units will behave pretty
much the same as that linear note.

135
00:10:27,692 --> 00:10:32,285
Provided we also divide the weights in the
linear note by four, which take into

136
00:10:32,285 --> 00:10:36,937
account the fact that when there's hidden
units there, in that linear region, and

137
00:10:36,937 --> 00:10:41,824
they have a slope of a quarter.
So it's got no more capacity than the

138
00:10:41,824 --> 00:10:46,520
linear net, so even though in that network
I'm showing you there's three  six + six

139
00:10:46,520 --> 00:10:54,487
two weights, it's really got no more
capacity than a network with three  two

140
00:10:54,490 --> 00:10:57,952
weights.
That's the way its grow.

141
00:10:57,952 --> 00:11:01,616
We start using the non linear region of
the sequence.

142
00:11:01,616 --> 00:11:05,280
And then we start making use of all those
parameters.

143
00:11:06,120 --> 00:11:12,483
So if the network has six weights at the
beginning of learning and has 30 weights

144
00:11:12,483 --> 00:11:17,070
at the end of learning,
Then we could think of the capacity as

145
00:11:17,070 --> 00:11:23,360
changing smoothly from six perimeters to
30 perimeters as the weights get bigger.

146
00:11:23,360 --> 00:11:28,102
And what's happening in early stopping is
we're stopping the learning when it has

147
00:11:28,102 --> 00:11:32,728
the right number of parameters to do as
well as possible on the validation data.

148
00:11:32,728 --> 00:11:37,529
That is when it's optimized the trade off
between fitting the true regularities in

149
00:11:37,529 --> 00:11:41,866
the data and fitting the spurious
regularities that are just there because

150
00:11:41,866 --> 00:11:44,469
of the particular training examples we
chose.