In this video, I'm going to talk about
improving generalization by reducing the
overfitting that occurs when a network has
too much capacity for the amount of data
it's given during training.
I'll describe various ways of controlling
the capacity of a network.
And I'll also describe how we determine
how to set the metric parameters when we
use a method for controlling capacity.
I'll then go on to give an example where
we control capacity by stopping the
learning early.
Just to remind you, the reason we get
over-fitting is because as well as having
information about the true regularities in
the mapping from the input or output, any
finite set of training data also contains
sampling error.
There's accidental regularities in the
training set, just because of the
particular training cases that were
chosen.
So when we fit the model, it can't tell
which of the regularities are real, and
would also exist if we sampled the
training set again,
And which are caused by the sampling
error.
So the model fits both kinds of
regularity.
And if the model's too flexible, it'll fit
the sampling error really well, and then
it'll generalize badly.
So we need a way to prevent this over
fitting.
The first method I'll describe is by far
the best.
And it's simply to get more data.
There's no point coming up with fancy
schemes to prevent over fitting if you can
get yourself more data.
Data has exactly the right characteristics
to prevent over fitting.
The more of it you have the better.
Assuming your computer's fast enough to
use it.
A second method is to try and judiciously
limit the capacity of the model so that
it's got enough capacity to fit the true
regularities but not enough capacity to
fit the spurious regularities caused by
the sampling error.
This of course is very difficult to do.
And I'll describe in the rest of this
lecture, various approaches to trying to
regulate the capacity appropriately.
In the next lecture, I'll talk about
averaging together many different models.
If we average models that have different
forms and make different mistakes, the
average will do better than the individual
models.
We could make the models different just by
training them on different subsets of the
training data.
This is a technique called bagging.
There's also other ways to mess with the
training data to make the models as
different as possible.
And the fourth approach, which is the
basian approach, is to use a single neural
network architecture, but to find many
different sets of weights that do a good
job of predicting the output.
And then on test data, you average the
predictions made by all those different
weight vectors.
So, there's many ways to control the
capacity of a model.
The most obvious is via the architecture.
You limit the number of hidden layers, and
the number of units per layer.
And this controls the number of
connections in the network, i.e.
The number of parameters.
A second method which is often very
convenient is to start with small weights
and then stop the learning before it has
time to overfit.
Again on the assumption that it finds the
true regularities before it finds the
spurious regularities that have just to do
with the particular training set we have.
I'll describe that method at the end of
this video.
A very common way to control the capacity
of a neural network is to give it a number
of hidden lairs or units per lair is a
little to large, but then to penalize the
weights using penalties or constraints
using squared values of the weights or
absolute values of the weights.
And finally, we can control the capacity
of a model by adding noise to the weights,
or by adding noise to the activities.
Typically, we use a combination of several
of these different capacity control
methods.
Now for most of these methods, there's
meta parameters that you have to set.
Like the number of hidden units, or the
number of layers, or the size of the
weight penalty.
An obvious way to transit those meta
parameters is to try lots of different
values of one of the meta parameters like,
for example, the number of hidden units,
and see which gives the best performance
on the test set.
But there's something deeply wrong with
that.
It gives a false impression of how well
the method will work if you give it
another test set.
So the settings that work best for one
particular test set are unlikely to work
as well on a new test set that's drawn
from the same distribution because they've
been tuned to that particular test set.
And that means you get a false impression
of how well you would do on a new test
set.
Let me give you an extreme example of
that.
Suppose the test set really is random,
quite a lot of financial data seems to be
like that.
So the answers just don't depend on the
inputs or can't be predictive from the
inputs.
If you choose the model that does best on
your test set, that will obviously do
better than chance because you selected it
to do better than chance.
But if you take that model and try it on
new data that's also random, you can't
expect it to do better than chance.
So by selecting a model, you got a false
impression of how well a model will do on
new data and the question is, is there a
way around that?
So here's a better way to choose the
meta-parameters.
You start by dividing the total data set
into three subsets.
You have the training data, which is what
you're going to use to train your model.
You hold back some validation data, which
isn't going to be used for training.
But is going to be used for deciding how
to set the meta parameters.
In other words, you're going to look at
how well the model does on the validation
data to decide what's an appropriate
number of hidden units or an appropriate
size of weight penalty.
But then once you've done that, and
trained your model with what looks like
the best number of hidden units and the
best weight penalty,
You're then going to see how well it does
on the final set of data that you've held
back which is the test data.
And you must only use that once.
And that'll give you an unbiased estimate
of how well the network works.
And in general that estimate will be a
little worse than on the validation data.
Nowadays in competitions, the people
organizing the competitions have learned
to hold back that true test data and get
people to send in predictions so they can
see whether they really can predict on
true test data, or whether they're just
over-fitting to the validation data by
selecting meta-parameters that do
particularly well on the validation data
but won't generalize to new test sets.
One way we can get a better estimate of
our weight penalties or number of hidden
units or anything else we're trying to fix
using the validation data, is to rotate
the validation set.
So, we hold back a final test set to get
our final unbiased estimate.
But then we divide the other data into N
equal sized subsets and we train on all
but one of those N, and use the Nth as a
validation set.
Then we can rotate and a hold back a
different subset as a validation set, and
so we can get many different estimates of
what the best weight penalty is, or the
best number of hidden units is.
This is called N-fold cross-validation.
It's important to remember, the N
different estimates we get are not
independent of one another.
If for example, we were really unlucky and
all the examples of one class fell into
one of those subsets,
We'd expect to generalize very badly.
And we'd expect to generalize very badly,
whether that subset was the validation
subset or whether it was in the training
data.
So now I'm going to describe one
particularly easy to use method for
printing over fitting.
It's good when you have a big model on a
small computer and you don't have the time
to train a model many different times with
different numbers of hidden units or
different size weight penalties.
What you do is you start with small
weights, and as the model trains, they
grow.
And you watch the performance on the
validation set.
And as soon as it starts to get worse, you
stop training.
Now, the performance civilization on the
set may fluctuate particularly if you're
error rate rather than a squared error or
presentory error.
And so its hard to decide when to stop and
so what you typically do is keep going
until you're sure things are getting worse
and then go back to the point at which
things were best.
The reason this controls the capacity of
the model, is because models with small
weights generally don't have as much
capacity, and the weights haven't had time
to grow big.
It's interesting to ask why small weights
lower the capacity.
So consider a model with some input units,
some hidden units, and some output units.
When the weight's very small, if the
hidden unit's a logistic units, their
total inputs will be close to zero, and
they'll be in the middle of their linear
range.
That is, they'll behave very like linear
units.
What that means is, when the weights are
small, the whole network is the same as a
linear network that maps the inputs
straight to the outputs.
So, if you multiply that weight matrix W1
by that weight matrix W2, you'll get a
weight matrix that you can use to connect
the inputs to the outputs and provided the
weights are small, a net with a layer of
logistic hidden units will behave pretty
much the same as that linear note.
Provided we also divide the weights in the
linear note by four, which take into
account the fact that when there's hidden
units there, in that linear region, and
they have a slope of a quarter.
So it's got no more capacity than the
linear net, so even though in that network
I'm showing you there's three  six + six
two weights, it's really got no more
capacity than a network with three  two
weights.
That's the way its grow.
We start using the non linear region of
the sequence.
And then we start making use of all those
parameters.
So if the network has six weights at the
beginning of learning and has 30 weights
at the end of learning,
Then we could think of the capacity as
changing smoothly from six perimeters to
30 perimeters as the weights get bigger.
And what's happening in early stopping is
we're stopping the learning when it has
the right number of parameters to do as
well as possible on the validation data.
That is when it's optimized the trade off
between fitting the true regularities in
the data and fitting the spurious
regularities that are just there because
of the particular training examples we
chose.