Suppose you like to decide
what degree of polynomial to
fit to a data set, sort of
what features to include to give you a learning algorithm.
Or suppose you'd like to choose
the regularization parameter lambda for the learning algorithm.
How do you do that?
These are called model selection problems.
And in our discussion of
how to do this we'll talk
about not just how to
split your data into a train
and test sets but how to
split your data into what
we'll discover is called the
train validation and test sets.
We'll see in this video
just what these things are and
how to use them to do model selection.
We've already seen a lot
of times the problem of overfitting,
in which just because the
learning algorithm fits a training
set well, that doesn't mean there's a good hypothesis.
More generally, this is why
the training set error is
not a good predictor for how
well the hypothesis will do on new examples.
Concretely, if you fit
some set of parameters - theta
0, theta 1, theta 2
and so on - to your
training set then, the fact
that your hypothesis does well in
the training set, well, this
doesn't mean much in terms
of predicting how well your
hypothesis will generalize to new
examples not seen in
the training set.
And the more general principal is that,
once your parameters were fit
to some set of data--maybe the
training set, maybe something else--then
the error of your hypothesis
as measured on that same
data set, such as the
training error, that's unlikely
to be a good estimate
of your actual generalization error, that
is, of how well the
hypothesis will generalize to new examples.
Now let's consider the model selection problem.
Let's say you try to choose
what degree polynomial to fit to data.
So, you should you choose a linear function, a
quadratic function, a cubic function,
all the way up to a 10th power polynomial?
So it's as if there's one extra parameter in
this algorithm, which I'm going
to denote d, which is
what degree of polynomial do you want to pick?
So it is as if
does this, in addition
to the theta parameters it's
as if there's one more parameter d
that your trying to determine using your data cells.
the first option is d
equals 1, which is for the linear
function we can choose d
equals 2, d equals 3,
all the way up to d
equals 10, so we would like
to fit this extra sort of parameter,
which I am denoting by d,
and concretely, let's say
that you want to choose a
model, that is choose
a degree of polynomial choose one off
these ten models, and fit that model
and also get some estimate
of how well your fitted hypothesis
will generalize to new
examples.
Here's one thing
you could do: you could
take your first model and
minimize the training error
and this would give you some
parameter vector theta, and
you can then take your second model,
the quadratic function and
for that your training set and
this will give you some other parameters vector theta.
In order to distinguish between
these different parameter vectors, I'm
going to use a superscript 1, superscript 2
there where theta superscript
1 just means the parameters
I get by fitting this
model to my training data,
and theta superscript 2 just means
the parameters I get by fitting
this quadratic function to my training ata and so on.
And by fitting a cubic model I get parameters theta 3
up to, you know, say theta 10.
And one thing we could
do is then take these
parameters and look at the test set error.
So I can compute on my
test set, j test
of 1, j test of
theta 2 and so
on, j test of
theta 3 and so on.
So I'm going to
take each of my hypothesis
with the corresponding and just measure
the performance on the test set.
Now one thing I could do
then is, in order to select
one of these models, I could
then see which model
has the lowest test sets
error, and lets just say
for this example, that I
ended up choosing the fifth order polynomial.
So this seems reasonable so far.
By now, lets say, I want to
take my fit hypothesis, this fifth
order model and let's
say I want to ask how well this model generalized.
One thing I could do is
look at how well my fifth
order polynomial hypothesis, had done on my test set.
But the problem is this
will not to be a fair
estimate of how well
my hypothesis generalizes.
And the reason is, what we've
done is, we've fit this
extra the parameter d, that
is this degree of polynomial,
and we'll fit that parameter
d using the test set.
Namely, we chose the value
of d that gave us the
best possible performance on the
test set, and
so, the performance of
my parameter vector theta five
on the test set, that's likely to
be to be an overly optimistic
estimate of generalization error.
Right? So that because I have fit
this parameter d to my
test set, it is no
longer fair to
evaluate my hypothesis on this test set.
That's because I've fit my parameters to the test set.
I've chosen the degree d of
polynomial using the test set. And
so our hypothesis is
likely to do better on
this test set than it
would on new examples that
it hasn't seen before and that's which is what we hear about.
So, just to reiterate
on the previous slide we
saw that if we fit
some set of parameters, say theta
0, theta 1, and so on, to
some training set, then the
performance of the fitted model on
the training set is not
predictive of how well
the hypothesis we generalized the
new examples; is because these
parameters would fit to the training set.
So they are likely to do
well on the training set, even
if the parameters don't do well on other examples.
And in the procedure I've just described
on this slide, we've just done
the same thing and specifically
what we did is we fit this
parameter d to the test set.
And by having fit the parameter
to the test set, this means that
the performance of the hypothesis
on that test set may not
be a fair estimate of how
well the hypothesis is likely to do
on examples we haven't seen before.
To address this problem in a
model selection setting, if
we want to evaluate a hypothesis
this is usually what we do instead.
Given the data set, instead of just splitting it
into a train and test
set, what we are going to
do is instead split it into three pieces.
And the first piece is
going to be called the training set as usual.
So you call this first part,
the training set, and then
were going to coddle the second
piece of data, which is
called the cross-validation set, and
I'm going to abbreviate cross-validation
CV, and the second
piece of this data, I'm going
to call the cross-validation set
cross-validation, and I
am going to abbreviate cross-validation as CV.
Sometimes it's also called the
validation set, instead of cross-validation set.
And then the last part I
am going to call my usual test set.
And the pretty typical ratio
I wish to split these things; would be to
send 60% of your data
to your training set, maybe 20%
to your cross-validation set, and 20% to your test set.
And these numbers can vary a little
bit but this sort of ratio will be pretty typical.
And so our training set
will now be only, maybe
60% of the data,
and our cross-validation set or
our validation set will have some number of examples.
I'm going to denote that M
subscript cv, so that's the
number of cross-validation examples.
And as following our earlier
notational convention, I'm going
to use XiCV, YiCV.
Following our earlier notational convention
I'm going to use
XiCV, YiCV to
denote the i cross-validation example.
And finally we also
have a test set over here;
with M subscript test,
being the number of test examples.
So, now that we have
defined the training validation or
cross validation and test sets,
we can also define the training
error, cross validation error, and test error.
So here's my training error,
and I'm just writing this as
J j subscript train of theta.
This is pretty much the same thing.
It's usually the same thing as
the j of theta that we'll be writing so far,
this is just a training set
error you guys measure on your training set.
And then j subscript cv
is my cause validation error is pretty much what you'd expect.
Just select the training error, except
measure it on the
cross-validation data set, and here's
my test set error, same as before.
So when theta with the model
selection problem like this
is, instead of using
the test set to select
the model, we're instead going
to use validation set or
the cross-validation set to select the model.
Concretely, we're going to
first take our first hypothesis, take
this first model and say,
minimize the cos function,
and this would give me some
parameter vector theta for the linear model
and as before I'm going to put the
superscript 1 just to
denote that this is a parameter
for the linear model. We do
the same thing for the
quadratic model, get some
parameter vector theta 2, get
some parameter vectors there 3, and so on down
to, say, the tenth by the
polynomial, and what
we I'm going to do is, instead of
testing these hypothesis on the
test set, instead I'm
going to test them on the
cross-validation set. I'm going to
measure j subscript cv,
to see how well each of
these hypothesis do on my
cross validation set and then
I'm going to pick the hypothesis
with the lowest cross-validation error.
So for this example, let's say
for the sake of argument that
it was my fourth order polynomial
that had the lowest cross-validation error.
So in that case, I'm going
to pick this fourth order polynomial
model and finally what
this means is that that parameter
d, remember d was the
degree of polynomial, right d equals 2, d equals 3,
up to d equals 10. What we've
done is we fit that parameter
d, and we'll set d equals 4, and
we did so using
the cross-validation set.
And so this degree of
polynomial, so the parameter
is no longer fit to the test set.
And so we've now saved
a way the test set
and we can use the test
set to measure or to
estimate the generalization error of
the model that was selected
by this algorithm.
So, that
was model selection and how
you can take your data and split
it into a train validation
and test set, and use your
cross validation data to select
model and evaluate it on the test set.
One final note: I should
say that in the machine
learning as of this practice
today, there are many
people that will do
that early thing that I
talked about, and said that
isn't such a good idea of
selecting your model using the
test set and they're using
the same test set to report
the error, as though selecting
your degree of polynomial on the
test set, and then reporting
the error on the test set as
though that were good estimate of generalization error.
That sort of practice is unfortunately many
people do do it; and
if you have a massive massive
test set is maybe not
a terrible thing to do, but
most practitioners of machine
learning tend to advise
against that
and is considered better practice to have separate training validations of test sets. I'll just warn you that just
sometimes people do
you know, use the same data
for the purpose of the validation
set and for the purpose
of the test sets. You only have a training set
and the test set and that's because
that's good practice. So, you
will see some people do it
but if possible I will
recommend against doing.