[NOISE] In this lesson,
we'll discuss how to validate models, how to check whether they generalize, whether they work well not only on
training set, but also on new data. And we'll start with the discussion
of overfitting problem. So suppose that we have
a classification problem. And we've just trained a classifier
that has accuracy of 80%. So it gives correct answers
on 80% of our training data. It looks good, but actually we don't have any guarantees that
our model will work well on new data. Maybe it's overfitted,
maybe it just remembered the answers for data from the training set and
doesn't generalize at all. So let's consider two
examples of overfitting. So suppose that we have a problem where
each example is described by only one feature, and the data looks like that. The green line is a true target function,
we want to estimate it. And suppose that we use
a linear regression model. If we fit it, it looks like a blue line. So it's under fitted,
it has inferior quality. So actually it's too simple for our data, because inter-dependency between y and
x is not linear. To overcome this we can
use polynomial model. So we add features to our examples. We use not only x, but also x squared,
x to the 3rd degree, and x to the 4th degree. And if you fit this model, then we get
a model blue line like on this picture. So it's a very good model,
it fits the true target model perfectly. So here we have just as much parameters
as we need, this is a nice model. But if we continue to increase
number of features, for example up to x to the 15th degree,
then we'll get this blue model. It's too complex for our data,
it's overfitted, and maybe it has good performance
on training examples, but it has very poor
performance on new data points. Okay, here is another example. Suppose that we have eight
data points: 0.2, 0.4 to 1.6, and the target value is calculated as
sin(x) plus some small normal noise. And we use, once again, the polynomial
model with eight parameters plus a bias. And if you fit this model,
then we'll see a picture like this one. So our model goes through every training
example, through every blue point. So it gives a perfect prediction for a
training set, and we'll have, for example, zero loss on our training set. But this model is overfitted. If you take any data point
not from our training set, then the quality would be very poor. And if you look at parameter vector,
then they are very large. The range of target value is from 0 to 1,
but the values of our parameters are hundreds. So this model just incorporates
target values into parameters. So when you apply model with these
parameters to training data points, then you get a correct answer. How can we validate our model and
check whether it's overfitted or not? Well, we can take all
our labeled examples and split them into two parts,
training set and holdout set. We use training set to learn our model,
like classifier or regression model, and we use holdout set to
calculate the .equality. For example, accuracy or
cross-entropy for classification, or maybe mean squared error for regression. And then, if the loss on holdout set is
not very high, then the model is good. But if we see that loss
increases on holdout set, then maybe the model is overfitted. And, of course, there is a question,
how to split our data into two parts, whether training set should be large,
or holdout set should be large. If the holdout set is small,
then the training set is representative. It contains almost all data
points from our label set. But the holdout set is too small, so the estimate of quality based on
holdout set may have large variance. But if we select to have
large holdout set and small training set, then the training
set is not representative. It contains much lesser data points
than we'll have in practice. So our estimate of
quality would be biased. But since holdout set is large, then our estimates of quality
will have low variance. In practice, we usually put 70%
of our data as a training set, and 30% to holdout set. Or maybe 80% to training set,
and 20% to holdout set. There are some problems with
this approach with holdout set. For example, if the sample is small, we want to see what happens if
each example is in training set. And what happens if this
example is a holdout set. So to achieve this, for example we can
split our data into training set and holdout set, K times. And then just average our
estimates from all holdout set. But there are still no guarantees
that every example will be both in training set and
holdout set at some splits. Much better way to do
it is cross-validation. In it we split our data into k blocks
of approximately similar size, and we call these blocks folds. Then we take the first fold and use it as a holdout set, and
all other blocks as training set. We train a model,
we assess its quality, we validate it, we calculate the metric,
on this first fold on our holdout set. Then we use second fold on our holdout set
and repeat the same procedure, and etc. And we use, as the last step,
the last fold as a holdout set, and all other folds as training set. And then we just take an average of
our estimates from all iterations of this procedure. Cross-validation guarantees that each
example will be both in holdout set and in training set at some
iterations of this procedure. But cross-validation is
quite hard to perform, because it requires to
train our model K times. And if we are talking about
deep neural networks, training of one network can take one or
two or four weeks on several GPUs, so it will be quite hard to train
them five times, or ten times. So in deep learning,
we'll usually use just a holdout set. It's okay, because usually we'll work
with a large samples where even one holdout set is representative. So there is no need to use
multiple holdout sets. In this video we discussed that models
can easily overfit if they have enough parameters for it. And discussed some ways to assess
model quality to validate them, like holdout set or cross-validation. And in next video, we'll discuss how
to modify our training procedure so that our models cannot overfit,
how to reduce their complexity. [NOISE]