[MUSIC] So in modules one and two we described
how to fit different models and in module two we described how
to fit very complex models. But at up to our third module
we had no idea how to access whether that fitted model was going to
perform well in our prediction tasks. So in module three, that was our
emphasis in assessing the performance of our fitted module and thinking about
how we can select between different models to get good predictive performance. So the first notion that we
introduced in order to measure how good our fit was performing
was the measure of loss. So this is kind of a negative measure
of performance where we wanna lose as little as possible
in making poor predictions. We're just under an assumption that
our predictions are not perfect. And we discussed two different
examples of loss metrics that are very commonly used talking about this
absolute error or this squared error. Then with this loss function, we talked about defining three
different measures of air. The first was our training error, which we said was not a good assessment of
the predicted performance of our model. Then we defined something
called our generalization, or true error, which is what we really want. We wanna say how well are we predicting every possible observation
that we might see out there. And we said, okay, well can't actually
compute that so then we defined something called our test error which
looks at the subset of our data that was not including in the training
set and looks at the model that was fit on the training data set but now making
predictions on these held out points. And we said that test error was a noisy
approximation to our generalization error. And for
these three different measures of error, we talked about how they varied as
a function of model complexity. So training error, we know, goes down
with increasing model complexity but that doesn't indicate
that we get better and better predictions as we
increase our model complexity. But in contrast, if we look at
generalization error, true error, these tend to increase, the errors tend
to increase after a certain point. We say that that point is when this
models start to become overfit. Because they perform very well
on the training data set, but they don't generalize well to
new data that we have not yet seen. And again, although we discuss this in
the context of regression, this notion of training, test, generalization error,
and variations with model complexity is a much more general concept that we'll
see again in the specialization. We then characterized three different
sources that contribute to our prediction error. These are,
the noise that's inherent in the data. This is our irreducible error. We have no control over it. It has nothing to do with our model or
our estimation procedure but then we talked about this idea of bias and
variance. So we just described bias as saying
how well can our model fit the true relationship, averaging over all possible
training data sets that we might see. Whereas variance was describing
how much can a fitted function vary from training data set to training
data set, all of size and observations. So of course noise in the data can
contribute to our errors in prediction, but of course if our model can't
adequately describe the true relationship that's also a source of error
as well as this variability from training set to training set. So of course we want low bias and
low variance to have good predicted performance, but we saw
that there's this bias variance trade-off. That as you increase model complexity,
your bias goes down, but your variance goes up. And so there's this sweet spot
that trades off between bias and variance and results in the lowest
what's called mean square error. And that's what we're seeking to find. And like we've said multiple times, machine learning is all about
exploring this bias variance tradeoff. Then with concluded this module by saying, how are we both going to select our
model and assess its performance? And for this we said, well we need to actually form
something called a validation set. So we're going to fit our model
on the training data set, we're going to select between different
models or thinking about selecting a tuning perimeter describing these
different models on our validation set and then testing the performance on our test
set, where we never touched the test data. In later modules,
like we're going to describe, we talked about how if you don't have
enough data to form this validation set, you can think about doing
cross-validation instead. Then, in our fourth module,
we talked about rich regression. And remember that as our
models become more and more complex, we can become overfit and
what we saw is the symptom of overfitting was that the magnitude of our
estimated coefficients just exploded. So what ridge regression does
is it trades off between a measure of fit of our function
to our training data and a measure of the magnitude
of the coefficients. And implicitly by balancing
these two terms we're doing a bias-variance tradeoff. In particular we saw that our rich
regression objective sought to minimize our residuals sum of squares plus lambda
plus the L2 norm of our coefficients, and we talked about what the coefficient
path of our ridge solution looked like, as we varied this tuning parameter, lambda,
the penalty strength on this L2 norm term. And we saw that as you increase
this penalty parameter, the magnitude of our coefficients
become smaller and smaller and smaller. Then for
our ridge objective just like we did in our standard lease squares objective,
we computed the gradient, set it equal to zero to get our
closed-form solution and this looks very similar to our solution we had
before except with this additional term. And what we talked about in this
module is the fact that by adding this lambda times the identity matrix. This allowed us to have a solution, even when the number of features was
larger than the number of observations. And it allowed for a much more quote,
unquote, regularized solution. That's why it's called
a regularized regression technique. But the complexity of the solution
was exactly the same as we had for these squares, cubic in the number
of features that we have. We also talked about a gradient
descent implementation of ridge. And as we saw, a key question in what
solution we would get out of ridge was determined by this
lambda penalty strength. And so, for this, instead of talking about
cutting out a validation set to select this tuning parameter, we talked about
cases where you might not have enough data to do that, and instead described
this cross validation procedure. [MUSIC]