[MUSIC] So, the first measure of error of our
predictions that we can look at is something called training error. And we discussed this at a high level in
the first course of the specialization, but now let's go through it
in a little bit more detail. So, to define training error,
we first have to define training data. So, training data typically you have some dataset which I've shown you
are these blue circles here, and we're going to choose our training
dataset just some subset of these points. So, the greyed circles are ones that
are not included in the training set. The blue circles are the ones that
we're keeping in this training set. And then we take our training data and,
as we've discussed in previous modules of this course, we use it in order to fit our
model, to estimate our model parameters. Just as an example, for
example with this dataset here, maybe we choose to fit some
quadratic function to the data and like we've talked about in order
to fit this quadratic function, we're gonna minimize the residual sum of
squares on these training data points. So, now we have our estimated
model parameters, w hat. And we want to assess the training
error of that estimated model. And the way we do that is first we
need to define some lost functions. So, maybe we look at squared error,
absolute error. Any one fo the many possibilities for
our lost function. And then the way training error's
defined is simply as the average loss, defined over the training points. So, mathematically what
this is is simply 1 over N. So, N are the total number of
observations in my training set. Some of the loss over each one
of those training observations. And just to remember to be very clear the estimated parameters were
estimated on the training set. They were Minimizing
the residual semi-squares for these training points that we're looking
at again and defining this training error. So, we can go through this pictorially in
the following example, where in this case we're specifically looking at using
squared error as our loss function. And in this case,
our training error is simply one over n times the sum of The difference between
our actual house sales price and our predicted house sales price squared. Where that sum is taken over all
houses in our training data set. And what we see is that in this case
where we choose squared error as our loss function, then the form of training error Is exactly 1
over N times our residual sum of squares. And I want to note here that there's some
difference in convention that people use, whether there's the 1 over N as
the definition of training error, or not. So, just be aware of that when
you're computing training error and reporting these numbers. Here we're defining it
as the average loss. More formally we can write our
training error as follows and then we can define something
that's commonly referred to just as something as RMSE and
the full name is root mean square error. And RMSE is simply the square root of
our average loss on the training houses. So, the square root of our training error. And the reason one might consider
looking at root mean square error is because the units,
in this case, are just dollars. Whereas when we thought about our training
error, the units were dollars squared. Remember we're taking the squares of
all these differences in dollars. So, the result is dollars squared. So, that's a little bit less intuitive
as an error metric than just an error in terms of dollars themselves. Now, that we've defined training error, we can look at how training error
behaves as model complexity increases. So, to start with let's look at
the simplest possible model you might fit, which is just a constant model. So this is the simplest model we're
gonna consider, or could consider, and you see that there is pretty
significant training error. So let's just say that
that has some value here, this is the training error
of the constant model. Then let's say I fit a linear model. Well, a line, these are all linear models
we're looking at, it's linear regression. But just fitting a line to the data. And you see that my training
error has gone down. So, some other value that I'm
showing with this pink circle here. Then I fit a quadratic function
again training error goes down, and what I see is that as I increase my model
complexity to maybe this higher order of polynomial, I have very low training
error just this one pink bar here. So, training error decreases quite
significantly with model complexity and, in total not that we've gone through these
examples we can look at what the plot of training error versus model
complexity tends to look like. So, there's a decrease in training error
as you increase your model complexity. And why is that? Well, it's pretty intuitive, because the
model was fit on the training points and then I'm saying how well does it fit it? As I increase the model complexity,
I'm better and better able to fit my
training data points. So, then when I go to assess my training
error with these high-complexity models, I have very low training error. So, a natural question is
whether a training error is a good measure of
predictive performance? And what we're showing here is
one of our high-complexity, high-order polynomial models that
had very low training error. So it really fit those
training data points well. But how's it gonna perform
on some new house? So, in particular, maybe we're looking
at a house in this gray region, so with this range of square feet. Question is, is there something
particularly wrong with having Xt square feet? Because what our fitted function is saying
is that I believe or I'm predicting that the values of houses with roughly
Xt square feet are less valuable than houses with fewer square feet, cuz
there's this dip down in this function. Do we really believe that this
is a true dip in value, that these houses are just less desirable than
houses with fewer or more square feet? Probably not. So, what's going wrong here? The issue is the fact
that training error is overly optimistic when we're going
to assess predictive performance. And that's because these parameters,
w-hat, were fit on the training data. They were fit to minimize
this training error. Sorry, minimize residual sum of squares, which can often be related
to training error. And then we're using training error
to assess predictive performance but that's gonna be very very
optimistic as this picture shows. So, in general, having small
training error does not imply having good predictive performance unless
your training data set is really representative of everything that you
might see there out in the world. [MUSIC]