[MUSIC] So, instead of using training error to assess our predictive performance. What we'd really like to do is analyze something that's called generalization or true error. So, in particular, we really want an estimate of what the loss is averaged over all houses that we might ever see in our neighborhood. But really, in our dataset we only have a few examples of houses that were sold. But there are lots of other houses that are in our neighborhood that we don't have in our dataset, or other houses that you might imagine having been sold. Okay, so to compute this estimate over all houses that we might see in our dataset, we'd like to weight these house pairs, so the pair of house attributes and the house sale's price. By how likely that pair is to have occurred in our dataset. So to do this we can think about defining a distribution and in this case over square feet of houses in our neighborhood. So what this little cartoon is trying to show is a distribution over the real line of square feet. But you can think of it as just a really dense, in a sense histogram, counting how many houses that we might see with a given square feet for every possible square feet value. Okay and so what this picture is showing is a distribution that says we're very unlikely to see houses with very small or low number of square feet, very small houses. And we're also very unlikely to see really, really massive houses. So there's some bell curve to this, there's some sweet spot of kind of typical houses in our neighborhood, and then the likelihood drops off from there. Likewise what we can do is define a distribution that says for a given square footage of a house, what's the distribution over the sales price of that house? So let's say the house has 2,640 square feet. Maybe I expect the range of house prices to be somewhere between $680,000 to maybe $950,000. That might be a typical range. But of course, you might see much lower valued houses or higher value, depending on the quality of that house. And that's what this distribution here is representing. Okay, so formally when we go to define our generalization error, we're saying that we're taking the average value of our loss weighted by how likely those pairs were in our dataset. So specifically we estimate our model parameters on our training data set so that's what gives us w hat. That defines the model we're using for prediction, and then we have our loss function, assessing the cost of predicting f, this f sub w hat at our square foot x when the true value was y. And then what we're gonna do is we're gonna average over all possible xy's. But weighted by how likely they are according to those distributions over square feet and value given square feet. Okay, so let's go back to these plots of looking at error verses model complexity. But in this case let's quantify our generalization error as a function of this complexity. And to do this, what I'm showing by this crazy blue region here. And, it has different gradation going from white to darker blue, is the distribution of houses that I'm likely to see in my dataset. So, this white region here, are the houses and now we just made it not white, but hopefully we still see. These are the houses that I'm very, very likely to see, and then as I go further away from this I get to less likely house sale prices given a specific square foot value. And so what I'm gonna do when I look at thinking about generalization error is I'm gonna take my fitted function where remember this green line was fit on the training data which are these blue circles. And then I'm gonna say, how well does it predict houses in this shaded blue region, weighted by how likely they are, how close to that white region. If you imagine in 3D, there are these distributions popping up off of this shaded grey and shaded blue area. Maybe I can try and draw it. Maybe the distribution at a given square foot, okay that doesn't look good at all, let me try and do it again. Then it looks something like this, the houses with xt square feet. And so when I think about how well my prediction is doing at xt, this x here, I'm looking at the difference between this and all points along this line. Weighted by how likely they are in the general population of houses I might see. And then I do that across this entire region of possible square feet. Okay, so what I see here is this constant model who really doesn't approximate things well except maybe in this region here. So overall it has a reasonably high generalization error and I can go to my more complex, just fitting a line through the data. And I see I have better performance, but still not doing great in these regions. So my generalization error dropped a bit, but when I get to this higher complexity quadratic fit things are starting to look a bit better, maybe not great out in these regions here, so again, the generalization error drops. Then I get to this much higher order polynomial, and when we were looking at training error, the training error was lower, right? But now, when we think about generalization error, we actually see that the generalization error is gonna go up relative to the simpler model, because if we look at this region here, it's doing really horribly. So, we might get a generalization error that's actually larger than the quadratic, and then we can fit even a higher order polynomial, and we get this really, really crazy fit. And it's doing horrible basically everywhere, except maybe at these very, very small little regions where it's doing okay. So in this case we get dramatically bad generalization there. Okay, so this is starting to match a lot more of our intuition behind what might be a good fit to this data. So, let's think about just drawing the curve over all possible models now that we've fit these few specific points. So our generalization error in general will have some shape where it's going down. And then we get to a point where the error starts increasing. Sorry, that should have been a smoother curve. The error starts increasing because we're getting to these overly complex models that fit the training data really well but don't generalize to other houses that we might see. But importantly, in contrast to training error we can't actually compute generalization error. Because everything was relative to this true distribution, the true way in which the world works. How likely houses are to appear in our dataset over all possible square feet and all possible house values. And of course, we don't know what that is. So, this is our ideal picture or our cartoon of what would happen. But we can't actually go along and compute these different points. [MUSIC]