1 00:00:00,148 --> 00:00:04,634 [MUSIC] 2 00:00:04,634 --> 00:00:08,360 So, instead of using training error to assess our predictive performance. 3 00:00:08,360 --> 00:00:13,100 What we'd really like to do is analyze something that's called generalization or 4 00:00:13,100 --> 00:00:13,850 true error. 5 00:00:15,450 --> 00:00:19,370 So, in particular, we really want an estimate of what the loss is 6 00:00:19,370 --> 00:00:22,800 averaged over all houses that we might ever see in our neighborhood. 7 00:00:24,240 --> 00:00:28,870 But really, in our dataset we only have a few examples of houses that were sold. 8 00:00:28,870 --> 00:00:32,640 But there are lots of other houses that are in our neighborhood that we don't have 9 00:00:32,640 --> 00:00:36,400 in our dataset, or other houses that you might imagine having been sold. 10 00:00:37,850 --> 00:00:42,520 Okay, so to compute this estimate over all houses that we might see in our dataset, 11 00:00:42,520 --> 00:00:47,450 we'd like to weight these house pairs, so the pair of house attributes and 12 00:00:47,450 --> 00:00:49,210 the house sale's price. 13 00:00:49,210 --> 00:00:52,245 By how likely that pair is to have occurred in our dataset. 14 00:00:53,510 --> 00:00:56,390 So to do this we can think about defining a distribution and 15 00:00:56,390 --> 00:01:00,780 in this case over square feet of houses in our neighborhood. 16 00:01:00,780 --> 00:01:03,710 So what this little cartoon is trying to show 17 00:01:03,710 --> 00:01:08,580 is a distribution over the real line of square feet. 18 00:01:08,580 --> 00:01:13,090 But you can think of it as just a really dense, in a sense histogram, 19 00:01:13,090 --> 00:01:18,530 counting how many houses that we might see with a given square feet for 20 00:01:18,530 --> 00:01:21,300 every possible square feet value. 21 00:01:21,300 --> 00:01:24,760 Okay and so what this picture is showing is a distribution that says 22 00:01:24,760 --> 00:01:29,300 we're very unlikely to see houses with very small or 23 00:01:29,300 --> 00:01:31,970 low number of square feet, very small houses. 24 00:01:31,970 --> 00:01:35,990 And we're also very unlikely to see really, really massive houses. 25 00:01:35,990 --> 00:01:40,040 So there's some bell curve to this, there's some sweet spot of kind of typical 26 00:01:40,040 --> 00:01:43,550 houses in our neighborhood, and then the likelihood drops off from there. 27 00:01:45,630 --> 00:01:50,760 Likewise what we can do is define a distribution that says for 28 00:01:50,760 --> 00:01:54,260 a given square footage of a house, 29 00:01:54,260 --> 00:01:57,550 what's the distribution over the sales price of that house? 30 00:01:57,550 --> 00:02:03,050 So let's say the house has 2,640 square feet. 31 00:02:03,050 --> 00:02:07,880 Maybe I expect the range of house prices to be somewhere 32 00:02:07,880 --> 00:02:12,930 between $680,000 to maybe $950,000. 33 00:02:12,930 --> 00:02:15,720 That might be a typical range. 34 00:02:15,720 --> 00:02:19,640 But of course, you might see much lower valued houses or higher value, 35 00:02:19,640 --> 00:02:22,510 depending on the quality of that house. 36 00:02:22,510 --> 00:02:24,760 And that's what this distribution here is representing. 37 00:02:26,390 --> 00:02:30,690 Okay, so formally when we go to define our generalization error, 38 00:02:30,690 --> 00:02:36,260 we're saying that we're taking the average value of our loss 39 00:02:37,840 --> 00:02:41,520 weighted by how likely those pairs were in our dataset. 40 00:02:41,520 --> 00:02:47,160 So specifically we estimate our model parameters on our training data set so 41 00:02:47,160 --> 00:02:48,930 that's what gives us w hat. 42 00:02:48,930 --> 00:02:52,520 That defines the model we're using for prediction, and 43 00:02:52,520 --> 00:02:57,600 then we have our loss function, assessing the cost of predicting f, 44 00:02:57,600 --> 00:03:04,530 this f sub w hat at our square foot x when the true value was y. 45 00:03:04,530 --> 00:03:09,683 And then what we're gonna do is we're gonna average over all possible xy's. 46 00:03:09,683 --> 00:03:14,045 But weighted by how likely they are according to those distributions over 47 00:03:14,045 --> 00:03:17,340 square feet and value given square feet. 48 00:03:17,340 --> 00:03:22,230 Okay, so let's go back to these plots of looking at error verses model complexity. 49 00:03:22,230 --> 00:03:26,580 But in this case let's quantify our generalization error 50 00:03:26,580 --> 00:03:28,250 as a function of this complexity. 51 00:03:29,410 --> 00:03:35,400 And to do this, what I'm showing by this crazy blue region here. 52 00:03:35,400 --> 00:03:39,410 And, it has different gradation going from white to darker blue, 53 00:03:39,410 --> 00:03:44,110 is the distribution of houses that I'm likely to see in my dataset. 54 00:03:44,110 --> 00:03:49,220 So, this white region here, are the houses and 55 00:03:49,220 --> 00:03:51,910 now we just made it not white, but hopefully we still see. 56 00:03:51,910 --> 00:03:55,420 These are the houses that I'm very, very likely to see, and 57 00:03:55,420 --> 00:04:00,860 then as I go further away from this I get to less likely 58 00:04:00,860 --> 00:04:06,790 house sale prices given a specific square foot value. 59 00:04:06,790 --> 00:04:11,793 And so what I'm gonna do when I look at thinking about generalization 60 00:04:11,793 --> 00:04:16,796 error is I'm gonna take my fitted function where remember this green 61 00:04:16,796 --> 00:04:21,388 line was fit on the training data which are these blue circles. 62 00:04:25,326 --> 00:04:31,929 And then I'm gonna say, how well does it predict houses in this shaded blue region, 63 00:04:31,929 --> 00:04:37,050 weighted by how likely they are, how close to that white region. 64 00:04:37,050 --> 00:04:41,460 If you imagine in 3D, there are these distributions popping up 65 00:04:41,460 --> 00:04:45,020 off of this shaded grey and shaded blue area. 66 00:04:45,020 --> 00:04:46,720 Maybe I can try and draw it. 67 00:04:47,970 --> 00:04:51,615 Maybe the distribution at a given square foot, 68 00:04:51,615 --> 00:04:56,455 okay that doesn't look good at all, let me try and do it again. 69 00:05:00,006 --> 00:05:07,210 Then it looks something like this, the houses with xt square feet. 70 00:05:07,210 --> 00:05:13,727 And so when I think about how well my prediction is doing at xt, this x here, 71 00:05:13,727 --> 00:05:20,364 I'm looking at the difference between this and all points along this line. 72 00:05:20,364 --> 00:05:26,320 Weighted by how likely they are in the general population of houses I might see. 73 00:05:26,320 --> 00:05:30,940 And then I do that across this entire region of possible square feet. 74 00:05:32,085 --> 00:05:35,720 Okay, so what I see here is this constant model who 75 00:05:35,720 --> 00:05:41,112 really doesn't approximate things well except maybe in this region here. 76 00:05:41,112 --> 00:05:46,000 So overall it has a reasonably high generalization error and 77 00:05:46,000 --> 00:05:49,340 I can go to my more complex, just fitting a line through the data. 78 00:05:49,340 --> 00:05:54,700 And I see I have better performance, but still not doing great in these regions. 79 00:05:56,340 --> 00:06:01,440 So my generalization error dropped a bit, but when I get to this higher complexity 80 00:06:01,440 --> 00:06:05,970 quadratic fit things are starting to look a bit better, maybe not great out in these 81 00:06:05,970 --> 00:06:11,060 regions here, so again, the generalization error drops. 82 00:06:12,330 --> 00:06:17,260 Then I get to this much higher order polynomial, and 83 00:06:17,260 --> 00:06:21,360 when we were looking at training error, the training error was lower, right? 84 00:06:21,360 --> 00:06:25,299 But now, when we think about generalization error, we actually see that 85 00:06:25,299 --> 00:06:29,112 the generalization error is gonna go up relative to the simpler model, 86 00:06:29,112 --> 00:06:32,878 because if we look at this region here, it's doing really horribly. 87 00:06:35,551 --> 00:06:39,967 So, we might get a generalization error that's actually larger than the quadratic, 88 00:06:39,967 --> 00:06:43,775 and then we can fit even a higher order polynomial, and we get this really, 89 00:06:43,775 --> 00:06:44,760 really crazy fit. 90 00:06:44,760 --> 00:06:49,370 And it's doing horrible basically everywhere, except maybe at these very, 91 00:06:49,370 --> 00:06:52,290 very small little regions where it's doing okay. 92 00:06:53,770 --> 00:06:57,270 So in this case we get dramatically bad generalization there. 93 00:06:58,810 --> 00:07:03,100 Okay, so this is starting to match a lot more of our intuition 94 00:07:03,100 --> 00:07:06,090 behind what might be a good fit to this data. 95 00:07:06,090 --> 00:07:08,920 So, let's think about just drawing the curve over 96 00:07:08,920 --> 00:07:13,420 all possible models now that we've fit these few specific points. 97 00:07:13,420 --> 00:07:15,700 So our generalization error 98 00:07:17,380 --> 00:07:21,310 in general will have some shape where it's going down. 99 00:07:21,310 --> 00:07:26,670 And then we get to a point where the error starts increasing. 100 00:07:26,670 --> 00:07:28,350 Sorry, that should have been a smoother curve. 101 00:07:32,010 --> 00:07:34,870 The error starts increasing because we're getting to these 102 00:07:34,870 --> 00:07:38,320 overly complex models that fit the training data really well but 103 00:07:38,320 --> 00:07:40,860 don't generalize to other houses that we might see. 104 00:07:42,930 --> 00:07:43,900 But importantly, 105 00:07:43,900 --> 00:07:48,240 in contrast to training error we can't actually compute generalization error. 106 00:07:48,240 --> 00:07:52,010 Because everything was relative to this true distribution, 107 00:07:52,010 --> 00:07:55,040 the true way in which the world works. 108 00:07:55,040 --> 00:08:00,220 How likely houses are to appear in our dataset over all possible square feet and 109 00:08:00,220 --> 00:08:02,350 all possible house values. 110 00:08:02,350 --> 00:08:04,800 And of course, we don't know what that is. 111 00:08:04,800 --> 00:08:10,200 So, this is our ideal picture or our cartoon of what would happen. 112 00:08:10,200 --> 00:08:14,826 But we can't actually go along and compute these different points. 113 00:08:14,826 --> 00:08:19,109 [MUSIC]