1 00:00:00,000 --> 00:00:03,387 [MUSIC] 2 00:00:03,387 --> 00:00:08,377 Okay, so the issue we're facing here with this crazy 13th order polynomial fit 3 00:00:08,377 --> 00:00:10,561 is something called overfitting. 4 00:00:10,561 --> 00:00:15,152 So in particular, what we've done is we've taken a model and really, 5 00:00:15,152 --> 00:00:18,901 really, really honed in to our actual observations, but 6 00:00:18,901 --> 00:00:23,134 it doesn't generalize well to thinking about new predictions. 7 00:00:23,134 --> 00:00:29,334 And so the issues actually go beyond just making really crazy predictions. 8 00:00:29,334 --> 00:00:33,251 And we're gonna discuss this in a lot more detail in the regression course. 9 00:00:33,251 --> 00:00:37,372 But I wanna mention that this is a real problem with any machine learning model or 10 00:00:37,372 --> 00:00:40,150 statistical model that you might consider. 11 00:00:40,150 --> 00:00:44,290 So in these cases, we wanna fit a model to data, but 12 00:00:44,290 --> 00:00:48,780 we don't want that model to be so specified exactly to the one data set that 13 00:00:48,780 --> 00:00:52,560 we have that it doesn't generalize well to new observations we might get. 14 00:00:53,630 --> 00:00:58,690 Okay, so let's go back to this 13th order degree polynomial fit. 15 00:00:58,690 --> 00:01:02,020 And a question is, do we actually believe this? 16 00:01:02,020 --> 00:01:04,730 Do we believe that this might be a reasonable fit to the data? 17 00:01:04,730 --> 00:01:08,350 And I think as I alluded to before, probably not. 18 00:01:08,350 --> 00:01:12,280 So although it minimizes the residual sum of squares, 19 00:01:12,280 --> 00:01:15,320 it ends up leading to very bad predictions. 20 00:01:17,240 --> 00:01:22,167 Because I'm sitting here and thinking, well, this quadratic fit that we had, 21 00:01:22,167 --> 00:01:27,023 even though it didn't minimize my residual sum of squares as much as that 13th 22 00:01:27,023 --> 00:01:32,034 order polynomial, it still, my gut feeling is, somehow this is a better model. 23 00:01:32,034 --> 00:01:35,717 Okay, so a question is, what's going on here, and 24 00:01:35,717 --> 00:01:40,926 how do we think about choosing the right model order or model complexity? 25 00:01:40,926 --> 00:01:44,003 Well, what we want, is we want good predictions. 26 00:01:44,003 --> 00:01:45,610 Of course, that's what we're aiming for. 27 00:01:47,050 --> 00:01:49,020 But we can't actually observe the future. 28 00:01:49,020 --> 00:01:52,250 Right, so we can't actually observe that prediction that we want to make and 29 00:01:52,250 --> 00:01:55,940 say did we do a good job or not until we actually go ahead and do it. 30 00:01:55,940 --> 00:01:57,760 So when we're thinking about choosing our model, 31 00:01:57,760 --> 00:02:00,690 somehow we have to work just with the data that we have. 32 00:02:00,690 --> 00:02:04,820 So how can we think about trying to choose a good model in this case? 33 00:02:04,820 --> 00:02:09,460 Well, what we can do is we can think about simulating predictions. 34 00:02:09,460 --> 00:02:11,170 So we're gonna take our data set that we have, 35 00:02:11,170 --> 00:02:13,760 and we're gonna remove some of the houses. 36 00:02:13,760 --> 00:02:16,080 So those are the grayed-out houses here. 37 00:02:16,080 --> 00:02:20,909 These are gonna be removed temporarily. 38 00:02:20,909 --> 00:02:24,090 And we're gonna fit our model on the remaining houses. 39 00:02:24,090 --> 00:02:28,910 So all of these guys we're gonna use to fit our model using 40 00:02:28,910 --> 00:02:33,738 exactly the kind of methods that we talked about before. 41 00:02:33,738 --> 00:02:37,406 And then what we're gonna do is we're gonna predict. 42 00:02:37,406 --> 00:02:41,820 So I'll go through an erase these x's now and put question marks. 43 00:02:43,270 --> 00:02:48,672 And say from the model that I just learned on the circled houses, 44 00:02:48,672 --> 00:02:52,984 what values do I predict for these question marks? 45 00:02:52,984 --> 00:02:55,847 And then I can compare to the actual observed values, 46 00:02:55,847 --> 00:02:58,020 because these houses are in my data set. 47 00:02:59,610 --> 00:03:04,400 Okay, so I can use this as a proxy for doing the types of 48 00:03:04,400 --> 00:03:07,830 real predictions that I wanna do on data that I haven't yet collected. 49 00:03:09,020 --> 00:03:12,640 Of course, this type of method only is gonna work well if I have enough 50 00:03:12,640 --> 00:03:20,270 observations to think about fitting on versus testing my predictions on. 51 00:03:20,270 --> 00:03:23,620 Okay, so let's introduce a little bit of terminology. 52 00:03:23,620 --> 00:03:28,340 Well, the houses that we use to fit our model, we call that the training set. 53 00:03:28,340 --> 00:03:30,370 And the houses that we're using as a proxy for 54 00:03:30,370 --> 00:03:33,840 our predictions, those that we're holding out, we call the test set. 55 00:03:35,260 --> 00:03:39,790 Okay, so let's dig a little bit more into how we're gonna do this analysis. 56 00:03:40,920 --> 00:03:45,180 And the first thing that we can do is look at something called the training error. 57 00:03:45,180 --> 00:03:48,550 So we're gonna examine every house in our test data set. 58 00:03:48,550 --> 00:03:51,690 So let's look at this red color here. 59 00:03:51,690 --> 00:03:57,850 So all of our training houses are represented 60 00:03:57,850 --> 00:04:02,720 with these blue circles here, and these are the only houses 61 00:04:02,720 --> 00:04:06,640 we're gonna look at when we're thinking about defining our training error. 62 00:04:06,640 --> 00:04:07,880 So in particular, 63 00:04:07,880 --> 00:04:12,080 we're gonna look at what are the errors that we make on these houses? 64 00:04:12,080 --> 00:04:16,741 So this is just the residual sum of squares on the houses in our training 65 00:04:16,741 --> 00:04:20,139 data set, and that's called the training error. 66 00:04:20,139 --> 00:04:24,399 So in particular, the training error looks exactly like what we had for 67 00:04:24,399 --> 00:04:27,168 our residual sum of squares calculation, but 68 00:04:27,168 --> 00:04:31,714 we're only including the houses that are present in our training data set. 69 00:04:31,714 --> 00:04:36,071 Okay, so then for any given model, such as a linear fit through the data, 70 00:04:36,071 --> 00:04:40,849 quadratic fit, or so on, what we can do is we can think about estimating our model 71 00:04:40,849 --> 00:04:44,300 parameters as those that minimize the training error. 72 00:04:44,300 --> 00:04:47,740 So that's equivalent to what we talked about before of minimizing the residual 73 00:04:47,740 --> 00:04:48,880 sum of squares. 74 00:04:48,880 --> 00:04:52,440 But again, here we're only looking at the houses in our training data set. 75 00:04:53,770 --> 00:04:58,830 Okay, so then that's how we get our estimated w hat, 76 00:04:58,830 --> 00:05:01,800 our estimated model parameters. 77 00:05:01,800 --> 00:05:05,000 But then what we wanna do is we wanna take these estimated model parameters, and 78 00:05:05,000 --> 00:05:06,800 we wanna say how good are we doing? 79 00:05:06,800 --> 00:05:08,660 And remember what we said, what we're gonna do, 80 00:05:08,660 --> 00:05:12,120 is we're gonna look at our held out observations, okay? 81 00:05:12,120 --> 00:05:20,680 So here, these gray circles are the houses that are in our test set. 82 00:05:20,680 --> 00:05:23,380 So these are houses that were not used to fit this model. 83 00:05:24,840 --> 00:05:29,412 And we're gonna say how well are we predicting these actual house sales? 84 00:05:29,412 --> 00:05:31,860 Okay, and so what were our predictions? 85 00:05:31,860 --> 00:05:34,530 Well, remember when we thought about making a prediction, 86 00:05:34,530 --> 00:05:39,930 we just used the value of the fit, so just the point on the line. 87 00:05:40,930 --> 00:05:44,670 So to assess how well we're predicting these held-out observations, 88 00:05:44,670 --> 00:05:48,610 our test data, we're gonna look at something that, again, 89 00:05:48,610 --> 00:05:51,310 looks exactly like residual sum of squares. 90 00:05:51,310 --> 00:05:53,440 But it's called our test error, 91 00:05:53,440 --> 00:05:58,140 where we take these estimated model parameters w hat, and we sum 92 00:05:58,140 --> 00:06:03,220 over our residual sum of squares over all houses that are in our test data set. 93 00:06:04,520 --> 00:06:06,930 Okay, so that's our test error. 94 00:06:06,930 --> 00:06:11,612 But what we can think about is we can think about how does test error and 95 00:06:11,612 --> 00:06:15,666 training error vary as a function of the model complexity? 96 00:06:15,666 --> 00:06:19,249 [MUSIC]