1 00:00:00,000 --> 00:00:04,046 [MUSIC] 2 00:00:04,046 --> 00:00:08,154 Okay, so we can't compute generalization error, but we want some better measure of 3 00:00:08,154 --> 00:00:11,440 our predictive performance than training error gives us. 4 00:00:11,440 --> 00:00:13,530 And so this takes us to something called test error, 5 00:00:13,530 --> 00:00:17,739 and what test error is going to allow us to do is approximate generalization error. 6 00:00:19,610 --> 00:00:23,960 And the way we're gonna do this is by approximating the error, 7 00:00:23,960 --> 00:00:27,060 looking at houses that aren't in our training set. 8 00:00:27,060 --> 00:00:29,270 So to do that, we have to hold out some houses. 9 00:00:29,270 --> 00:00:33,950 So instead of including all these colored houses in our training set, 10 00:00:33,950 --> 00:00:38,420 which is these colored houses are our entire recorded data set, 11 00:00:39,860 --> 00:00:44,290 we're gonna shade out some of them, these shaded gray houses and 12 00:00:44,290 --> 00:00:49,190 we're gonna make these into what's called a test set. 13 00:00:49,190 --> 00:00:52,930 Okay. So here we have houses that are not 14 00:00:52,930 --> 00:00:58,250 included in our training set, the training set are the remaining colored houses here. 15 00:00:58,250 --> 00:01:01,270 And when we go to fit our models, 16 00:01:01,270 --> 00:01:04,090 we're just going to fit our models on the training data set. 17 00:01:04,090 --> 00:01:07,330 But then when we go to assess our performance of that model, 18 00:01:07,330 --> 00:01:11,140 we can look at these test houses, and these are hopefully 19 00:01:11,140 --> 00:01:15,690 going to serve as a proxy of everything out there in the world. 20 00:01:15,690 --> 00:01:22,960 So hopefully, our test data set is a good measure of other houses that we might see, 21 00:01:22,960 --> 00:01:28,850 or at least in order to think of how well a given model is performing. 22 00:01:29,940 --> 00:01:33,540 Okay, so test error is gonna be our average loss 23 00:01:33,540 --> 00:01:36,570 computed over the houses in our test data set. 24 00:01:36,570 --> 00:01:41,430 So formally, we write it as follows where we have one over N test. 25 00:01:41,430 --> 00:01:47,010 N test are the number of houses in our test data set 26 00:01:47,010 --> 00:01:51,430 times the sum of the loss defined over those test set houses. 27 00:01:52,490 --> 00:01:55,730 But I wanna emphasize, and this is really, really important, 28 00:01:55,730 --> 00:02:00,840 that the estimated parameters W hat were fit on the training data set. 29 00:02:00,840 --> 00:02:06,130 Okay, so even though this function looks very, very, very much like training error, 30 00:02:06,130 --> 00:02:09,610 the sum is over the test houses, but 31 00:02:09,610 --> 00:02:13,310 the function we're looking at was fit on training data. 32 00:02:13,310 --> 00:02:19,690 Okay, so these parameters in this fitted function never saw the test data. 33 00:02:21,020 --> 00:02:24,470 So just to illustrate this, like in our previous example, 34 00:02:24,470 --> 00:02:27,720 we might think of fitting a quadratic function through this data, 35 00:02:27,720 --> 00:02:31,710 where we're gonna minimize the residual sum of squares on the training points, 36 00:02:31,710 --> 00:02:35,505 those blue circles, to get our estimated parameters W hat. 37 00:02:36,640 --> 00:02:40,870 Then when we go to compute our test error, which in this case again we're gonna 38 00:02:40,870 --> 00:02:45,850 use squared error as an example, we're computing this error 39 00:02:45,850 --> 00:02:50,560 over the test points, all these grey different circles here. 40 00:02:50,560 --> 00:02:55,770 So test error is 1 over N times the sum of the difference between our 41 00:02:55,770 --> 00:02:59,600 true house sales prices and our predicted price 42 00:02:59,600 --> 00:03:04,030 squared summing over all houses in our test data set. 43 00:03:04,030 --> 00:03:05,820 Okay, so this is where the difference arises, 44 00:03:05,820 --> 00:03:08,610 where this function was fit with the blue circles. 45 00:03:08,610 --> 00:03:11,700 The one we're assessing, the performance, we're looking at these grey circles. 46 00:03:12,970 --> 00:03:17,770 Okay, so let's summarize our measures of error as a function of model complexity. 47 00:03:19,270 --> 00:03:21,340 And what we saw was that our training error 48 00:03:22,710 --> 00:03:25,620 decreased with increasing model complexity. 49 00:03:27,290 --> 00:03:30,587 So here, this is our training error. 50 00:03:36,650 --> 00:03:44,010 And in contrast, our generalization error went down for some period of time. 51 00:03:44,010 --> 00:03:48,880 But then we started getting to overly complex models that 52 00:03:48,880 --> 00:03:52,750 didn't generalize well, and the generalization error started increasing. 53 00:03:53,820 --> 00:03:57,046 So here we have generalization error. 54 00:04:00,212 --> 00:04:03,449 Or true error. 55 00:04:03,449 --> 00:04:07,280 And what is our test error? 56 00:04:07,280 --> 00:04:10,260 Well, our test error is a noisy approximation of generalization error. 57 00:04:10,260 --> 00:04:15,280 Because if our test data setting included everything we might ever see in the world 58 00:04:15,280 --> 00:04:18,460 in proportion to how likely it was to be seen, 59 00:04:18,460 --> 00:04:22,310 then that would be exactly our generalization error. 60 00:04:22,310 --> 00:04:25,954 But of course, our test data set is just some finite data set, and 61 00:04:25,954 --> 00:04:29,205 we're using it to approximate generalization error, so 62 00:04:29,205 --> 00:04:32,212 it's gonna be some noisy version of this curve here. 63 00:04:36,600 --> 00:04:39,640 So this is our test error. 64 00:04:43,350 --> 00:04:47,060 Okay, so test error is the thing that we can actually compute. 65 00:04:47,060 --> 00:04:49,962 Generalization error is the thing that we really want. 66 00:04:49,962 --> 00:04:54,079 [MUSIC]