1 00:00:00,000 --> 00:00:04,270 [MUSIC] 2 00:00:04,270 --> 00:00:04,810 Okay. 3 00:00:04,810 --> 00:00:09,712 Let's wrap up by talking about two really important task when you're doing 4 00:00:09,712 --> 00:00:10,640 regression. 5 00:00:10,640 --> 00:00:12,181 And through this discussion, 6 00:00:12,181 --> 00:00:16,460 it's gonna motivate another important concept of thinking about validation sets. 7 00:00:17,770 --> 00:00:20,200 So, the two important task in regression, 8 00:00:20,200 --> 00:00:23,296 is first we need to choose a specific model complexity. 9 00:00:23,296 --> 00:00:26,528 So for example, when we're talking about polynomial regression, 10 00:00:26,528 --> 00:00:29,140 what's the degree of that polynomial? 11 00:00:29,140 --> 00:00:33,520 And then for our selected model, we assess its performance. 12 00:00:33,520 --> 00:00:37,310 And actually these two steps aren't specific gesture regression. 13 00:00:37,310 --> 00:00:40,800 We're gonna see this in all different aspects of machine learning, where we have 14 00:00:40,800 --> 00:00:44,790 to specify our model and then we need to assess the performance of that model. 15 00:00:44,790 --> 00:00:48,320 So, what we're gonna talk about in this portion of this module 16 00:00:48,320 --> 00:00:51,080 generalizes well beyond regression. 17 00:00:51,080 --> 00:00:56,282 And for this first task, where we're talking about choosing the specific model. 18 00:00:56,282 --> 00:01:00,158 We're gonna talk about it in terms of sum set of tuning parameters, 19 00:01:00,158 --> 00:01:02,879 lambda, which control the model complexity. 20 00:01:02,879 --> 00:01:07,683 Again, and for example, lambda might specify the degree of the polynomial and 21 00:01:07,683 --> 00:01:10,050 polynomial aggression. 22 00:01:10,050 --> 00:01:14,530 So, let's first talk about how we can think about choosing lambda. 23 00:01:14,530 --> 00:01:18,968 And then for a given model specified by lambda, a given model complexity, 24 00:01:18,968 --> 00:01:23,421 let's think about how we're gonna assess the performance of that model. 25 00:01:23,421 --> 00:01:27,903 Well, one really naive approach is to do what we've described before, 26 00:01:27,903 --> 00:01:32,462 where you take your data set and split it into a training set and a test set. 27 00:01:32,462 --> 00:01:35,080 And then, what we're gonna do is for 28 00:01:35,080 --> 00:01:40,920 our model selection portion where we're choosing the model complexity lambda. 29 00:01:40,920 --> 00:01:45,867 For every possible choice of lambda, we're gonna estimate model 30 00:01:45,867 --> 00:01:51,810 parameters associated with that lambda model on the training set. 31 00:01:51,810 --> 00:01:58,430 And the we're gonna test the performance of that fitted model on the test set. 32 00:01:58,430 --> 00:02:02,870 And we're gonna tabulate that for every lambda that we're considering. 33 00:02:02,870 --> 00:02:05,480 And we're gonna choose our tuning 34 00:02:05,480 --> 00:02:09,629 parameters as the ones that minimize this test error. 35 00:02:09,629 --> 00:02:13,750 So, the ones that perform best on the test data. 36 00:02:13,750 --> 00:02:16,440 And we're gonna call those parameters lambda star. 37 00:02:16,440 --> 00:02:17,540 So, now I have my model. 38 00:02:17,540 --> 00:02:22,400 I have my specific degree of polynomial that I'm gonna use. 39 00:02:22,400 --> 00:02:27,340 And I wanna go and assess the performance of this specific model. 40 00:02:27,340 --> 00:02:31,220 And the way I'm gonna do this is I'm gonna take my test data again. 41 00:02:31,220 --> 00:02:33,431 And I'm gonna say, well, okay, 42 00:02:33,431 --> 00:02:38,370 I know that test error is an approximation of generalization error. 43 00:02:38,370 --> 00:02:43,118 So, I'm just gonna compute the test error for 44 00:02:43,118 --> 00:02:46,421 this lambda star fitted model. 45 00:02:46,421 --> 00:02:51,879 And I'm gonna use that as my approximation of the performance of this model. 46 00:02:51,879 --> 00:02:53,421 Well, what's the issue with this? 47 00:02:53,421 --> 00:02:54,550 Is this gonna perform well? 48 00:02:55,830 --> 00:02:59,461 No, it's really overly optimistic. 49 00:02:59,461 --> 00:03:03,862 So, this issue is just like what we saw when we weren't dealing with this notion 50 00:03:03,862 --> 00:03:05,646 of choosing model complexity. 51 00:03:05,646 --> 00:03:10,780 We just assumed that we had a specific model, like a specific degree polynomial. 52 00:03:10,780 --> 00:03:13,670 But we wanted to asses the performance of the model. 53 00:03:13,670 --> 00:03:16,325 And the naive approach we took there was saying, 54 00:03:16,325 --> 00:03:18,978 well, we fit the model to the training data, and 55 00:03:18,978 --> 00:03:23,212 then we're gonna use training error to assess the performance of the model. 56 00:03:23,212 --> 00:03:26,410 And we said, that was overly optimistic because we were double dipping. 57 00:03:26,410 --> 00:03:28,921 We already used the data to fit our model. 58 00:03:28,921 --> 00:03:29,951 And then, so 59 00:03:29,951 --> 00:03:35,685 that error was not a good measure of how we're gonna perform on new data. 60 00:03:35,685 --> 00:03:39,336 Well, it's exactly the same notion here and let's walk through why. 61 00:03:39,336 --> 00:03:44,357 Most specifically, when we're thinking about choosing our model complexity, 62 00:03:44,357 --> 00:03:48,805 we were using our test data to compare between different lambda values. 63 00:03:48,805 --> 00:03:52,491 And we chose the lambda value that minimized the error on that test data that 64 00:03:52,491 --> 00:03:54,298 performed the best there. 65 00:03:54,298 --> 00:03:58,540 So, you could think of this as having fit lambda, 66 00:03:58,540 --> 00:04:03,320 this model complexity tuning parameter, on the test data. 67 00:04:03,320 --> 00:04:06,300 And now, we're thinking about using test error 68 00:04:06,300 --> 00:04:11,060 as a notion of approximating how well we'll do on new data. 69 00:04:11,060 --> 00:04:15,891 But the issue is, unless our test data represents everything we might see out 70 00:04:15,891 --> 00:04:19,547 there in the world, that's gonna be way too optimistic. 71 00:04:19,547 --> 00:04:25,809 Because lambda was chosen, the model was chosen, to do well on the test data and 72 00:04:25,809 --> 00:04:30,030 so that won't generalize well to new observations. 73 00:04:31,360 --> 00:04:33,040 So, what's our solution? 74 00:04:33,040 --> 00:04:35,720 Well, we can just create two test data sets. 75 00:04:36,940 --> 00:04:41,070 They won't both be called test sets, we're gonna call one of them a validation set. 76 00:04:41,070 --> 00:04:44,004 So, we're gonna take our entire data set, just to be clear. 77 00:04:44,004 --> 00:04:47,160 And now, we're gonna split it into three data sets. 78 00:04:47,160 --> 00:04:51,884 One will be our training data set, one will be what we call our validation set, 79 00:04:51,884 --> 00:04:53,970 and the other will be our test set. 80 00:04:55,370 --> 00:05:01,910 And then what we're gonna do is, we're going to fit our model parameters always 81 00:05:01,910 --> 00:05:06,360 on our training data, for every given model complexity that we're considering. 82 00:05:06,360 --> 00:05:11,556 But then we're gonna select our model complexity as the model that 83 00:05:11,556 --> 00:05:17,046 performs best on the validation set has the lowest validation error. 84 00:05:17,046 --> 00:05:20,682 And then we're gonna assess the performance of that 85 00:05:20,682 --> 00:05:22,920 selected model on the test set. 86 00:05:22,920 --> 00:05:27,228 And we're gonna say that that test error is now an approximation of our 87 00:05:27,228 --> 00:05:28,769 generalization error. 88 00:05:28,769 --> 00:05:34,198 Because that test set was never used in either fitting our parameters, w hat, 89 00:05:34,198 --> 00:05:39,426 or selecting our model complexity lambda, that other tuning parameter. 90 00:05:39,426 --> 00:05:43,733 So, that data was completely held out, never touched, and 91 00:05:43,733 --> 00:05:47,879 it now forms a fair estimate of our generalization error. 92 00:05:47,879 --> 00:05:51,556 So in summary, we're gonna fit our model parameters for 93 00:05:51,556 --> 00:05:55,080 any given complexity on our training set. 94 00:05:55,080 --> 00:06:01,030 Then we're gonna, for every fitted model and for every model complexity, 95 00:06:01,030 --> 00:06:06,920 we're gonna assess the performance and tabulate this on our validation set. 96 00:06:06,920 --> 00:06:11,080 And we're gonna use that to select the optimal set of tuning parameters 97 00:06:11,080 --> 00:06:11,920 lambda star. 98 00:06:11,920 --> 00:06:17,130 And then for that resulting model, that w hat sub lambda star, 99 00:06:17,130 --> 00:06:23,540 we're gonna assess a notion of the generalization error using our test set. 100 00:06:24,680 --> 00:06:27,870 And so a question, is how can we think about 101 00:06:27,870 --> 00:06:33,670 doing the split between our training set, validation set, and test set? 102 00:06:33,670 --> 00:06:36,118 And there's no hard and fast rule here, 103 00:06:36,118 --> 00:06:39,150 there's no one answer that's the right answer. 104 00:06:39,150 --> 00:06:43,754 But typical splits that you see out there are something like an 80-10-10 split. 105 00:06:43,754 --> 00:06:51,090 So, 80% of your data for training data, 10% t for validation, 10% for tests. 106 00:06:51,090 --> 00:06:56,379 Or another common split is 50%, 25%, 25%. 107 00:06:56,379 --> 00:07:02,041 But again, this is assuming that you have enough data to do this type of split and 108 00:07:02,041 --> 00:07:06,269 still get reasonable estimates of your model parameters, 109 00:07:06,269 --> 00:07:11,890 reasonable notions of how different model complexities compare. 110 00:07:11,890 --> 00:07:14,480 Because you have a large enough validation set, and 111 00:07:14,480 --> 00:07:17,020 you still have a large enough test set 112 00:07:17,020 --> 00:07:21,010 in order to assess the generalization error of the resulting model. 113 00:07:21,010 --> 00:07:25,670 And if this isn't the case, we're gonna talk about other methods that 114 00:07:25,670 --> 00:07:28,697 allow us to do these same types of notions, but 115 00:07:28,697 --> 00:07:33,921 not with this type of hard division between training, validation, and test. 116 00:07:33,921 --> 00:07:38,109 [MUSIC]