1 00:00:00,131 --> 00:00:04,504 [MUSIC] 2 00:00:04,504 --> 00:00:07,127 So we've said to assess the performance of our model, 3 00:00:07,127 --> 00:00:11,060 we really need to have a test data set carved out from our full data set. 4 00:00:11,060 --> 00:00:13,040 So, this raises the question of, 5 00:00:13,040 --> 00:00:17,110 how do I think about dividing the data set into training data versus test data? 6 00:00:18,610 --> 00:00:23,920 So, in pictures, how many points do I put in this blue space here, 7 00:00:23,920 --> 00:00:27,960 this training set, versus this pink space this test set? 8 00:00:29,540 --> 00:00:32,730 Well, if I put too few points in my training set, 9 00:00:32,730 --> 00:00:35,130 then I'm not going to estimate my model well. 10 00:00:35,130 --> 00:00:38,830 And so, I'm going to have clearly bad predictor performance because of that. 11 00:00:39,960 --> 00:00:43,940 But on the other hand if I put to few points in my test set, 12 00:00:43,940 --> 00:00:47,720 that's gonna be a bad approximation to generalization error. 13 00:00:47,720 --> 00:00:51,190 Because it's not gonna represent a wide enough range of things 14 00:00:51,190 --> 00:00:52,450 I might see out there in the world. 15 00:00:53,970 --> 00:00:55,780 So there's no perfect formula for 16 00:00:55,780 --> 00:00:59,250 how to split a data set into training versus test. 17 00:00:59,250 --> 00:01:04,260 But a general rule of thumb if you can figure out how to do this is typically you 18 00:01:04,260 --> 00:01:09,800 want just enough points in your test set to approximate generalization error well. 19 00:01:09,800 --> 00:01:13,300 And you want all your points in your training data set. 20 00:01:13,300 --> 00:01:15,320 Because you want to have as many points 21 00:01:16,570 --> 00:01:20,060 in your training data set to learn a good model. 22 00:01:20,060 --> 00:01:24,086 Especially when your looking at very complex models. 23 00:01:24,086 --> 00:01:29,036 But you still, like we've said before, wanna have enough points in your test 24 00:01:29,036 --> 00:01:32,337 set to analyze the performance of the fitted model. 25 00:01:32,337 --> 00:01:37,191 Okay, well this is assuming that you have enough data to do this type of split, 26 00:01:37,191 --> 00:01:41,580 so that you can leave enough points both in the training and test sets. 27 00:01:41,580 --> 00:01:44,712 But if that isn't the case, there are other methods that we're gonna talk 28 00:01:44,712 --> 00:01:46,712 about in this course, like cross validation. 29 00:01:46,712 --> 00:01:51,269 [MUSIC]