[MUSIC] So we've said to assess
the performance of our model, we really need to have a test data set
carved out from our full data set. So, this raises the question of, how do I think about dividing the data
set into training data versus test data? So, in pictures, how many points
do I put in this blue space here, this training set,
versus this pink space this test set? Well, if I put too few
points in my training set, then I'm not going to
estimate my model well. And so, I'm going to have clearly bad
predictor performance because of that. But on the other hand if I put
to few points in my test set, that's gonna be a bad approximation
to generalization error. Because it's not gonna represent
a wide enough range of things I might see out there in the world. So there's no perfect formula for how to split a data set
into training versus test. But a general rule of thumb if you can
figure out how to do this is typically you want just enough points in your test set
to approximate generalization error well. And you want all your points
in your training data set. Because you want to have as many points in your training data set
to learn a good model. Especially when your looking
at very complex models. But you still, like we've said before,
wanna have enough points in your test set to analyze the performance
of the fitted model. Okay, well this is assuming that you have
enough data to do this type of split, so that you can leave enough points
both in the training and test sets. But if that isn't the case, there
are other methods that we're gonna talk about in this course,
like cross validation. [MUSIC]