Suppose you like to decide what degree of polynomial to fit to a data set, sort of what features to include to give you a learning algorithm. Or suppose you'd like to choose the regularization parameter lambda for the learning algorithm. How do you do that? These are called model selection problems. And in our discussion of how to do this we'll talk about not just how to split your data into a train and test sets but how to split your data into what we'll discover is called the train validation and test sets. We'll see in this video just what these things are and how to use them to do model selection. We've already seen a lot of times the problem of overfitting, in which just because the learning algorithm fits a training set well, that doesn't mean there's a good hypothesis. More generally, this is why the training set error is not a good predictor for how well the hypothesis will do on new examples. Concretely, if you fit some set of parameters - theta 0, theta 1, theta 2 and so on - to your training set then, the fact that your hypothesis does well in the training set, well, this doesn't mean much in terms of predicting how well your hypothesis will generalize to new examples not seen in the training set. And the more general principal is that, once your parameters were fit to some set of data--maybe the training set, maybe something else--then the error of your hypothesis as measured on that same data set, such as the training error, that's unlikely to be a good estimate of your actual generalization error, that is, of how well the hypothesis will generalize to new examples. Now let's consider the model selection problem. Let's say you try to choose what degree polynomial to fit to data. So, you should you choose a linear function, a quadratic function, a cubic function, all the way up to a 10th power polynomial? So it's as if there's one extra parameter in this algorithm, which I'm going to denote d, which is what degree of polynomial do you want to pick? So it is as if does this, in addition to the theta parameters it's as if there's one more parameter d that your trying to determine using your data cells. the first option is d equals 1, which is for the linear function we can choose d equals 2, d equals 3, all the way up to d equals 10, so we would like to fit this extra sort of parameter, which I am denoting by d, and concretely, let's say that you want to choose a model, that is choose a degree of polynomial choose one off these ten models, and fit that model and also get some estimate of how well your fitted hypothesis will generalize to new examples. Here's one thing you could do: you could take your first model and minimize the training error and this would give you some parameter vector theta, and you can then take your second model, the quadratic function and for that your training set and this will give you some other parameters vector theta. In order to distinguish between these different parameter vectors, I'm going to use a superscript 1, superscript 2 there where theta superscript 1 just means the parameters I get by fitting this model to my training data, and theta superscript 2 just means the parameters I get by fitting this quadratic function to my training ata and so on. And by fitting a cubic model I get parameters theta 3 up to, you know, say theta 10. And one thing we could do is then take these parameters and look at the test set error. So I can compute on my test set, j test of 1, j test of theta 2 and so on, j test of theta 3 and so on. So I'm going to take each of my hypothesis with the corresponding and just measure the performance on the test set. Now one thing I could do then is, in order to select one of these models, I could then see which model has the lowest test sets error, and lets just say for this example, that I ended up choosing the fifth order polynomial. So this seems reasonable so far. By now, lets say, I want to take my fit hypothesis, this fifth order model and let's say I want to ask how well this model generalized. One thing I could do is look at how well my fifth order polynomial hypothesis, had done on my test set. But the problem is this will not to be a fair estimate of how well my hypothesis generalizes. And the reason is, what we've done is, we've fit this extra the parameter d, that is this degree of polynomial, and we'll fit that parameter d using the test set. Namely, we chose the value of d that gave us the best possible performance on the test set, and so, the performance of my parameter vector theta five on the test set, that's likely to be to be an overly optimistic estimate of generalization error. Right? So that because I have fit this parameter d to my test set, it is no longer fair to evaluate my hypothesis on this test set. That's because I've fit my parameters to the test set. I've chosen the degree d of polynomial using the test set. And so our hypothesis is likely to do better on this test set than it would on new examples that it hasn't seen before and that's which is what we hear about. So, just to reiterate on the previous slide we saw that if we fit some set of parameters, say theta 0, theta 1, and so on, to some training set, then the performance of the fitted model on the training set is not predictive of how well the hypothesis we generalized the new examples; is because these parameters would fit to the training set. So they are likely to do well on the training set, even if the parameters don't do well on other examples. And in the procedure I've just described on this slide, we've just done the same thing and specifically what we did is we fit this parameter d to the test set. And by having fit the parameter to the test set, this means that the performance of the hypothesis on that test set may not be a fair estimate of how well the hypothesis is likely to do on examples we haven't seen before. To address this problem in a model selection setting, if we want to evaluate a hypothesis this is usually what we do instead. Given the data set, instead of just splitting it into a train and test set, what we are going to do is instead split it into three pieces. And the first piece is going to be called the training set as usual. So you call this first part, the training set, and then were going to coddle the second piece of data, which is called the cross-validation set, and I'm going to abbreviate cross-validation CV, and the second piece of this data, I'm going to call the cross-validation set cross-validation, and I am going to abbreviate cross-validation as CV. Sometimes it's also called the validation set, instead of cross-validation set. And then the last part I am going to call my usual test set. And the pretty typical ratio I wish to split these things; would be to send 60% of your data to your training set, maybe 20% to your cross-validation set, and 20% to your test set. And these numbers can vary a little bit but this sort of ratio will be pretty typical. And so our training set will now be only, maybe 60% of the data, and our cross-validation set or our validation set will have some number of examples. I'm going to denote that M subscript cv, so that's the number of cross-validation examples. And as following our earlier notational convention, I'm going to use XiCV, YiCV. Following our earlier notational convention I'm going to use XiCV, YiCV to denote the i cross-validation example. And finally we also have a test set over here; with M subscript test, being the number of test examples. So, now that we have defined the training validation or cross validation and test sets, we can also define the training error, cross validation error, and test error. So here's my training error, and I'm just writing this as J j subscript train of theta. This is pretty much the same thing. It's usually the same thing as the j of theta that we'll be writing so far, this is just a training set error you guys measure on your training set. And then j subscript cv is my cause validation error is pretty much what you'd expect. Just select the training error, except measure it on the cross-validation data set, and here's my test set error, same as before. So when theta with the model selection problem like this is, instead of using the test set to select the model, we're instead going to use validation set or the cross-validation set to select the model. Concretely, we're going to first take our first hypothesis, take this first model and say, minimize the cos function, and this would give me some parameter vector theta for the linear model and as before I'm going to put the superscript 1 just to denote that this is a parameter for the linear model. We do the same thing for the quadratic model, get some parameter vector theta 2, get some parameter vectors there 3, and so on down to, say, the tenth by the polynomial, and what we I'm going to do is, instead of testing these hypothesis on the test set, instead I'm going to test them on the cross-validation set. I'm going to measure j subscript cv, to see how well each of these hypothesis do on my cross validation set and then I'm going to pick the hypothesis with the lowest cross-validation error. So for this example, let's say for the sake of argument that it was my fourth order polynomial that had the lowest cross-validation error. So in that case, I'm going to pick this fourth order polynomial model and finally what this means is that that parameter d, remember d was the degree of polynomial, right d equals 2, d equals 3, up to d equals 10. What we've done is we fit that parameter d, and we'll set d equals 4, and we did so using the cross-validation set. And so this degree of polynomial, so the parameter is no longer fit to the test set. And so we've now saved a way the test set and we can use the test set to measure or to estimate the generalization error of the model that was selected by this algorithm. So, that was model selection and how you can take your data and split it into a train validation and test set, and use your cross validation data to select model and evaluate it on the test set. One final note: I should say that in the machine learning as of this practice today, there are many people that will do that early thing that I talked about, and said that isn't such a good idea of selecting your model using the test set and they're using the same test set to report the error, as though selecting your degree of polynomial on the test set, and then reporting the error on the test set as though that were good estimate of generalization error. That sort of practice is unfortunately many people do do it; and if you have a massive massive test set is maybe not a terrible thing to do, but most practitioners of machine learning tend to advise against that and is considered better practice to have separate training validations of test sets. I'll just warn you that just sometimes people do you know, use the same data for the purpose of the validation set and for the purpose of the test sets. You only have a training set and the test set and that's because that's good practice. So, you will see some people do it but if possible I will recommend against doing.