You've seen how regularization can help prevent overfitting, but how does it affect the bias and variance of a learning algorithm? In this video, I like to go deeper into the issue of bias and variance, and talk about how it interacts with, and is effected by, the regularization of your learning algorithm. Suppose we fit a linear regression model with a very high order polynomial, but to prevent overfitting, we are going to use regularization as shown here. Suppose we're fitting a high order polynomial like that shown here, but to prevent overfitting, we're going to use regularization, like that shown here, so we have this regularization term to try to keep the values of the parameters small. And as usual, the regularization sums from j equals 1 to m rather than j equals 0 to m. Let's consider three cases. The first is the case of a very large value of the regularization parameter lambda, such as if lambda were equal to 10,000s of huge value. In this case, all of these parameters, theta 1, theta 2, theta 3 and so on will be heavily penalized and so, what ends up with most of these parameter values being close to 0 and the hypothesis will be roughly h or x just equal or approximately equal to theta 0, and so we end up a hypothesis that more or less looks like that. This is more or less a flat, constant straight line. And so this hypothesis has high bias and a value underfits this data set. So the horizontal straight line is just not a very good model for this data set. At the other extreme beam is if we have a very small value of lambda, such as if lambda were equal to 0. In that case, given that we're fitting a high order polynomial, this is a usual overfitting setting. In that case, given that we're fitting a high order polynomial, basically without regularization or with very minimal regularization, we end up with our usual high variance, overfitting setting, because basically if lambda is equal to zero, we are just fitting with our regularization so that overfits the hypothesis and is only if we have some intermediate value of lambda that is neither too large nor too small that we end up with parameters theta that we end up that give us a reasonable fit to this data. So how can we automatically choose a good value for the regularization parameter lambda? Just to reiterate, here is our model and here is our learning algorithm subjective. For the setting where we're using regularization, let me define j train of theta to be something different to be the optimization objective but without the regularization term. Previously, in earlier video when we are not using regularization, I define j train of theta to be the same as j of theta as the cost function but when we are using regularization with this extra lambda term we're going to define j train my training set error, to be just my sum of squared errors on the training set, or my average squared error on the training set without taking into account that regularization chart. And similarly, I'm then also going to define the cross-validation set error when the test set error, as before to be the average sum of squared errors on the cross-validation and the test sets. So just to summarize, my definitions of J train and J C V and J Test are just the average squared error, or one half of the average squared error on my training validation and test sets without the extra regularization chart. So, this is how we can automatically choose the regularization parameter lambda. What I usually do is may be have some range of values of lambda I want to try it. So I might be considering not using regularization, or here are a few values I might try. I might be considering along because of O1, O2 from O4 and so on. And you know, I usually step these up in multiples of two until some maybe larger value this in multiples of two you I actually end up with 10.24; it's ten exactly, but you know, this is close enough and the 35 decimal places won't affect your result that much. So, this gives me, maybe twelve different models, that I'm trying to select amongst, corresponding to 12 different values of the regularization parameter lambda and of course, you can also go to values less than 0.01 or values larger than 10, but I've just truncated it here for convenience. Given each of these 12 models, what we can do is then the following: we take this first model with lambda equals 0, and minimize my cos function j of theta and this would give me some parameter vector theta and similar to the earlier video, let me just denote this as theta superscript 1. And then I can take my second model, with lambda set to 0.01 and minimize my cos function, now using lambda equals 0.01 of course, to get some different parameter vector theta, we need to know that theta 2, and for that I end up with theta 3 so that this is correct for my third model, and so on, until for for my final model with lambda set to 10, or 10.24, or I end up with this theta 12. Next I can take all of these hypotheses, all of these parameters, and use my cross-validation set to evaluate them. So I can look at my first model, my second model, fits with these different values of the regularization parameter and evaluate them on my cross-validation set - basically measure the average squared error of each of these parameter vectors theta on my cross-validation set. And I would then pick whichever one of these 12 models gives me the lowest error on the cross-validation set. And let's say, for the sake of this example, that I end up picking theta 5, the fifth order polynomial, because that has the Noah's cross-validation error. Having done that, finally, what I would do if I want to report a test set error is to take the parameter theta 5 that I've selected and look at how well it does on my test set. And once again here is as if we fit this parameter theta to my cross-validation set, which is why I am saving aside a separate test set that I am going to use to get a better estimate of how well my a parameter vector theta will generalize to previously unseen examples. So that's model selection applied to selecting the regularization parameter lambda. The last thing I'd like to do in this video, is get a better understanding of how cross-validation and training error vary as we as we vary the regularization parameter lambda. And so just a reminder, that was our original cosine function j of theta, but for this purpose we're going to define training error without using the regularization parameter, and cross-validation error without using the regularization parameter and what I'd like to do is plot this J train and plot this Jcv, meaning just how well does my hypothesis do for on the training set and how well does my hypothesis do on the cross-validation set as I vary my regularization parameter lambda so as we saw earlier, if lambda is small, then we're not using much regularization and we run a larger risk of overfitting. Where as if lambda is large, that is if we were on the right part of this horizontal axis, then with a large value of lambda we run the high risk of having a bias problem. So if you plot J train and Jcv, what you find is that for small values of lambda you can fit the training set relatively well because you're not regularizing. So, for small values of lambda, the regularization term basically goes away and you're just minimizing pretty much your squared error. So when lambda is small, you end up with a small value for J train, whereas if lambda is large, then you have a high bias problem and you might not fit your training set so well. So you end up with a value up there. So, J train of theta will tend to increase when lambda increases because a large value of lambda corresponds a high bias where you might not even fit your training set well, whereas a small value of lambda corresponds to, if you can you know freely fit to very high degree polynomials, your data, let's say. As for the cross-validation error, we end up with a figure like this. Where, over here on the right, if we have a large value of lambda, we may end up underfitting. And so, this is the bias regime whereas and cross validation error will be high and let me just leave all that. So, that's Jcv of theta because with high bias we won't be fitting. We won't be doing well on the cross-validation set. Whereas here on the left, this is the high-variance regime. Where if we have two smaller value of then we may be overfitting the data and so by over fitting the data then it a cross validation error will also be high. And so, this is what the cross-validation error and what the training error may look like on a training set as we vary the parameter lambda, as we vary the regularization parameter lambda. And so, once again, it will often be some intermediate value of lambda that you know, subsequent just right or that works best in terms of having a small cross-validation error or a small test set error. And whereas the curves I've drawn here are somewhat cartoonish and somewhat idealized. So on a real data set the pros you get may end up looking a little bit more messy and just a little bit more noisy than this. For some data sets you will really see these poor source of trends and by looking at the plot of the whole or cross validation error, you can either manually, automatically try to select a point that minimizes the cross-validation error and select the value of lambda corresponding to low cross-validation error. When I'm trying to pick the regularization parameter lambda for a learning algorithm, often I find that plotting a figure like this one showed here, helps me understand better what's going on and helps me verify that I am indeed picking a good value for the regularization parameter lambda. So hopefully that gives you more insight into regularization and it's effects on the bias and variance of the learning algorithm. By know you've seen bias and variance from a lot of different perspectives. And what I'd like to do in the next video is take a lot of the insights that we've gone through and build on them to put together a diagnostic that's called learning curves, which is a tool that I often use to try to diagnose if a learning algorithm may be suffering from a bias problem or a variance problem or a little bit of both.