For linear regression, we had previously worked out two learning algorithms, one based on gradient descent and one based on the normal equation. In this video we will take those two algorithms and generalize them to the case of regularized linear regression. Here's the optimization objective, that we came up with last time for regularized linear regression. This first part is our usual, objective for linear regression, and we now have this additional regularization term, where londer is our regularization parameter, and we like to find parameters theta, that minimizes this cost function, this regularized cost function, J of theta. Previously, we were using gradient descent for the original cost function, without the regularization term, and we had the following algorithm for regular linear regression, without regularization. We will repeatedly update the parameters theta J as follows for J equals 1,2 up through n. Let me take this and just write the case for theta zero separately. So, you know, I'm just gonna write the update for theta zero separately, then for the update for the parameters 1, 2, 3, and so on up to n. So, I haven't changed anything yet, right? This is just writing the update for theta zero separately from the updates from theta 1, theta 2, theta 3, up to theta n. And the reason I want to do this is you may remember that for our regularized linear regression, we penalize the parameters theta 1, theta 2, and so on up to theta n, but we don't penalize theta zero. So when we modify this algorithm for regularized linear regression, we're going to end up treating theta zero slightly differently. Concretely, if we want to take this algorithm and modify it to use the regularized objective, all we need to do is take this term at the bottom and modify as follows. We're gonna take this term and add minus londer M, times theta J. And if you implement this, then you have gradient descent for trying to minimize the regularized cost function J of F theta, and concretely, I'm not gonna do the calculus to prove it, but concretely if you look at this term, this term that's written is square brackets. If you know calculus, it's possible to prove that that term is the partial derivative, with respect of J of theta, using the new definition of J of theta with the regularization term. And somebody on this term up on top, which I guess I am drawing the salient box that's still the partial derivative respect of theta zero for J of theta. If you look at the update for theta J, it's possible to show's something pretty interesting, concretely theta J gets updated as theta J, minus alpha times, and then you have this other term here that depends on theta J . So if you group all the terms together that depending on theta J. We can show that this update can be written equivalently as follows and all I did was have, you know, theta J here is theta J times 1 and this term is londer over M. There's also an alpha here, so you end up with alpha londer over m, multiply them to theta J and this term here, one minus alpha times londer M, is a pretty interesting term, it has a pretty interesting effect. Concretely, this term one minus alpha times londer over M, is going to be a number that's, you know, usually a number that's a loop and less than 1, right? Because of alpha times londer over M is going to be positive and usually, if you're learning rate is small and M is large. That's usually pretty small. So this term here, it's going to be a number, it's usually, you know, a little bit less than one. So think of it as a number like 0.99, let's say and so, the effect of our updates of theta J is we're going to say that theta J gets replaced by thetata J times 0.99. Alright so theta J times 0.99 has the effect of shrinking theta J a little bit towards 0. So this makes theta J a bit smaller. More formally, this you know, this square norm of theta J is smaller and then after that the second term here, that's actually exactly the same as the original gradient descent updated that we had. Before we added all this regularization stuff. So, hopefully this gradient descent, hopefully this update makes sense, when we're using regularized linear regression what we're doing is on every regularization were multiplying data J by a number that is a little bit less than one, so we're shrinking the parameter a little bit, and then we're performing a, you know, similar update as before. Of course that's just the intuition behind what this particular update is doing. Mathematically, what it's doing is exactly gradient descent on the cost function J of theta that we defined on the previous slide that uses the regularization term. Gradient descent was just one our two algorithms for fitting a linear regression model. The second algorithm was the one based on the normal equation where, what we did was we created the design matrix "x" where each row corresponded to a separate training example. And we created a vector Y, so this is a vector that is an M dimensional vector and that contain the labels from a training set. So whereas X is an M by N plus 1 dimensional matrix. Y is an M dimensional vector and in order to minimize the cost function change we found that of one way to do is to set theta to be equal to this. We have X transpose X, inverse X transpose Y. I am leaving room here, to fill in stuff of course. And what this value for theta does, is this minimizes the cost function J of theta when we were not using regularization. Now that we are using regularization, if you were to derive what the minimum is, and just to give you a sense of how to derive the minimum, the way you derive it is you know, take partial derivatives in respect to each parameter, set this to zero, and then do a bunch of math, and you can then show that is a formula like this that minimizes the cost function. And concretely, if you are using regularization then this formula changes as follows. Inside this parenthesis, you end up with a matrix like this. Zero, one, one, one and so on, one until the bottom. So this thing over here is a matrix, who's upper leftmost entry is zero. There's ones on the diagonals and then the zeros everywhere else on this matrix. Because I am drawing this a little bit sloppy. But as a concrete example if N equals 2, then this matrix is going to be a three by three matrix. More generally, this matrix is a N plus one by N plus one dimensional matrix. So then N equals two then that matrix becomes something that looks like this. Zero, and then ones on the diagonals, and then zeros on the rest of the diagonals. And once again, you know, I'm not going to those this derivation. Which is frankly somewhat long and involved. But it is possible to prove that if you are using the new definition of J of theta, with the regularization objective. Then this new formula for theta is the one that will give you the global minimum of J of theta. So finally, I want to just quickly describe the issue of non-invertibility. This is relatively advanced material. So you should consider this as optional and feel free to skip it or if you listen to it and you know, possibly it don't really make sense, don't worry about it either. But earlier when I talked the normal equation method. We also had an optional video on the non-invertability issue. So this is another optional part, that is sort of add on earlier optional video on non-invertibility. Now considering setting where M the number of examples is less than or equal to N the number features. If you have fewer examples then features then this matrix X transpose X will be non-invertible or singular, or the other term for this is the matrix will be degenerate and if you implement this in Octave anyway, and you use the P in function to take the psuedo inverse. It will kind of do the right thing that is not clear that it will give you a very good hypothesis even though numerically the octave P in function will give you a result that kind of makes sense. But, if you were doing this in a different language. And if you were taking just the regular inverse which an octave is denoted with the function INV. We're trying to take the regular inverse of X transpose X, then in this setting you find that X transpose X is singular, is non-invertible and if you're doing this in a different programming language and using some linear algebra library try to take the inverse of this matrix. It just might not work because that matrix is non-invertible or singular. Fortunately, regularization also takes care of this for us, and concretely, so long as the regularization parameter is strictly greater than zero. It is actually possible to prove that this matrix X transpose X plus parameter time, you know,s this funny matrix here, is possible to prove that this matrix will not be singular and that this matrix will be invertible. So using regularization also takes care of any non-invertibility issues of the X transpose X matrix as well. So, you now know how to implement regularize linear regression. Using this, you'll be able to avoid overfitting, even if you have lots of features in a relatively small training set. And this should let you get linear regression to work much better for many problems. In the next video, we'll take this regularization idea and apply it to logistic regression. So that you'll be able to get logistic impression to avoid overfitting and perform much better as well.