For logistic regression, we previously talked about two types of optimization algorithms. We talked about how to use gradient descent to optimize as cost function J of theta. And we also talked about advanced optimization methods. Ones that require that you provide a way to compute your cost function J of theta and that you provide a way to compute the derivatives. In this video, we'll show how you can adapt both of those techniques, both gradient descent and the more advanced optimization techniques in order to have them work for regularized logistic regression. So, here's the idea. We saw earlier that Logistic Regression can also be prone to overfitting if you fit it with a very, sort of, high order polynomial features like this. Where G is the sigmoid function and in particular you end up with a hypothesis, you know, whose decision bound to be just sort of an overly complex and extremely contortive function that really isn't such a great hypothesis for this training set, and more generally if you have logistic regression with a lot of features. Not necessarily polynomial ones, but just with a lot of features you can end up with overfitting. This was our cost function for logistic regression. And if we want to modify it to use regularization, all we need to do is add to it the following term plus londer over 2M, sum from J equals 1, and as usual sum from J equals 1. Rather than the sum from J equals 0, of theta J squared. And this has to effect therefore, of penalizing the parameters theta 1 theta 2 and so on up to theta N from being too large. And if you do this, then it will the have the effect that even though you're fitting a very high order polynomial with a lot of parameters. So long as you apply regularization and keep the parameters small you're more likely to get a decision boundary. You know, that maybe looks more like this. It looks more reasonable for separating the positive and the negative examples. So, when using regularization even when you have a lot of features, the regularization can help take care of the overfitting problem. How do we actually implement this? Well, for the original gradient descent algorithm, this was the update we had. We will repeatedly perform the following update to theta J. This slide looks a lot like the previous one for linear regression. But what I'm going to do is write the update for theta 0 separately. So, the first line is for update for theta 0 and a second line is now my update for theta 1 up to theta N. Because I'm going to treat theta 0 separately. And in order to modify this algorithm, to use a regularized cos function, all I need to do is pretty similar to what we did for linear regression is actually to just modify this second update rule as follows. And, once again, this, you know, cosmetically looks identical what we had for linear regression. But of course is not the same algorithm as we had, because now the hypothesis is defined using this. So this is not the same algorithm as regularized linear regression. Because the hypothesis is different. Even though this update that I wrote down. It actually looks cosmetically the same as what we had earlier. We're working out gradient descent for regularized linear regression. And of course, just to wrap up this discussion, this term here in the square brackets, so this term here, this term is, of course, the new partial derivative for respect of theta J of the new cost function J of theta. Where J of theta here is the cost function we defined on a previous slide that does use regularization. So, that's gradient descent for regularized linear regression. Let's talk about how to get regularized linear regression to work using the more advanced optimization methods. And just to remind you for those methods what we needed to do was to define the function that's called the cost function, that takes us input the parameter vector theta and once again in the equations we've been writing here we used 0 index vectors. So we had theta 0 up to theta N. But because Octave indexes the vectors starting from 1. Theta 0 is written in Octave as theta 1. Theta 1 is written in Octave as theta 2, and so on down to theta N plus 1. And what we needed to do was provide a function. Let's provide a function called cost function that we would then pass in to what we have, what we saw earlier. We will use the fminunc and then you know at cost function, and so on, right. But the F min, u and c was the F min unconstrained and this will work with fminunc was what will take the cost function and minimize it for us. So the two main things that the cost function needed to return were first J-val. And for that, we need to write code to compute the cost function J of theta. Now, when we're using regularized logistic regression, of course the cost function j of theta changes and, in particular, now a cost function needs to include this additional regularization term at the end as well. So, when you compute j of theta be sure to include that term at the end. And then, the other thing that this cost function thing needs to derive with a gradient. So gradient one needs to be set to the partial derivative of J of theta with respect to theta zero, gradient two needs to be set to that, and so on. Once again, the index is off by one. Right, because of the indexing from one Octave users. And looking at these terms. This term over here. We actually worked this out on a previous slide is actually equal to this. It doesn't change. Because the derivative for theta zero doesn't change. Compared to the version without regularization. And the other terms do change. And in particular the derivative respect to theta one. We worked this out on the previous slide as well. Is equal to, you know, the original term and then minus londer M times theta 1. Just so we make sure we pass this correctly. And we can add parentheses here. Right, so the summation doesn't extend. And similarly, you know, this other term here looks like this, with this additional term that we had on the previous slide, that corresponds to the gradient from their regularization objective. So if you implement this cost function and pass this into fminunc or to one of those advanced optimization techniques, that will minimize the new regularized cost function J of theta. And the parameters you get out will be the ones that correspond to logistic regression with regularization. So, now you know how to implement regularized logistic regression. When I walk around Silicon Valley, I live here in Silicon Valley, there are a lot of engineers that are frankly, making a ton of money for their companies using machine learning algorithms. And I know we've only been, you know, studying this stuff for a little while. But if you understand linear regression, the advanced optimization algorithms and regularization, by now, frankly, you probably know quite a lot more machine learning than many, certainly now, but you probably know quite a lot more machine learning right now than frankly, many of the Silicon Valley engineers out there having very successful careers. You know, making tons of money for the companies. Or building products using machine learning algorithms. So, congratulations. You've actually come a long ways. And you can actually, you actually know enough to apply this stuff and get to work for many problems. So congratulations for that. But of course, there's still a lot more that we want to teach you, and in the next set of videos after this, we'll start to talk about a very powerful cause of non-linear classifier. So whereas linear regression, logistic regression, you know, you can form polynomial terms, but it turns out that there are much more powerful nonlinear quantifiers that can then sort of polynomial regression. And in the next set of videos after this one, I'll start telling you about them. So that you have even more powerful learning algorithms than you have now to apply to different problems.