[MUSIC] We see now how regularization can play a role in logistic regression to find much better fits of data and better assessments of probability. Let's finally talk about how we can learn them from data coefficient using gradient ascent. And it's just going to be a very very tiny change on what we did for learning the coefficients in logistic regression. So with a tiny change of code, we now address, alleviate, all those over fitting problems that you had before. So again, same setting as before, Training Data, features, same model. Now we have L2 regularized logistic regression or log likelihood is quality metric, and we're going to talk about ML algorithm to address it to optimize it to get w hat. We're going to be using the same kind of gradient ascent algorithm that we used before, we'll start from some point and we take these little steps. Go eta steps until we get to our solution w hat, and the same kind of approach, we're taking the old coefficient, adding eta times the gradient, and getting the new coefficients w, t plus one. And so the only thing that we have to ask ourselves is what is the gradient equal to now that we've add this extra regularization term? So we somehow need the gradient of the regularized log likelihood. Let's see what that looks like. We've seen that our total quality is the sum of the log likelihood of the data, which a measure of fit, minus lambda times our regularization penalty, which is this L2 norm squared. And so what is the derivative of this thing? This is the thing we need to be able to walk into that hill-climbing direction. So the derivative of the sum is the sum of the derivative. So the total derivative is the derivative of the first term, the derivative of the log-likelihood, which, thankfully, we've seen in the previous module, minus lambda times the derivative of the quadratic term here. And it is how the derivative of the quadratic term we already covered in the regression course. But we're going to do a quick review here. But as you can see, just a small change to your code before, we just have to add this lambda times the derivative of the quadratic term. So the review, the derivative of the log-likelihood is going to be the sum of my data points of the difference between the syndicate of whether it's a positive example and the probability of it being positive weighed by the value of the feature. And we talked about last module and interpreted this piece in quite a bit of detail. So what I'm going to go over again, we're going to focus on the second part, which is the derivative of the L2 penalty. So in other words, what's the partial derivative with respect to some parameter wj of w0 squared plus w1 squared plus w2 squared, Plus dot, dot, dot plus wj squared plus dot, dot, dot plus wd squared. Now if you look at all of this terms w zero squared wr squared all of those don't play any role in the derivative. The only thing that plays a role is wj squared Now what's the derivative of 30g squared? It's just 2wj. So that's always going to change in our code, it's actually 2wj. So in fact, our total derivative is going to be the same derivative that we've implemented in the past, mins 2 lambda, Times wj. So 2 times the regularization coefficient, the regularization penalty, the parameter lambda times the magnitude, so times the value of that coefficient. So let's interpret what this extra term does for us. So what does the minus 2 lambda wj do to the derivative? So wj is positive, this minus lambda wj is a negative term. Negative contribution to a derivative which means that it decreases wj because you're going to add some negative term to it. It was positive we're going to decrease it. So since it was positive and you're decreasing it what happens is wj becomes closer to 0. So if the rig is positive you have the negative number and becomes less positive closer to 0. And in fact if lambda's bigger then that thing becomes more negative and going to 0 faster. That's what happens. And if wj is very positive that the decrement is also larger so it becomes again goes to towards 0 even faster. Now if wj is negative then -2 lambda wj is going to be greater than 0. Because lambda is also greater than 0. And what impact does that have? So you're adding something positive so you're increasing wj which implies that wj becomes, again, closer to 0. It was negative, and I posited it numbers with, it goes a little closer to 0. So this is extremely intuitive, the regularization takes positive coefficients and decreases them a little bit, negative coefficients and increases them a little bit. So it tries to push coefficients to 0, that was the effect has on the gradient, exactly what you expect. Finally, this is exactly the code that we described in the last module, so learn the coefficients of a logistic regression model. You start with some, that is equal to 0, or some other randomly initiated or some kind of smartly initiated parameters. And you go, for each iteration you go coefficient by coefficient, you compute a partial derivative, which is this really long term here, sum over data points. The feature value times the difference between where there's a positive data point and the predicted value positive, so called a partial j. And you have the same update, wj(t+1) is wj(t) plus the step size. It multiplies the partial derivative just as before, which is the derivative of the likelihood function With respect to wj. And all you need to change in your code, there's only one little thing to change in the code. You have this little thing here which is our only change. In other words, take all the code you had before, put- 2 lambda wj in the computation of the derivative, and now you have a solver for L2 regularized logistic regression. And this is going to help you a tremendous amount in practice. [MUSIC]