[MUSIC] We see now how regularization can play
a role in logistic regression to find much better fits of data and
better assessments of probability. Let's finally talk about how we can
learn them from data coefficient using gradient ascent. And it's just going to be a very
very tiny change on what we did for learning the coefficients
in logistic regression. So with a tiny change of code,
we now address, alleviate, all those over fitting
problems that you had before. So again, same setting as before,
Training Data, features, same model. Now we have L2 regularized
logistic regression or log likelihood is quality metric, and we're going to talk about ML algorithm
to address it to optimize it to get w hat. We're going to be using the same kind
of gradient ascent algorithm that we used before, we'll start from some
point and we take these little steps. Go eta steps until we get
to our solution w hat, and the same kind of approach,
we're taking the old coefficient, adding eta times the gradient, and getting
the new coefficients w, t plus one. And so the only thing that we have to
ask ourselves is what is the gradient equal to now that we've add
this extra regularization term? So we somehow need the gradient of the regularized log likelihood. Let's see what that looks like. We've seen that our total quality is
the sum of the log likelihood of the data, which a measure of fit, minus lambda
times our regularization penalty, which is this L2 norm squared. And so
what is the derivative of this thing? This is the thing we need to be able to
walk into that hill-climbing direction. So the derivative of the sum
is the sum of the derivative. So the total derivative is the derivative
of the first term, the derivative of the log-likelihood, which, thankfully,
we've seen in the previous module, minus lambda times the derivative
of the quadratic term here. And it is how the derivative of
the quadratic term we already covered in the regression course. But we're going to do a quick review here. But as you can see,
just a small change to your code before, we just have to add this lambda times
the derivative of the quadratic term. So the review, the derivative of the log-likelihood is
going to be the sum of my data points of the difference between the syndicate
of whether it's a positive example and the probability of it being positive
weighed by the value of the feature. And we talked about last module and interpreted this piece in
quite a bit of detail. So what I'm going to go over again,
we're going to focus on the second part, which is the derivative of the L2 penalty. So in other words, what's the partial
derivative with respect to some parameter wj of w0 squared plus w1
squared plus w2 squared, Plus dot, dot,
dot plus wj squared plus dot, dot, dot plus wd squared. Now if you look at all of this terms w zero squared wr squared all of those
don't play any role in the derivative. The only thing that plays a role is
wj squared Now what's the derivative of 30g squared? It's just 2wj. So that's always going to change
in our code, it's actually 2wj. So in fact, our total derivative is
going to be the same derivative that we've implemented in the past,
mins 2 lambda, Times wj. So 2 times the regularization coefficient,
the regularization penalty, the parameter lambda times the magnitude,
so times the value of that coefficient. So let's interpret what this
extra term does for us. So what does the minus 2 lambda
wj do to the derivative? So wj is positive,
this minus lambda wj is a negative term. Negative contribution to a derivative
which means that it decreases wj because you're going to
add some negative term to it. It was positive we're
going to decrease it. So since it was positive and you're decreasing it what happens is wj becomes closer to 0. So if the rig is positive you
have the negative number and becomes less positive closer to 0. And in fact if lambda's bigger then
that thing becomes more negative and going to 0 faster. That's what happens. And if wj is very positive that
the decrement is also larger so it becomes again goes to
towards 0 even faster. Now if wj is negative then -2 lambda
wj is going to be greater than 0. Because lambda is also greater than 0. And what impact does that have? So you're adding something positive so
you're increasing wj which implies that wj becomes, again, closer to 0. It was negative, and I posited it numbers
with, it goes a little closer to 0. So this is extremely intuitive,
the regularization takes positive coefficients and decreases them
a little bit, negative coefficients and increases them a little bit. So it tries to push coefficients to 0, that was the effect has on the gradient,
exactly what you expect. Finally, this is exactly the code that
we described in the last module, so learn the coefficients of
a logistic regression model. You start with some,
that is equal to 0, or some other randomly initiated or
some kind of smartly initiated parameters. And you go, for each iteration you go
coefficient by coefficient, you compute a partial derivative, which is this really
long term here, sum over data points. The feature value times
the difference between where there's a positive data point and the predicted
value positive, so called a partial j. And you have the same update,
wj(t+1) is wj(t) plus the step size. It multiplies the partial
derivative just as before, which is the derivative of the likelihood
function With respect to wj. And all you need to change in your code, there's only one little
thing to change in the code. You have this little thing
here which is our only change. In other words,
take all the code you had before, put- 2 lambda wj in the computation
of the derivative, and now you have a solver for
L2 regularized logistic regression. And this is going to help you
a tremendous amount in practice. [MUSIC]