[MUSIC] We're now down to our final
practical issue stochastic gradient, which again is going to
be an optional section. And the question is,
how do we introduce regularization and what impact it has on the updates. And this is going to be pretty significant if you want to implement it
yourself in a general way. Again, optional section
because it's pretty detailed. So let's remind ourselves of
the regularized likelihood. So, we have some total party metric
defined in the pit of our data, which is the log likelihood. And some measures of the complex of
the parameters which in our case, here we're using w squared,
the actual normal for the parameters, and we wanted to compute the gradient
of this regularized likelihood to make some progress, and
avoid over fitting. So, the total derivative is the derivative
of the first term, the quality, which is the sum over the data points
of the contribution of each data point, the thing that got really expensive. And the contribution of the second one,
once we introduced that parameter, lambda, to trade off the two things, talked about
this a lot, contribution is -2 lambda wj. And we derive this update during an earlier module on
regularized logistic regression. Now this is how we do straight up
gradient updates with regularization. The question is what do we do? We only do stochastic
gradient with regularization. So, if you remember stochastic gradient,
we just said we take the contribution for single data point, and if we added up those contributions
we get exactly the gradient. This is why it worked, because the sum of these stochastic
gradients equal the gradient. And so, to my theme that, we need to think
about how to set up the regularization term such that the sum of the stochastic
gradients also is equal to the gradient. And one way to think about this is
that we take the regularization and divide it by N. So in practice you want to do this. You want to say that
the total derivative for stochastic gradient is contribution
data point minus two over n lambda wj. And so if you were to add this up
you get back to the full gradient. So with regularization now,
the algorithm stays the same, but the contribution data point is its
gradient minus two over n lambda wj. That's the contribution from
the regularization point. If you are using mini-batches,
you'd adapt that times. So you had the sum of the contributions
from each data point And then it will be 2 b over n lambda w j. So this is how we take
care of regularization. Again very small change,
it's going to behave way better.