[MUSIC] >> [SOUND] This has taken a little while,
but we're ready for that final step. In this very, very optional derivation,
you don't have to watch it. It's for those who are super
interested finishing it off and showing how to compute that gradient. And so this is just the end game here. Took the long life function for
a single data point, and showed that it can be written
in this particular form. And now we're going to take the derivative
of that with respect to parameter wj. That's the last thing that we'll do here. So what's the partial derivative
of the log likelihood function with respect to parameter wj? So this is the sum of two terms, so
it just becomes the derivative of the sum. So let's look at the first term. So minus one, minus to indicate
a function, that doesn't depend on wj. So that can be pulled out, and so we just get minus 1 minus the indicator
of yi is equal to plus 1. Whatever the partial derivative is with respect to wj of w transpose, h of xi, and that's OK. And now we have the second term. So in the second term we're going to use
blue, so change of colors transformation. So it's going to be minus
the partial derivative with respect to the parameter wj of log of 1
plus e to the minus w transpose h of xi. Now, this is a little scary. Lots of partial derivatives,
lots of logs, but it turns out that the answer
is really simple. So, i'm going to work you through this. And then we'll get to that
beautiful solution at the end. So let's take the first term, what is
the partial derivative with respect to wj of w transpose h of xi? What is that equal to? Well the w transpose h is really
the sum of every possible term, every possible feature,
of the coefficient times the feature. Most of the terms don't depend on wj,
there's only one term that depends on wj, and that's exactly hj, so that
derivative is just going to be hj of xi. So I'm not doing that very explicitly but
you can convince yourselves at home. Now let's look at the second term. So, the second term is where the beauty
of the universe comes to play. You going to end up with something
extremely simple, extremely beautiful. So the question here is what is
the partial derivative with respect to wj of this crazy function log of 1 plus
e to the minus w transpose h of xi. Now, let me warn you that deriving
this takes many steps and I'm not going to go through all of them. You have to use the factor that the
derivative of the log of some function is derivative of that function
by the dot function. There's a bunch of different math
steps that you may not remember and we said push out what the result is. And the result is minus hj of xi. Note that it's just like up
here there's an hj of xi. That multiplies a really
exciting beautiful term. E to the power of minus w transpose, h of xi divided by 1plus e to
the minus w transpose h of xi. Please take a moment to enjoy the beauty
of the logistic regression model. >> We're going to change colors here
because it's just too much fun. Now what is this term here equal to? The probability that y is
equal to minus one when the input is x i and
you're using parameter w. That's how that probability showed up
in a derivative in the first place. You went through a lot of math and
then you got that probability back. So the result if you
plug these two terms in, are for the first term, the red term here, you get minus, 1 minus the indicator that yi is plus 1 times hj of xi. And from the second term,
the blue term here, you get minus and then there is another minus, and so you end up with a minus minus,
a plus hj of xi that multiples the probability
that y is equal to minus 1, given xi and w. Always keep a few steps that
your going to do at home. One of the core things is plug
in the probability that y equals plus one is one minus the probability
of y equals minus one. And just moving a bunch of stuff around,
but you'll end up with exactly what we're hoping for,
which is the derivative of the log likelihood with respect to parameter j,
is simply hj of xi. That multiplies the difference
between the indicator that yi is equal to plus 1, so
it's a positive example. Minus the probability that
y is equal to plus 1, given the input xi and the parameter is w. And this is exactly what I showed
you earlier as what the answer was. So that is really, really cool. Now we're almost done. So for one datapoint, we have
the contribution of the derivative is hj times the difference between
the indicator and the actual value. Now if you want to compute the derivative
of the likelihood function for all of the datapoints respect to wj,
all you have to do is sum. So you just sum over datapoints. So sum over i equals 1
through N of the term above. So hj of xi,
that multiplies the difference between the indicator of yi equals plus 1 minus the probability yi equals plus 1. Given xi and w. >> Very cool. For those brave enough to have gone
through this advanced material, I really commend you. This is exciting things. It is how that derivative gets computed,
and it's also how many other derivatives
in machine learning are computed. So this is just one example of that. And so, I hope you enjoyed it. Now we can go back to the main thread,
thanks. >> [MUSIC]