[MUSIC] >> [SOUND] This has taken a little while, but we're ready for that final step. In this very, very optional derivation, you don't have to watch it. It's for those who are super interested finishing it off and showing how to compute that gradient. And so this is just the end game here. Took the long life function for a single data point, and showed that it can be written in this particular form. And now we're going to take the derivative of that with respect to parameter wj. That's the last thing that we'll do here. So what's the partial derivative of the log likelihood function with respect to parameter wj? So this is the sum of two terms, so it just becomes the derivative of the sum. So let's look at the first term. So minus one, minus to indicate a function, that doesn't depend on wj. So that can be pulled out, and so we just get minus 1 minus the indicator of yi is equal to plus 1. Whatever the partial derivative is with respect to wj of w transpose, h of xi, and that's OK. And now we have the second term. So in the second term we're going to use blue, so change of colors transformation. So it's going to be minus the partial derivative with respect to the parameter wj of log of 1 plus e to the minus w transpose h of xi. Now, this is a little scary. Lots of partial derivatives, lots of logs, but it turns out that the answer is really simple. So, i'm going to work you through this. And then we'll get to that beautiful solution at the end. So let's take the first term, what is the partial derivative with respect to wj of w transpose h of xi? What is that equal to? Well the w transpose h is really the sum of every possible term, every possible feature, of the coefficient times the feature. Most of the terms don't depend on wj, there's only one term that depends on wj, and that's exactly hj, so that derivative is just going to be hj of xi. So I'm not doing that very explicitly but you can convince yourselves at home. Now let's look at the second term. So, the second term is where the beauty of the universe comes to play. You going to end up with something extremely simple, extremely beautiful. So the question here is what is the partial derivative with respect to wj of this crazy function log of 1 plus e to the minus w transpose h of xi. Now, let me warn you that deriving this takes many steps and I'm not going to go through all of them. You have to use the factor that the derivative of the log of some function is derivative of that function by the dot function. There's a bunch of different math steps that you may not remember and we said push out what the result is. And the result is minus hj of xi. Note that it's just like up here there's an hj of xi. That multiplies a really exciting beautiful term. E to the power of minus w transpose, h of xi divided by 1plus e to the minus w transpose h of xi. Please take a moment to enjoy the beauty of the logistic regression model. >> We're going to change colors here because it's just too much fun. Now what is this term here equal to? The probability that y is equal to minus one when the input is x i and you're using parameter w. That's how that probability showed up in a derivative in the first place. You went through a lot of math and then you got that probability back. So the result if you plug these two terms in, are for the first term, the red term here, you get minus, 1 minus the indicator that yi is plus 1 times hj of xi. And from the second term, the blue term here, you get minus and then there is another minus, and so you end up with a minus minus, a plus hj of xi that multiples the probability that y is equal to minus 1, given xi and w. Always keep a few steps that your going to do at home. One of the core things is plug in the probability that y equals plus one is one minus the probability of y equals minus one. And just moving a bunch of stuff around, but you'll end up with exactly what we're hoping for, which is the derivative of the log likelihood with respect to parameter j, is simply hj of xi. That multiplies the difference between the indicator that yi is equal to plus 1, so it's a positive example. Minus the probability that y is equal to plus 1, given the input xi and the parameter is w. And this is exactly what I showed you earlier as what the answer was. So that is really, really cool. Now we're almost done. So for one datapoint, we have the contribution of the derivative is hj times the difference between the indicator and the actual value. Now if you want to compute the derivative of the likelihood function for all of the datapoints respect to wj, all you have to do is sum. So you just sum over datapoints. So sum over i equals 1 through N of the term above. So hj of xi, that multiplies the difference between the indicator of yi equals plus 1 minus the probability yi equals plus 1. Given xi and w. >> Very cool. For those brave enough to have gone through this advanced material, I really commend you. This is exciting things. It is how that derivative gets computed, and it's also how many other derivatives in machine learning are computed. So this is just one example of that. And so, I hope you enjoyed it. Now we can go back to the main thread, thanks. >> [MUSIC]