[MUSIC] Very good. We've now seen the simple gradient ascent algorithm for logistic regression. How we update the parameters, how we implement it and we talked a little bit about how to set that step size parameter [INAUDIBLE] parameter and what impact it has on the progress of our algorithm. Now I'm going to take a little bit of time to show you how to derive the derivative of the likelihood function for logistic regression, how that gradient is computed. The material we're going to talk about here is quite mathematical. This is really PhD level material and it can be a little annoying but for some folks, it might be exciting to go through every detail. For others, this could be something that you skip and it doesn't change anything. Up to you, you're warned but it's there for when you want to learn more about it. We're going to jump in to deriving the gradient for the likelihood function for logistic regression. Again, PhD level material. Here we go. As we said, our goal is to pick the coefficients w to maximize the likelihood function and that is the product of our data points are the probability of yi given xi and w. Now, it turns out that all the math that we need to do becomes quite a lot simpler if you take the log of that likelihood function. I'm going to call that ll(w) and this is the natural log ln, natural log of that product. Now, turns out for most of machine learning, especially for stuff, you often take the log, and it usually makes your math a lot simpler. Logs are your friends. Let me do kind of a quick review of the log, natural log function. Log lm and raise the function here that I am showing. You might have on the x-axis some value z, on the y-axis this is what the log of z looks like and you see that it's something that actually at zero would be minus infinity. It grows quickly at first and then kind of much, much, much slower later but this is what the log does. The reason that it's useful is that the big product that we had actually becomes a sum. In general, if you have the log of a times b that is equal to the log a plus log b. Similarly, if you have the log of a over b, that is the log of a minus the log of b. Here's a side note. I remember, in high school, I think, I learned about log function, and that the log of a times b is log a plus log b and I thought my god, this is the most useless thing I've ever seen. Why are they spending time teaching it? It really took about six years before I actually saw it was useful for something. It's actually extremely useful for machine learning, a funny side note. But anyway, a log has some very interesting properties. The other property it has is that you take a function F and you compute it's maximum, taking the log doesn't change anything. If you don't note W hat here to be, we're going to call it the arg max over w of f(w) and this notation here just means the w that makes f(w) largest. And if you were to take the log of that function, let's call W hat underscore ln to be the thing that maximizes the log of f(w). So w hat ln = arg max over w of the log (f(w)) because log is what's called the positive monotonic function. This doesn't change the optimum, this transformation. It turns out that w hat is going to be equal to w hat ln. What we did in the previous slide, we just take the log of the likelihood function, still keeps the optimum exactly at the same place. And it's going to make your math quite a bit easier. [MUSIC]