[MUSIC] Very good. We've now seen the simple gradient ascent
algorithm for logistic regression. How we update the parameters,
how we implement it and we talked a little bit about how
to set that step size parameter [INAUDIBLE] parameter and what impact it
has on the progress of our algorithm. Now I'm going to take a little bit of time
to show you how to derive the derivative of the likelihood function for logistic
regression, how that gradient is computed. The material we're going to talk
about here is quite mathematical. This is really PhD level material and
it can be a little annoying but for some folks, it might
be exciting to go through every detail. For others, this could be something that
you skip and it doesn't change anything. Up to you, you're warned but it's there
for when you want to learn more about it. We're going to jump in to
deriving the gradient for the likelihood function for
logistic regression. Again, PhD level material. Here we go. As we said, our goal is to pick
the coefficients w to maximize the likelihood function and that is the product of our data points
are the probability of yi given xi and w. Now, it turns out that
all the math that we need to do becomes quite a lot simpler if you
take the log of that likelihood function. I'm going to call that ll(w) and
this is the natural log ln, natural log of that product. Now, turns out for most of machine
learning, especially for stuff, you often take the log, and
it usually makes your math a lot simpler. Logs are your friends. Let me do kind of a quick review
of the log, natural log function. Log lm and
raise the function here that I am showing. You might have on the x-axis some value z, on the y-axis this is what
the log of z looks like and you see that it's something that actually
at zero would be minus infinity. It grows quickly at first and
then kind of much, much, much slower later but
this is what the log does. The reason that it's useful is
that the big product that we had actually becomes a sum. In general, if you have the log of a times
b that is equal to the log a plus log b. Similarly, if you have
the log of a over b, that is the log of a minus the log of b. Here's a side note. I remember, in high school, I think,
I learned about log function, and that the log of a times b is log
a plus log b and I thought my god, this is the most useless
thing I've ever seen. Why are they spending time teaching it? It really took about six years before I
actually saw it was useful for something. It's actually extremely useful for
machine learning, a funny side note. But anyway, a log has some
very interesting properties. The other property it has is
that you take a function F and you compute it's maximum,
taking the log doesn't change anything. If you don't note W hat here to be, we're going to call it the arg max over w of f(w) and this notation here just means the w that makes f(w) largest. And if you were to take
the log of that function, let's call W hat underscore ln to be the thing that maximizes the log of f(w). So w hat ln = arg max over w of the log (f(w)) because log is what's called
the positive monotonic function. This doesn't change the optimum,
this transformation. It turns out that w hat is
going to be equal to w hat ln. What we did in the previous slide, we just
take the log of the likelihood function, still keeps the optimum
exactly at the same place. And it's going to make your
math quite a bit easier. [MUSIC]