[MUSIC] We now seen how to express
the probability of y equals plus 1, the probability of y equals minus 1, and when I plug those into the log
likelihood before we take its gradient. There's still part of that very, very optional derivation that we're doing
of the gradient of logistic regression. Now we have probability y=+1,
probability y=-1. We can go ahead and plug those
into our log likelihood function. And with the indicators and all that
good stuff, let's see what happens. So I'm going to go ahead and plug in
this definition of probability y=+1, y=-1 into log likelihood, It turns out
the log likelihood is going to simplify to a pretty cool term and we're
going to take the derivative of it, and derive the derivative
that we're hoping for. So here we go. So I'm just going to first plug in
the probability y=+1 into the first term. So we're going to have that the log
likelihood function is the indicator that y is i = + 1 so we're only
going to deal with one data point for now and then we'll get the sum
over the data points later. So of the log of that ratio so
1/1+e- w transpose, h(xi), this is the particular
data point we're dealing with. So that was easy, that was the first one. Now for the second term here,
we need to plug in the probability y=-1. But it's a little bit annoying for a
derivation that we have an indicator that yi is +1 and another indicator that
yi is -1, so we're going to take this indicator that yi = -1 and
substitute it with something else. So let's do a change of colors
transformation here and just remind ourselves that
the indicator that yi = -1, this takes value 1 when y=-1. Can be written as 1 minus the indicator that yi=+1. So if you think about it, when yi=-1, then the left side here is 1 and
the right side is also 1. If yi = -1, the left side is 0 and
the right side is 0. So let's plug that in,
what we just learned. And so from the first term here, we get 1 minus indicator that yi = -1. And we're going to plug in the definition
of the probability that y = -1. So that's log of e to the -w transpose h(xi) / 1 + e to the 1 + e to the -w transpose h(xi). Great, so now we have our two terms and
now we're going to move a couple of things around and simplify
the equations pretty significantly. So let's go ahead and do that. Okay, let's go back to our
change of colors transformation, with dealing with the first terms. The first term is the red term,
so let's go back to the red term. And let's see what the log of 1 / 1 + e to the -w transpose h(xi) looks like. And, the log of 1 over something turns
out to be minus the log of something. And there's something here,
is 1 + e to the -w, transpose h(xi). So plugging that in, we get the indicator that yi = +1 that multiplies minus the log of 1 + 8 to the -w transpose h. So we're going to write here log of 1 + e to the -w transpose h(xi), and put the minus sign out here. That was for the first term, so now let's look at that famous
second term and expand it out. So let's go back to our blue. And so
the coefficient here stayed the same. The 1 minus the indicator
that yi is positive, so yi is positive, sorry about that, of the log of that ratio. So let's explore what the log
of the ratio looks like. So, the log of e to the -w transpose h(xi) / 1+e to the -w transpose h(xi). And so what does that look like? That looks like it's
the log of the ratio so it's the difference of the log so it's the log of e to the power
of -w transpose h(xi) minus the log of 1 + e to
the power of -w transpose h(xi). Let's note two things, first the second
term is exactly what we had up here, so things are starting
to look very similar. And what is the log of the first term? And here's another log trick. Log of e to the something,
say e to the a is exactly equal to a. So in this case, the log of each of the -w transpose h is just -w transpose h(xi). So plug in that n, we get a coefficient that multiplies -w transpose h(xi) minus the same term as the other side, log of 1 + e to the -w transpose h(xi). Okay, going a little slowly, and now you
can shake things around, move things, lots of stuff cancels out, I'm not
going to go through that in detail, but you're welcome to do it. I'm just going to do a change
of color transformation, go to purple which is a color I love,
and just write out what the answer is. And the answer here becomes log
likely can be written as for [INAUDIBLE] point as 1 minus the indicator that it
is a positive example. So, yi =+1. That multiplies w transpose h(xi) minus the log of that crazy term 1+e to the -w transpose h(xi). So, we started from the log
likelihood function, we went through a bunch of derivations and maths
which you can explore if you want and we ended up with this much simpler form. And now we're going to take the simpler
form and take its derivative. [MUSIC]