Now that we learned about
the likelihood function, the thing that were trying to maximize,
let's talk about gradient ascent algorithm that tries to make it
as large as possible. In this section, were going to go through
a little bit of math, a little bit of detail but in the end, the gradient
ascent algorithm for learning a logistic regression classifier is going to be
extremely simple and extremely intuitive. Even if the likelihood function's a little
bit fuzzy for you and the gradient stuff's not totally clear but in the end,
the algorithm that you're going to implement is going to be something that
only requires a few lines of code. In fact,
you'll be able to do that extremely easy. Good. We defined the model we want to fit,
the logistic regression model, we talked about the quality max trick,
the likelihood function, now we're going to define
the gradient ascent algorithm. Which is machine learning algorithm that tries
to make the likelihood function as large as possible to find their famous
W hat fits our data really well. Now, we can go back to this picture
that we've seen a few times where we have multiple lines and
they have likelihood function and we're trying to find the one
that has best likelihood. This line here with the W0 = 1,
W1 = 0.5, W2 = -1.5. We now know that the likelihood function
is exactly this function up here. The product over my data point of
the probability of the true label given the input centers that we have. Our goal is to take this l and optimize it with gradient ascent. That's what we're going to
go after right now. As a quick, quick, quick review,
we have our likelihood function and when I find the parameter values
that take this likelihood function, so that this is the W0, W1, W2, which is a function of three parameters
in this little example over here. We're trying to find the maximum
over all possible parameters W0, W1, and W2 and
there's infinitely many of those, so if you try to enumerate it will
be impossible to try them all. But gradient ascent is this magically
simple but wonderful algorithm where you start from some point over
here in the parameter space, which might be the weight for awful is 0,
the weight for awesome is -6. And you slowly climb up the hill
in order to find the optimum, the top of the hill here,
which going to be our famous W hat. They might say that the weight for
awesome is, it's probably going to be a positive
number, so maybe somewhere like this, say 0.5 and the weight for
awful is maybe -1. Now in this plot, I've only shown
two of the coordinates W1 and W2. I didn't show W0 because it's really
hard to plot in four-dimensional space. Four dimensions, so I'm just showing
you three out of those four dimensions. Now, let's discuss the gradient ascent
algorithm to go ahead and do that. [MUSIC]