[MUSIC] In order to learn
the coefficients w hat from data, we have to define some kind of quality
metric so that the coefficients give you the highest quality,
that's what w hat is supposed to be. So if we go back to our learning loop when
getting some training data, we squeeze it from feature generation, or feature
extraction systems, that gets to h(x). In our model is a logistic regression
model we're not going to talk about the quality metric which is going to
feed into the machine learning algorithm that's output in w hat. So turning into this orange box over here. In particular we would not be given
some training data as the input and the data's kind of like the one
I'm showing you, so it has for example if it only had two features the
number of awesomes the number of awfuls and the output label or sentiment which
might be plus one or negative one. In all this data, the n datapoints
that say datapoints when it use some kind of learning algorithm that optimize
the quality metric to give us w hat. So what's the quality metric look like? Let's get right into it and try to
understand what the likelihood function that we hinted at in the previous
module is really about. Let's take our data set,
just for intuition and split it into the data points that have
positive sentiment on the right and the data points that have
negative sentiment on the left. So we have two tables but slightly more negative sentiments than
positive sentiments in this example. So what do we want w hat to satisfy? Like a good w satisfy. For all the positive sentiments, for all the data points with positive
sentiments, in the extreme we want the probability that it is positive
to go all the way to plus one. For all the negative ones, we want
the probability to go all the way to zero. So, it's not positive. It must be negative. So, our goal is to find a w hat that makes
this possible or is close as possible. So, in other word,
if we take the negative examples and the positive examples, there might not be
a w hat that achieves exactly zero for the negative, one for
the positives for all of them. So the quality metric of the likelihood
function tries to figure out kind of on average it measures the quality
throughout all the data points. With respect to coefficients W,
how well we're making extremes happen. Now, if I have the likelihood function,
I can evaluate multiple lines or multiple classifiers. So for example, for the green line here, the likelihood function may have a certain
value, let's say 10 to the minus 6, well for this other line where
instead of having w0 be 0, now w0 is 1, but the w1 and
the w2 coefficients are the same then the likelihood is slightly higher,
10 to the minus 6. But for the best line which maybe sets
w0 to be 1, w1 to be 0.5, and w2 to be -1.5 then the likelihood is biggest
which in this case is 10 to minus 4. Now, you see these numbers. They're kind of weird,
10 to the minus something. But this is what likelihoods
will come out to. They're going to be very,
very, small numbers. They're going to be less than one. But the higher you get,
the closer you get to one, the better. And so the question is, how do we find
the best w's, the best classifiers, we'll find the ones that make this
likelihood function that we're going to talk about, as big as possible. So we're going to define this function
l(w) and then we're going to use gradient ascent to find w hat. And you should have some fond memories,
maybe some sad sad memories, from the regression course where we
talked about gradient descent and we explored the idea of using gradient to find the best possible
parameters to optimize the quality metric. And we're going to go through
that in this case again. [MUSIC]