[MUSIC] In order to learn the coefficients w hat from data, we have to define some kind of quality metric so that the coefficients give you the highest quality, that's what w hat is supposed to be. So if we go back to our learning loop when getting some training data, we squeeze it from feature generation, or feature extraction systems, that gets to h(x). In our model is a logistic regression model we're not going to talk about the quality metric which is going to feed into the machine learning algorithm that's output in w hat. So turning into this orange box over here. In particular we would not be given some training data as the input and the data's kind of like the one I'm showing you, so it has for example if it only had two features the number of awesomes the number of awfuls and the output label or sentiment which might be plus one or negative one. In all this data, the n datapoints that say datapoints when it use some kind of learning algorithm that optimize the quality metric to give us w hat. So what's the quality metric look like? Let's get right into it and try to understand what the likelihood function that we hinted at in the previous module is really about. Let's take our data set, just for intuition and split it into the data points that have positive sentiment on the right and the data points that have negative sentiment on the left. So we have two tables but slightly more negative sentiments than positive sentiments in this example. So what do we want w hat to satisfy? Like a good w satisfy. For all the positive sentiments, for all the data points with positive sentiments, in the extreme we want the probability that it is positive to go all the way to plus one. For all the negative ones, we want the probability to go all the way to zero. So, it's not positive. It must be negative. So, our goal is to find a w hat that makes this possible or is close as possible. So, in other word, if we take the negative examples and the positive examples, there might not be a w hat that achieves exactly zero for the negative, one for the positives for all of them. So the quality metric of the likelihood function tries to figure out kind of on average it measures the quality throughout all the data points. With respect to coefficients W, how well we're making extremes happen. Now, if I have the likelihood function, I can evaluate multiple lines or multiple classifiers. So for example, for the green line here, the likelihood function may have a certain value, let's say 10 to the minus 6, well for this other line where instead of having w0 be 0, now w0 is 1, but the w1 and the w2 coefficients are the same then the likelihood is slightly higher, 10 to the minus 6. But for the best line which maybe sets w0 to be 1, w1 to be 0.5, and w2 to be -1.5 then the likelihood is biggest which in this case is 10 to minus 4. Now, you see these numbers. They're kind of weird, 10 to the minus something. But this is what likelihoods will come out to. They're going to be very, very, small numbers. They're going to be less than one. But the higher you get, the closer you get to one, the better. And so the question is, how do we find the best w's, the best classifiers, we'll find the ones that make this likelihood function that we're going to talk about, as big as possible. So we're going to define this function l(w) and then we're going to use gradient ascent to find w hat. And you should have some fond memories, maybe some sad sad memories, from the regression course where we talked about gradient descent and we explored the idea of using gradient to find the best possible parameters to optimize the quality metric. And we're going to go through that in this case again. [MUSIC]