[MUSIC] Now let's take a deep dive, and understand how we're going to predict the probability that a sentence is positive or negative using linear classifiers. Or we call them linear models. So now we're taking some input data x. We're going to compute some features h, and we're going to output this P hat, the probability that the label is positive or negative. Now if you'll go back to our example for the decision boundary we computed the score of every data point as w transpose h of x or w0 h0 + w1 h1 + w2 h2 + w3 h3 and so on. Everything below the line had score greater than 0. Everything above the line have score less than 0, but we don't know how far. It's like a lot less than 0 and a lot greater than 0 potentially. And how do we relate these scores which could be somewhere between minus infinity and positive infinity, with the probability that the output is plus one, the probability of the sentence is plus one. And that's the task that we're going to try to do today. In fact, we have the w transpose h, the score, can range from minus infinity to plus infinity. If it's positive, greater than 0, we're going to output +1, and if it's negative, less than 0, we're going to put -1. Now what we want to say is if that score is really, really, really, really big, it's like infinity, then we're very sure that y hat is +1. So we're going to say that the probability that y=+1 given this input is 1. On the other end of the spectrum, if the score is very, very, very, very negative, minus infinity, we're very sure that y hat is -1. And then we should output the probability y is +1 is 0, for this particular input x. Now if the score is 0, we're right at that decision boundary where we're neither positive or negative. So we're kind of indifferent, in predicting whether y hat is +1 or -1, so if we're indifferent with probabilities we can interpret that. We can say the probability the y hat, the y is +1 given the input is 0.5, 50-50. It could go either way. So that's our goal is to predict those probabilities from the scores. So, we have the scores. The scores range from minus infinity to plus infinity. And they're this weighted combination of the features. And probabilities are between 0 and 1. If it's minus infinity the score, I want the probability of that output to be 0. If it's plus infinity the score, I want output that I predict to be 1.0. And if the score is 0.0, I want to say the probability is 0.5. Scores range from minus infinity to plus infinity, probabilities range between 0 and 1. The question is how do I relate score from minus infinity to plus infinity, to probability 0 and 1. How do I link these two things? And now we're going to see some magic. [LAUGH] The magic that glues, that links this range minus infinity, plus infinity to the range 0,1 is called a link function, so it links the two. I'm going to take the score, which is between minus infinity and plus infinity, I'm going to push it through a function g that squeezes that huge line into the interval 0, 1. [SOUND] And uses it to predict the probability that y equals +1. And when you're taking a linear model, w transpose h minus infinity to plus infinity and you're squeezing it into 0, 1 using link functions you are building what's called a generalized linear model. So, if somebody stops you in the street today and asks you what's a generalized linear model? Say no problem. It's just like a regression model, but you squeeze it into 0, 1 by pushing through a link function. So it's a little abstract. We're going to talk about it in the context of logistic regression which is a specific type of link function. Now I talked about generalized linear models as squeezing minus infinity to plus infinity to the interval 0, 1. That's true for classifiers and for most kinds of classifiers. There are other types of generalized linear models that don't squeeze between 0 and 1. But for our purposes you can think about them in that context. So in this context our goal now becomes taking your training data, pushing it through some feature extraction which gives us the h's, the FIDF or whatever else you represent in data, number of awesomes. And now we build a linear model W transpose h, we push it through the link function that squeezes it into interval 0, 1. And use that to predict the probability that your sentiment of your review is positive given the input sentence. [MUSIC]