[MUSIC] Now let's take a deep dive, and understand
how we're going to predict the probability that a sentence is positive or
negative using linear classifiers. Or we call them linear models. So now we're taking some input data x. We're going to compute some features h,
and we're going to output this P hat, the probability that the label
is positive or negative. Now if you'll go back to our example for
the decision boundary we computed the score of every data
point as w transpose h of x or w0 h0 + w1 h1 + w2 h2 + w3 h3 and so on. Everything below the line
had score greater than 0. Everything above the line have score
less than 0, but we don't know how far. It's like a lot less than 0 and
a lot greater than 0 potentially. And how do we relate these scores which
could be somewhere between minus infinity and positive infinity, with the
probability that the output is plus one, the probability of
the sentence is plus one. And that's the task that we're
going to try to do today. In fact, we have the w transpose h, the score, can range from minus
infinity to plus infinity. If it's positive, greater than 0, we're
going to output +1, and if it's negative, less than 0, we're going to put -1. Now what we want to say is if that score
is really, really, really, really big, it's like infinity,
then we're very sure that y hat is +1. So we're going to say that the probability
that y=+1 given this input is 1. On the other end of the spectrum, if the
score is very, very, very, very negative, minus infinity,
we're very sure that y hat is -1. And then we should output the probability
y is +1 is 0, for this particular input x. Now if the score is 0, we're right at that decision boundary
where we're neither positive or negative. So we're kind of indifferent,
in predicting whether y hat is +1 or -1, so if we're indifferent with
probabilities we can interpret that. We can say the probability the y hat,
the y is +1 given the input is 0.5, 50-50. It could go either way. So that's our goal is to predict
those probabilities from the scores. So, we have the scores. The scores range from minus
infinity to plus infinity. And they're this weighted
combination of the features. And probabilities are between 0 and 1. If it's minus infinity the score, I want
the probability of that output to be 0. If it's plus infinity the score,
I want output that I predict to be 1.0. And if the score is 0.0,
I want to say the probability is 0.5. Scores range from minus
infinity to plus infinity, probabilities range between 0 and 1. The question is how do I relate score
from minus infinity to plus infinity, to probability 0 and 1. How do I link these two things? And now we're going to see some magic. [LAUGH] The magic that glues,
that links this range minus infinity, plus infinity to the range 0,1 is called
a link function, so it links the two. I'm going to take the score,
which is between minus infinity and plus infinity,
I'm going to push it through a function g that squeezes that huge line
into the interval 0, 1. [SOUND] And uses it to predict
the probability that y equals +1. And when you're taking a linear model,
w transpose h minus infinity to plus infinity and
you're squeezing it into 0, 1 using link functions you are building
what's called a generalized linear model. So, if somebody stops you
in the street today and asks you what's
a generalized linear model? Say no problem. It's just like a regression model,
but you squeeze it into 0, 1 by pushing through a link function. So it's a little abstract. We're going to talk about
it in the context of logistic regression which is
a specific type of link function. Now I talked about generalized
linear models as squeezing minus infinity to plus
infinity to the interval 0, 1. That's true for classifiers and
for most kinds of classifiers. There are other types of generalized
linear models that don't squeeze between 0 and 1. But for our purposes you can
think about them in that context. So in this context our goal now
becomes taking your training data, pushing it through some feature
extraction which gives us the h's, the FIDF or whatever else you
represent in data, number of awesomes. And now we build a linear
model W transpose h, we push it through the link function
that squeezes it into interval 0, 1. And use that to predict
the probability that your sentiment of your review is positive
given the input sentence. [MUSIC]