We started with the intuition of a linear
classifier using the sentimental analysis example. Let's take a little bit
of a next level dive and let's understand more about what
a linear classier model really captures. In particular,
we're going to take some data set, some data set which is
going to feed us x's. We're going to get features out of that
just like we did in the regression course. And we're going to feed it out to
the machine learning model for classification which is going to
output predictions y hat. And that model depends on
some parameters w hat which we're going to train from data. Now let's go back to that
example that we had with just two features with no zero
coefficients, awesome and awful. Now suppose that we had a third
feature with no zero coefficient. Let's say that the word great. Now in this case every data
point is associated with a core in this three dimensional space. So for example this data point over
here might have five awesomes, it might have three awfuls, and it might have two greats
associated with it. And what a linear classifier
model is going to do, is try to build a hyperplane that tries to separate
the positives from the negative examples. And the hyperplane is associated
with the score function. The score function is going to
be a weighted combination of the coefficients w0 multiplied
by the features that we have. So w0 + w1 times number
of awesomes which in our case was 5 + w2 times
the number of awfuls, which in our case is 3 and finally, w3 times number of greats,
which in our case was 2. So for this data point over here, the score of xi is going to be defined by w0 + 5w1 plus 3w2 plus 2w3 and if this, depending on the coefficients, that score may be positive or negative. So, this is a positive training example, we want to choose the w's that
make that score positive. Now that we set up
the classification problem, and the task that we're after, let's do a
quick review of notation, for the course. In this course we're going to use
the same notation that we used in the regression course, that was
the second course in the specialization. And here we have an output y, which is
the thing you're trying to predict. In the regression case that was a real
value but in our case is a class. And we have a set of inputs, x,
which have little d dimensions. x[1], x[2], through x[d]. So x is really a d-dimensional vector, and
y is an output we're trying to predict, which in our case is either minus one or
plus one in the binary classification setting,
which is where we're starting out today. Now we use xj to denote the jth
input which is a scaler value. We're going to use hj of x
to denote the jth feature. And then we'll use x sub i, very
importantly to denote the ith data point. And x sub i of j denoted the jth
input of the ith data point. It's a little bit of a handful but it's the exact same as what we
did in the regression course. Now with this, in part by this notation,
we can go back and define our simple hyperplane, the one
we just saw with awfuls and awesomes. And just say that, in this case,
y hat, our prediction is the sign of the score that we have for
this particular input. And this sign function just says
that if the score is greater than 0, predict plus 1. If the score is less than 0,
predict minus 1. And then at zero, you have the choice
of either predicting minus 1 or plus 1. You can make an arbitrary choice. The way I think about it is if it's 0,
I predict plus 1. Now the score of an input x i is w0 + w1 times x1 of i + w2 times x2 of i, all the way to wd, the dth coefficient, times the dth entry in the x i vector. And here the features
input the first one is 1, the constant feature like we did
with regression and x[1] could be awesome, number of awesomes,
x[2] could be number of awfuls, and say the last one x[d] could be the number
of times the word ramen shows up. Which to me might be associated
with a negative review, but it might be kind of indifferent. Depends on what coefficient
you have there. So our goal is to optimize the score,
fit the score. And I'm going to use w
transpose xi as a short hand so I don't have to always write w0
plus wix1 plus w2x2 and so on. So, we use this transpose notation
which is the same one that we talked about in the regression course. [MUSIC]