We started with the intuition of a linear classifier using the sentimental analysis example. Let's take a little bit of a next level dive and let's understand more about what a linear classier model really captures. In particular, we're going to take some data set, some data set which is going to feed us x's. We're going to get features out of that just like we did in the regression course. And we're going to feed it out to the machine learning model for classification which is going to output predictions y hat. And that model depends on some parameters w hat which we're going to train from data. Now let's go back to that example that we had with just two features with no zero coefficients, awesome and awful. Now suppose that we had a third feature with no zero coefficient. Let's say that the word great. Now in this case every data point is associated with a core in this three dimensional space. So for example this data point over here might have five awesomes, it might have three awfuls, and it might have two greats associated with it. And what a linear classifier model is going to do, is try to build a hyperplane that tries to separate the positives from the negative examples. And the hyperplane is associated with the score function. The score function is going to be a weighted combination of the coefficients w0 multiplied by the features that we have. So w0 + w1 times number of awesomes which in our case was 5 + w2 times the number of awfuls, which in our case is 3 and finally, w3 times number of greats, which in our case was 2. So for this data point over here, the score of xi is going to be defined by w0 + 5w1 plus 3w2 plus 2w3 and if this, depending on the coefficients, that score may be positive or negative. So, this is a positive training example, we want to choose the w's that make that score positive. Now that we set up the classification problem, and the task that we're after, let's do a quick review of notation, for the course. In this course we're going to use the same notation that we used in the regression course, that was the second course in the specialization. And here we have an output y, which is the thing you're trying to predict. In the regression case that was a real value but in our case is a class. And we have a set of inputs, x, which have little d dimensions. x[1], x[2], through x[d]. So x is really a d-dimensional vector, and y is an output we're trying to predict, which in our case is either minus one or plus one in the binary classification setting, which is where we're starting out today. Now we use xj to denote the jth input which is a scaler value. We're going to use hj of x to denote the jth feature. And then we'll use x sub i, very importantly to denote the ith data point. And x sub i of j denoted the jth input of the ith data point. It's a little bit of a handful but it's the exact same as what we did in the regression course. Now with this, in part by this notation, we can go back and define our simple hyperplane, the one we just saw with awfuls and awesomes. And just say that, in this case, y hat, our prediction is the sign of the score that we have for this particular input. And this sign function just says that if the score is greater than 0, predict plus 1. If the score is less than 0, predict minus 1. And then at zero, you have the choice of either predicting minus 1 or plus 1. You can make an arbitrary choice. The way I think about it is if it's 0, I predict plus 1. Now the score of an input x i is w0 + w1 times x1 of i + w2 times x2 of i, all the way to wd, the dth coefficient, times the dth entry in the x i vector. And here the features input the first one is 1, the constant feature like we did with regression and x[1] could be awesome, number of awesomes, x[2] could be number of awfuls, and say the last one x[d] could be the number of times the word ramen shows up. Which to me might be associated with a negative review, but it might be kind of indifferent. Depends on what coefficient you have there. So our goal is to optimize the score, fit the score. And I'm going to use w transpose xi as a short hand so I don't have to always write w0 plus wix1 plus w2x2 and so on. So, we use this transpose notation which is the same one that we talked about in the regression course. [MUSIC]