1 00:00:04,443 --> 00:00:09,144 We started with the intuition of a linear classifier using the sentimental 2 00:00:09,144 --> 00:00:10,820 analysis example. 3 00:00:10,820 --> 00:00:13,040 Let's take a little bit of a next level dive and 4 00:00:13,040 --> 00:00:18,350 let's understand more about what a linear classier model really captures. 5 00:00:18,350 --> 00:00:21,574 In particular, we're going to take some data set, 6 00:00:21,574 --> 00:00:25,150 some data set which is going to feed us x's. 7 00:00:25,150 --> 00:00:28,470 We're going to get features out of that just like we did in the regression course. 8 00:00:28,470 --> 00:00:31,632 And we're going to feed it out to the machine learning model for 9 00:00:31,632 --> 00:00:35,610 classification which is going to output predictions y hat. 10 00:00:35,610 --> 00:00:38,420 And that model depends on some parameters w hat which 11 00:00:38,420 --> 00:00:39,750 we're going to train from data. 12 00:00:41,190 --> 00:00:44,180 Now let's go back to that example that we had with 13 00:00:44,180 --> 00:00:49,060 just two features with no zero coefficients, awesome and awful. 14 00:00:49,060 --> 00:00:52,320 Now suppose that we had a third feature with no zero coefficient. 15 00:00:52,320 --> 00:00:53,410 Let's say that the word great. 16 00:00:54,420 --> 00:00:59,220 Now in this case every data point is associated with a core 17 00:00:59,220 --> 00:01:00,900 in this three dimensional space. 18 00:01:00,900 --> 00:01:08,200 So for example this data point over here might have five awesomes, 19 00:01:08,200 --> 00:01:13,750 it might have three awfuls, and 20 00:01:13,750 --> 00:01:19,340 it might have two greats associated with it. 21 00:01:19,340 --> 00:01:22,630 And what a linear classifier model is going to do, is try to 22 00:01:22,630 --> 00:01:26,820 build a hyperplane that tries to separate the positives from the negative examples. 23 00:01:26,820 --> 00:01:30,153 And the hyperplane is associated with the score function. 24 00:01:30,153 --> 00:01:34,367 The score function is going to be a weighted combination of 25 00:01:34,367 --> 00:01:38,928 the coefficients w0 multiplied by the features that we have. 26 00:01:38,928 --> 00:01:43,603 So w0 + w1 times number of awesomes which in 27 00:01:43,603 --> 00:01:48,646 our case was 5 + w2 times the number of awfuls, 28 00:01:48,646 --> 00:01:52,336 which in our case is 3 and finally, 29 00:01:52,336 --> 00:01:57,890 w3 times number of greats, which in our case was 2. 30 00:01:57,890 --> 00:02:02,118 So for this data point over here, 31 00:02:02,118 --> 00:02:06,950 the score of xi is going to be defined by 32 00:02:06,950 --> 00:02:11,631 w0 + 5w1 plus 3w2 plus 2w3 and 33 00:02:11,631 --> 00:02:17,069 if this, depending on the coefficients, 34 00:02:17,069 --> 00:02:22,220 that score may be positive or negative. 35 00:02:22,220 --> 00:02:24,860 So, this is a positive training example, 36 00:02:24,860 --> 00:02:28,990 we want to choose the w's that make that score positive. 37 00:02:28,990 --> 00:02:30,710 Now that we set up the classification problem, 38 00:02:30,710 --> 00:02:35,300 and the task that we're after, let's do a quick review of notation, for the course. 39 00:02:35,300 --> 00:02:38,130 In this course we're going to use the same notation that we used 40 00:02:38,130 --> 00:02:43,030 in the regression course, that was the second course in the specialization. 41 00:02:43,030 --> 00:02:47,920 And here we have an output y, which is the thing you're trying to predict. 42 00:02:47,920 --> 00:02:53,600 In the regression case that was a real value but in our case is a class. 43 00:02:53,600 --> 00:02:57,906 And we have a set of inputs, x, which have little d dimensions. 44 00:02:57,906 --> 00:03:01,520 x[1], x[2], through x[d]. 45 00:03:01,520 --> 00:03:05,380 So x is really a d-dimensional vector, and y is an output we're trying to predict, 46 00:03:05,380 --> 00:03:07,920 which in our case is either minus one or plus one in 47 00:03:07,920 --> 00:03:11,338 the binary classification setting, which is where we're starting out today. 48 00:03:11,338 --> 00:03:20,210 Now we use xj to denote the jth input which is a scaler value. 49 00:03:20,210 --> 00:03:26,020 We're going to use hj of x to denote the jth feature. 50 00:03:26,020 --> 00:03:31,160 And then we'll use x sub i, very importantly to denote the ith data point. 51 00:03:32,820 --> 00:03:38,600 And x sub i of j denoted the jth input of the ith data point. 52 00:03:38,600 --> 00:03:41,627 It's a little bit of a handful but 53 00:03:41,627 --> 00:03:47,160 it's the exact same as what we did in the regression course. 54 00:03:47,160 --> 00:03:51,905 Now with this, in part by this notation, we can go back and 55 00:03:51,905 --> 00:03:56,050 define our simple hyperplane, the one we just saw with awfuls and awesomes. 56 00:03:56,050 --> 00:04:00,560 And just say that, in this case, y hat, our prediction is the sign 57 00:04:00,560 --> 00:04:04,840 of the score that we have for this particular input. 58 00:04:04,840 --> 00:04:09,360 And this sign function just says that if the score is greater than 0, 59 00:04:09,360 --> 00:04:11,080 predict plus 1. 60 00:04:11,080 --> 00:04:13,510 If the score is less than 0, predict minus 1. 61 00:04:13,510 --> 00:04:19,070 And then at zero, you have the choice of either predicting minus 1 or plus 1. 62 00:04:19,070 --> 00:04:20,660 You can make an arbitrary choice. 63 00:04:20,660 --> 00:04:25,140 The way I think about it is if it's 0, I predict plus 1. 64 00:04:25,140 --> 00:04:30,262 Now the score of an input x i is w0 + 65 00:04:30,262 --> 00:04:35,914 w1 times x1 of i + w2 times x2 of i, 66 00:04:35,914 --> 00:04:41,921 all the way to wd, the dth coefficient, 67 00:04:41,921 --> 00:04:47,420 times the dth entry in the x i vector. 68 00:04:50,200 --> 00:04:54,060 And here the features input the first one is 1, 69 00:04:54,060 --> 00:04:59,640 the constant feature like we did with regression and x[1] could 70 00:04:59,640 --> 00:05:04,460 be awesome, number of awesomes, x[2] could be number of awfuls, and 71 00:05:04,460 --> 00:05:09,040 say the last one x[d] could be the number of times the word ramen shows up. 72 00:05:09,040 --> 00:05:12,790 Which to me might be associated with a negative review, but 73 00:05:12,790 --> 00:05:14,390 it might be kind of indifferent. 74 00:05:14,390 --> 00:05:15,880 Depends on what coefficient you have there. 75 00:05:18,230 --> 00:05:23,010 So our goal is to optimize the score, fit the score. 76 00:05:23,010 --> 00:05:27,456 And I'm going to use w transpose xi as a short hand so 77 00:05:27,456 --> 00:05:33,435 I don't have to always write w0 plus wix1 plus w2x2 and so on. 78 00:05:33,435 --> 00:05:37,646 So, we use this transpose notation which is the same one that we 79 00:05:37,646 --> 00:05:40,354 talked about in the regression course. 80 00:05:40,354 --> 00:05:44,839 [MUSIC]