1 00:00:00,000 --> 00:00:04,378 [MUSIC] 2 00:00:04,378 --> 00:00:06,842 In order to learn the coefficients w hat from data, 3 00:00:06,842 --> 00:00:10,165 we have to define some kind of quality metric so that the coefficients 4 00:00:10,165 --> 00:00:14,060 give you the highest quality, that's what w hat is supposed to be. 5 00:00:14,060 --> 00:00:19,100 So if we go back to our learning loop when getting some training data, we squeeze it 6 00:00:19,100 --> 00:00:23,960 from feature generation, or feature extraction systems, that gets to h(x). 7 00:00:23,960 --> 00:00:26,540 In our model is a logistic regression model we're not going to talk about 8 00:00:26,540 --> 00:00:29,180 the quality metric which is going to feed into the machine learning 9 00:00:29,180 --> 00:00:30,540 algorithm that's output in w hat. 10 00:00:31,580 --> 00:00:33,370 So turning into this orange box over here. 11 00:00:35,290 --> 00:00:38,900 In particular we would not be given some training data as the input and 12 00:00:38,900 --> 00:00:41,780 the data's kind of like the one I'm showing you, so it has for 13 00:00:41,780 --> 00:00:45,750 example if it only had two features the number of awesomes the number of awfuls 14 00:00:45,750 --> 00:00:50,760 and the output label or sentiment which might be plus one or negative one. 15 00:00:50,760 --> 00:00:56,100 In all this data, the n datapoints that say datapoints when it use some 16 00:00:56,100 --> 00:01:00,490 kind of learning algorithm that optimize the quality metric to give us w hat. 17 00:01:00,490 --> 00:01:02,570 So what's the quality metric look like? 18 00:01:02,570 --> 00:01:06,810 Let's get right into it and try to understand what the likelihood function 19 00:01:06,810 --> 00:01:10,360 that we hinted at in the previous module is really about. 20 00:01:11,870 --> 00:01:15,150 Let's take our data set, just for intuition and 21 00:01:15,150 --> 00:01:20,450 split it into the data points that have positive sentiment on the right and 22 00:01:20,450 --> 00:01:22,680 the data points that have negative sentiment on the left. 23 00:01:22,680 --> 00:01:24,440 So we have two tables but 24 00:01:24,440 --> 00:01:28,670 slightly more negative sentiments than positive sentiments in this example. 25 00:01:28,670 --> 00:01:31,090 So what do we want w hat to satisfy? 26 00:01:31,090 --> 00:01:32,345 Like a good w satisfy. 27 00:01:32,345 --> 00:01:34,768 For all the positive sentiments, for 28 00:01:34,768 --> 00:01:39,160 all the data points with positive sentiments, in the extreme we want 29 00:01:39,160 --> 00:01:43,267 the probability that it is positive to go all the way to plus one. 30 00:01:44,520 --> 00:01:49,120 For all the negative ones, we want the probability to go all the way to zero. 31 00:01:49,120 --> 00:01:50,630 So, it's not positive. 32 00:01:50,630 --> 00:01:52,050 It must be negative. 33 00:01:52,050 --> 00:01:57,240 So, our goal is to find a w hat that makes this possible or is close as possible. 34 00:01:58,930 --> 00:02:01,250 So, in other word, if we take the negative examples and 35 00:02:01,250 --> 00:02:05,780 the positive examples, there might not be a w hat that achieves exactly zero for 36 00:02:05,780 --> 00:02:09,080 the negative, one for the positives for all of them. 37 00:02:09,080 --> 00:02:13,412 So the quality metric of the likelihood function tries to figure out 38 00:02:13,412 --> 00:02:18,830 kind of on average it measures the quality throughout all the data points. 39 00:02:18,830 --> 00:02:23,120 With respect to coefficients W, how well we're making extremes happen. 40 00:02:23,120 --> 00:02:27,300 Now, if I have the likelihood function, I can evaluate multiple lines or 41 00:02:27,300 --> 00:02:29,130 multiple classifiers. 42 00:02:29,130 --> 00:02:31,740 So for example, for the green line here, 43 00:02:31,740 --> 00:02:36,980 the likelihood function may have a certain value, let's say 10 to the minus 6, 44 00:02:36,980 --> 00:02:41,820 well for this other line where instead of having w0 be 0, 45 00:02:41,820 --> 00:02:47,010 now w0 is 1, but the w1 and the w2 coefficients are the same then 46 00:02:47,010 --> 00:02:50,860 the likelihood is slightly higher, 10 to the minus 6. 47 00:02:50,860 --> 00:02:57,480 But for the best line which maybe sets w0 to be 1, w1 to be 0.5, and w2 to 48 00:02:57,480 --> 00:03:03,009 be -1.5 then the likelihood is biggest which in this case is 10 to minus 4. 49 00:03:04,340 --> 00:03:06,090 Now, you see these numbers. 50 00:03:06,090 --> 00:03:08,340 They're kind of weird, 10 to the minus something. 51 00:03:08,340 --> 00:03:10,340 But this is what likelihoods will come out to. 52 00:03:10,340 --> 00:03:12,180 They're going to be very, very, small numbers. 53 00:03:12,180 --> 00:03:14,080 They're going to be less than one. 54 00:03:14,080 --> 00:03:17,250 But the higher you get, the closer you get to one, the better. 55 00:03:17,250 --> 00:03:21,980 And so the question is, how do we find the best w's, the best classifiers, 56 00:03:21,980 --> 00:03:24,910 we'll find the ones that make this likelihood function that we're going to 57 00:03:24,910 --> 00:03:27,010 talk about, as big as possible. 58 00:03:27,010 --> 00:03:31,790 So we're going to define this function l(w) and then we're going to use gradient 59 00:03:33,870 --> 00:03:39,270 ascent to find w hat. 60 00:03:39,270 --> 00:03:45,470 And you should have some fond memories, maybe some sad sad memories, 61 00:03:45,470 --> 00:03:49,820 from the regression course where we talked about gradient descent and 62 00:03:49,820 --> 00:03:52,480 we explored the idea of using 63 00:03:52,480 --> 00:03:56,860 gradient to find the best possible parameters to optimize the quality metric. 64 00:03:56,860 --> 00:03:59,547 And we're going to go through that in this case again. 65 00:03:59,547 --> 00:04:04,529 [MUSIC]