1 00:00:00,000 --> 00:00:04,401 [MUSIC] 2 00:00:04,401 --> 00:00:10,157 Now let's take a deep dive, and understand how we're going to predict the probability 3 00:00:10,157 --> 00:00:14,610 that a sentence is positive or negative using linear classifiers. 4 00:00:14,610 --> 00:00:16,129 Or we call them linear models. 5 00:00:17,380 --> 00:00:20,210 So now we're taking some input data x. 6 00:00:20,210 --> 00:00:24,550 We're going to compute some features h, and we're going to output this P hat, 7 00:00:24,550 --> 00:00:28,930 the probability that the label is positive or negative. 8 00:00:30,290 --> 00:00:35,921 Now if you'll go back to our example for the decision boundary we 9 00:00:35,921 --> 00:00:41,135 computed the score of every data point as w transpose h of x or 10 00:00:41,135 --> 00:00:45,223 w0 h0 + w1 h1 + w2 h2 + w3 h3 and so on. 11 00:00:45,223 --> 00:00:49,380 Everything below the line had score greater than 0. 12 00:00:49,380 --> 00:00:53,080 Everything above the line have score less than 0, but we don't know how far. 13 00:00:53,080 --> 00:00:57,580 It's like a lot less than 0 and a lot greater than 0 potentially. 14 00:00:57,580 --> 00:01:02,500 And how do we relate these scores which could be somewhere between minus infinity 15 00:01:02,500 --> 00:01:07,660 and positive infinity, with the probability that the output is plus one, 16 00:01:07,660 --> 00:01:10,290 the probability of the sentence is plus one. 17 00:01:10,290 --> 00:01:13,380 And that's the task that we're going to try to do today. 18 00:01:14,450 --> 00:01:16,740 In fact, we have the w transpose h, 19 00:01:16,740 --> 00:01:20,980 the score, can range from minus infinity to plus infinity. 20 00:01:22,210 --> 00:01:26,780 If it's positive, greater than 0, we're going to output +1, and if it's negative, 21 00:01:26,780 --> 00:01:28,714 less than 0, we're going to put -1. 22 00:01:30,240 --> 00:01:34,560 Now what we want to say is if that score is really, really, really, really big, 23 00:01:34,560 --> 00:01:38,260 it's like infinity, then we're very sure that y hat is +1. 24 00:01:38,260 --> 00:01:44,061 So we're going to say that the probability that y=+1 given this input is 1. 25 00:01:44,061 --> 00:01:48,687 On the other end of the spectrum, if the score is very, very, very, very negative, 26 00:01:48,687 --> 00:01:51,599 minus infinity, we're very sure that y hat is -1. 27 00:01:53,150 --> 00:01:59,380 And then we should output the probability y is +1 is 0, for this particular input x. 28 00:02:00,850 --> 00:02:01,720 Now if the score is 0, 29 00:02:01,720 --> 00:02:07,560 we're right at that decision boundary where we're neither positive or negative. 30 00:02:07,560 --> 00:02:12,904 So we're kind of indifferent, in predicting whether y hat is +1 or 31 00:02:12,904 --> 00:02:17,380 -1, so if we're indifferent with probabilities we can interpret that. 32 00:02:17,380 --> 00:02:24,792 We can say the probability the y hat, the y is +1 given the input is 0.5, 50-50. 33 00:02:24,792 --> 00:02:26,400 It could go either way. 34 00:02:26,400 --> 00:02:29,760 So that's our goal is to predict those probabilities from the scores. 35 00:02:30,860 --> 00:02:32,660 So, we have the scores. 36 00:02:32,660 --> 00:02:35,930 The scores range from minus infinity to plus infinity. 37 00:02:35,930 --> 00:02:38,090 And they're this weighted combination of the features. 38 00:02:39,220 --> 00:02:41,500 And probabilities are between 0 and 1. 39 00:02:43,460 --> 00:02:47,660 If it's minus infinity the score, I want the probability of that output to be 0. 40 00:02:47,660 --> 00:02:51,840 If it's plus infinity the score, I want output that I predict to be 1.0. 41 00:02:51,840 --> 00:02:57,080 And if the score is 0.0, I want to say the probability is 0.5. 42 00:02:57,080 --> 00:03:00,758 Scores range from minus infinity to plus infinity, 43 00:03:00,758 --> 00:03:03,607 probabilities range between 0 and 1. 44 00:03:03,607 --> 00:03:08,458 The question is how do I relate score from minus infinity to plus infinity, 45 00:03:08,458 --> 00:03:10,160 to probability 0 and 1. 46 00:03:10,160 --> 00:03:13,890 How do I link these two things? 47 00:03:13,890 --> 00:03:15,482 And now we're going to see some magic. 48 00:03:15,482 --> 00:03:22,350 [LAUGH] The magic that glues, that links this range minus infinity, 49 00:03:22,350 --> 00:03:30,360 plus infinity to the range 0,1 is called a link function, so it links the two. 50 00:03:30,360 --> 00:03:35,246 I'm going to take the score, which is between minus infinity and 51 00:03:35,246 --> 00:03:39,488 plus infinity, I'm going to push it through a function g 52 00:03:39,488 --> 00:03:43,873 that squeezes that huge line into the interval 0, 1. 53 00:03:43,873 --> 00:03:51,538 [SOUND] And uses it to predict the probability that y equals +1. 54 00:03:51,538 --> 00:03:56,338 And when you're taking a linear model, w transpose h minus 55 00:03:56,338 --> 00:04:01,234 infinity to plus infinity and you're squeezing it into 0, 56 00:04:01,234 --> 00:04:08,780 1 using link functions you are building what's called a generalized linear model. 57 00:04:08,780 --> 00:04:10,710 So, if somebody stops you in the street today and 58 00:04:10,710 --> 00:04:12,920 asks you what's a generalized linear model? 59 00:04:12,920 --> 00:04:13,960 Say no problem. 60 00:04:13,960 --> 00:04:17,428 It's just like a regression model, but you squeeze it into 0, 61 00:04:17,428 --> 00:04:20,430 1 by pushing through a link function. 62 00:04:20,430 --> 00:04:21,660 So it's a little abstract. 63 00:04:21,660 --> 00:04:24,330 We're going to talk about it in the context of 64 00:04:25,800 --> 00:04:29,670 logistic regression which is a specific type of link function. 65 00:04:29,670 --> 00:04:32,510 Now I talked about generalized linear models as squeezing 66 00:04:32,510 --> 00:04:35,850 minus infinity to plus infinity to the interval 0, 1. 67 00:04:35,850 --> 00:04:38,730 That's true for classifiers and for most kinds of classifiers. 68 00:04:38,730 --> 00:04:42,730 There are other types of generalized linear models that don't squeeze 69 00:04:42,730 --> 00:04:44,090 between 0 and 1. 70 00:04:44,090 --> 00:04:48,680 But for our purposes you can think about them in that context. 71 00:04:48,680 --> 00:04:52,970 So in this context our goal now becomes taking your training data, 72 00:04:52,970 --> 00:04:56,810 pushing it through some feature extraction which gives us the h's, 73 00:04:56,810 --> 00:05:01,170 the FIDF or whatever else you represent in data, number of awesomes. 74 00:05:01,170 --> 00:05:05,300 And now we build a linear model W transpose h, 75 00:05:05,300 --> 00:05:09,960 we push it through the link function that squeezes it into interval 0, 1. 76 00:05:09,960 --> 00:05:14,642 And use that to predict the probability that your sentiment of 77 00:05:14,642 --> 00:05:18,523 your review is positive given the input sentence. 78 00:05:18,523 --> 00:05:22,799 [MUSIC]