1 00:00:00,000 --> 00:00:04,607 [MUSIC] 2 00:00:04,607 --> 00:00:06,569 Before we get to gradient descent though, 3 00:00:06,569 --> 00:00:10,400 we need to figure out what the quality metric we're trying to optimize is. 4 00:00:10,400 --> 00:00:12,930 That quality metric is going to call the data likelihood. 5 00:00:12,930 --> 00:00:16,640 So let's dive in and explore exactly that concept. 6 00:00:18,020 --> 00:00:20,009 To better understand the data likelihood, 7 00:00:20,009 --> 00:00:23,770 let's start from some simple examples and try to understand what w is trying to do, 8 00:00:23,770 --> 00:00:26,258 what we're trying to do here in the learning problem. 9 00:00:26,258 --> 00:00:31,198 So let's say that we have this data point that has two awesomes and one awful, so 10 00:00:31,198 --> 00:00:32,981 let's call that input x1. 11 00:00:32,981 --> 00:00:38,640 And we're going to try to output y1 which is this is a positive sentiment. 12 00:00:38,640 --> 00:00:41,830 This is a review of positive sentiment. 13 00:00:41,830 --> 00:00:47,250 If our classifier were really good, if ws were good, what would happen? 14 00:00:47,250 --> 00:00:54,900 Well for this particular input you would output y hat for 1 is also equal to +1. 15 00:00:54,900 --> 00:00:57,390 In other words, y hat agrees with the true label y. 16 00:00:59,150 --> 00:01:02,396 So what should we do for y hat to agree with the true label y, 17 00:01:02,396 --> 00:01:04,310 we're using logistic regression. 18 00:01:04,310 --> 00:01:09,106 Well we're going to try to find w that 19 00:01:09,106 --> 00:01:14,057 makes the probability that y = +1, 20 00:01:14,057 --> 00:01:17,008 when the input is x1. 21 00:01:17,008 --> 00:01:20,990 And the parameter's w, we're going to make that as big as possible. 22 00:01:20,990 --> 00:01:27,603 In other words, we want to make the probability that y = +1, 23 00:01:27,603 --> 00:01:32,560 when x1, number of awesomes is 2, and x2, 24 00:01:32,560 --> 00:01:37,160 number of awfuls is 1 for the parameter w. 25 00:01:37,160 --> 00:01:40,590 We're going to make that probability as big as possible. 26 00:01:40,590 --> 00:01:41,840 But that was for a positive example. 27 00:01:41,840 --> 00:01:45,440 We're going to make the probability of y = +1 as big as possible. 28 00:01:45,440 --> 00:01:46,650 Now let's take another example. 29 00:01:46,650 --> 00:01:51,581 Let's call this example x2, where had 0 awesomes and 2 awfuls, 30 00:01:51,581 --> 00:01:55,210 must be an awful review, awful restaurant. 31 00:01:55,210 --> 00:01:59,443 And so the sentiment for y2 is -1. 32 00:01:59,443 --> 00:02:02,441 So if our classify is good, if w is good, 33 00:02:02,441 --> 00:02:06,606 in this case when it should predict that y hat 2 = -1. 34 00:02:06,606 --> 00:02:08,139 So, agree with the true label. 35 00:02:08,139 --> 00:02:12,504 So in this case, we're not maximizing the probability of y = 1. 36 00:02:12,504 --> 00:02:17,008 But, we're maximizing the probability that y 37 00:02:17,008 --> 00:02:21,744 = -1 when the input is x2 for the parameters w. 38 00:02:21,744 --> 00:02:24,063 In other words, when the input is x1, 39 00:02:24,063 --> 00:02:27,238 we'll maximize the probability that y = +1. 40 00:02:27,238 --> 00:02:31,658 And when we put this x2, we want to maximize the probability that y = -1. 41 00:02:31,658 --> 00:02:35,452 That's what the data likelihood tries to do, but let's dig in and 42 00:02:35,452 --> 00:02:37,637 understand that a little bit better. 43 00:02:37,637 --> 00:02:41,090 Now we just don't have two training examples, we have a ton of examples. 44 00:02:41,090 --> 00:02:42,342 We have a big dataset. 45 00:02:42,342 --> 00:02:43,520 So what are we trying to do? 46 00:02:43,520 --> 00:02:45,010 So let's look at the first example. 47 00:02:45,010 --> 00:02:48,457 In the first example, we want to maximize, 48 00:02:48,457 --> 00:02:53,055 since this is a positive example, we're going to maximize 49 00:02:53,055 --> 00:02:57,964 the probability that y is equal to +1 when the input is x1. 50 00:02:57,964 --> 00:03:01,580 And the parameters are w, so 51 00:03:01,580 --> 00:03:06,553 in this case is the probability that y = 52 00:03:06,553 --> 00:03:11,677 +1 when the number of awesomes is 2 and 53 00:03:11,677 --> 00:03:18,220 the number of awfuls is 1, and the parameter is w. 54 00:03:18,220 --> 00:03:23,470 So we try to make the probability y = +1 given x1 to be as big as possible. 55 00:03:23,470 --> 00:03:29,016 For the next example, we're trying to make the probability that y = -1 56 00:03:29,016 --> 00:03:34,320 given x2, w to be to be as big as possible as a negative example. 57 00:03:34,320 --> 00:03:36,220 For the third one is also a negative example, so 58 00:03:36,220 --> 00:03:44,330 we're trying to make the probability that y = -1 given x to be nw. 59 00:03:44,330 --> 00:03:48,140 For the fourth one, we're trying to make the probability that y, 60 00:03:48,140 --> 00:03:55,730 it's a positive example, so y = +1 given x4 and w as large as possible. 61 00:03:55,730 --> 00:03:58,760 So in other words, for the possible example of trying to make the probability 62 00:03:58,760 --> 00:04:01,680 y = +1 as big as possible for negative examples, we're trying 63 00:04:01,680 --> 00:04:05,990 to make the probability y = -1 as big as possible, which is pretty natural. 64 00:04:05,990 --> 00:04:10,500 And we want to do that for each example, for every single one of those. 65 00:04:12,050 --> 00:04:14,880 Now the question is how do we combine, 66 00:04:16,030 --> 00:04:20,280 like for a single quality metric, how do we combine these? 67 00:04:20,280 --> 00:04:25,390 And you can imagine multiple ways of combining this averages, 68 00:04:25,390 --> 00:04:29,490 I don't know, all sorts of ideas out there. 69 00:04:29,490 --> 00:04:33,332 The way that you typically combine when you're doing what's called maximizing 70 00:04:33,332 --> 00:04:33,950 likelihood estimation or 71 00:04:33,950 --> 00:04:37,578 maximizing likelihood is by multiplying these things together. 72 00:04:37,578 --> 00:04:40,375 So you multiply. 73 00:04:44,019 --> 00:04:49,590 So, in other words, you say that it's the product of y = +1 given, 74 00:04:49,590 --> 00:04:51,960 that's from the first line. 75 00:04:51,960 --> 00:04:57,205 Given x1 and w times the probability that y = -1 from 76 00:04:57,205 --> 00:05:04,670 the second line here given x2 and w, and the third line is also negative. 77 00:05:04,670 --> 00:05:12,667 So, times the probability that y = -1 given x3, w and so on. 78 00:05:12,667 --> 00:05:14,800 So you just multiply them all together. 79 00:05:14,800 --> 00:05:17,700 Here's a side little note who those who know about probabilities. 80 00:05:17,700 --> 00:05:20,630 The reason you multiply is that you assume that every row is independent 81 00:05:20,630 --> 00:05:21,160 of each other. 82 00:05:21,160 --> 00:05:23,800 So, there's an independence that comes into play, but 83 00:05:23,800 --> 00:05:25,540 don't worry too much about this. 84 00:05:25,540 --> 00:05:27,490 Just think about as a multiplication. 85 00:05:29,445 --> 00:05:31,640 Okay, so let's do that a little bit more explicitly. 86 00:05:31,640 --> 00:05:34,650 Now we have these data points 1 through 4, and for 87 00:05:34,650 --> 00:05:39,110 each one of them we're trying to maximize the specific probability. 88 00:05:39,110 --> 00:05:42,955 So for the positive examples, we're trying to maximize probability of y = +1, for 89 00:05:42,955 --> 00:05:45,139 the negative examples of probability y = -1. 90 00:05:45,139 --> 00:05:49,387 Given the parameter w, we're going to find w that makes those as big as possible. 91 00:05:49,387 --> 00:05:52,458 So the likelihood function, the thing in which we want to optimize 92 00:05:52,458 --> 00:05:56,469 is the product of each one of these probabilities. 93 00:05:58,570 --> 00:05:59,440 I like that animation. 94 00:05:59,440 --> 00:06:00,200 It's pretty cool? 95 00:06:01,650 --> 00:06:06,390 So, we're going to use a shorthand notation here. 96 00:06:06,390 --> 00:06:13,500 So for the first example, we have that y1 up here, was +1. 97 00:06:13,500 --> 00:06:18,702 So we're going to denote that by probability of y1, given x1 and w. 98 00:06:18,702 --> 00:06:23,837 So, the y1's going to get from this +1, and 99 00:06:23,837 --> 00:06:30,570 x1 is going to come from this representation of x. 100 00:06:30,570 --> 00:06:36,090 And similarly for the other one, so the notation's going to get long and heavy. 101 00:06:37,280 --> 00:06:39,010 We just say probability of y1 given x1 and 102 00:06:39,010 --> 00:06:43,430 that would be the probability of y2 given x2 and w. 103 00:06:43,430 --> 00:06:46,560 Probability of y3 given x3 and w, and we're just multiplying them together. 104 00:06:48,080 --> 00:06:52,850 And finally so that that line doesn't get really long, because if we have 105 00:06:52,850 --> 00:06:55,590 a million or two examples, we would have a million of these entries. 106 00:06:55,590 --> 00:06:57,910 We use the product notation. 107 00:06:58,950 --> 00:07:01,220 So this is the little notation over here. 108 00:07:01,220 --> 00:07:05,175 And it just says, I'm just going to write same function here. 109 00:07:05,175 --> 00:07:10,343 l(w) is going to be equal to the product 110 00:07:10,343 --> 00:07:15,815 that ranges from the first data point to n, 111 00:07:15,815 --> 00:07:22,048 which is number of data points of the probability 112 00:07:22,048 --> 00:07:27,230 of whatever label yi has, +1, -1. 113 00:07:27,230 --> 00:07:32,716 Given the input xi, which is the sentence of that review, and the parameter w. 114 00:07:32,716 --> 00:07:35,640 This is the likelihood function that we're trying to optimize. 115 00:07:35,640 --> 00:07:41,102 So, our goal here is to pick 116 00:07:41,102 --> 00:07:46,565 w to make this crazy thing, 117 00:07:46,565 --> 00:07:51,553 I mean, this function as 118 00:07:51,553 --> 00:07:55,610 large as possible. 119 00:07:55,610 --> 00:07:56,928 That's our goal. 120 00:07:56,928 --> 00:08:01,399 [MUSIC]