1 00:00:00,000 --> 00:00:04,480 [MUSIC] 2 00:00:04,480 --> 00:00:08,323 Let's take a moment to explore the decision boundary that we played with 3 00:00:08,323 --> 00:00:10,390 quite a bit in this lecture. 4 00:00:10,390 --> 00:00:15,286 So this is one where 1.0 times number of #awesomes minus 1.5 times the number of 5 00:00:15,286 --> 00:00:16,720 #awfuls is equal to 0. 6 00:00:18,220 --> 00:00:23,964 In this case, all the points to the bottom of the line have score of 7 00:00:23,964 --> 00:00:29,660 xi greater than 0, so they're all labeled as positive. 8 00:00:29,660 --> 00:00:33,380 And all the ones on the other side have score 9 00:00:34,590 --> 00:00:38,220 of xi less than 0 so we labeled them as negative. 10 00:00:38,220 --> 00:00:41,830 But let's think about the probabilities a little bit more. 11 00:00:41,830 --> 00:00:47,280 So if we take this point over here, it's close to the boundary 12 00:00:48,990 --> 00:00:53,280 but it's on the positive side so it should be outputting probabilities that 13 00:00:53,280 --> 00:00:56,620 are greater than 0.5 but not that much greater than 0.5. 14 00:00:56,620 --> 00:00:58,770 So let's see what the score looks like for this one. 15 00:01:00,180 --> 00:01:03,440 So I looked at score of xi for this one. 16 00:01:03,440 --> 00:01:05,900 And we can measure directly. 17 00:01:05,900 --> 00:01:08,600 So it has four #awesomes, so those counts as four. 18 00:01:08,600 --> 00:01:14,323 And it has two #awfuls so this kind is -3, 19 00:01:14,323 --> 00:01:20,201 so you have the score is 4- 1.5 times 2, 20 00:01:20,201 --> 00:01:24,086 which gives you a total of 1. 21 00:01:24,086 --> 00:01:28,746 And if you push that through the the sigmoid, 22 00:01:28,746 --> 00:01:33,895 you realize the probability that y is equal to +1 23 00:01:33,895 --> 00:01:38,812 given this particular xi is equal to 0.73. 24 00:01:38,812 --> 00:01:40,550 And you can read that from the graph as well. 25 00:01:40,550 --> 00:01:47,360 So if I go like this and I push it to the right, I should get 0.73. 26 00:01:47,360 --> 00:01:51,730 Now, if you take another example that's further away from the boundary. 27 00:01:51,730 --> 00:01:55,110 Like this one over here where I feel more sure, more confident. 28 00:01:55,110 --> 00:01:57,790 The probability that y = +1 should be higher. 29 00:01:57,790 --> 00:02:03,748 So in this case here, the score of xi is 3. 30 00:02:03,748 --> 00:02:06,637 It has three #awesomes, 31 00:02:06,637 --> 00:02:12,022 no #awfuls which implies that our prediction, 32 00:02:12,022 --> 00:02:17,141 let's say p hat of y = +1 given xi is 0.95 33 00:02:17,141 --> 00:02:24,630 which is much larger than the one that was closer to boundary. 34 00:02:24,630 --> 00:02:29,640 So, it's something it's more like this, 0.95. 35 00:02:32,848 --> 00:02:38,390 Very good, now let's take a final datapoint on the other side of the axis. 36 00:02:38,390 --> 00:02:43,580 So if you take this datapoint over here and you compute its score. 37 00:02:43,580 --> 00:02:45,830 So it has one awesome. 38 00:02:45,830 --> 00:02:49,590 So the score of xi has one #awesome, so that counts as 1. 39 00:02:49,590 --> 00:02:51,020 But has three #awfuls. 40 00:02:51,020 --> 00:02:56,240 So that counts as- 1.5 times 3. 41 00:02:56,240 --> 00:03:02,950 So the total score is -3.5, so it's very negative. 42 00:03:02,950 --> 00:03:07,470 We should be very sure that it's a negative example, so 43 00:03:07,470 --> 00:03:16,330 the probability that y = +1 given this particular xi should be really low. 44 00:03:16,330 --> 00:03:20,120 So, we're at -3 here, the probability should be really low, 45 00:03:20,120 --> 00:03:24,240 you can see from the graph and it turns out that the probability is 0.03 when you 46 00:03:24,240 --> 00:03:29,085 push it through the sigmoid. 47 00:03:29,085 --> 00:03:32,370 Extremely low probability. 48 00:03:32,370 --> 00:03:35,470 It makes sense, a review with three #awfuls is just awful. 49 00:03:38,220 --> 00:03:41,545 Now that we've explored the decision boundaries that come from a logistic 50 00:03:41,545 --> 00:03:42,970 regression model. 51 00:03:42,970 --> 00:03:45,710 And the notions of probability, let's explore 52 00:03:45,710 --> 00:03:48,940 a little bit more the effects of the coefficients that we learn from data. 53 00:03:48,940 --> 00:03:54,030 Both on the boundary and also on the confidence or the sureness that 54 00:03:54,030 --> 00:03:58,170 we have about the particular prediction on the actual probabilities that we predict. 55 00:03:58,170 --> 00:04:03,090 So, if we take a very simple model with just wot features, number of #awesomes and 56 00:04:03,090 --> 00:04:04,260 number of #awfuls. 57 00:04:04,260 --> 00:04:08,010 And we look at the coefficients of those features as well as the constant w0. 58 00:04:08,010 --> 00:04:09,340 Let's see what happens. 59 00:04:09,340 --> 00:04:14,764 So if that w0 has coefficient 0, 60 00:04:14,764 --> 00:04:19,627 then we have is the probability 61 00:04:19,627 --> 00:04:23,760 of y=+1 given xi and w. 62 00:04:23,760 --> 00:04:27,480 This particular w0 is given by the curve below. 63 00:04:27,480 --> 00:04:32,711 So, we see that if the number of #awesomes is exactly equal to the number #awfuls, 64 00:04:32,711 --> 00:04:35,370 then the score of the two cancel out. 65 00:04:35,370 --> 00:04:40,331 And we have a score of xi to be exactly equal to 0. 66 00:04:43,161 --> 00:04:49,701 So, that gives you a predictive probability of 0.5, just like we saw. 67 00:04:49,701 --> 00:04:54,391 Now if you have one more #awesome than you have #awfuls, 68 00:04:54,391 --> 00:04:57,359 then that difference becomes one and 69 00:04:57,359 --> 00:05:02,815 your prediction now of the probability of being a positive review, 70 00:05:02,815 --> 00:05:07,430 just by having that extra #awesome, this is 0.73. 71 00:05:07,430 --> 00:05:09,740 Now let's see what happens if you change the constant. 72 00:05:10,950 --> 00:05:13,653 Now if you change the constant w0 to -2 and 73 00:05:13,653 --> 00:05:18,200 keep everything the same, you see that the line has shifted to the right. 74 00:05:18,200 --> 00:05:24,504 So now what happens is I need the number #awesomes to have two more, 75 00:05:24,504 --> 00:05:29,828 the number of #awfuls for that prediction to be 0.5. 76 00:05:29,828 --> 00:05:35,051 So for the probability that y=+1 77 00:05:35,051 --> 00:05:39,408 given xi and w to be 50 50. 78 00:05:39,408 --> 00:05:44,394 So in this case, #awfuls count a lot more negatively than 79 00:05:44,394 --> 00:05:49,503 the #awesomes, or at least that has that extra constant. 80 00:05:49,503 --> 00:05:57,870 Now, if I keep w0 to 0 but I increase the magnitude of the parameters, 81 00:05:57,870 --> 00:06:02,570 I get a curve on the right which similar to the curve in the middle so 82 00:06:02,570 --> 00:06:07,830 if the difference between the two is 0, I still predict 0.5. 83 00:06:07,830 --> 00:06:11,470 However, it grows much, much more steeply. 84 00:06:11,470 --> 00:06:19,746 So, in other words, if you just have one more #awesome than you have #awfuls, 85 00:06:19,746 --> 00:06:27,820 you're going to say that the probability of y=+1 given xi and ws is almost 1. 86 00:06:27,820 --> 00:06:31,520 So the bigger you make the parameters in magnitude, 87 00:06:31,520 --> 00:06:33,240 the more sure you get more quickly. 88 00:06:34,350 --> 00:06:36,330 And the amount to change the constant, 89 00:06:36,330 --> 00:06:39,100 you kind of shifting that line to the left and to the right. 90 00:06:40,830 --> 00:06:43,470 Now we can see our logistic regression during the problem. 91 00:06:43,470 --> 00:06:46,130 So I have some training data, which had some features. 92 00:06:46,130 --> 00:06:50,955 We have this ML model that says probability that a review is 93 00:06:50,955 --> 00:06:55,884 positive is given by the sigmoid of the score w transpose H. 94 00:06:55,884 --> 00:07:01,090 Which is 1 / 1 + e to the -w transpose H. 95 00:07:01,090 --> 00:07:05,370 We're going to learn a w hat that really fits the data well. 96 00:07:05,370 --> 00:07:10,172 So next, we'll discuss the algorithmic foundations of 97 00:07:10,172 --> 00:07:12,980 how do we fit the w hats from data. 98 00:07:12,980 --> 00:07:17,279 [MUSIC]