[MUSIC] Let's take a moment to explore the decision boundary that we played with quite a bit in this lecture. So this is one where 1.0 times number of #awesomes minus 1.5 times the number of #awfuls is equal to 0. In this case, all the points to the bottom of the line have score of xi greater than 0, so they're all labeled as positive. And all the ones on the other side have score of xi less than 0 so we labeled them as negative. But let's think about the probabilities a little bit more. So if we take this point over here, it's close to the boundary but it's on the positive side so it should be outputting probabilities that are greater than 0.5 but not that much greater than 0.5. So let's see what the score looks like for this one. So I looked at score of xi for this one. And we can measure directly. So it has four #awesomes, so those counts as four. And it has two #awfuls so this kind is -3, so you have the score is 4- 1.5 times 2, which gives you a total of 1. And if you push that through the the sigmoid, you realize the probability that y is equal to +1 given this particular xi is equal to 0.73. And you can read that from the graph as well. So if I go like this and I push it to the right, I should get 0.73. Now, if you take another example that's further away from the boundary. Like this one over here where I feel more sure, more confident. The probability that y = +1 should be higher. So in this case here, the score of xi is 3. It has three #awesomes, no #awfuls which implies that our prediction, let's say p hat of y = +1 given xi is 0.95 which is much larger than the one that was closer to boundary. So, it's something it's more like this, 0.95. Very good, now let's take a final datapoint on the other side of the axis. So if you take this datapoint over here and you compute its score. So it has one awesome. So the score of xi has one #awesome, so that counts as 1. But has three #awfuls. So that counts as- 1.5 times 3. So the total score is -3.5, so it's very negative. We should be very sure that it's a negative example, so the probability that y = +1 given this particular xi should be really low. So, we're at -3 here, the probability should be really low, you can see from the graph and it turns out that the probability is 0.03 when you push it through the sigmoid. Extremely low probability. It makes sense, a review with three #awfuls is just awful. Now that we've explored the decision boundaries that come from a logistic regression model. And the notions of probability, let's explore a little bit more the effects of the coefficients that we learn from data. Both on the boundary and also on the confidence or the sureness that we have about the particular prediction on the actual probabilities that we predict. So, if we take a very simple model with just wot features, number of #awesomes and number of #awfuls. And we look at the coefficients of those features as well as the constant w0. Let's see what happens. So if that w0 has coefficient 0, then we have is the probability of y=+1 given xi and w. This particular w0 is given by the curve below. So, we see that if the number of #awesomes is exactly equal to the number #awfuls, then the score of the two cancel out. And we have a score of xi to be exactly equal to 0. So, that gives you a predictive probability of 0.5, just like we saw. Now if you have one more #awesome than you have #awfuls, then that difference becomes one and your prediction now of the probability of being a positive review, just by having that extra #awesome, this is 0.73. Now let's see what happens if you change the constant. Now if you change the constant w0 to -2 and keep everything the same, you see that the line has shifted to the right. So now what happens is I need the number #awesomes to have two more, the number of #awfuls for that prediction to be 0.5. So for the probability that y=+1 given xi and w to be 50 50. So in this case, #awfuls count a lot more negatively than the #awesomes, or at least that has that extra constant. Now, if I keep w0 to 0 but I increase the magnitude of the parameters, I get a curve on the right which similar to the curve in the middle so if the difference between the two is 0, I still predict 0.5. However, it grows much, much more steeply. So, in other words, if you just have one more #awesome than you have #awfuls, you're going to say that the probability of y=+1 given xi and ws is almost 1. So the bigger you make the parameters in magnitude, the more sure you get more quickly. And the amount to change the constant, you kind of shifting that line to the left and to the right. Now we can see our logistic regression during the problem. So I have some training data, which had some features. We have this ML model that says probability that a review is positive is given by the sigmoid of the score w transpose H. Which is 1 / 1 + e to the -w transpose H. We're going to learn a w hat that really fits the data well. So next, we'll discuss the algorithmic foundations of how do we fit the w hats from data. [MUSIC]