[MUSIC] Let's take a moment to explore
the decision boundary that we played with quite a bit in this lecture. So this is one where 1.0 times number of
#awesomes minus 1.5 times the number of #awfuls is equal to 0. In this case, all the points to
the bottom of the line have score of xi greater than 0, so
they're all labeled as positive. And all the ones on
the other side have score of xi less than 0 so
we labeled them as negative. But let's think about
the probabilities a little bit more. So if we take this point over here,
it's close to the boundary but it's on the positive side so
it should be outputting probabilities that are greater than 0.5 but
not that much greater than 0.5. So let's see what the score looks like for
this one. So I looked at score of xi for this one. And we can measure directly. So it has four #awesomes,
so those counts as four. And it has two #awfuls so this kind is -3, so you have the score is 4- 1.5 times 2, which gives you a total of 1. And if you push that
through the the sigmoid, you realize the probability
that y is equal to +1 given this particular xi is equal to 0.73. And you can read that
from the graph as well. So if I go like this and
I push it to the right, I should get 0.73. Now, if you take another example
that's further away from the boundary. Like this one over here where I
feel more sure, more confident. The probability that y
= +1 should be higher. So in this case here,
the score of xi is 3. It has three #awesomes, no #awfuls which implies
that our prediction, let's say p hat of y = +1 given xi is 0.95 which is much larger than the one
that was closer to boundary. So, it's something it's more like this,
0.95. Very good, now let's take a final
datapoint on the other side of the axis. So if you take this datapoint over
here and you compute its score. So it has one awesome. So the score of xi has one #awesome,
so that counts as 1. But has three #awfuls. So that counts as- 1.5 times 3. So the total score is -3.5,
so it's very negative. We should be very sure that
it's a negative example, so the probability that y = +1 given this
particular xi should be really low. So, we're at -3 here,
the probability should be really low, you can see from the graph and it turns
out that the probability is 0.03 when you push it through the sigmoid. Extremely low probability. It makes sense,
a review with three #awfuls is just awful. Now that we've explored the decision
boundaries that come from a logistic regression model. And the notions of probability,
let's explore a little bit more the effects of
the coefficients that we learn from data. Both on the boundary and also on
the confidence or the sureness that we have about the particular prediction on
the actual probabilities that we predict. So, if we take a very simple model with
just wot features, number of #awesomes and number of #awfuls. And we look at the coefficients of those
features as well as the constant w0. Let's see what happens. So if that w0 has coefficient 0, then we have is the probability of y=+1 given xi and w. This particular w0 is
given by the curve below. So, we see that if the number of #awesomes
is exactly equal to the number #awfuls, then the score of the two cancel out. And we have a score of xi
to be exactly equal to 0. So, that gives you a predictive
probability of 0.5, just like we saw. Now if you have one more
#awesome than you have #awfuls, then that difference becomes one and your prediction now of the probability
of being a positive review, just by having that extra #awesome,
this is 0.73. Now let's see what happens
if you change the constant. Now if you change
the constant w0 to -2 and keep everything the same, you see that
the line has shifted to the right. So now what happens is I need
the number #awesomes to have two more, the number of #awfuls for
that prediction to be 0.5. So for the probability that y=+1 given xi and w to be 50 50. So in this case,
#awfuls count a lot more negatively than the #awesomes, or
at least that has that extra constant. Now, if I keep w0 to 0 but I increase
the magnitude of the parameters, I get a curve on the right which
similar to the curve in the middle so if the difference between the two is 0,
I still predict 0.5. However, it grows much, much more steeply. So, in other words, if you just have one
more #awesome than you have #awfuls, you're going to say that the probability
of y=+1 given xi and ws is almost 1. So the bigger you make
the parameters in magnitude, the more sure you get more quickly. And the amount to change the constant, you kind of shifting that line
to the left and to the right. Now we can see our logistic
regression during the problem. So I have some training data,
which had some features. We have this ML model that says
probability that a review is positive is given by the sigmoid
of the score w transpose H. Which is 1 / 1 + e to the -w transpose H. We're going to learn a w hat
that really fits the data well. So next, we'll discuss
the algorithmic foundations of how do we fit the w hats from data. [MUSIC]