In the last video, we talked
about the hypothesis representation for logistic progression.
What I'd like to do now is
tell you about something called the
decision boundary, and this
will give us a better sense
of what the logistic regression
hypothesis function is computing.
To recap, this is
what we wrote out last time,
where we said that the
hypothesis is represented as
H of X equals G of
theta transpose X, where G
is this function called the
sigmoid function, which looks like this.
So, it slowly increases from zero
to one, asymptoting at one.
What I want to do now is
try to understand better when
this hypothesis will make
predictions that Y is
equal to one versus when it
might make predictions that Y
is equal to zero and understand
better what the hypothesis function
looks like, particularly when we have more than one feature.
Concretely, this hypothesis is
out putting estimates of the
probability that Y is
equal to one given X is prime.
So if we wanted to
predict is Y equal to
one or is Y equal
to zero here's something we might do.
When ever the hypothesis its that
the problem with y being one
is greater than or equal
to 0.5 so this means
that it is more likely to
be y equals one than y
equals zero then let's predict Y equals one.
And otherwise, if the probability
of, the estimated probability of
Y being one is less
than 0.5, then let's predict Y equals zero.
And I chose a greater
than or equal to here and less than here.
If H of X is equal
to 0.5 exactly, then
we could predict positive or
negative vector but a put a
great deal on to here
so we default maybe to predicting
a positive if your
vector is 0.5 but that's
a detail that really doesn't matter that much.
What I want to do is understand
better when it is
exactly that H of
X will be greater or equal
to 0.5, so that
we end up predicting Y is equal to one.
If we look at this plot
of the sigmoid function, we'll notice
that the sigmoid function, G
of Z, is greater than
or equal to 0.5
whenever Z is
greater than or equal to zero.
So is in this half of
the figure that, G takes
on values that are 0.5 and higher.
This is not clear, that's the 0.5.
So when Z is
positive, G of Z,
the sigmoid function, is greater than or equal to 0.5.
Since the hypothesis for
logistic regression is H of
X equals G of theta
transpose X. This is
therefore going to be greater
than or equal to 0.5
whenever theta transpose
X is greater than or equal to zero.
So what was shown, right,
because here theta transpose X
takes the row of Z.
So what we're shown is that
our hypothesis is going
to predict Y equals one
whenever theta transpose X
is greater than or equal to 0.
Let's now consider the other
case of when a hypothesis
will predict Y is equal to 0.
Well, by similar argument, H
of X is going to be
less than 0.5 whenever G
of Z is less than
0.5 because the range
of values of Z that
calls Z to take on
values less that 0.5, well that's when Z is negative.
So when G of Z is less than 0.5.
Our hypothesis will predict
that Y is equal to zero, and
by similar argument to what
we had earlier, H of
X is equal G of
theta transpose X. And
so, we'll predict Y equals
zero whenever this quantity
theta transpose X is less than zero.
To summarize what we just
worked out, we saw that if
we decide to predict whether
Y is equal to one or
Y is equal to zero,
depending on whether the estimated
probability is greater than
or equal 0.5, or whether
it's less than 0.5, then
that's the same as saying that
will predict Y equals 1
whenever theta transpose axis greater
than or equal to 0,
and we will predict Y is
equal to zero whenever theta transpose X
is less than zero.
Let's use this to better
understand how the hypothesis
of logistic regression makes those predictions.
Now, let's suppose we have
a training set like that shown
on the slide, and suppose
our hypothesis is H of
X equals G of theta
zero, plus theta one X1
plus theta two X2.
We haven't talked yet about how
to fit the parameters of this model.
We'll talk about that in the next video.
But suppose that variable procedure
to be specified, we end
up choosing the following values for the parameters.
Let's say we choose theta zero
equals three, theta one
equals one, theta two equals one.
So this means that my parameter
vector is going to be
theta equals minus 311.
So, we're given this
choice of my hypothesis parameters,
let's try to figure out where
a hypothesis will end up
predicting y equals 1 and where it
will end up predicting y equals 0.
Using the formulas that we
worked on the previous slide, we know
that Y equals 1 is
more likely, that is the
probability that Y equals
1 is greater than 0.5
or greater than or equal to 0.5.
Whenever theta transpose x
is greater than zero.
And this formula that I
just underlined minus three
plus X1 plus X2 is,
of course, theta transpose
X when theta is equal
to this value of the parameters
that we just chose.
So, for any example, for
any example with features X1
and X2 that satisfy this
equation that minus 3
plus X1 plus X2
is greater than or equal to 0.
Our hypothesis will think
that Y equals 1 is
more likely, or will predict that Y is equal to one.
We can also take minus three
and bring this to the right
and rewrite this as X1
plus X2 is greater than or equal to three.
And so, equivalently, we found
that this hypothesis will predict
Y equals one whenever X1
plus X2 is greater than or equal to three.
Let's see what that means on the figure.
If I write down the equation,
X1 plus X2 equals three,
this defines the equation of a straight line.
And if I draw what that straight
line looks like, it gives
me the following line which passes
through 3 and 3 on
the X1 and the X2 axis.
So the part of the input space,
the part of the
X1, X2 plane that corresponds
to when X1 plus X2 is greater than or equal to three.
That's going to be this very top plane.
That is everything to the
up, and everything to the upper
right portion of this magenta line that I just drew.
And so, the region where our
hypothesis will predict Y
equals 1 is this
region, you know, is
really this huge region, this
half-space over to the upper right.
And let me just write that down.
I'm gonna call this the Y
equals one region, and in
contrast the region where
X1 plus X2 is
less than three, that's when
we will predict that Y,
Y is equal to zero, and
that corresponds to this region.
You know, itt's really a half-plane, but
that region on the left is
the region where our hypothesis predict Y equals 0.
I want to give
this line, this magenta line that I drew a name.
This line there is called
the decision boundary.
And concretely, this straight line
X1 plus X equals 3.
That corresponds to the set of points.
So that corresponds to the region
where H of X is equal
to 0.5 exactly and
the decision boundary, that is
this straight line, that's the
line that separates the region
where the hypothesis predicts Y equals
one from the region
where the hypothesis predicts that Y is equal to 0.
And just to be clear.
The decision boundary is a
property of the hypothesis
including the parameters theta 0, theta 1, theta 2.
And in the figure I drew a training set.
I drew a data set in order to help the visualization.
But even if we take
away the data set, you know
decision boundary and a
region where we predict Y
equals 1 versus Y equals zero.
That's a property of the
hypothesis and of the
parameters of the hypothesis, and
not a property of the data set.
Later on, of course, we'll talk
about how to fit the
parameters and there we'll
end up using the training set,
or using our data, to determine the value of the parameters.
But once we have particular values
for the parameters: theta 0, theta 1, theta 2.
Then that completely defines
the decision boundary and we
don't actually need to plot
a training set in order
to plot the decision boundary.
Let's now look at a more
complex example where, as
usual, I have crosses to
denote my positive examples and
O's to denote my negative examples.
Given a training set like this,
how can I get logistic regression
to fit this sort of data?
Earlier, when we were talking about
polynomial regression or when
we're linear regression, we talked
about how we can add extra
higher order polynomial terms to the features.
And we can do the same for logistic regression.
Concretely, let's say my hypothesis looks like this.
Where I've added two extra
features, X1 squared and X2 squared, to my features.
So that I now have 5 parameters,
theta 0 through theta 4.
As before, we'll defer to
the next video our discussion
on how to automatically choose
values for the parameters theta 0 through theta 4.
But let's say that
very procedure to be specified,
I end up choosing theta 0
equals minus 1, theta 1
equals 0, theta 2
equals 0, theta 3 equals
1, and theta 4 equals 1.
What this means
is that with this particular choice
of parameters, my parameter
vector theta looks like minus 1, 0, 0, 1, 1.
Following our earlier discussion, this
means that my hypothesis will predict
that Y is equal to 1
whenever minus 1 plus X1
squared plus X2 squared is greater than or equal to 0.
This is whenever theta transpose
times my theta transpose
my features is greater than or equal to 0.
And if I take minus
1 and just bring this to
the right, I'm saying that
my hypothesis will predict that
Y is equal to 1
whenever X1 squared plus
X2 squared is greater than or equal to 1.
So, what does decision boundary look like?
Well, if you were to plot the
curve for X1 squared plus
X2 squared equals 1.
Some of you will
that is the equation for
a circle of radius
1 centered around the origin.
So, that is my decision boundary.
And everything outside the
circle I'm going to predict
as Y equals 1.
So out here is, you know, my
Y equals 1 region.
I'm going to predict Y equals 1 out here.
And inside the circle is where
I'll predict Y is equal to 0.
So, by adding these more
complex or these polynomial terms to my features as well.
I can get more complex decision
boundaries that don't just
try to separate the positive and negative examples of a straight line.
I can get in this example
a decision boundary that's a circle.
Once again the decision boundary
is a property not of
the training set, but of the hypothesis and of the parameters.
So long as we've
given my parameter vector theta,
that defines the decision
boundary which is the circle.
But the training set is not what we use to define decision boundary.
The training set may be used to fit the parameters theta.
We'll talk about how to do that later.
But once you have the
parameters theta, that is what defines the decision boundary.
Let me put back the training set
just for visualization.
And finally, let's look at a more complex example.
So can we come up
with even more complex decision boundaries and this?
If I have even higher
order polynomial terms, so things
like X1 squared, X1
squared X2, X1 squared
X2 squared, and so on.
If I have much higher order
polynomials, then it's possible
to show that you can get
even more complex decision boundaries and
logistic regression can be
used to find the zero boundaries
that may, for example, be
an ellipse like that, or
maybe with a different setting of
the parameters, maybe you
can get instead a different decision boundary that
may even look like, you know, some funny
shape like that.
Or for even more complex examples
you can also get decision boundaries
that can look like, you know,
more complex shapes like that.
Where everything in here you
predict Y equals 1, and
everything outside you predict Y equals 0.
So these higher order polynomial
features you can get very complex decision boundaries.
So with these visualizations, I
hope that gives you a
what's the range of hypothesis
functions we can represent using
the representation that we have for logistic regression.
Now that we know what H of X can represent.
What I'd like to do next in
the following video is talk
about how to automatically choose the parameters theta.
So that given a training
set we can automatically fit the parameters to our data.