In this video we'll talk about
how to fit the parameters theta
for logistic regression.
In particular, I'd like to
define the optimization objective or the
cost function that we'll use to fit the parameters.
Here's to supervised learning problem
of fitting a logistic regression model.
We have a training set
of M training examples.
And as usual each of
our examples is represented by
feature vector that's N plus 1 dimensional.
And as usual we have
X 0 equals 1.
Our first feature, or our 0
feature is always equal to 1,
and because this is a
classification problem, our training
set has the property that
every label Y, is either 0 or 1.
This is a hypothesis
and the parameters of the
hypothesis is this theta over here.
And the question I want
to talk about is given this
training set how do
we choose, or how do we fit the parameters theta?
Back when we were developing the
linear regression model, we use the following cost function.
I've written this slightly differently, where
instead of 1/2m, I've
taken the 1/2 and put it inside the summation instead.
Now, I want to use
an alternative way of writing
out this cost function which is
that instead of writing out
this squared and return here,
let's write here, cost of
H of X comma
Y, and I'm going
to define that term cost
of H of X comma Y to be equal to this.
It's just equal to just one half of the square root error.
So now, we can see more
clearly that the cost
function is a sum
over my training set, or
is 1/m times the sum
over my training set of this cost term here.
And to simplify this
equation a little bit more, it's gonna
be convenient to get rid of those superscripts.
So just define cost of
H of X comma Y to
be equal to 1/2 of
this square root error  and the
interpretation of this cost function
is that this is the
cost I want my learning
algorithm to, you know,
have to pay, if it
outputs that value it
this prediction is H of
X, and the actual
label was Y. So just
cross off those superscripts. All right.
And no surprise for linear
regression the cost for you to define is that.
Well the cost for this
is, that is 1/2
times the square difference
between what are predicted and the
actual value that we observe
for Y. Now, this cost
function worked fine for linear
regression, but here we're interested in logistic regression.
If we could minimize this cost
function that is plugged into J here.
That will work okay.
But it turns out that if
we use this particular cost function
this would be a non-convex function of the parameters theta.
Here's what I mean by non-convex.
We have some cost function J
of theta, and for logistic
regression this function H here
has a non linearity, right?
It says, you know, 1 over
1 plus E to the negative theta transfers
X. So it's a pretty complicated nonlinear function.
And if you take the sigmoid
function and plug it in
here and then take
this cost function and plug
it in there, and then plot
what J of theta looks
like, you find that
J of theta can look like a function just like this.
You know with many local optima and
the formal term for this
is that this a non convex function.
And you can kind of tell.
If you were to run gradient
descent on this sort of
function, it is not guaranteed
to converge to the global minimum.
Whereas in contrast, what
we would like is to have
a cost function J of theta
that is convex, that is
a single bow-shaped function that
looks like this, so that
if you run gradient descent, we
would be guaranteed that gradient descent, you know,
would converge to the global minimum.
And the problem of using the
the square cost function is that
because of this very
non linear sigmoid function that
appears in the middle here, J of
theta ends up being
a non convex function if you
were to define it as the square cost function.
So what we'd would like to do
is to instead come up with
a different cost function that
is convex and so
that we can apply a great
algorithm like gradient descent
and be guaranteed to find a global minimum.
Here's a cost function that we're going to use for logistic regression.
We're going to say the cost
or the penalty that the algorithm
pays if it outputs
a value H of X.
So, this is some number like 0.7
where it predicts a value H
of X. And the actual
cost label turns out to
be Y. The cost is
going to be minus log
H of X if Y is equal 1.
And minus log, 1 minus
H of X if Y is equal to 0.
This looks like a pretty complicated function.
But let's plot function to
gain some intuition about what it's doing.
Let's start up with the case of Y equals 1.
If Y is equal equal
to 1, then the cost function
is -log H of X, and
if we plot that, so let's
say that the horizontal
axis is H of X.
So we know that a hypothesis
is going to output a value between
0 and 1.
Right?
So H of X that varies
between 0 and 1.
If you plot what this cost function looks like.
You find that it looks like this.
One way to see why the
plot like this it is because
if you were to plot log Z
with Z on the horizontal axis.
Then that looks like that.
And it's approach is minus infinity.
So this is what the log function looks like.
And so this is 0, this is 1.
Here Z is of
course playing the role  of
H of X.  And so
minus log Z will look like this.
Right just flipping the sign.
Minus log Z. And we're
interested only in the
range of when this function
goes between 0 and 1.
So, get rid of that.
And so, we're just left with,
you know, this part of the curve.
And that's what this curve on the left looks like.
Now this cost function
has a few interesting and desirable properties.
First you notice that if
Y is equal to 1 and H of X is equal 1, in
other words, if the hypothesis
exactly, you know, predicts
H equals 1, and Y
is exactly equal to what I predicted.
Then the cost is equal 0.
Right?
That corresponds to, the curve doesn't actually flatten out.
The curve is still going. First, notice
that if H of X
equals 1, if the hypothesis
predicts that Y is equal to 1.
And if indeed Y is
equal to 1 then the cost is equal to 0.
That corresponds to this point down here.
Right?
If H of X is equal
to 1, and we're only
concerned the case that Y equals 1 here.
But if H of X is equal to 1.
Then the cost is down here is equal to 0.
And that is what we like it to be.
Because, you know, if we
correctly predict the output Y then the cost is 0.
But now, notice also
that H of X approaches 0.
So, that's H. As the
output of the hypothesis approaches 0
the cost blows up, and it goes to infinity.
And what this does is
it captures the intuition that if
a hypothesis, you know, outputs 0.
That's like saying, our hypothesis is
saying, the chance of Y
equals 1 is equal to 0.
It's kind of like our going
to our medical patient and saying,
"The probability that you
have a malignant tumor, the
probability that Y equals 1 is zero."
So, it's like absolutely impossible that your
tumor is malignant.
But if it turns out that
the tumor, the patient's tumor, actually is malignant.
So if Y is equal to
1 even after we told them
you know, the probability of it happening is 0.
It's absolutely impossible for it to be malignant.
But if we told them
this with that level of certainty,
and we turn out to be wrong,
then we penalize the learning algorithm
by a very, very large cost,
and that's captured by having this
cost goes infinity if Y
equals 1 and H
of X approaches 0.
This might consider of
Y1, let's look at what
the cost function looks like for Y0.
If Y is equal to 0, then the cost
looks like this expression over here.
And if you plot
the function minus log 1
minus Z what you
get is the cost function actually looks like this.
So, it goes from 0 to 1.
Something like that.
And so if you plot
the cost function for the case
of y equals zero, you find that it looks
like this and what
this curve does is it
now blows up,
and it goes to plus infinity as H of X goes to 1.
Because it's saying that
if Y turns out to be
equal to 0, but we
predicted that you know, Y is
equal to 1 with almost
certainty with probability 1, then
we end up paying a very large cost.
Let's plot the cost function for
the case of Y equals 0.
So if Y equals 0 that's going to be our cost function.
If you look at this expression,
and if you plot, you know, minus
log 1 minus Z, if
you figure out what that looks like,
you get a figure that looks like this.
Where, which goes from 0
to 1 with the Z
axis on the horizontal axis.
So If you take this cost
function and plot it for
the case of Y equals 0,
what you get is
that the cost function looks like this.
And what this cost function
does is that it blows
up or it goes to a
positive infinity as each
H of X goes to one
and this captures the
intuition that if a hypothesis
predicted that, you know, H of
X is equal to 1 with
certainty, with like probability 1,
it's absolutely got to be Y equals 1.
But if Y turned out to
be equal to 0 then
it makes sense to make the
hypothesis, or make the learning algorithm pay a very large cost.
And conversely, if H
of X is equal to
0 and Y equals zero,
then the hypothesis nailed it.
The predicted Y is equal
to zero and it turns
out Y is equal to zero
so at this point the cost
function is going to be 0.
In this video, we
have defined the cost function
for a single training example.
The topic of convexity analysis is beyond the scope of this course.
But it is possible to show
that with our particular choice
of cost function this would
give us a convex optimization problem
as cost function, overall cost function
J of theta will be
convex and local optima free.
In the next video we're going
to take these ideas of the
cost function for a single
training example and develop that
further and define the
cost function for the entire
training set, and we'll also
figure out a simpler way to
write it than we have been using so far.
And based on that will
work out gradient descent, and
that will give us our logistic regression algorithm.