In this video, we'll figure out a slightly simpler way to write the cost function than we have been using so far. And we'll also figure out how to apply gradient descent to fit the parameters of logistic regression. So by the end of this video you know how to implement a fully working version of logistic regression. Here's our cost function for logistic regression. Our overall cost function is 1 over M times sum of the training set of the cost of making different predictions on the different examples of labels Y I. And this is a cost for a single example that we worked out earlier. And I just want to remind you that for classification problems in our training and in fact even for examples known in our training set, Y is always equal to 0 or 1. Right? That's sort of part of the mathematical definition of Y. Because Y is either 0 or 1. We'll be able to come up with a simpler way to write this cost function. And in particular, rather than writing out this cost function on two separate lines with two separate cases for Y equals 1 and Y equals 0, I am going to show you a way take these two lines and compress them into one equation. And this will make it more convenient to write out the cost function and derive gradient descent. Concretely, we can write out the cost function as follows. We'll say the cost of H of X comma Y. I'm going to write this as minus Y times log H of X minus 1 minus Y times log 1 minus H of X. And I'll show you in a second that this expression, or this equation is an equivalent way or more compact way of writing out this definition of the cost function that we had up here. Let's see why that's the case. We know that there are only 2 possible cases. Y must be 0 or 1. So let's suppose Y equals 1. If Y is equal to 1 then this equation is saying that the cost is equal to. Well if Y is equal to one, then this thing here is equal to one. And one minus Y is going to be equal to zero, right? So if Y is equal to one, then one minus Y is one minus one, which is therefore zero. So the second term gets multiplied by zero and goes away, and we're left with only this first term which is Y times log, minus Y times log H of X. Y is 1 so that's equal to minus log H of X. And this equation is exactly what we have up here for if Y is equal to one. The other case is if Y is equal to 0. And if that is the case then, writing of the cost function is saying that if Y is equal to zero, then this term here, will be equal to zero. Whereas 1 minus Y, if Y equals zero, would be equal to 1, because 1 minus Y becomes 1 minus 0, which is just equal to 1. And so the cost function simplifies to just this last term here. Right? Because the first term over here gets multiplied by zero, and so it disappears. So we're just left with this last term, which is minus log, 1 minus H of X. And you can verify that this term here is just exactly what we had for when Y is equal to 0. So this shows that this definition for the cost is just a more compact way of taking both of these expressions, the cases Y equals 1 and Y equals 0, and writing them in one, in a more convenient form with just one line. We can, therefore, write all of our cost function for logistic regression as follows. It is this one of m of the sum of this cost functions, and plugging in the definition for the cost that we worked out earlier, we end up with this. And we just brought the minus sign outside. And why do we choose this particular cost function? When it looks like there could be other cost functions that we could have chosen. Although I won't have time to go into great detail of this in this course, this cost function can be derived from statistics using the principle of maximum likelihood estimation, which is an idea statistics for how to efficiently find parameters data for different models. And it also has a nice property that it is convex. So this is the cost function that, you know, essentially everyone uses when putting Logistic Regression models. If we don't understand the terms I just say and you don't know what the principle of maximum likelihood estimation is, don't worry about. There's just a deeper rational and justification behind this particular cost function then I have time to go into in this class. Given this cost function, in order to fit the parameters, what we're going to do then is try to find the parameters theta that minimizes J of theta. So if we, you know, try to minimize this. This would give us some set of parameters theta. Finally, if we're given a new example with some set of features X. We can then take the thetas that we fit our training set and output our prediction as this, and just to remind you the output of my hypothesis, I am going to interpret as the probability that Y is equal to 1. And this is given the implement X and parameters by theta. But think of this as just my hypothesis is estimating the probability that Y is equal to 1. So all that remains to be done is figure out how to actually minimize J of theta as a function of theta so we can actually fit the parameters to our training set. The way we're going to minimize the cost function is using gradient descent. Here's our cost function. And if we want to minimize it as a function of theta. Here's our usual template for gradient descent. Where we repeatedly update each parameter by taking updating it as itself minus a learning rate alpha times this derivative term. If you know some calculus feel free to take this term and try to compute a derivative yourself and see if you can simplify it to the same answer that I get. But even if you don't know calculus don't worry about it. If you actually compute this, what you get is this equation. And just write it out here. The sum from I equals 1 through M of the, essentially the error, times X I J. So if you take this partial derivative term and plug it back in here, we can then write out our gradient descent algorithm as follows. And all I've done is I took the derivative term from the previous line and plugged it in there. So if you have N features, you would have, you know, a parameter vector theta, which parameters theta zero, theta one, theta two, down to theta N and you will use this update to simultaneously update all of your values of theta. Now if you take this update rule and compare it to what we were doing for linear regression, you might be surprised to realize that, well, this equation was exactly what we had for linear regression. In fact, if you look at the earlier videos and look at the update rule, the gradient descent rule for linear regression, it looked exactly like what I drew here inside the blue box. So are linear regression and logistic regression different algorithms or not? Well, this is resolved by observing that for logistic regression, what has changed is that the definition for this hypothesis has changed. So whereas for linear regression we had H of X equals theta transpose X, now the definition of H of X has changed and is instead now 1 over 1 plus e to the negative theta transpose X. So even though the update rule looks cosmetically identical, because the definition of the hypothesis has changed, this is actually not the same thing as gradient descent for linear regression. In an earlier video, when we were talking about gradient descent for linear regression, we had talked about how to monitor gradient descent to make sure that it is converging. I usually apply that same method to logistic regression too to monitor gradient descent to make sure it's conversion correctly. And hopefully you can figure out how to apply that technique to logistic regression yourself. When implementing logistic regression with gradient descent, we have all of these different parameter values, you know, theta 0 down to theta N that we need to update using this expression. And one thing we could do is have a for loop. So for I equals 0 to N of i equals 1 to N plus 1. So update each of these parameter values in turn. But of course, rather than using a folder, ideally we would also use a vectorized implementation. And so that a vectorized implementation can update, you know, all of these N plus 1 parameters all in one fell swoop. And to check your own understanding, you might see if you can figure out how to do the vectorized implementation of this algorithm as well. So now you know how to implement gradient descents for logistic aggression. There was one last idea that we had talked about earlier for which was feature scaling. We saw how feature scaling can help gradient descents converge faster for linear regression. The idea of feature scaling also applies to gradient descent for logistic regression. And if you have features that are on very different scales. Then applying feature scaling can also make it, gradient descent, run faster for logistic regression. So, that's it. You now know how to implement logistic regression, and this is a very powerful and probably even most widely used classification algorithm in the world. And you now know how we get to work with yourself.