We've previously defined the cost function J. In this video I want to tell you about an algorithm called gradient descent for minimizing the cost function J. It turns out gradient descent is a more general algorithm and is used not only in linear regression. It's actually used all over the place in machine learning. And later in the class we'll use gradient descent to minimize other functions as well, not just the cost function J, for linear regression. So in this video, I'm going to talk about gradient descent for minimizing some arbitrary function J. And then in later videos, we'll take those algorithm and apply it specifically to the cost function J that we had to find for linear regression. So here's the problem setup. We're going to see that we have some function J of (theta0, theta1). Maybe it's a cost function from linear regression. Maybe it's some other function we want to minimize. And we want to come up with an algorithm for minimizing that as a function of J of (theta0, theta1). Just as an aside, it turns out that gradient descent actually applies to more general functions. So imagine if you have a function that's a function of J of (theta0, theta1, theta2, up to some theta n). And you want to minimize over (theta0 up to theta n) of this J of (theta0 up to theta n). It turns out gradient descent is an algorithm for solving this more general problem, but for the sake of brevity, for the sake of your succinctness of notation, I'm just going to present only two parameters throughout the rest of this video. Here's the idea for gradient descent. What we're going to do is we're going to start off with some initial guesses for theta0 and theta1. Doesn't really matter what they are, but a common choice would be if we set theta0 to 0, and set theta1 to 0. Just initialize them to 0. What we're going to do in gradient descent is we'll keep changing theta0 and theta1 a little bit to try to reduce J of (theta0, theta1) until hopefully we wind up at a minimum or maybe a local minimum. So, let's see see pictures of what gradient descent does. Let's say I try to minimize this function. So notice the axes. This is, (theta0, theta1) on the horizontal axes, and J is a vertical axis. And so the height of the surface shows J, and we want to minimize this function. So, we're going to start off with (theta0, theta1) at some point. So imagine picking some value for (theta0, theta1), and that corresponds to starting at some point on the surface of this function. Okay? So whatever value of (theta0, theta1) gives you some point here. I did initialize them to 0, but sometimes you initialize it to other values as well. Now. I want us to imagine that this figure shows a hill. Imagine this is like a landscape of some grassy park with two hills like so. And I want you to imagine that you are physically standing at that point on the hill on this little red hill in your park. In gradient descent, what we're going to do is spin 360 degrees around and just look all around us and ask, "If I were to take a little baby step in some direction, and I want to go downhill as quickly as possible, what direction do I take that little baby step in if I want to go down, if I sort of want to physically walk down this hill as rapidly as possible?" Turns out that if we're standing at that point on the hill, you look all around, you find that the best direction to take a little step downhill is roughly that direction. Okay. And now you're at this new point on your hill. You're going to, again, look all around, and then say, "What direction should I step in order to take a little baby step downhill?" And if you do that and take another step, you take a step in that direction, and then you keep going. You know, from this new point, you look around, decide what direction will take you downhill most quickly, take another step, another step, and so on, until you converge to this, local minimum down here. Further descent has an interesting property. This first time we ran gradient descent, we were starting at this point over here, right? Started at that point over here. Now imagine, we initialize gradient descent just a couple steps to the right. Imagine we initialized gradient descent with that point on the upper right. If you were to repeat this process, and stop at the point, and look all around. Take a little step in the direction of steepest descent. You would do that. Then look around, take another step, and so on. And if you start it just a couple steps to the right, the gradient descent will have taken you to this second local optimum over on the right. So if you had started at this first point, you would have wound up at this local optimum. But if you started just a little bit, a slightly different location, you would have wound up at a very different local optimum. And this is a property of gradient descent that we'll say a little bit more about later. So, that's the intuition in pictures. Let's look at the map. This is the definition of the gradient descent algorithm. We're going to just repeatedly do this. On to convergence. We're going to update my parameter theta j by, you know, taking theta j and subtracting from it alpha times this term over here. So, let's see. There are a lot of details in this equation, so let me unpack some of it. First, this notation here, colon equals. We're going to use := to denote assignment, so it's the assignment operator. So concretely, if I write A: =B, what this means in a computer, this means take the value in B and use it to overwrite whatever the value of A. So this means we will set A to be equal to the value of B. Okay, it's assignment. I can also do A:=A+1. This means take A and increase its value by one. Whereas in contrast, if I use the equals sign and I write A=B, then this is a truth assertion. So if I write A=B, then I'm asserting that the value of A equals to the value of B. So the left hand side, that's a computer operation, where you set the value of A to be a value. The right hand side, this is asserting, I'm just making a claim that the values of A and B are the same. And so, whereas I can write A: =A+1, that means increment A by 1. Hopefully, I won't ever write A=A+1. Because that's just wrong. A and A+1 can never be equal to the same values. So that's the first part of the definition. This alpha here is a number that is called the learning rate. And what alpha does is, it basically controls how big a step we take downhill with gradient descent. So if alpha is very large, then that corresponds to a very aggressive gradient descent procedure, where we're trying to take huge steps downhill. And if alpha is very small, then we're taking little, little baby steps downhill. And, I'll come back and say more about this later. About how to set alpha and so on. And finally, this term here. That's the derivative term, I don't want to talk about it right now, but I will derive this derivative term and tell you exactly what this is based on. And some of you will be more familiar with calculus than others, but even if you aren't familiar with calculus don't worry about it, I'll tell you what you need to know about this term here. Now there's one more subtlety about gradient descent which is, in gradient descent, we're going to update theta0 and theta1. So this update takes place where j=0, and j=1. So you're going to update j, theta0, and update theta1. And the subtlety of how you implement gradient descent is, for this expression, for this update equation, you want to simultaneously update theta0 and theta1. What I mean by that is that in this equation, we're going to update theta0:=theta0 - something, and update theta1:=theta1 - something. And the way to implement this is, you should compute the right hand side. Compute that thing for both theta0 and theta1, and then simultaneously at the same time update theta0 and theta1. So let me say what I mean by that. This is a correct implementation of gradient descent meaning simultaneous updates. I'm going to set temp0 equals that, set temp1 equals that. So basically compute the right hand sides. And then having computed the right hand sides and stored them together in temp0 and temp1, I'm going to update theta0 and theta1 simultaneously, because that's the correct implementation. In contrast, here's an incorrect implementation that does not do a simultaneous update. So in this incorrect implementation, we compute temp0, and then we update theta0 and then we compute temp1. Then we update temp1. And the difference between the right hand side and the left hand side implementations is that if we look down here, you look at this step, if by this time you've already updated theta0 then you would be using the new value of theta0 to compute this derivative term and so this gives you a different value of temp1 than the left hand side, because you've now plugged in the new value of theta0 into this equation. And so this on right hand side is not a correct implementation of gradient descent. So I don't want to say why you need to do the simultaneous updates, it turns out that the way gradient descent is usually implemented, we'll say more about it later, it actually turns out to be more natural to implement the simultaneous update. And when people talk about gradient descent, they always mean simultaneous update. If you implement the non-simultaneous update, it turns out it will probably work anyway, but this algorithm on the right is not what people people refer to as gradient descent and this is some other algorithm with different properties. And for various reasons, this can behave in slightly stranger ways. And what you should do is to really implement the simultaneous update of gradient descent. So, that's the outline of the gradient descent algorithm. In the next video, we're going to go into the details of the derivative term, which I wrote out but didn't really define. And if you've taken a calculus class before and if you're familiar with partial derivatives and derivatives, it turns out that's exactly what that derivative term is. But in case you aren't familiar with calculus, don't worry about it. The next video will give you all the intuitions and will tell you everything you need to know to compute that derivative term, even if you haven't seen calculus, or even if you haven't seen partial derivatives before. And with that, with the next video, hopefully, we'll be able to give all the intuitions you need to apply gradient descent.