In the previous video, we gave a mathematical definition of gradient descent. Let's delve deeper, and in this video, get better intuition about what the algorithm is doing, and why the steps of the gradient descent algorithm might make sense. Here's the gradient descent algorithm that we saw last time. And, just to remind you, this parameter, or this term, alpha, is called the learning rate. And it controls how big a step we take when updating my parameter theta J. And this second term here is the derivative term. And what I want to do in this video is give you better intuition about what each of these two terms is doing and why, when put together, this entire update makes sense. In order to convey these intuitions, what I want to do is use a slightly simpler example where we want to minimize the function of just one parameter. So, so we have a, say we have a cost function J of just one parameter, theta one, like we did, you know, a few videos back. Where theta one is a real number, okay? Just so we can have 1D plots, which are a little bit simpler to look at. And let's try to understand why gradient descent would do on this function. [sound]. So, let's say here's my function. J of theta one, and so that's my, and where theta one is a real number. Right, now let's say I've initialized gradient descent with theta one at this location. So image that we start off at that point on my function. What gradient descent will do, is it will update. Theta one gets updated as theta one minus Alpha times DD theta one J of theta one, right? and oh an just as an aside you know this, this derivative term, right? If you're wondering why I changed the notation from these partial derivative symbols. If you don't know what the difference is between these partial derivative symbols and the dd theta don't worry about it. Technically in mathematics we call this a partial derivative, we call this a derivative, depending on the number of, of parameters in the function J, but that's a mathematical technicality, so, you know For the purpose of this lecture, think of these partial symbols, and DD theta one as exactly the same thing. And, don't worry about whether there are any differences. I'm gonna try to use the mathematically precise notation. But for our purposes, these notations are really the same thing. So, let's see what this, this equation will do. And so we're going to compute this derivative of, I'm not sure if you've seen derivatives in calculus before. But what a derivative, at this point, does, is basically saying, you know, let's. Take the tangent to that point, like that straight line, the red line, just, just touching this function and let's look at the slope of this red line. That's where the derivative is. It says what's the slope of the line that is just tangent to the function, okay, and the slope of the line is of course is just right, you know just the height divided by this horizontal thing. Now. This line has a positive slope, so it has a positive derivative. And so, my update to theta is going to be, theta one gives the update that theta one minus alpha times some positive number. Okay? Alpha, the learning rate is always a positive number. And so I'm gonna to take theta one, this update as theta one minus something. So I'm gonna end up moving theta one to the left. I'm gonna decrease theta one and we can see this is the right thing to do because I actually went ahead in this direction you know to get me closer to the minimum over there. So, gradient descent so far seems to be doing the right thing. Let's look at another example. So let's take my same function j. Just trying to draw the same function j of theta one. And now let's say I had instead initialized my parameter over there on the left. So theta one is here. I'm gonna add that point on the surface. Now, my derivative term, d, d theta one j of theta one, when evaluated at this point, gonna look at right. The slope of that line. So this derivative term is a slope of this line. But this line is slanting down, so this line has negative slope. Right? Or alternatively I say that this function has negative derivative, just means negative slope at that point. So this is less than equal to zero. So when I update theta, then if theta is updated as theta minus alpha times a negative number. And so I have theta one minus a negative number which means I'm actually going to increase theta, right? Because this is minus of a negative number means I'm adding something to theta and what that means is that I'm going to end up increasing theta. And so we'll start here and increase theta, which again seems like the thing I want to do to try to get me closer to the minimum. So, this hopefully explains the intuition behind what the derivative term is doing. Let's next take a look at the learning rate term alpha, and try to figure out what that's doing. So, here's my gradient descent update rule. Right, there's this equation And let's look at what can happen, if Alpha is either too small, or if Alpha is too large. So this first example, what happens if Alpha is too small. So here's my function J. J of theta. Lets just start here. If alpha is too small then what I'm going to do is gonna multiply the update by some small number. So end up taking, you know, it's like a baby step like that. Okay, so that's one step [inaudible]. Then from this new point we're gonna take another step. But if the alpha is too small lets take another little baby step. And so if And so if my learning rate is too small. I'm gonna end up, you know, taking these tiny, tiny baby steps to try to get to the minimum and I'm gonna need a lot of steps to get to the minimum and so. If alpha's too small, can be slow because it's gonna take these tiny, tiny baby steps. And it's gonna need a lot of steps before it gets anyway close to the global minimum. Now, how about if the alpha is too large. So here's my function J of theta. Turns out if alpha is too large, then grading descent can overshoot a minimum and may even fail to converge or even diverge. So here is what I mean. Let's say [inaudible] ireful minimum So the derivative council It's actually close to the minimum. So the derivative points to the right, but if alpha is too big, I'm gonna take a huge step. Maybe I'm gonna take a huge step like that. Right? So I end up taking a huge step. Now, my cost function has got worse. cause it starts off from this value then now my value has gotten worse. Now my derivatives, you know, points to the left, it's actually decrease theta. But look, if my learning rate is too big, I may take a huge step going from here all the way out there so I end up. going all there. Right? And if my learning rate was too big I can take another huge step on the next acceleration and kind of overshoot and overshoot and so on until you notice I'm actually getting further and further away from the minimum. And so if alpha is too large it can fail to converge or even diverge. Now, I have another question for you. So, this is a tricky one. And when I was first learning this stuff, it actually took me a long time to figure this out. What if your pre-emptive theta one is already at a local minimum? What do you think one step of gradient descent will do? So let's suppose you initialize theta one at a local minimum. So you know suppose this is your initial value of theta one over here and it's already at a local optimum or the local minimum. It sends out that at local optimum your derivative would be equal to zero. Since it's that slope where it's that tangent point so the slope of this line will be equal to zero and thus this derivative term. Is equal to zero. And so, in your gradient descent update, you have theta one, gives update that theta one, minus alpha times zero. And so, what this means is that, if you're already at a local optimum, it leaves theta one unchanged 'cause this, you know, gets the update's theta one equals theta one. So if your parameter is already at a local minimum, one step of gradient descent does absolutely nothing. It doesn't change the parameter, which is, which is what you want. Cuz it keeps your solution at the local optimum. This also explains why gradient descent can converge the local minimum, even with the learning rate Alpha fixed. Here's what I mean by that. Let's look at an example. So here's a cost function J with theta. That maybe I want to minimize and let's say I initialize my algorithm my gradient descent algorithm, you know, out there at that magenta point. If I take one step of gradient descent you know, maybe it'll take me to that point cuz my derivative's pretty steep out there, right? Now I'm at this green point and if I take another step of gradient descent, you notice that my derivative, meaning the slope, is less steep at the green point when compared to at the magenta point out there, right? Because as I approach the minimum my derivative gets closer and closer to zero as I approach the minimum. So, after one step of gradient descent, my new derivative is a little bit smaller. So I wanna take another step of gradient descent. I will naturally take a somewhat smaller step from this green point than I did from the magenta point. Now I'm at the new point, the red point, and then now even closer to global minimums, so the derivative here will be even smaller than it was at the green point. So when I take another step of gradient descent, you know, now my derivative term is even smaller, and so the magnitude of the update to theta one is even smaller, so you can take small step like so, and as gradient descent runs. You will automatically take smaller and smaller steps until eventually you are taking very small steps, you know, and you find the converge to the to the local minimum. So, just to recap. In gradient descent as we approach the local minimum, grading descent will automatically take smaller steps and that's because as we approach the local minimum, by definition, local minimum is when you have this derivative equal to zero. So as we approach the local minimum this derivative term will automatically get smaller and so gradient descent will automatically take smaller step. So, this is what gradient descent looks like, and so actually there is no need to decrease alpha overtime. So, that's the gradient descent algorithm, and you can use it to minimize, to try to minimize any cost function J. Not the cost function J to be defined for linear regression. In the next video, we're going to take the function J, and set that back to be exactly linear regression's cost function. The, the square cost function that we came up with earlier. And taking gradient descent, and the square cost function, and putting them together. That will give us our first learning algorithm, that'll give us our linear regression algorithm.