1 00:00:00,000 --> 00:00:04,353 In the previous video, we gave a mathematical definition of gradient 2 00:00:04,353 --> 00:00:09,464 descent. Let's delve deeper, and in this video, get better intuition about what the 3 00:00:09,464 --> 00:00:14,701 algorithm is doing, and why the steps of the gradient descent algorithm might make 4 00:00:14,701 --> 00:00:20,639 sense. Here's the gradient descent algorithm that we saw last time. And, just 5 00:00:20,639 --> 00:00:26,427 to remind you, this parameter, or this term, alpha, is called the learning rate. 6 00:00:26,427 --> 00:00:32,444 And it controls how big a step we take when updating my parameter theta J. And 7 00:00:32,444 --> 00:00:41,360 this second term here is the derivative term. And what I want to do in this video 8 00:00:41,360 --> 00:00:47,360 is give you better intuition about what each of these two terms is doing and why, when put 9 00:00:47,360 --> 00:00:53,077 together, this entire update makes sense. In order to convey these intuitions, what 10 00:00:53,077 --> 00:00:58,460 I want to do is use a slightly simpler example where we want to minimize the 11 00:00:58,460 --> 00:01:03,022 function of just one parameter. So, so we have a, say we have a cost function J of 12 00:01:03,022 --> 00:01:07,294 just one parameter, theta one, like we did, you know, a few videos back. Where 13 00:01:07,294 --> 00:01:11,913 theta one is a real number, okay? Just so we can have 1D plots, which 14 00:01:11,913 --> 00:01:16,416 are a little bit simpler to look at. And let's try to understand why gradient 15 00:01:16,416 --> 00:01:23,940 descent would do on this function. [sound]. So, let's say here's my function. 16 00:01:24,660 --> 00:01:31,696 J of theta one, and so that's my, and where theta one is a real number. Right, 17 00:01:31,696 --> 00:01:39,202 now let's say I've initialized gradient descent with theta one at this location. 18 00:01:39,202 --> 00:01:46,989 So image that we start off at that point on my function. What gradient descent will 19 00:01:46,989 --> 00:01:56,935 do, is it will update. Theta one gets updated as theta one minus Alpha times DD 20 00:01:56,935 --> 00:02:04,694 theta one J of theta one, right? and oh an just as an aside you know this, this 21 00:02:04,694 --> 00:02:11,636 derivative term, right? If you're wondering why I changed the notation from 22 00:02:11,636 --> 00:02:16,132 these partial derivative symbols. If you don't know what the difference is between 23 00:02:16,132 --> 00:02:20,523 these partial derivative symbols and the dd theta don't worry about it. Technically 24 00:02:20,523 --> 00:02:24,491 in mathematics we call this a partial derivative, we call this a derivative, 25 00:02:24,491 --> 00:02:28,299 depending on the number of, of parameters in the function J, but that's a 26 00:02:28,299 --> 00:02:32,428 mathematical technicality, so, you know For the purpose of this lecture, think of 27 00:02:32,428 --> 00:02:36,768 these partial symbols, and DD theta one as exactly the same thing. And, don't worry 28 00:02:36,768 --> 00:02:41,056 about whether there are any differences. I'm gonna try to use the mathematically 29 00:02:41,056 --> 00:02:45,190 precise notation. But for our purposes, these notations are really the same thing. 30 00:02:45,360 --> 00:02:49,627 So, let's see what this, this equation will do. And so we're going to compute 31 00:02:49,627 --> 00:02:54,293 this derivative of, I'm not sure if you've seen derivatives in calculus before. But 32 00:02:54,293 --> 00:02:58,666 what a derivative, at this point, does, is basically saying, you know, let's. Take 33 00:02:58,666 --> 00:03:02,877 the tangent to that point, like that straight line, the red line, just, 34 00:03:02,877 --> 00:03:06,976 just touching this function and let's look at the slope of this red line. That's 35 00:03:06,976 --> 00:03:11,352 where the derivative is. It says what's the slope of the line that is just 36 00:03:11,352 --> 00:03:15,563 tangent to the function, okay, and the slope of the line is of course is just 37 00:03:15,563 --> 00:03:20,789 right, you know just the height divided by this horizontal thing. Now. This line has 38 00:03:20,789 --> 00:03:28,378 a positive slope, so it has a positive derivative. And so, my update to theta is 39 00:03:28,378 --> 00:03:36,258 going to be, theta one gives the update that theta one minus alpha times some positive 40 00:03:36,258 --> 00:03:43,103 number. Okay? Alpha, the learning rate is always a positive number. And so 41 00:03:43,103 --> 00:03:47,932 I'm gonna to take theta one, this update as theta one minus something. So I'm gonna 42 00:03:47,932 --> 00:03:52,644 end up moving theta one to the left. I'm gonna decrease theta one and we can see 43 00:03:52,644 --> 00:03:57,473 this is the right thing to do because I actually went ahead in this direction you 44 00:03:57,473 --> 00:04:02,582 know to get me closer to the minimum over there. So, gradient descent so far seems 45 00:04:02,582 --> 00:04:08,115 to be doing the right thing. Let's look at another example. So let's take my same 46 00:04:08,115 --> 00:04:13,787 function j. Just trying to draw the same function j of theta one. And now let's say 47 00:04:13,787 --> 00:04:19,181 I had instead initialized my parameter over there on the left. So theta one is 48 00:04:19,181 --> 00:04:24,161 here. I'm gonna add that point on the surface. Now, my derivative term, d, d 49 00:04:24,161 --> 00:04:29,567 theta one j of theta one, when evaluated at this point, gonna look at right. The 50 00:04:29,567 --> 00:04:35,035 slope of that line. So this derivative term is a slope of this line. But this 51 00:04:35,035 --> 00:04:42,745 line is slanting down, so this line has negative slope. Right? Or alternatively I 52 00:04:42,745 --> 00:04:48,718 say that this function has negative derivative, just means negative slope at 53 00:04:48,718 --> 00:04:54,770 that point. So this is less than equal to zero. So when I update theta, then if 54 00:04:54,770 --> 00:05:02,840 theta is updated as theta minus alpha times a negative number. And so I have 55 00:05:02,840 --> 00:05:07,881 theta one minus a negative number which means I'm actually going to increase theta, 56 00:05:07,881 --> 00:05:13,106 right? Because this is minus of a negative number means I'm adding something to theta 57 00:05:13,106 --> 00:05:17,900 and what that means is that I'm going to end up increasing theta. And so we'll 58 00:05:17,900 --> 00:05:23,002 start here and increase theta, which again seems like the thing I want to do to try 59 00:05:23,002 --> 00:05:28,335 to get me closer to the minimum. So, this hopefully explains the intuition behind 60 00:05:28,335 --> 00:05:33,874 what the derivative term is doing. Let's next take a look at the learning rate term 61 00:05:33,874 --> 00:05:39,956 alpha, and try to figure out what that's doing. So, here's my gradient descent 62 00:05:39,956 --> 00:05:46,641 update rule. Right, there's this equation And let's look at what can happen, if 63 00:05:46,641 --> 00:05:52,845 Alpha is either too small, or if Alpha is too large. So this first example, what 64 00:05:52,845 --> 00:05:59,583 happens if Alpha is too small. So here's my function J. J of theta. Lets 65 00:05:59,583 --> 00:06:04,230 just start here. If alpha is too small then what I'm going to do is gonna 66 00:06:04,230 --> 00:06:09,322 multiply the update by some small number. So end up taking, you know, it's like a baby step 67 00:06:09,322 --> 00:06:13,841 like that. Okay, so that's one step [inaudible]. Then from this new point 68 00:06:13,841 --> 00:06:18,870 we're gonna take another step. But if the alpha is too small lets take another 69 00:06:18,870 --> 00:06:25,342 little baby step. And so if And so if my learning rate is too small. I'm gonna end 70 00:06:25,342 --> 00:06:30,589 up, you know, taking these tiny, tiny baby steps to try to get to the minimum and I'm 71 00:06:30,589 --> 00:06:35,837 gonna need a lot of steps to get to the minimum and so. If alpha's too small, can 72 00:06:35,837 --> 00:06:41,019 be slow because it's gonna take these tiny, tiny baby steps. And it's gonna need 73 00:06:41,019 --> 00:06:45,829 a lot of steps before it gets anyway close to the global minimum. Now, 74 00:06:45,829 --> 00:06:52,236 how about if the alpha is too large. So here's my function J of theta. 75 00:06:52,236 --> 00:06:57,590 Turns out if alpha is too large, then grading descent can overshoot a minimum 76 00:06:57,590 --> 00:07:03,362 and may even fail to converge or even diverge. So here is what I mean. Let's say [inaudible] ireful minimum So the derivative council 77 00:07:03,362 --> 00:07:08,647 It's actually close to the minimum. So the derivative points to the right, but if alpha is too big, I'm gonna 78 00:07:08,686 --> 00:07:14,140 take a huge step. Maybe I'm gonna take a huge step like that. Right? So I end up taking a huge step. 79 00:07:14,140 --> 00:07:20,051 Now, my cost function has got worse. cause it starts off from this value then now my value has gotten worse. Now my 80 00:07:20,051 --> 00:07:25,190 derivatives, you know, points to the left, it's actually decrease theta. But look, if my learning rate is too big, 81 00:07:25,190 --> 00:07:29,792 I may take a huge step going from here all the way out there so I end up. going all 82 00:07:29,792 --> 00:07:35,372 there. Right? And if my learning rate was too big I can take another huge step on the 83 00:07:35,372 --> 00:07:41,034 next acceleration and kind of overshoot and overshoot and so on until you notice 84 00:07:41,034 --> 00:07:46,765 I'm actually getting further and further away from the minimum. And so if alpha is 85 00:07:46,765 --> 00:07:51,905 too large it can fail to converge or even diverge. Now, I have another question for 86 00:07:51,905 --> 00:07:56,057 you. So, this is a tricky one. And when I was first learning this stuff, it actually 87 00:07:56,057 --> 00:08:00,005 took me a long time to figure this out. What if your pre-emptive theta one is 88 00:08:00,005 --> 00:08:04,106 already at a local minimum? What do you think one step of gradient descent will 89 00:08:04,106 --> 00:08:10,857 do? So let's suppose you initialize theta one at a local minimum. So you know 90 00:08:10,857 --> 00:08:16,713 suppose this is your initial value of theta one over here and it's already at a local 91 00:08:16,713 --> 00:08:22,718 optimum or the local minimum. It sends out that at local optimum your derivative 92 00:08:22,718 --> 00:08:28,796 would be equal to zero. Since it's that slope where it's that tangent point so the 93 00:08:28,796 --> 00:08:35,528 slope of this line will be equal to zero and thus this derivative term. Is equal to 94 00:08:35,528 --> 00:08:40,941 zero. And so, in your gradient descent update, you have theta one, gives update 95 00:08:40,941 --> 00:08:46,284 that theta one, minus alpha times zero. And so, what this means is that, if you're 96 00:08:46,284 --> 00:08:51,222 already at a local optimum, it leaves theta one unchanged 'cause this, you know, 97 00:08:51,222 --> 00:08:56,132 gets the update's theta one equals theta one. So if your parameter is already at a local 98 00:08:56,132 --> 00:09:00,694 minimum, one step of gradient descent does absolutely nothing. It doesn't change 99 00:09:00,694 --> 00:09:05,257 the parameter, which is, which is what you want. Cuz it keeps your solution at the 100 00:09:05,257 --> 00:09:09,706 local optimum. This also explains why gradient descent can converge the local 101 00:09:09,706 --> 00:09:14,326 minimum, even with the learning rate Alpha fixed. Here's what I mean by that. Let's 102 00:09:14,326 --> 00:09:21,550 look at an example. So here's a cost function J with theta. That maybe I want 103 00:09:21,550 --> 00:09:26,811 to minimize and let's say I initialize my algorithm my gradient descent algorithm, you know, 104 00:09:26,811 --> 00:09:32,080 out there at that magenta point. If I take one step of gradient descent you know, 105 00:09:32,080 --> 00:09:36,941 maybe it'll take me to that point cuz my derivative's pretty steep out there, right? 106 00:09:36,941 --> 00:09:42,051 Now I'm at this green point and if I take another step of gradient descent, you 107 00:09:42,051 --> 00:09:47,036 notice that my derivative, meaning the slope, is less steep at the green point when 108 00:09:47,036 --> 00:09:51,959 compared to at the magenta point out there, right? Because as I approach the 109 00:09:51,959 --> 00:09:56,883 minimum my derivative gets closer and closer to zero as I approach the minimum. 110 00:09:56,883 --> 00:10:01,794 So, after one step of gradient descent, my new derivative is a little bit smaller. 111 00:10:01,794 --> 00:10:06,635 So I wanna take another step of gradient descent. I will naturally take a somewhat 112 00:10:06,635 --> 00:10:11,598 smaller step from this green point than I did from the magenta point. Now I'm at the new 113 00:10:11,598 --> 00:10:16,038 point, the red point, and then now even closer to global minimums, so the 114 00:10:16,038 --> 00:10:21,229 derivative here will be even smaller than it was at the green point. So when I take 115 00:10:21,229 --> 00:10:26,420 another step of gradient descent, you know, now my derivative term is even smaller, and so 116 00:10:26,420 --> 00:10:31,360 the magnitude of the update to theta one is even smaller, so you can take 117 00:10:31,360 --> 00:10:39,145 small step like so, and as gradient descent runs. You will automatically take smaller 118 00:10:39,145 --> 00:10:46,343 and smaller steps until eventually you are taking very small steps, you know, and you 119 00:10:46,343 --> 00:10:52,737 find the converge to the to the local minimum. So, just to recap. In gradient 120 00:10:52,737 --> 00:10:57,716 descent as we approach the local minimum, grading descent will automatically take 121 00:10:57,716 --> 00:11:02,634 smaller steps and that's because as we approach the local minimum, by definition, 122 00:11:02,634 --> 00:11:07,122 local minimum is when you have this derivative equal to zero. So as we 123 00:11:07,122 --> 00:11:12,408 approach the local minimum this derivative term will automatically get smaller and 124 00:11:12,408 --> 00:11:16,957 so gradient descent will automatically take smaller step. So, this is what 125 00:11:16,957 --> 00:11:21,506 gradient descent looks like, and so actually there is no need to decrease alpha 126 00:11:21,506 --> 00:11:26,258 overtime. So, that's the gradient descent algorithm, and you can use it to minimize, 127 00:11:26,258 --> 00:11:30,713 to try to minimize any cost function J. Not the cost function J to be defined for 128 00:11:30,713 --> 00:11:34,738 linear regression. In the next video, we're going to take the function J, and 129 00:11:34,738 --> 00:11:38,549 set that back to be exactly linear regression's cost function. The, the 130 00:11:38,549 --> 00:11:43,057 square cost function that we came up with earlier. And taking gradient descent, and 131 00:11:43,057 --> 00:11:47,351 the square cost function, and putting them together. That will give us our first 132 00:11:47,351 --> 00:11:50,948 learning algorithm, that'll give us our linear regression algorithm.