1 00:00:00,000 --> 00:00:04,934 We've previously defined the cost function J. In this video I want to tell you about 2 00:00:04,934 --> 00:00:09,634 an algorithm called gradient descent for minimizing the cost function J. It turns 3 00:00:09,634 --> 00:00:14,275 out gradient descent is a more general algorithm and is used not only in linear 4 00:00:14,275 --> 00:00:18,916 regression. It's actually used all over the place in machine learning. And later 5 00:00:18,916 --> 00:00:23,791 in the class we'll use gradient descent to minimize other functions as well, not just 6 00:00:23,791 --> 00:00:27,845 the cost function J, for linear regression. So in this video, I'm going to 7 00:00:27,845 --> 00:00:32,558 talk about gradient descent for minimizing some arbitrary function J. And then in 8 00:00:32,558 --> 00:00:37,406 later videos, we'll take those algorithm and apply it specifically to the cost 9 00:00:37,406 --> 00:00:43,332 function J that we had to find for linear regression. So here's the problem 10 00:00:43,332 --> 00:00:48,112 setup. We're going to see that we have some function J of (theta0, theta1). 11 00:00:48,112 --> 00:00:52,773 Maybe it's a cost function from linear regression. Maybe it's some other function 12 00:00:52,773 --> 00:00:56,801 we want to minimize. And we want to come up with an algorithm for 13 00:00:56,801 --> 00:01:01,174 minimizing that as a function of J of (theta0, theta1). Just as an aside, 14 00:01:01,174 --> 00:01:05,793 it turns out that gradient descent actually applies to more general 15 00:01:05,793 --> 00:01:10,994 functions. So imagine if you have a function that's a function of 16 00:01:10,994 --> 00:01:16,194 J of (theta0, theta1, theta2, up to some theta n). And you want to 17 00:01:16,405 --> 00:01:21,795 minimize over (theta0 up to theta n) of this J of (theta0 up to theta n). 18 00:01:21,795 --> 00:01:26,580 It turns out gradient descent is an algorithm for solving 19 00:01:26,580 --> 00:01:31,368 this more general problem, but for the sake of brevity, for the sake of 20 00:01:31,368 --> 00:01:35,935 your succinctness of notation, I'm just going to present only two parameters 21 00:01:36,113 --> 00:01:41,097 throughout the rest of this video. Here's the idea for gradient descent. What we're 22 00:01:41,097 --> 00:01:45,882 going to do is we're going to start off with some initial guesses for theta0 and 23 00:01:45,882 --> 00:01:50,788 theta1. Doesn't really matter what they are, but a common choice would be if we 24 00:01:50,788 --> 00:01:55,452 set theta0 to 0, and set theta1 to 0. Just initialize 25 00:01:55,452 --> 00:02:00,322 them to 0. What we're going to do in gradient descent is we'll keep changing 26 00:02:00,322 --> 00:02:05,258 theta0 and theta1 a little bit to try to reduce J of (theta0, theta1) 27 00:02:05,258 --> 00:02:10,571 until hopefully we wind up at a minimum or maybe a local minimum. So, let's see 28 00:02:10,796 --> 00:02:16,106 see pictures of what gradient descent does. Let's say I try to minimize this 29 00:02:16,106 --> 00:02:20,849 function. So notice the axes. This is, (theta0, theta1) on the horizontal 30 00:02:20,849 --> 00:02:25,774 axes, and J is a vertical axis. And so the height of the surface shows J, and we 31 00:02:25,774 --> 00:02:30,582 want to minimize this function. So, we're going to start off with (theta0, theta1) 32 00:02:30,582 --> 00:02:35,375 at some point. So imagine picking some value for (theta0, theta1), and that 33 00:02:35,375 --> 00:02:39,934 corresponds to starting at some point on the surface of this function. Okay? So 34 00:02:39,934 --> 00:02:44,201 whatever value of (theta0, theta1) gives you some point here. I did 35 00:02:44,201 --> 00:02:48,819 initialize them to 0, but sometimes you initialize it to other 36 00:02:48,819 --> 00:02:53,942 values as well. Now. I want us to imagine that this figure shows a hill. Imagine 37 00:02:53,942 --> 00:02:59,178 this is like a landscape of some grassy park with two hills like so. 38 00:02:59,178 --> 00:03:04,618 And I want you to imagine that you are physically standing at that point on the 39 00:03:04,618 --> 00:03:09,990 hill on this little red hill in your park. In gradient descent, what we're 40 00:03:09,990 --> 00:03:15,770 going to do is spin 360 degrees around and just look all around us and ask, "If I were 41 00:03:15,770 --> 00:03:20,423 to take a little baby step in some direction, and I want to go downhill as 42 00:03:20,423 --> 00:03:25,320 quickly as possible, what direction do I take that little baby step in if I want to 43 00:03:25,320 --> 00:03:29,686 go down, if I sort of want to physically walk down this hill as rapidly as 44 00:03:29,686 --> 00:03:34,465 possible?" Turns out that if we're standing at that point on the hill, you look all 45 00:03:34,465 --> 00:03:39,185 around, you find that the best direction to take a little step downhill 46 00:03:39,185 --> 00:03:44,035 is roughly that direction. Okay. And now you're at this new point on your hill. 47 00:03:44,035 --> 00:03:49,430 You're going to, again, look all around, and then say, "What direction should I step in order 48 00:03:49,430 --> 00:03:54,695 to take a little baby step downhill?" And if you do that and take another step, you 49 00:03:54,695 --> 00:03:59,700 take a step in that direction, and then you keep going. You know, from this new 50 00:03:59,700 --> 00:04:04,835 point, you look around, decide what direction will take you downhill most 51 00:04:04,835 --> 00:04:09,775 quickly, take another step, another step, and so on, until you converge to this, 52 00:04:09,970 --> 00:04:15,059 local minimum down here. Further descent has an interesting property. This first 53 00:04:15,059 --> 00:04:19,682 time we ran gradient descent, we were starting at this point over here, right? 54 00:04:19,682 --> 00:04:24,183 Started at that point over here. Now imagine, we initialize gradient 55 00:04:24,183 --> 00:04:29,232 descent just a couple steps to the right. Imagine we initialized gradient descent with 56 00:04:29,232 --> 00:04:34,159 that point on the upper right. If you were to repeat this process, and stop at the 57 00:04:34,159 --> 00:04:39,207 point, and look all around. Take a little step in the direction of steepest descent. 58 00:04:39,207 --> 00:04:44,772 You would do that. Then look around, take another step, and so on. And if you start 59 00:04:44,772 --> 00:04:50,570 it just a couple steps to the right, the gradient descent will have taken you to 60 00:04:50,570 --> 00:04:56,236 this second local optimum over on the right. So if you had started at this first 61 00:04:56,236 --> 00:05:01,602 point, you would have wound up at this local optimum. But if you started just a 62 00:05:01,602 --> 00:05:06,762 little bit, a slightly different location, you would have wound up at a very 63 00:05:06,762 --> 00:05:12,197 different local optimum. And this is a property of gradient descent that we'll 64 00:05:12,197 --> 00:05:17,425 say a little bit more about later. So, that's the intuition in pictures. Let's 65 00:05:17,425 --> 00:05:22,929 look at the map. This is the definition of the gradient descent algorithm. We're 66 00:05:22,929 --> 00:05:28,240 going to just repeatedly do this. On to convergence. We're going to update my 67 00:05:28,240 --> 00:05:33,543 parameter theta j by, you know, taking theta j and subtracting from it alpha 68 00:05:33,543 --> 00:05:39,129 times this term over here. So, let's see. There are a lot of details in this 69 00:05:39,129 --> 00:05:45,030 equation, so let me unpack some of it. First, this notation here, colon equals. 70 00:05:45,030 --> 00:05:51,643 We're going to use := to denote assignment, so it's the assignment 71 00:05:51,643 --> 00:05:57,790 operator. So concretely, if I write A: =B, what this means in 72 00:05:57,790 --> 00:06:02,878 a computer, this means take the value in B and use it to overwrite 73 00:06:02,878 --> 00:06:08,517 whatever the value of A. So this means we will set A to be equal to the value of B. 74 00:06:08,517 --> 00:06:13,674 Okay, it's assignment. I can also do A:=A+1. This means 75 00:06:13,674 --> 00:06:18,969 take A and increase its value by one. Whereas in contrast, if I use the equals 76 00:06:18,969 --> 00:06:26,067 sign and I write A=B, then this is a truth assertion. So if I write A=B, 77 00:06:26,067 --> 00:06:31,006 then I'm asserting that the value of A equals to the value of B. So the left hand 78 00:06:31,006 --> 00:06:36,331 side, that's a computer operation, where you set the value of A to be a value. The 79 00:06:36,331 --> 00:06:41,399 right hand side, this is asserting, I'm just making a claim that the values of A 80 00:06:41,399 --> 00:06:46,274 and B are the same. And so, whereas I can write A: =A+1, that means increment A by 81 00:06:46,274 --> 00:06:50,764 1. Hopefully, I won't ever write A=A+1. Because that's just wrong. 82 00:06:50,764 --> 00:06:55,704 A and A+1 can never be equal to the same values. So that's the first 83 00:06:55,704 --> 00:07:05,733 part of the definition. This alpha here is a number that is called the 84 00:07:05,733 --> 00:07:12,360 learning rate. And what alpha does is, it basically controls how big a step we take 85 00:07:12,360 --> 00:07:17,113 downhill with gradient descent. So if alpha is very large, then that corresponds 86 00:07:17,113 --> 00:07:21,925 to a very aggressive gradient descent procedure, where we're trying to take huge 87 00:07:21,925 --> 00:07:26,322 steps downhill. And if alpha is very small, then we're taking little, little 88 00:07:26,322 --> 00:07:31,194 baby steps downhill. And, I'll come back and say more about this later. 89 00:07:31,194 --> 00:07:35,660 About how to set alpha and so on. And finally, this term here. That's the 90 00:07:35,660 --> 00:07:40,582 derivative term, I don't want to talk about it right now, but I will derive this 91 00:07:40,582 --> 00:07:45,564 derivative term and tell you exactly what this is based on. And some of you 92 00:07:45,564 --> 00:07:50,547 will be more familiar with calculus than others, but even if you aren't familiar 93 00:07:50,547 --> 00:07:55,469 with calculus don't worry about it, I'll tell you what you need to know about this 94 00:07:55,469 --> 00:08:00,580 term here. Now there's one more subtlety about gradient descent which is, in 95 00:08:00,580 --> 00:08:05,837 gradient descent, we're going to update theta0 and theta1. So 96 00:08:05,837 --> 00:08:10,699 this update takes place where j=0, and j=1. So you're going to update j, theta0, 97 00:08:10,699 --> 00:08:15,955 and update theta1. And the subtlety of how you implement gradient descent is, 98 00:08:15,955 --> 00:08:21,562 for this expression, for this update equation, you want to 99 00:08:21,562 --> 00:08:31,384 simultaneously update theta0 and theta1. What I mean by that is 100 00:08:31,384 --> 00:08:36,432 that in this equation, we're going to update 101 00:08:36,432 --> 00:08:40,975 theta0:=theta0 - something, and update theta1:=theta1 - something. 102 00:08:40,975 --> 00:08:45,834 And the way to implement this is, you should compute the right hand 103 00:08:45,834 --> 00:08:52,677 side. Compute that thing for both theta0 and theta1, and then simultaneously at 104 00:08:52,677 --> 00:08:57,469 the same time update theta0 and theta1. So let me say what I mean 105 00:08:57,469 --> 00:09:02,024 by that. This is a correct implementation of gradient descent meaning simultaneous 106 00:09:02,024 --> 00:09:06,461 updates. I'm going to set temp0 equals that, set temp1 equals that. So basically 107 00:09:06,461 --> 00:09:11,430 compute the right hand sides. And then having computed the right hand sides and stored 108 00:09:11,430 --> 00:09:15,926 them together in temp0 and temp1, I'm going to update theta0 and theta1 109 00:09:15,926 --> 00:09:20,245 simultaneously, because that's the correct implementation. In contrast, 110 00:09:20,245 --> 00:09:25,533 here's an incorrect implementation that does not do a simultaneous update. So in 111 00:09:25,533 --> 00:09:31,666 this incorrect implementation, we compute temp0, and then we update theta0 112 00:09:31,666 --> 00:09:36,644 and then we compute temp1. Then we update temp1. And the difference between 113 00:09:36,644 --> 00:09:41,877 the right hand side and the left hand side implementations is that if we look down 114 00:09:41,877 --> 00:09:46,791 here, you look at this step, if by this time you've already updated theta0 115 00:09:46,791 --> 00:09:51,897 then you would be using the new value of theta0 to compute this 116 00:09:51,897 --> 00:09:57,340 derivative term and so this gives you a different value of temp1 than the left 117 00:09:57,340 --> 00:10:01,565 hand side, because you've now plugged in the new value of theta0 118 00:10:01,565 --> 00:10:05,852 into this equation. And so this on right hand side is not a correct implementation 119 00:10:05,852 --> 00:10:09,916 of gradient descent. So I don't want to say why you need to do the 120 00:10:09,916 --> 00:10:14,617 simultaneous updates, it turns out that the way gradient descent 121 00:10:14,617 --> 00:10:18,735 is usually implemented, we'll say more about it later, it actually turns out to 122 00:10:18,735 --> 00:10:22,496 be more natural to implement the simultaneous update. And when people talk 123 00:10:22,496 --> 00:10:26,665 about gradient descent, they always mean simultaneous update. If you implement the 124 00:10:26,665 --> 00:10:30,630 non-simultaneous update, it turns out it will probably work anyway, but this 125 00:10:30,630 --> 00:10:34,747 algorithm on the right is not what people people refer to as gradient descent and 126 00:10:34,747 --> 00:10:38,356 this is some other algorithm with different properties. And for various 127 00:10:38,356 --> 00:10:42,220 reasons, this can behave in slightly stranger ways. And 128 00:10:42,220 --> 00:10:46,626 what you should do is to really implement the simultaneous update of 129 00:10:46,626 --> 00:10:52,313 gradient descent. So, that's the outline of the gradient descent algorithm. In the next video, 130 00:10:52,313 --> 00:10:56,998 we're going to go into the details of the derivative term, which I wrote out but 131 00:10:56,998 --> 00:11:01,799 didn't really define. And if you've taken a calculus class before and if you're 132 00:11:01,799 --> 00:11:06,367 familiar with partial derivatives and derivatives, it turns out that's exactly 133 00:11:06,367 --> 00:11:11,425 what that derivative term is. But in case you aren't familiar with calculus, don't 134 00:11:11,425 --> 00:11:15,680 worry about it. The next video will give you all the intuitions and will tell you 135 00:11:15,680 --> 00:11:19,883 everything you need to know to compute that derivative term, even if you haven't 136 00:11:19,883 --> 00:11:24,296 seen calculus, or even if you haven't seen partial derivatives before. And with 137 00:11:24,296 --> 00:11:28,288 that, with the next video, hopefully, we'll be able to give all the intuitions 138 00:11:28,288 --> 00:11:30,180 you need to apply gradient descent.