1 00:00:00,007 --> 00:00:04,187 [MUSIC] 2 00:00:04,187 --> 00:00:09,110 Okay, well like we discussed the other approach that we can take is to do 3 00:00:09,110 --> 00:00:13,052 Gradient descent where we're walking down this surface 4 00:00:13,052 --> 00:00:17,390 of residual sum of squares trying to get to the minimum. 5 00:00:17,390 --> 00:00:19,140 Of course we might over shoot it and go back and 6 00:00:19,140 --> 00:00:23,590 forth but that's the general idea that we're doing this iterative procedure. 7 00:00:23,590 --> 00:00:28,230 And in this case it's useful to reinterpret this 8 00:00:29,250 --> 00:00:32,760 gradient of the residual sum of squares that we computed previously. 9 00:00:32,760 --> 00:00:34,930 So, this is what we've been working with. 10 00:00:34,930 --> 00:00:39,663 But what I want to point out here is that this term, 11 00:00:39,663 --> 00:00:44,406 Yi, this is our actual house sales observation. 12 00:00:51,103 --> 00:00:52,264 And what is this term here? 13 00:00:52,264 --> 00:00:57,251 Well, it's the predicted value if we use W0 and 14 00:00:57,251 --> 00:01:00,690 W1 to form that prediction. 15 00:01:00,690 --> 00:01:08,306 So, I'll call it the predicted value, 16 00:01:08,306 --> 00:01:13,810 y at i, but I'm gonna write it a function of W0 and W1, 17 00:01:13,810 --> 00:01:21,370 to make it clear that it's the prediction I'm forming when using W0 and W1. 18 00:01:23,160 --> 00:01:30,220 Okay, so what we can do is re-write this residual sum of squares 19 00:01:30,220 --> 00:01:35,360 in terms of these predicted observation values. 20 00:01:37,020 --> 00:01:41,980 Then when we go to write our gradient descent algorithm, 21 00:01:41,980 --> 00:01:43,020 what's the algorithm say? 22 00:01:43,020 --> 00:01:47,289 Well we have, while not 23 00:01:47,289 --> 00:01:52,850 converged, we're gonna take 24 00:01:52,850 --> 00:01:58,720 our previous vector of W0 at iteration T, 25 00:02:00,020 --> 00:02:05,760 W1 at iteration T and what are we going to. 26 00:02:05,760 --> 00:02:07,270 We're going to subtract. 27 00:02:08,340 --> 00:02:13,110 Going to write it up here, we're going to subtract eta times 28 00:02:13,110 --> 00:02:17,780 the gradient, maybe I'll write it in two steps. 29 00:02:17,780 --> 00:02:22,070 So we're subtracting eta times the gradient. 30 00:02:22,070 --> 00:02:23,700 And what's the gradient? 31 00:02:23,700 --> 00:02:28,419 The gradient is minus two 32 00:02:28,419 --> 00:02:33,000 sum i = 1 to n, 33 00:02:33,000 --> 00:02:37,040 yi- yi hat W0 to iteration T, 34 00:02:37,040 --> 00:02:43,990 that's the value I'm using to predict my observation i, W1 at iteration T. 35 00:02:45,640 --> 00:02:51,420 And likewise for the second component for W1. 36 00:02:51,420 --> 00:02:54,955 But in this case, I'm gonna have to multiply by xi 37 00:02:54,955 --> 00:02:58,750 because the gradient is a little bit different here. 38 00:03:02,697 --> 00:03:07,403 W1t And then I'm gonna 39 00:03:07,403 --> 00:03:11,873 multiply this by xi and 40 00:03:11,873 --> 00:03:17,047 that is my update to form my 41 00:03:17,047 --> 00:03:22,943 next estimate of w0 and w1. 42 00:03:22,943 --> 00:03:28,428 Okay let me just quickly rewrite this in the way that I was going 43 00:03:28,428 --> 00:03:33,388 to before where I see in both components of this gradient 44 00:03:33,388 --> 00:03:38,260 vector I have this -2 term here and I have a minus eta. 45 00:03:38,260 --> 00:03:41,083 So I'm going to bring that -2 out here. 46 00:03:41,083 --> 00:03:43,740 So I'm just gonna erase this. 47 00:03:43,740 --> 00:03:44,770 And rewrite this. 48 00:03:44,770 --> 00:03:50,630 The minus two times minus eta is gonna turn into a plus sign. 49 00:03:53,620 --> 00:03:55,480 So I'll just write this explicitly. 50 00:03:55,480 --> 00:04:00,558 This was from minus two times minus eta. 51 00:04:00,558 --> 00:04:01,830 Okay. 52 00:04:01,830 --> 00:04:03,260 So we are doing gradient descent, 53 00:04:03,260 --> 00:04:06,180 even though you see a plus sign, we're still doing gradient descent. 54 00:04:06,180 --> 00:04:10,050 It's just that our gradient had a negative sign, and 55 00:04:10,050 --> 00:04:14,140 that made it become a positive sign here, okay? 56 00:04:14,140 --> 00:04:19,968 But I want it in this form to provide a little bit of intuition here. 57 00:04:19,968 --> 00:04:26,910 Because what happens if overall, we just tend to be underestimating our values y? 58 00:04:26,910 --> 00:04:29,800 So, if overall, 59 00:04:29,800 --> 00:04:35,197 we're under predicting y hat i, 60 00:04:35,197 --> 00:04:40,787 then we're gonna have that the sum 61 00:04:40,787 --> 00:04:46,977 of yi- y hat i is going to be positive. 62 00:04:46,977 --> 00:04:52,164 Because we're saying that y hat 63 00:04:52,164 --> 00:04:57,735 i is always below, or in general, 64 00:04:57,735 --> 00:05:02,150 below the true value yi. 65 00:05:02,150 --> 00:05:06,035 So this is going to be positive. 66 00:05:10,516 --> 00:05:11,433 And what's gonna happen? 67 00:05:11,433 --> 00:05:17,869 Well, this term here is positive. 68 00:05:19,120 --> 00:05:24,080 We're multiplying by a positive thing, and adding that to our W. 69 00:05:24,080 --> 00:05:31,138 So, W zero is going 70 00:05:31,138 --> 00:05:36,680 to increase. 71 00:05:36,680 --> 00:05:42,880 And that makes sense, because we have some current estimate of our regression fit. 72 00:05:42,880 --> 00:05:45,860 But if generally we're under predicting 73 00:05:45,860 --> 00:05:50,250 our observations that means probably that line is too low. 74 00:05:50,250 --> 00:05:51,480 So, we wanna shift it up. 75 00:05:51,480 --> 00:05:52,820 And what does that mean? 76 00:05:52,820 --> 00:05:54,010 That means increasing W0. 77 00:05:54,010 --> 00:05:57,830 So, there's a lot of 78 00:05:57,830 --> 00:06:03,090 intuition in this formula for what's going on in this gradient descent algorithm. 79 00:06:03,090 --> 00:06:05,910 And that's just talking about this first term W0, but 80 00:06:05,910 --> 00:06:09,640 then there's this second term W1, which is the slope of the line. 81 00:06:09,640 --> 00:06:12,070 And in this case there's a similar intuition. 82 00:06:13,350 --> 00:06:21,839 So, I'll say similar intuition, For W1. 83 00:06:21,839 --> 00:06:26,889 But we need to multiply by this xi, 84 00:06:26,889 --> 00:06:34,215 accounting for the fact that this is a slope term. 85 00:06:41,192 --> 00:06:41,970 Okay. 86 00:06:41,970 --> 00:06:45,743 So that's our gradient decent algorithm for 87 00:06:45,743 --> 00:06:49,713 minimizing our residual sum of squares where, 88 00:06:49,713 --> 00:06:56,890 when we assess convergence what we're gonna output is w hat zero, W hat one. 89 00:06:56,890 --> 00:06:59,000 That's going to be our fitted regression line. 90 00:07:00,430 --> 00:07:04,759 And this is an alternative approach to studying the gradient equal to zero and 91 00:07:04,759 --> 00:07:07,365 solving for W hat zero and W hat one in that way. 92 00:07:07,365 --> 00:07:12,169 [MUSIC]