1 00:00:04,072 --> 00:00:04,647 [MUSIC] Okay, so 2 00:00:04,647 --> 00:00:09,248 now that all of you are optimization experts we can think about applying these 3 00:00:09,248 --> 00:00:10,807 optimization notions and 4 00:00:10,807 --> 00:00:15,790 optimization algorithms that we described to our specific scenario of interest. 5 00:00:15,790 --> 00:00:18,460 Which is searching over all possible lines and 6 00:00:18,460 --> 00:00:20,620 finding the line that best fits our data. 7 00:00:21,770 --> 00:00:25,670 So the first thing that's important to mention is the fact that our objective 8 00:00:25,670 --> 00:00:26,730 is Convex. 9 00:00:26,730 --> 00:00:29,610 So you can go and show that this is a convex function. 10 00:00:29,610 --> 00:00:33,490 And what this is implies is that the solution 11 00:00:33,490 --> 00:00:38,260 to this minimization problem is unique. 12 00:00:41,000 --> 00:00:43,240 We know there's a unique minimum to this function. 13 00:00:43,240 --> 00:00:44,323 And likewise, 14 00:00:44,323 --> 00:00:49,668 we know that our gradient descent algorithm will converge to this minimum. 15 00:00:54,009 --> 00:00:59,475 Gradient descent algorithm 16 00:00:59,475 --> 00:01:05,400 will converge to the minimum. 17 00:01:08,717 --> 00:01:13,810 Okay, so this is good news for trying to find the line that best fits 18 00:01:13,810 --> 00:01:18,100 our data because this is a very, it sounds like a very complicated problem at least. 19 00:01:18,100 --> 00:01:21,120 We have to search over all possible lines. 20 00:01:21,120 --> 00:01:25,270 And, of course we couldn't possibly go and test each one of these lines, and 21 00:01:25,270 --> 00:01:29,460 it's very, very helpful that there is this property, so 22 00:01:29,460 --> 00:01:33,770 we can use these very straightforward algorithms that we described to go and 23 00:01:33,770 --> 00:01:35,670 solve for the solution to this problem. 24 00:01:36,790 --> 00:01:41,740 Okay, so let's return to the definition of our cost, 25 00:01:41,740 --> 00:01:45,862 which is the residual sum of squares of our two parameters, 26 00:01:45,862 --> 00:01:49,780 (wo,w1), which I've written again right here. 27 00:01:49,780 --> 00:01:54,970 But before we get to taking the gradient, which is gonna be so crucial for 28 00:01:54,970 --> 00:02:00,960 our gradient descent algorithm, let's just note the following fact about derivatives, 29 00:02:00,960 --> 00:02:06,700 where if we take the derivative of the sum of functions over some 30 00:02:06,700 --> 00:02:08,340 parameter, some variable w. 31 00:02:10,240 --> 00:02:12,670 Then we can write this equivalently, 32 00:02:12,670 --> 00:02:17,210 so this sigma notation, it's just shorthand for this. 33 00:02:18,230 --> 00:02:21,280 Full sum over the N terms. 34 00:02:21,280 --> 00:02:26,850 So N different functions, g1 to gN, the derivative distributes across the sum. 35 00:02:26,850 --> 00:02:31,710 And we can rewrite this as the sum of the derivative, okay? 36 00:02:31,710 --> 00:02:36,010 So the derivative of the sum of functions is the same as the sum of the derivative 37 00:02:36,010 --> 00:02:42,271 of the individual functions so in our case, We have that gi(w). 38 00:02:42,271 --> 00:02:45,828 The gi that I'm writing here is equal to (yi-[w0+w0+w1xi]) squared. 39 00:02:45,828 --> 00:02:51,362 And we see that the residual 40 00:02:51,362 --> 00:02:56,174 sum of squares is indeed 41 00:02:56,174 --> 00:03:00,504 a sum over n different 42 00:03:00,504 --> 00:03:05,807 functions of w0 and w1. 43 00:03:05,807 --> 00:03:10,428 And so in our case when we're thinking about taking 44 00:03:10,428 --> 00:03:14,409 the partial of the residual sum of squares. 45 00:03:19,308 --> 00:03:23,181 With respect to, for example w0, 46 00:03:23,181 --> 00:03:28,861 this is going to be equal to the sum, I equals 1 to N, 47 00:03:28,861 --> 00:03:32,763 of the partial with respect to W0. 48 00:03:32,763 --> 00:03:37,922 (Yi- [W0 + W1Xi]) 49 00:03:37,922 --> 00:03:43,050 squared, and the same holds for W1. 50 00:03:46,910 --> 00:03:51,200 Okay, so now let's go ahead and actually compute this gradient. 51 00:03:51,200 --> 00:03:53,920 And the first thing we're gonna do is take the derivative or 52 00:03:53,920 --> 00:03:56,690 the partial with respect to the W0. 53 00:03:56,690 --> 00:04:00,305 Okay, so I'm gonna use this fact that, 54 00:04:00,305 --> 00:04:05,820 I showed on the previous slide to take the sum to the outside. 55 00:04:05,820 --> 00:04:08,530 And then I'm gonna take the partial with respect to the inside. 56 00:04:08,530 --> 00:04:10,740 So, I have a function raised to a power. 57 00:04:10,740 --> 00:04:13,432 So i'm gonna bring that power down. 58 00:04:13,432 --> 00:04:16,930 I'm gonna get a 2 here, rewrite the function. 59 00:04:19,980 --> 00:04:26,650 That's W1 XI, and now the power is just gonna be 1 here. 60 00:04:26,650 --> 00:04:29,420 But then, I have to take the derivative of the inside. 61 00:04:29,420 --> 00:04:32,955 And so what's the derivative of the inside when I'm taking this 62 00:04:32,955 --> 00:04:34,791 derivative with respect to W0? 63 00:04:34,791 --> 00:04:38,595 Well, what I have is I have a -1 multiplying W0, and 64 00:04:38,595 --> 00:04:42,580 everything else I'm just treating as a constant. 65 00:04:42,580 --> 00:04:46,084 So, I need to multiply by -1. 66 00:04:47,350 --> 00:04:48,910 So I'll just rewrite this. 67 00:04:48,910 --> 00:04:52,030 I'm gonna multiply the 2 by this -1. 68 00:04:52,030 --> 00:04:53,530 I'm gonna pull it outside the sum. 69 00:04:53,530 --> 00:04:58,130 So I'm gonna get -2 sum i equals one to n. 70 00:05:00,515 --> 00:05:06,630 Yi-w0+w1xi, promise I'll try and 71 00:05:06,630 --> 00:05:10,760 minimize how many derivations we're doing in the slides like this. 72 00:05:10,760 --> 00:05:13,200 But I think this is really instructive. 73 00:05:13,200 --> 00:05:17,360 So, let's go ahead and now take the derivative or 74 00:05:17,360 --> 00:05:19,125 the partial with respect to W1. 75 00:05:20,220 --> 00:05:24,610 So in this case, again I'm pulling 76 00:05:24,610 --> 00:05:28,700 the sum out same thing happens where I'm gonna bring the 2 down. 77 00:05:31,730 --> 00:05:35,100 And I'm gonna rewrite the function here, 78 00:05:35,100 --> 00:05:39,820 the inside part of the function, raise it just to the 1 power. 79 00:05:39,820 --> 00:05:44,230 And then, when I take the derivative of this part, this inside part, 80 00:05:44,230 --> 00:05:46,910 with respect to W1, What do I have? 81 00:05:46,910 --> 00:05:50,760 Well all of these things are constants with respect to W1, but 82 00:05:50,760 --> 00:05:51,920 what's multiplying W1? 83 00:05:51,920 --> 00:05:57,500 I have a -xi so I'm going to need to multiply by -xi. 84 00:06:00,065 --> 00:06:02,320 Okay, so again rewriting this. 85 00:06:02,320 --> 00:06:05,290 I'm going to take the -2 to the outside of the sum. 86 00:06:07,660 --> 00:06:11,294 And I'm gonna have yi-w0 +w1xi. 87 00:06:11,294 --> 00:06:15,850 It's a good thing you guys can put me on two speed and 88 00:06:15,850 --> 00:06:18,785 just get through all of this very quickly. 89 00:06:18,785 --> 00:06:23,800 But here we're multiply by xi and we have this extra term 90 00:06:23,800 --> 00:06:29,490 in the derivative with respect to w1 okay, well let's put this all together and look, 91 00:06:29,490 --> 00:06:34,310 I went ahead and, I typed it out so, I don't need to write it here, that's good. 92 00:06:34,310 --> 00:06:38,970 So this is the gradient of our residual sum of squares, and 93 00:06:38,970 --> 00:06:44,820 it's a vector of two dimensions because we have two variables, w0 and w1. 94 00:06:46,620 --> 00:06:49,172 Now what can w think about doing? 95 00:06:49,172 --> 00:06:52,560 Well of course we can think about doing the gradient descent algorithm. 96 00:06:52,560 --> 00:06:58,160 But let's hold off on that because what do we know is another way to solve for 97 00:06:58,160 --> 00:07:01,880 the minimum of this function? 98 00:07:01,880 --> 00:07:05,870 Well we know we can, just like we talked about in one D, taking the derivative and 99 00:07:05,870 --> 00:07:08,300 setting it equal to zero, that was the first approach for 100 00:07:08,300 --> 00:07:10,060 solving for the minimum. 101 00:07:11,370 --> 00:07:14,503 Well here we can take the gradient and set it equal to zero. 102 00:07:14,503 --> 00:07:18,649 [MUSIC]