1 00:00:00,000 --> 00:00:04,480 [MUSIC] 2 00:00:04,480 --> 00:00:08,224 Okay, so now we're onto the final important step of the derivation, 3 00:00:08,224 --> 00:00:09,950 which is taking the gradient. 4 00:00:09,950 --> 00:00:15,000 Because as we saw in the simple regression case, the gradient was important both for 5 00:00:15,000 --> 00:00:17,480 our closed form solution as well as, of course, for 6 00:00:17,480 --> 00:00:19,370 the gradient descent algorithm. 7 00:00:19,370 --> 00:00:22,610 So what's the gradient of our residual sum of squares in this multiple 8 00:00:22,610 --> 00:00:24,000 regression case? 9 00:00:24,000 --> 00:00:28,880 Well, it's the gradient of this matrix notation that we use for 10 00:00:28,880 --> 00:00:31,270 representing the residual sum of squares. 11 00:00:31,270 --> 00:00:35,110 And if you know gradients of vectors and matrices, 12 00:00:35,110 --> 00:00:40,710 which we're not assuming you do, so please don't think that you need to know this, 13 00:00:40,710 --> 00:00:44,180 but the result is -2H transpose, so 14 00:00:44,180 --> 00:00:49,890 taking that big grain matrix and turning it on its side, times y-Hw, 15 00:00:49,890 --> 00:00:56,430 which again is that vector, of residuals. 16 00:00:56,430 --> 00:00:58,374 And why is this the result? 17 00:00:58,374 --> 00:01:02,337 Well, I'm not gonna give a complete proof of this. 18 00:01:02,337 --> 00:01:05,754 I'm just gonna give some motivation. 19 00:01:05,754 --> 00:01:09,979 I'm going to walk through an analogy to 1D case, and we'll see some patterns, and 20 00:01:09,979 --> 00:01:14,170 maybe you'll believe that that's the result of the matrix case. 21 00:01:14,170 --> 00:01:20,650 So, in particular, if we think about taking the derivative with respect to w 22 00:01:20,650 --> 00:01:27,806 of a function that is y-hw times 23 00:01:27,806 --> 00:01:32,579 y-hw where these things are all scalars. 24 00:01:34,090 --> 00:01:39,140 So this is the 1D analog to this equation here, 25 00:01:39,140 --> 00:01:43,460 where the gradient is just this derivative of this one parameter w. 26 00:01:43,460 --> 00:01:47,060 That arrow is not quite pointing to w. 27 00:01:48,210 --> 00:01:49,600 Well what's the derivative of this? 28 00:01:49,600 --> 00:01:57,890 It's equivalent to the derivative with respect to w of Y minus hw squared. 29 00:01:57,890 --> 00:02:01,180 And, like we've done multiple times in this course now, 30 00:02:01,180 --> 00:02:05,880 when I take the derivative with respect to w of some function raised to the power, 31 00:02:05,880 --> 00:02:08,080 by the chain rule, I bring that power down. 32 00:02:09,100 --> 00:02:17,800 Then I'm gonna multiply by the function Hw raised to the power minus 1. 33 00:02:17,800 --> 00:02:21,670 And then I'm gonna take the derivative of the inside. 34 00:02:21,670 --> 00:02:24,629 And what's the derivative of this function with respect to w? 35 00:02:24,629 --> 00:02:26,850 It's minus h. 36 00:02:26,850 --> 00:02:34,874 And so the result here is -2h(y-Hw). 37 00:02:34,874 --> 00:02:38,136 So we have the -2 in both cases, 38 00:02:38,136 --> 00:02:43,144 this little scalar H is this big matrix in our case, 39 00:02:43,144 --> 00:02:49,921 and y- Hw in the scalar case, this big vector matrix notation here. 40 00:02:49,921 --> 00:02:55,290 Okay, so just believe that this is the gradient. 41 00:02:55,290 --> 00:02:58,620 We didn't wanna bog you down in too much linear algebra, or 42 00:02:58,620 --> 00:03:01,000 too much in terms of derivatives. 43 00:03:01,000 --> 00:03:05,884 But if we have this notation, then we can derive everything we need to for 44 00:03:05,884 --> 00:03:09,422 our two different solutions to fitting this model. 45 00:03:09,422 --> 00:03:14,509 [MUSIC]