1 00:00:00,029 --> 00:00:04,620 [MUSIC] 2 00:00:04,620 --> 00:00:09,493 For now, let's assume that somebody gives you the lambda value that we wanna use 3 00:00:09,493 --> 00:00:11,786 in our ridge regression objective and 4 00:00:11,786 --> 00:00:16,100 let's discuss algorithmically how we fit this model. 5 00:00:16,100 --> 00:00:18,550 So in particular, in this part, 6 00:00:18,550 --> 00:00:21,960 we're describing this gray box our machine learning algorithm. 7 00:00:23,310 --> 00:00:26,510 And to start with, just like we've done many times before, 8 00:00:26,510 --> 00:00:31,150 we're first gonna rewrite our objective in matrix notation, 9 00:00:31,150 --> 00:00:33,820 where we assume we have some N observations and 10 00:00:33,820 --> 00:00:38,040 we wanna write jointly what the objective is for all N of these observations. 11 00:00:39,920 --> 00:00:44,540 So let's just recall what we did for our residual sum of squares term. 12 00:00:44,540 --> 00:00:50,550 Where, for our model, we thought about stacking up all N observations and 13 00:00:50,550 --> 00:00:55,640 all the features associated with those N observations in this green matrix. 14 00:00:55,640 --> 00:01:01,460 And that matrix got multiplied by a vector of our regression coefficient, 15 00:01:01,460 --> 00:01:05,450 and then we had this additive noise per observation. 16 00:01:05,450 --> 00:01:14,320 So we wrote this matrix notation as y = H w + epsilon. 17 00:01:14,320 --> 00:01:18,866 Then, when we went to form our residual sum of squares, 18 00:01:18,866 --> 00:01:25,949 we showed that it was equivalent to the following where we have (y-Hw)T(y-Hw). 19 00:01:25,949 --> 00:01:30,422 Okay, so this is our matrix notation for our residual sum of squares term, 20 00:01:30,422 --> 00:01:34,611 but now we wanna do a similar term for our model complexity, penalty, 21 00:01:34,611 --> 00:01:39,520 that we added to our original objective to get a resulting regression objective. 22 00:01:40,710 --> 00:01:48,120 So in particular, we want to write this two norm of w squared in vector notation. 23 00:01:48,120 --> 00:01:53,323 So this two norm of our w vector squared, we said was equal to 24 00:01:53,323 --> 00:02:00,730 w0 squared + w1 squared + w2 squared + all the way up to our Dth feature squared. 25 00:02:03,220 --> 00:02:08,230 And this, is equivalent to taking our w vector, 26 00:02:10,240 --> 00:02:16,570 transpose, meaning putting it as a row, and 27 00:02:16,570 --> 00:02:19,419 multiplying by the w vector itself. 28 00:02:23,460 --> 00:02:27,673 Because if we think about doing this multiplication, 29 00:02:27,673 --> 00:02:34,960 we're gonna get w0 * w0 + w1 * w1 + w2 * w2 etc and that's exactly equivalent. 30 00:02:34,960 --> 00:02:42,460 And so we can write this as w, so I'm trying to write a thick w vector. 31 00:02:42,460 --> 00:02:44,360 W transpose w. 32 00:02:46,930 --> 00:02:52,800 Okay, so this is our vector notation for this model complexity term. 33 00:02:52,800 --> 00:02:57,470 So putting it all together, our ridge regression total cost, for 34 00:02:57,470 --> 00:03:02,610 some N observations, can be written as follows, where we 35 00:03:02,610 --> 00:03:09,140 have (y-Hw) transpose * (y-Hw) + lambda * w transpose w. 36 00:03:10,320 --> 00:03:14,760 Okay, now that we have this, we can start doing what we've done in the past which is 37 00:03:14,760 --> 00:03:19,230 take the gradient and we can think about either setting the gradient to zero to 38 00:03:19,230 --> 00:03:23,200 get a closed form solution, or doing our gradient descent algorithm. 39 00:03:23,200 --> 00:03:25,740 And we're gonna walk through both of these approaches now. 40 00:03:27,010 --> 00:03:30,210 So the first step is computing the gradient of this objective. 41 00:03:31,330 --> 00:03:35,830 So here I'm just writing exactly what we had on the previous slide. 42 00:03:36,860 --> 00:03:41,130 But, now, with these gradient signs in front, and 43 00:03:41,130 --> 00:03:45,310 as we've seen before, the gradient distributes across a sum. 44 00:03:45,310 --> 00:03:50,960 So we have the gradient of this first term plus the gradient of our model complexity 45 00:03:50,960 --> 00:03:56,160 term, or the first term is our measure of fit, that residual sum of squares term. 46 00:03:56,160 --> 00:03:57,440 And we know that the gradient, or 47 00:03:57,440 --> 00:04:00,840 the residual sum of squares has the following form. 48 00:04:00,840 --> 00:04:04,759 -2H transpose (y-Hw). 49 00:04:04,759 --> 00:04:09,300 The question is, what's the gradient of this model complexity term? 50 00:04:11,310 --> 00:04:16,081 What we see is that the gradient of this is 2 * w. 51 00:04:17,100 --> 00:04:19,820 Why is it 2*w? 52 00:04:19,820 --> 00:04:23,140 Well, instead of deriving it, I'll leave that for a little mini challenge for 53 00:04:23,140 --> 00:04:24,240 you guys. 54 00:04:24,240 --> 00:04:26,370 It's fairly straightforward to derive, 55 00:04:26,370 --> 00:04:29,690 just taking partials with respect to each w component. 56 00:04:29,690 --> 00:04:35,020 Just write w transpose w as w0 squared + w1 squared + blah, blah, blah. 57 00:04:35,020 --> 00:04:37,116 All the way up to WD squared, and 58 00:04:37,116 --> 00:04:40,850 then take a derivative just with respect to one of the Ws. 59 00:04:40,850 --> 00:04:45,951 But for now what I'm gonna do is I'm just gonna draw an analogy to 60 00:04:45,951 --> 00:04:50,970 the 1d case where w transpose w is analogous to just w squared. 61 00:04:50,970 --> 00:04:54,500 If w weren't a vector and were instead just a scalar. 62 00:04:54,500 --> 00:04:57,210 And what's the derivative of w squared? 63 00:04:57,210 --> 00:04:58,370 It's 2w. 64 00:04:58,370 --> 00:05:01,650 Okay, so proof by analogy here. 65 00:05:01,650 --> 00:05:04,039 [MUSIC]