1 00:00:00,379 --> 00:00:04,663 [MUSIC] 2 00:00:04,663 --> 00:00:08,140 Well we discussed ridge regression and cross-validation. 3 00:00:08,140 --> 00:00:12,510 But we kinda brushed under the rug what can be a fairly important issue 4 00:00:12,510 --> 00:00:15,650 when we discussed our ridge regression objective, which is 5 00:00:15,650 --> 00:00:19,389 how to deal with the intercept term that's commonly included in most models. 6 00:00:21,110 --> 00:00:26,060 So in particular let's recall our multiple regression model, which is shown here. 7 00:00:26,060 --> 00:00:31,620 And so far we've just treated generically that there's some h0 of x, 8 00:00:31,620 --> 00:00:34,190 that's our first feature with coefficient w0. 9 00:00:34,190 --> 00:00:38,600 But as we mentioned two modules ago, typically, 10 00:00:38,600 --> 00:00:43,100 that first feature is treated to be what's called 11 00:00:43,100 --> 00:00:47,970 the constant feature, so that w0 just represents the intercept of the model. 12 00:00:47,970 --> 00:00:50,180 So if you're thinking of some hyper-point issues, 13 00:00:50,180 --> 00:00:53,620 where is it sitting along that y-axis? 14 00:00:53,620 --> 00:00:58,130 And then all the other features are some arbitrary set of other 15 00:00:58,130 --> 00:01:00,390 terms that you might be interested in. 16 00:01:01,680 --> 00:01:02,180 Okay. 17 00:01:03,390 --> 00:01:07,150 Well if we have this constant feature in our model, then 18 00:01:07,150 --> 00:01:11,090 the model that I wrote on the previous slide simplifies to the following. 19 00:01:11,090 --> 00:01:15,040 Where in this case when we think of our matrix notation for having And 20 00:01:15,040 --> 00:01:16,640 different observations. 21 00:01:16,640 --> 00:01:22,820 When we're forming our H matrix, the first column of that matrix, 22 00:01:22,820 --> 00:01:28,130 that's the coefficient for the w0 term, the w0 coefficient. 23 00:01:28,130 --> 00:01:35,230 So in this special case, that entire first column is filled entirely with ones. 24 00:01:35,230 --> 00:01:43,340 So that we get w0 all along as the first feature for every observation. 25 00:01:43,340 --> 00:01:45,840 Okay so this is the specific form that our H 26 00:01:45,840 --> 00:01:50,180 matrix is gonna take in this case where we have an intercept term in the model. 27 00:01:51,550 --> 00:01:56,715 Now let's return to our standard ridge regression objective that we 28 00:01:56,715 --> 00:02:01,790 had where we said we have the RSS(w) + lambda times ||w||_2 29 00:02:01,790 --> 00:02:06,332 squared where that ||w||_2 vector included w_0 for 30 00:02:06,332 --> 00:02:12,200 the intercept term in the models where that's what it represents. 31 00:02:12,200 --> 00:02:16,500 So a question is does this really make sense to do? 32 00:02:16,500 --> 00:02:20,270 Because what this is doing is it's encouraging that intercept term 33 00:02:20,270 --> 00:02:20,940 to be small. 34 00:02:20,940 --> 00:02:24,030 That's what this ridge regression penalty is doing. 35 00:02:24,030 --> 00:02:26,350 And do we want a small intercept? 36 00:02:26,350 --> 00:02:30,570 So it's useful to think about doing ridge regression when you're adding lots and 37 00:02:30,570 --> 00:02:35,080 lots of features but regardless of how many features you add to your model, does 38 00:02:35,080 --> 00:02:39,640 that really matter in how we're thinking about the magnitude of the intercept? 39 00:02:40,780 --> 00:02:41,830 Not really. 40 00:02:41,830 --> 00:02:45,050 So it probably doesn't make a lot of sense intuitively 41 00:02:45,050 --> 00:02:47,800 to think about shrinking the intercept 42 00:02:47,800 --> 00:02:51,950 just because we have this very flexible model with lots of other features. 43 00:02:51,950 --> 00:02:53,520 So let's think about how to address this. 44 00:02:54,650 --> 00:02:59,170 Okay, the first option we have is not to penalize the intercept term. 45 00:02:59,170 --> 00:03:03,440 And the way we can do that is to separate out that w0 coefficient 46 00:03:03,440 --> 00:03:05,720 from all the other w's. 47 00:03:05,720 --> 00:03:10,090 w1, w2 all the way up to wd, when we're thinking about that penalty term. 48 00:03:10,090 --> 00:03:14,200 So we have residual sum of squares of w0, and 49 00:03:14,200 --> 00:03:16,950 what I'll call w rest, all those other w's. 50 00:03:16,950 --> 00:03:19,860 And when we add our ridge regression penalty, 51 00:03:21,180 --> 00:03:24,240 the 2 norm is only taken of that w rest factor. 52 00:03:24,240 --> 00:03:26,365 All those w's not including our intercept. 53 00:03:27,710 --> 00:03:30,060 So a question is, how do we implement this in practice? 54 00:03:30,060 --> 00:03:34,030 How is this gonna modify the closed form solution or the gradient descent 55 00:03:34,030 --> 00:03:38,270 algorithm that we showed previously when we weren't handling this specific case. 56 00:03:39,520 --> 00:03:42,230 So the very simple modification we can make 57 00:03:42,230 --> 00:03:45,650 is simply defining something that I'm calling Imod. 58 00:03:45,650 --> 00:03:48,270 It's a modified identity matrix. 59 00:03:48,270 --> 00:03:53,290 That has a 0 in the first entry, and so in the one one entry, 60 00:03:53,290 --> 00:03:57,500 and all the other elements are exactly the same as an identity matrix before. 61 00:03:57,500 --> 00:04:02,230 So to be explicit our H transpose H terms is gonna look just as it did before but 62 00:04:02,230 --> 00:04:05,610 now this lambda Imod has a 0. 63 00:04:05,610 --> 00:04:11,693 So this is the entry. 64 00:04:11,693 --> 00:04:18,310 Corresponding to the w0 index. 65 00:04:18,310 --> 00:04:23,780 And then we have lambdas 66 00:04:23,780 --> 00:04:30,860 as before everywhere else on this diagonal and of course still our 0s off diagonal. 67 00:04:30,860 --> 00:04:35,300 Okay, now let's look at our gradient descent algorithm. 68 00:04:35,300 --> 00:04:40,860 And here it's gonna be very simple, we just add in a special case that if we're 69 00:04:40,860 --> 00:04:46,672 updating our intercept term, so if we're looking at that zero feature, 70 00:04:46,672 --> 00:04:51,414 we're just gonna use our 71 00:04:51,414 --> 00:04:55,945 old re-sqaures update. 72 00:04:55,945 --> 00:05:00,870 No shrinkage to w0, but 73 00:05:04,900 --> 00:05:09,560 otherwise, for all other features 74 00:05:14,240 --> 00:05:15,680 we're gonna do the ridge update. 75 00:05:22,960 --> 00:05:27,320 Okay so we see algorithmically its very straightforward to make this 76 00:05:27,320 --> 00:05:32,080 modification where we don't want to penalize that intercept term. 77 00:05:33,260 --> 00:05:37,360 But there's another option we have which is to transform the data. 78 00:05:37,360 --> 00:05:43,770 So in particular if we center the data about 0 as a pre-processing 79 00:05:43,770 --> 00:05:49,760 step then it doesn't matter so much we're shrinking the intercept towards 0 and 80 00:05:49,760 --> 00:05:54,680 not correcting for that, because when we have data centered about 0 81 00:05:54,680 --> 00:05:57,890 in general we tend to believe that the intercept will be pretty small. 82 00:05:58,900 --> 00:06:01,680 So here what I'm saying is step one, 83 00:06:01,680 --> 00:06:06,930 first we transform all our y observations to have mean 0. 84 00:06:06,930 --> 00:06:11,180 And then as a second step we just run exactly the ridge regression we described 85 00:06:11,180 --> 00:06:13,470 at the beginning of this module. 86 00:06:13,470 --> 00:06:17,880 Where we don't account for the fact that there's this intercept term at all. 87 00:06:17,880 --> 00:06:20,970 So, that's another perfectly reasonable solution to this problem. 88 00:06:20,970 --> 00:06:26,099 [MUSIC]