1 00:00:00,580 --> 00:00:04,573 [MUSIC] 2 00:00:04,573 --> 00:00:08,630 Okay, well maybe we can just take our retrogression solution, and 3 00:00:08,630 --> 00:00:14,680 just take all the little coefficients and just say they're 0, just get rid of those. 4 00:00:14,680 --> 00:00:16,770 We're gonna call those thresholding those away. 5 00:00:16,770 --> 00:00:21,000 We're gonna choose some value where, below that value of the magnitude of 6 00:00:21,000 --> 00:00:23,850 the coefficient is below the threshold that we choose, 7 00:00:23,850 --> 00:00:25,800 just going to say it's not in the model. 8 00:00:25,800 --> 00:00:27,480 So, let's explore this idea a little bit. 9 00:00:28,750 --> 00:00:33,320 So, here I'm just showing an illustration of a little cartoon of what 10 00:00:33,320 --> 00:00:37,520 the weights might look like on a set of features in our housing application. 11 00:00:37,520 --> 00:00:43,300 And, I'm choosing some threshold which is this dashed black line. 12 00:00:43,300 --> 00:00:45,957 And if the magnitude exceeds that threshold, 13 00:00:45,957 --> 00:00:48,622 then I'm gonna say that features in my model. 14 00:00:48,622 --> 00:00:53,420 So here in, pink or fuchsia. 15 00:00:53,420 --> 00:00:55,190 Carlos, what color is this? 16 00:00:55,190 --> 00:00:56,240 >> Fuchsia. 17 00:00:56,240 --> 00:00:57,330 >> Fuchsia. 18 00:00:57,330 --> 00:00:58,850 This is Carlos's color scheme. 19 00:00:58,850 --> 00:01:01,000 He's very attached to it, fuchsia. 20 00:01:01,000 --> 00:01:06,410 So in fuchsia, I'm showing the features that have been selected to be 21 00:01:06,410 --> 00:01:10,840 in my model after doing this thresholding of my ridge regression coefficients. 22 00:01:12,280 --> 00:01:15,378 Might seem like a reasonable approach, but let's dig into this a little bit more. 23 00:01:15,378 --> 00:01:19,940 And in particula,r let's look at two very related features. 24 00:01:19,940 --> 00:01:22,690 So if you look at this list of features, you see, in green, 25 00:01:22,690 --> 00:01:26,110 I've highlighted number of bathrooms and number of showers. 26 00:01:27,260 --> 00:01:30,810 So, these numbers tend to be very, very close to one another. 27 00:01:31,890 --> 00:01:35,940 Because lots of bathrooms have showers, and 28 00:01:35,940 --> 00:01:39,600 as the number of showers grow, clearly the number of bathrooms grow 29 00:01:39,600 --> 00:01:43,700 because you're very unlikely to have a shower not in a bathroom. 30 00:01:43,700 --> 00:01:46,760 But what's happened here? 31 00:01:46,760 --> 00:01:51,700 Well, our model has included nothing having to do with bathrooms, or showers, 32 00:01:51,700 --> 00:01:52,920 or anything of this concept. 33 00:01:55,230 --> 00:01:56,910 So that doesn't really make a lot of sense. 34 00:01:56,910 --> 00:02:01,490 To me it seems like something having to do with how many bathrooms are in the house 35 00:02:01,490 --> 00:02:06,570 should be a valuable feature to include when I'm assessing the value of the house. 36 00:02:06,570 --> 00:02:08,420 So what's going wrong? 37 00:02:08,420 --> 00:02:11,500 Well, if I hadn't included number of showers. 38 00:02:11,500 --> 00:02:14,230 Let's just for simplicity's sake, 39 00:02:14,230 --> 00:02:19,360 treat the number of showers as exactly equivalent to the number of bathrooms. 40 00:02:19,360 --> 00:02:22,570 It might not be exactly equivalent, but they're very strongly related. 41 00:02:24,270 --> 00:02:26,922 But like I said for simplicity, let's say they're exactly the same. 42 00:02:26,922 --> 00:02:31,368 So if I hadn't included number of showers in my model to begin with, 43 00:02:31,368 --> 00:02:36,360 in the full model, then when I did my ridge search, it would've placed that 44 00:02:36,360 --> 00:02:41,810 weight that had been on number of showers, on the number of bathrooms. 45 00:02:41,810 --> 00:02:44,690 Because remember, it's a linear model, we're summing over 46 00:02:44,690 --> 00:02:48,450 weight times number of bathrooms plus weight times number of showers. 47 00:02:48,450 --> 00:02:50,240 So if number of bathrooms equals number of showers, 48 00:02:50,240 --> 00:02:55,320 it's equivalent to the sum of those two weights just times number of bathrooms, 49 00:02:55,320 --> 00:02:57,900 excluding number of showers from the model. 50 00:02:57,900 --> 00:03:02,730 Okay so, the point here is that if I hadn't included this redundant feature, 51 00:03:02,730 --> 00:03:07,940 number of showers, what I see now visually, is that number of bathrooms 52 00:03:07,940 --> 00:03:12,610 would have been included in my selected model doing the threshholding. 53 00:03:12,610 --> 00:03:14,890 So, the issue that I'm getting at here. 54 00:03:14,890 --> 00:03:17,890 It's not specific to the number of bathrooms and number of showers. 55 00:03:17,890 --> 00:03:21,620 It's an issue that, if you have a whole collection, maybe not two, 56 00:03:21,620 --> 00:03:25,980 maybe a whole set of strongly related features. 57 00:03:25,980 --> 00:03:30,079 More formally, statistically I will call these strongly correlated features. 58 00:03:31,250 --> 00:03:37,550 Then ridge regression is gonna prefer a solution that places a bunch of smaller 59 00:03:37,550 --> 00:03:42,220 weights on all the features, rather than one large weight on one of the features. 60 00:03:42,220 --> 00:03:43,540 Because remember the cost 61 00:03:44,690 --> 00:03:49,010 under the ridge regression model is the size of that feature squared. 62 00:03:49,010 --> 00:03:51,950 And so if you have one really big one, that's really gonna blow up that cost, 63 00:03:51,950 --> 00:03:54,720 that L2 penalty term. 64 00:03:54,720 --> 00:03:58,080 Whereas the fit of the model is gonna be basically about the same. 65 00:03:58,080 --> 00:04:00,960 Whether I distribute the weights over redundant features or 66 00:04:00,960 --> 00:04:04,518 if I put a big one on just for one of them and zeros elsewhere. 67 00:04:04,518 --> 00:04:07,410 So what's gonna happen is I'm going to get a bunch of these small weights over 68 00:04:07,410 --> 00:04:08,710 the redundant features. 69 00:04:08,710 --> 00:04:10,830 And if I think about simply thresholding, 70 00:04:10,830 --> 00:04:13,860 I'm gonna discard all of these redundant features. 71 00:04:13,860 --> 00:04:15,270 Whereas one of them, or 72 00:04:15,270 --> 00:04:19,773 potentially the whole set, really were relevant to my prediction task. 73 00:04:19,773 --> 00:04:23,132 So hopefully, it's clear from this illustration that just taking ridge 74 00:04:23,132 --> 00:04:26,880 regression and thresholding out these small weights, 75 00:04:26,880 --> 00:04:29,158 is not a solution to our feature selection problem. 76 00:04:29,158 --> 00:04:32,211 So instead we're left with this question of, 77 00:04:32,211 --> 00:04:36,374 can we use regularization to directly optimize for sparsity? 78 00:04:36,374 --> 00:04:40,499 [MUSIC]