1 00:00:00,000 --> 00:00:04,795 [MUSIC] 2 00:00:04,795 --> 00:00:07,755 Well, for our third option for feature selection, 3 00:00:07,755 --> 00:00:12,396 we're gonna explore a completely different approach which is using regularized 4 00:00:12,396 --> 00:00:16,500 regression to implicitly perform feature selection for us. 5 00:00:16,500 --> 00:00:18,850 And the algorithm we're gonna explore is called Lasso. 6 00:00:18,850 --> 00:00:23,620 And it's really fundamentally changed the field of machine learning, statistics, and 7 00:00:23,620 --> 00:00:24,620 engineering. 8 00:00:24,620 --> 00:00:28,510 It's had a lot of, lot of impact, in just a number of applications. 9 00:00:28,510 --> 00:00:30,030 And it's a really interesting approach. 10 00:00:31,380 --> 00:00:35,540 Let's recall regularized regression, and the context of ridge regression first. 11 00:00:35,540 --> 00:00:40,300 Where, remember, we were balancing between the fit of our model on our training data 12 00:00:40,300 --> 00:00:43,320 and a measure of the magnitude of our coefficients, 13 00:00:43,320 --> 00:00:46,950 where we said that smaller magnitudes of coefficients 14 00:00:46,950 --> 00:00:52,290 indicated that things were not as overfit as if you had crazy, large magnitudes. 15 00:00:52,290 --> 00:00:54,540 And we introduced this tuning parameter, 16 00:00:54,540 --> 00:00:57,900 lambda, which balanced between these two competing objectives. 17 00:00:59,040 --> 00:01:02,650 So for our measure of fit, we looked at residual sum of squares. 18 00:01:02,650 --> 00:01:04,130 And in the case of ridge regression, 19 00:01:04,130 --> 00:01:07,220 when we looked at our measure of the magnitude of the coefficients, 20 00:01:07,220 --> 00:01:13,010 we used what's called the L2 norm, so this is just the two norm squared in this case, 21 00:01:13,010 --> 00:01:17,680 which is the sum of each of our feature weights squared. 22 00:01:19,510 --> 00:01:25,882 Okay, this ridge regression penalty we said encouraged our weights to be small. 23 00:01:25,882 --> 00:01:30,300 But one thing I want to emphasize is that I encourage them to be small but 24 00:01:30,300 --> 00:01:33,360 not exactly 0. 25 00:01:33,360 --> 00:01:36,590 We can see this if we look at the coefficient path that we described for 26 00:01:36,590 --> 00:01:40,970 ridge regression, where we see the magnitude of our coefficients 27 00:01:40,970 --> 00:01:45,470 shrinking and shrinking towards 0, as we increase our lambda value. 28 00:01:45,470 --> 00:01:50,780 And we said in the limit as lambda goes to infinity, in that limit, 29 00:01:50,780 --> 00:01:53,310 the coefficients become exactly 0. 30 00:01:53,310 --> 00:01:57,494 But for any finite value of lambda, even a really really large value of lambda, 31 00:01:57,494 --> 00:02:01,367 we're still just going to have very, very, very small coefficients but 32 00:02:01,367 --> 00:02:02,820 they won't be exactly 0. 33 00:02:03,870 --> 00:02:06,030 So why does it matter that they're not exactly 0? 34 00:02:06,030 --> 00:02:09,820 Why am I emphasizing so much this concept of the coefficients being 0? 35 00:02:09,820 --> 00:02:13,800 Well, this is this concept of sparsity that we talked about before, 36 00:02:13,800 --> 00:02:18,320 where if we have coefficients that are exactly 0, well then, 37 00:02:18,320 --> 00:02:23,400 for efficiency of our predictions, that's really important because we can just 38 00:02:23,400 --> 00:02:28,660 completely remove all the features where their coefficients are 0 from 39 00:02:28,660 --> 00:02:34,430 our prediction operation and just use the other coefficients and the other features. 40 00:02:34,430 --> 00:02:39,300 And likewise, for interpretability, if we say that one of the coefficients is 41 00:02:39,300 --> 00:02:43,830 exactly 0, what we're saying is that that feature is not in our model. 42 00:02:43,830 --> 00:02:45,830 So that is doing our feature selection. 43 00:02:48,370 --> 00:02:53,225 So a question though, is can we use regularization to get at this idea 44 00:02:53,225 --> 00:02:57,570 of doing feature selection, instead of what we talked about before? 45 00:02:57,570 --> 00:03:02,420 Where before, when we're talking about all subsets, or greedy algorithms, what we 46 00:03:02,420 --> 00:03:06,900 were doing is we were searching over a discrete set of possible solutions, we're 47 00:03:06,900 --> 00:03:10,460 searching over the solution that included the first and the fifth feature, or 48 00:03:10,460 --> 00:03:15,770 the second and the seventh, or this entire collection of these discrete solutions. 49 00:03:17,410 --> 00:03:22,400 But we'd like to ask here is whether we can start with for 50 00:03:22,400 --> 00:03:24,028 example, our full model. 51 00:03:24,028 --> 00:03:32,030 And then just shrink some coefficients not towards 0, but exactly to 0. 52 00:03:32,030 --> 00:03:35,880 Because if we shrink them exactly to 0, then we're knocking out those 53 00:03:35,880 --> 00:03:39,060 coefficients, we're knocking those features out from our model. 54 00:03:39,060 --> 00:03:45,881 And instead, the non-zero coefficients are going to indicate our selected features. 55 00:03:45,881 --> 00:03:50,019 [MUSIC]