1 00:00:00,124 --> 00:00:04,610 [MUSIC] 2 00:00:04,610 --> 00:00:07,834 So we've gone through the coordinate descent algorithm for 3 00:00:07,834 --> 00:00:11,056 solving our lasso objective for a specific value of lambda, 4 00:00:11,056 --> 00:00:15,500 and that begs the question well how do we choose the lambda tuning parameter value? 5 00:00:16,880 --> 00:00:21,160 Well, It's exactly the same as in ridge regression. 6 00:00:21,160 --> 00:00:25,670 If we have enough data, we can think about holding out a validation set and 7 00:00:25,670 --> 00:00:30,170 using that to choose amongst these different model complexities lambda. 8 00:00:31,920 --> 00:00:36,410 Or if we don't have enough data, we talked about doing cross validation. 9 00:00:36,410 --> 00:00:42,560 So these are two very reasonable options for choosing this tuning parameter lambda. 10 00:00:42,560 --> 00:00:46,350 But in the case of lasso, I just want to mention that using these types of 11 00:00:46,350 --> 00:00:51,850 procedures, assessing the error on a validation set or doing cross validation, 12 00:00:51,850 --> 00:00:57,390 it's choosing lambda that provides the best predictive accuracy. 13 00:00:57,390 --> 00:01:02,160 But what that ends up tending to do is choosing a lambda value that's a bit 14 00:01:02,160 --> 00:01:06,320 smaller than might be optimal for doing model selection, 15 00:01:06,320 --> 00:01:11,650 because for predictive accuracy having slightly less solutions 16 00:01:11,650 --> 00:01:16,270 can actually lead to a little bit better predictions on any finite data set, 17 00:01:16,270 --> 00:01:22,010 than possibly the true model with the sparsest set of features possible. 18 00:01:23,370 --> 00:01:27,970 So instead, there are other ways that you can choose this tuning parameter lambda 19 00:01:27,970 --> 00:01:33,070 and I'll just refer you to other texts like this textbook by Kevin Murphy, 20 00:01:33,070 --> 00:01:35,560 Machine Leaning A Probabilistic Perspective for 21 00:01:35,560 --> 00:01:36,990 further discussion on this issue. 22 00:01:38,860 --> 00:01:42,710 So let's just conclude by discussing a few practical issues with lasso. 23 00:01:44,210 --> 00:01:48,440 The first is the fact that, as we've seen in multiple different ways throughout this 24 00:01:48,440 --> 00:01:53,435 module, lasso shrinks the coefficients relative to the least square solution. 25 00:01:53,435 --> 00:01:58,590 So what it's doing is increasing the bias of the solution in exchange for 26 00:01:58,590 --> 00:02:00,300 having lower variance. 27 00:02:00,300 --> 00:02:04,860 So this is doing this automatic bias variance tradeoff but 28 00:02:04,860 --> 00:02:08,110 we might wanna still have a low bias solution, so 29 00:02:08,110 --> 00:02:12,170 we can actually think about reducing the bias of our solution in the following way. 30 00:02:12,170 --> 00:02:18,980 This is called debiasing the lasso solution, where we run our lasso solver 31 00:02:18,980 --> 00:02:24,040 and we get out a set of selected features, so those are the features whose weights 32 00:02:24,040 --> 00:02:29,400 were not set exactly to zero, and then what we do is we take that reduced model. 33 00:02:29,400 --> 00:02:31,680 The model with these selected features and 34 00:02:31,680 --> 00:02:36,780 we run just standard lease squares regression on that reduced model. 35 00:02:36,780 --> 00:02:42,930 And in this case, what happens is these features that were deemed relevant to our 36 00:02:44,390 --> 00:02:49,290 task, their weights after doing this debiasing procedure 37 00:02:49,290 --> 00:02:53,750 will not be shrunk, relative to the weights of 38 00:02:53,750 --> 00:02:58,160 a least square solution if we had started exactly with that reduced model. 39 00:02:58,160 --> 00:03:00,950 But, of course, that was the whole point, we didn't know which model, so 40 00:03:00,950 --> 00:03:04,880 the lasso is allowing us to choose out this model, and 41 00:03:04,880 --> 00:03:07,910 then just run least squares on that model. 42 00:03:07,910 --> 00:03:14,020 So these plots show a little illustration of the benefits of debiasing. 43 00:03:14,020 --> 00:03:18,790 So the top figure shows the true coefficients for data, 44 00:03:18,790 --> 00:03:23,500 so it's generated with 4,096 different 45 00:03:23,500 --> 00:03:27,030 coefficients or different features in the model, but 46 00:03:27,030 --> 00:03:33,480 only 160 of these had positive coefficients associated with them. 47 00:03:33,480 --> 00:03:38,410 So it's a very sparse setup and if you look at the L one reconstruction, 48 00:03:38,410 --> 00:03:42,765 that's the second row of this plot, you see that it's discovered 49 00:03:42,765 --> 00:03:48,090 1,024 features that have non zero weights, 50 00:03:48,090 --> 00:03:53,882 has mean squared error of 0.0072, 51 00:03:53,882 --> 00:03:58,680 but if you take those 1,024 non zero 52 00:03:58,680 --> 00:04:03,590 weight features and just run least squares regression on them, you get the third row. 53 00:04:03,590 --> 00:04:09,234 And that has significantly, significantly, lower mean square but in contrast, 54 00:04:09,234 --> 00:04:14,162 how do you run least squares on the full model with 4,096 features? 55 00:04:14,162 --> 00:04:17,530 You would get a really, really poor estimate of all that's going on and 56 00:04:17,530 --> 00:04:19,070 a very large mean square there. 57 00:04:19,070 --> 00:04:22,980 So this shows the importance of doing both lasso and 58 00:04:22,980 --> 00:04:26,120 possibly this debiasing on top of that. 59 00:04:27,990 --> 00:04:32,470 Another issue with lasso is, if you have a collection of strongly correlated 60 00:04:32,470 --> 00:04:38,330 features, lasso will tend to just select amongst them pretty much arbitrarily. 61 00:04:38,330 --> 00:04:39,100 And what I mean is that, 62 00:04:39,100 --> 00:04:44,710 a small tweak in the data might lead to one variable included, whereas 63 00:04:44,710 --> 00:04:49,350 a different tweak of the data would have a different one of these variables included. 64 00:04:49,350 --> 00:04:51,380 So we're now housing an application. 65 00:04:51,380 --> 00:04:55,590 Maybe you could imagine that square feet and lot size are very correlated, and 66 00:04:55,590 --> 00:05:00,640 we might just arbitrarily choose between these, but in a lot of cases, 67 00:05:00,640 --> 00:05:06,300 you actually wanna include the whole set of correlated variables. 68 00:05:06,300 --> 00:05:12,860 And another issue is the fact that, it's been shown empirically that in many cases, 69 00:05:12,860 --> 00:05:19,620 rich regression actually outperforms lasso in terms of predictive performance. 70 00:05:19,620 --> 00:05:23,360 So there are other variants of lasso, something called elastic net. 71 00:05:23,360 --> 00:05:26,360 That tries to address these set of issues. 72 00:05:26,360 --> 00:05:30,520 And what it does is, it fuses both the objectives of ridge and 73 00:05:30,520 --> 00:05:35,730 lasso, including both an L one and an L two penalty. 74 00:05:35,730 --> 00:05:40,269 And you can see this paper for further discussion of these and other issues with 75 00:05:40,269 --> 00:05:44,144 the original lasso objective, and how elastic net addresses it. 76 00:05:44,144 --> 00:05:47,575 [MUSIC]