[MUSIC] Well, for our third option for feature selection, we're gonna explore a completely different approach which is using regularized regression to implicitly perform feature selection for us. And the algorithm we're gonna explore is called Lasso. And it's really fundamentally changed the field of machine learning, statistics, and engineering. It's had a lot of, lot of impact, in just a number of applications. And it's a really interesting approach. Let's recall regularized regression, and the context of ridge regression first. Where, remember, we were balancing between the fit of our model on our training data and a measure of the magnitude of our coefficients, where we said that smaller magnitudes of coefficients indicated that things were not as overfit as if you had crazy, large magnitudes. And we introduced this tuning parameter, lambda, which balanced between these two competing objectives. So for our measure of fit, we looked at residual sum of squares. And in the case of ridge regression, when we looked at our measure of the magnitude of the coefficients, we used what's called the L2 norm, so this is just the two norm squared in this case, which is the sum of each of our feature weights squared. Okay, this ridge regression penalty we said encouraged our weights to be small. But one thing I want to emphasize is that I encourage them to be small but not exactly 0. We can see this if we look at the coefficient path that we described for ridge regression, where we see the magnitude of our coefficients shrinking and shrinking towards 0, as we increase our lambda value. And we said in the limit as lambda goes to infinity, in that limit, the coefficients become exactly 0. But for any finite value of lambda, even a really really large value of lambda, we're still just going to have very, very, very small coefficients but they won't be exactly 0. So why does it matter that they're not exactly 0? Why am I emphasizing so much this concept of the coefficients being 0? Well, this is this concept of sparsity that we talked about before, where if we have coefficients that are exactly 0, well then, for efficiency of our predictions, that's really important because we can just completely remove all the features where their coefficients are 0 from our prediction operation and just use the other coefficients and the other features. And likewise, for interpretability, if we say that one of the coefficients is exactly 0, what we're saying is that that feature is not in our model. So that is doing our feature selection. So a question though, is can we use regularization to get at this idea of doing feature selection, instead of what we talked about before? Where before, when we're talking about all subsets, or greedy algorithms, what we were doing is we were searching over a discrete set of possible solutions, we're searching over the solution that included the first and the fifth feature, or the second and the seventh, or this entire collection of these discrete solutions. But we'd like to ask here is whether we can start with for example, our full model. And then just shrink some coefficients not towards 0, but exactly to 0. Because if we shrink them exactly to 0, then we're knocking out those coefficients, we're knocking those features out from our model. And instead, the non-zero coefficients are going to indicate our selected features. [MUSIC]