[MUSIC] Well, for our third option for
feature selection, we're gonna explore a completely different
approach which is using regularized regression to implicitly perform
feature selection for us. And the algorithm we're gonna
explore is called Lasso. And it's really fundamentally changed the
field of machine learning, statistics, and engineering. It's had a lot of, lot of impact,
in just a number of applications. And it's a really interesting approach. Let's recall regularized regression, and
the context of ridge regression first. Where, remember, we were balancing between
the fit of our model on our training data and a measure of the magnitude
of our coefficients, where we said that smaller
magnitudes of coefficients indicated that things were not as overfit
as if you had crazy, large magnitudes. And we introduced this tuning parameter, lambda, which balanced between
these two competing objectives. So for our measure of fit,
we looked at residual sum of squares. And in the case of ridge regression, when we looked at our measure of
the magnitude of the coefficients, we used what's called the L2 norm, so this
is just the two norm squared in this case, which is the sum of each of
our feature weights squared. Okay, this ridge regression penalty we
said encouraged our weights to be small. But one thing I want to emphasize is
that I encourage them to be small but not exactly 0. We can see this if we look at
the coefficient path that we described for ridge regression, where we see
the magnitude of our coefficients shrinking and shrinking towards 0,
as we increase our lambda value. And we said in the limit as lambda
goes to infinity, in that limit, the coefficients become exactly 0. But for any finite value of lambda, even
a really really large value of lambda, we're still just going to have very,
very, very small coefficients but they won't be exactly 0. So why does it matter that
they're not exactly 0? Why am I emphasizing so much this
concept of the coefficients being 0? Well, this is this concept of
sparsity that we talked about before, where if we have coefficients
that are exactly 0, well then, for efficiency of our predictions, that's
really important because we can just completely remove all the features
where their coefficients are 0 from our prediction operation and just use the
other coefficients and the other features. And likewise, for interpretability,
if we say that one of the coefficients is exactly 0, what we're saying is that
that feature is not in our model. So that is doing our feature selection. So a question though, is can we use
regularization to get at this idea of doing feature selection,
instead of what we talked about before? Where before, when we're talking about all
subsets, or greedy algorithms, what we were doing is we were searching over a
discrete set of possible solutions, we're searching over the solution that included
the first and the fifth feature, or the second and the seventh, or this entire
collection of these discrete solutions. But we'd like to ask here is
whether we can start with for example, our full model. And then just shrink some coefficients
not towards 0, but exactly to 0. Because if we shrink them exactly to 0,
then we're knocking out those coefficients, we're knocking those
features out from our model. And instead, the non-zero coefficients are
going to indicate our selected features. [MUSIC]