[MUSIC] So we've gone through the coordinate
descent algorithm for solving our lasso objective for
a specific value of lambda, and that begs the question well how do we
choose the lambda tuning parameter value? Well, It's exactly the same
as in ridge regression. If we have enough data, we can think
about holding out a validation set and using that to choose amongst these
different model complexities lambda. Or if we don't have enough data,
we talked about doing cross validation. So these are two very reasonable options
for choosing this tuning parameter lambda. But in the case of lasso, I just want
to mention that using these types of procedures, assessing the error on a
validation set or doing cross validation, it's choosing lambda that provides
the best predictive accuracy. But what that ends up tending to do is
choosing a lambda value that's a bit smaller than might be optimal for
doing model selection, because for predictive accuracy
having slightly less solutions can actually lead to a little bit better
predictions on any finite data set, than possibly the true model with
the sparsest set of features possible. So instead, there are other ways that you
can choose this tuning parameter lambda and I'll just refer you to other texts
like this textbook by Kevin Murphy, Machine Leaning A
Probabilistic Perspective for further discussion on this issue. So let's just conclude by discussing
a few practical issues with lasso. The first is the fact that, as we've seen
in multiple different ways throughout this module, lasso shrinks the coefficients
relative to the least square solution. So what it's doing is increasing
the bias of the solution in exchange for having lower variance. So this is doing this automatic
bias variance tradeoff but we might wanna still have
a low bias solution, so we can actually think about reducing the
bias of our solution in the following way. This is called debiasing the lasso
solution, where we run our lasso solver and we get out a set of selected features,
so those are the features whose weights were not set exactly to zero, and then
what we do is we take that reduced model. The model with these selected features and we run just standard lease squares
regression on that reduced model. And in this case, what happens is these
features that were deemed relevant to our task, their weights after
doing this debiasing procedure will not be shrunk,
relative to the weights of a least square solution if we had
started exactly with that reduced model. But, of course, that was the whole point,
we didn't know which model, so the lasso is allowing us to
choose out this model, and then just run least squares on that model. So these plots show a little illustration
of the benefits of debiasing. So the top figure shows
the true coefficients for data, so it's generated with 4,096 different coefficients or
different features in the model, but only 160 of these had positive
coefficients associated with them. So it's a very sparse setup and
if you look at the L one reconstruction, that's the second row of this plot,
you see that it's discovered 1,024 features that have non zero weights, has mean squared error of 0.0072, but if you take those 1,024 non zero weight features and just run least squares
regression on them, you get the third row. And that has significantly, significantly,
lower mean square but in contrast, how do you run least squares on
the full model with 4,096 features? You would get a really, really poor
estimate of all that's going on and a very large mean square there. So this shows the importance
of doing both lasso and possibly this debiasing on top of that. Another issue with lasso is, if you
have a collection of strongly correlated features, lasso will tend to just select
amongst them pretty much arbitrarily. And what I mean is that, a small tweak in the data might lead
to one variable included, whereas a different tweak of the data would have a
different one of these variables included. So we're now housing an application. Maybe you could imagine that square feet
and lot size are very correlated, and we might just arbitrarily choose
between these, but in a lot of cases, you actually wanna include the whole
set of correlated variables. And another issue is the fact that, it's
been shown empirically that in many cases, rich regression actually outperforms
lasso in terms of predictive performance. So there are other variants of lasso,
something called elastic net. That tries to address these set of issues. And what it does is,
it fuses both the objectives of ridge and lasso, including both an L one and
an L two penalty. And you can see this paper for further
discussion of these and other issues with the original lasso objective,
and how elastic net addresses it. [MUSIC]