[MUSIC] Okay, well in place of our
ridge regression objective. What if we took our measure of
our magnitude of our coefficients to be what's called the l1 norm. Where we're gonna sum over the absolute
value of each one of our coefficients. So, we actually describe this as
a reasonable measure of the magnitude of the coefficients when we're discussing
ridge regression last module. Well, the result of this is something
that leads to sparse solutions. For reasons that we're gonna go through
in the remainder of this module. And this objective is referred
to as Lasso regression. Or L1 regularized regression. So, just like in ridge regression,
lasso is governed by a tuning parameter, lambda, that controls how much we're
favoring sparsity of our solutions relative to the fit on our training data. And so, just to be clear, here, we see that when we're doing
our feature selection task, we're searching over a continuous space,
this space of lambda values. Lambda's governing the sparsity of
the solution and that's in contrast to, for example, the all subsets or
greedy approaches, where we talked about those searching over
a discrete set of possible solutions. So, it's really a fundamentally different
approach to doing feature selection. Okay but let's talk about what happens
to our solution as we vary lambda. And again just to emphasize, this lambda is a tuning parameter that in
this case is balancing fit and sparsity. Okay so if lambda is equal to zero,
what's gonna happen? Well, this penalty term is
completely going to disappear, and our objective is simply going to be
to minimize residual sum of squares. That was our old least squares objective. So, we're going to get W
hat what I'll call lasso. The solution to our lasso problem is going to be exactly equal
to W hat least squares. So, this is equal to our
unregularized solution. And in contrast if we set
lambda equals to infinity. This is where we are completely favoring. This magnitude penalty in completely
ignoring the residual square is fit. In this case, what's the thing
that minimizes the L1 norm. So, what value of our
regression coefficients is gonna have some other absolute
value is being the smallest. Well again or just like in ridge when
lambda's equal to infinity we're gonna get W hat lasso equal to the zero vector. And if lambda is in between we're gonna
get that in this case the one norm of our lasso solution It's gonna be less than or equal to the
one norm of our lease square solution and it's gonna be greater than or
equal to this zero vector. I mean, this zero number. Sorry. Here it's just a number
once we've taken this norm. Okay. So, as of yet, it's not clear why
this L 1 norm is leading to sparsity, and we're going to get to that, but
let's first just explore this visually. And one way we can see this
is from the coefficient path. But first, let's just remember
the coefficient path for ridge regression, where we saw that
even for a large value of lambda Everything was in our model,
just with small coefficients. So, everything has W hat J greater than zero but all W hat J. Are small for large values of our
tuning parameter lambda. In contrast,
when we look at the coefficient path for lasso, we see a very different pattern. What we see is that at certain critical
values of this tuning parameter lambda. Certain ones of our features
jump out of our model. So, for example here we had square feet of
the lot size disappears from the model. Here number of bedrooms almost
simultaneously with number of floors and number of bathrooms. Followed by the year the house was built. And then, but one thing that we see,
so let me just be clear, that for
let's say a value of lambda like this, we have a sparse set of
features included in our model. So, the ones I've circled. Are the only feature, sorry. Only features in our model. And all the other ones,
have dropped completely exactly to zero. And one thing that we see is that when
lambda is very large, like the large value I showed on the previous plot, the only
thing in our model is square feet living. And note that square feet living still has
a really significantly large weight on it. So, I'll say large weight on square feet living when everything else is out of the model. Meaning not included in the model. So, square feet living is still
very valuable to our predictions, and it would take quite a large
lambda value to say that square feet living,
even that was not relevant. Eventually, square feet living
would be shrunk exactly to 0. But for a much large value of land. But, if I go back to my
ridge regression solution. I see that I had a much smaller
value on square feet living, because I was distributing weights
across many other features in the model. So, that individual impact of
square feet living wasn't as clear. [MUSIC]