[MUSIC] Okay, well in place of our ridge regression objective. What if we took our measure of our magnitude of our coefficients to be what's called the l1 norm. Where we're gonna sum over the absolute value of each one of our coefficients. So, we actually describe this as a reasonable measure of the magnitude of the coefficients when we're discussing ridge regression last module. Well, the result of this is something that leads to sparse solutions. For reasons that we're gonna go through in the remainder of this module. And this objective is referred to as Lasso regression. Or L1 regularized regression. So, just like in ridge regression, lasso is governed by a tuning parameter, lambda, that controls how much we're favoring sparsity of our solutions relative to the fit on our training data. And so, just to be clear, here, we see that when we're doing our feature selection task, we're searching over a continuous space, this space of lambda values. Lambda's governing the sparsity of the solution and that's in contrast to, for example, the all subsets or greedy approaches, where we talked about those searching over a discrete set of possible solutions. So, it's really a fundamentally different approach to doing feature selection. Okay but let's talk about what happens to our solution as we vary lambda. And again just to emphasize, this lambda is a tuning parameter that in this case is balancing fit and sparsity. Okay so if lambda is equal to zero, what's gonna happen? Well, this penalty term is completely going to disappear, and our objective is simply going to be to minimize residual sum of squares. That was our old least squares objective. So, we're going to get W hat what I'll call lasso. The solution to our lasso problem is going to be exactly equal to W hat least squares. So, this is equal to our unregularized solution. And in contrast if we set lambda equals to infinity. This is where we are completely favoring. This magnitude penalty in completely ignoring the residual square is fit. In this case, what's the thing that minimizes the L1 norm. So, what value of our regression coefficients is gonna have some other absolute value is being the smallest. Well again or just like in ridge when lambda's equal to infinity we're gonna get W hat lasso equal to the zero vector. And if lambda is in between we're gonna get that in this case the one norm of our lasso solution It's gonna be less than or equal to the one norm of our lease square solution and it's gonna be greater than or equal to this zero vector. I mean, this zero number. Sorry. Here it's just a number once we've taken this norm. Okay. So, as of yet, it's not clear why this L 1 norm is leading to sparsity, and we're going to get to that, but let's first just explore this visually. And one way we can see this is from the coefficient path. But first, let's just remember the coefficient path for ridge regression, where we saw that even for a large value of lambda Everything was in our model, just with small coefficients. So, everything has W hat J greater than zero but all W hat J. Are small for large values of our tuning parameter lambda. In contrast, when we look at the coefficient path for lasso, we see a very different pattern. What we see is that at certain critical values of this tuning parameter lambda. Certain ones of our features jump out of our model. So, for example here we had square feet of the lot size disappears from the model. Here number of bedrooms almost simultaneously with number of floors and number of bathrooms. Followed by the year the house was built. And then, but one thing that we see, so let me just be clear, that for let's say a value of lambda like this, we have a sparse set of features included in our model. So, the ones I've circled. Are the only feature, sorry. Only features in our model. And all the other ones, have dropped completely exactly to zero. And one thing that we see is that when lambda is very large, like the large value I showed on the previous plot, the only thing in our model is square feet living. And note that square feet living still has a really significantly large weight on it. So, I'll say large weight on square feet living when everything else is out of the model. Meaning not included in the model. So, square feet living is still very valuable to our predictions, and it would take quite a large lambda value to say that square feet living, even that was not relevant. Eventually, square feet living would be shrunk exactly to 0. But for a much large value of land. But, if I go back to my ridge regression solution. I see that I had a much smaller value on square feet living, because I was distributing weights across many other features in the model. So, that individual impact of square feet living wasn't as clear. [MUSIC]