[MUSIC] Okay, well maybe we can just take our retrogression solution, and just take all the little coefficients and just say they're 0, just get rid of those. We're gonna call those thresholding those away. We're gonna choose some value where, below that value of the magnitude of the coefficient is below the threshold that we choose, just going to say it's not in the model. So, let's explore this idea a little bit. So, here I'm just showing an illustration of a little cartoon of what the weights might look like on a set of features in our housing application. And, I'm choosing some threshold which is this dashed black line. And if the magnitude exceeds that threshold, then I'm gonna say that features in my model. So here in, pink or fuchsia. Carlos, what color is this? >> Fuchsia. >> Fuchsia. This is Carlos's color scheme. He's very attached to it, fuchsia. So in fuchsia, I'm showing the features that have been selected to be in my model after doing this thresholding of my ridge regression coefficients. Might seem like a reasonable approach, but let's dig into this a little bit more. And in particula,r let's look at two very related features. So if you look at this list of features, you see, in green, I've highlighted number of bathrooms and number of showers. So, these numbers tend to be very, very close to one another. Because lots of bathrooms have showers, and as the number of showers grow, clearly the number of bathrooms grow because you're very unlikely to have a shower not in a bathroom. But what's happened here? Well, our model has included nothing having to do with bathrooms, or showers, or anything of this concept. So that doesn't really make a lot of sense. To me it seems like something having to do with how many bathrooms are in the house should be a valuable feature to include when I'm assessing the value of the house. So what's going wrong? Well, if I hadn't included number of showers. Let's just for simplicity's sake, treat the number of showers as exactly equivalent to the number of bathrooms. It might not be exactly equivalent, but they're very strongly related. But like I said for simplicity, let's say they're exactly the same. So if I hadn't included number of showers in my model to begin with, in the full model, then when I did my ridge search, it would've placed that weight that had been on number of showers, on the number of bathrooms. Because remember, it's a linear model, we're summing over weight times number of bathrooms plus weight times number of showers. So if number of bathrooms equals number of showers, it's equivalent to the sum of those two weights just times number of bathrooms, excluding number of showers from the model. Okay so, the point here is that if I hadn't included this redundant feature, number of showers, what I see now visually, is that number of bathrooms would have been included in my selected model doing the threshholding. So, the issue that I'm getting at here. It's not specific to the number of bathrooms and number of showers. It's an issue that, if you have a whole collection, maybe not two, maybe a whole set of strongly related features. More formally, statistically I will call these strongly correlated features. Then ridge regression is gonna prefer a solution that places a bunch of smaller weights on all the features, rather than one large weight on one of the features. Because remember the cost under the ridge regression model is the size of that feature squared. And so if you have one really big one, that's really gonna blow up that cost, that L2 penalty term. Whereas the fit of the model is gonna be basically about the same. Whether I distribute the weights over redundant features or if I put a big one on just for one of them and zeros elsewhere. So what's gonna happen is I'm going to get a bunch of these small weights over the redundant features. And if I think about simply thresholding, I'm gonna discard all of these redundant features. Whereas one of them, or potentially the whole set, really were relevant to my prediction task. So hopefully, it's clear from this illustration that just taking ridge regression and thresholding out these small weights, is not a solution to our feature selection problem. So instead we're left with this question of, can we use regularization to directly optimize for sparsity? [MUSIC]