[MUSIC] The fifth module was then all about feature selection. So, to motivate this, we talked about the fact that every house might have a really long list of The attributes associated with it and for reasons of both interpretability as well as efficiency in forming our predictions. We want to select a sparse subset of these features to include in our model. So to perform this feature selection the first thing that we talked about we a set of methods that Explicitly searched over models with different numbers of features and the exhaustive approach was something that's called all subsets selection. But then we also talked about greedy procedures like forward selection and saw that these gave, perhaps suboptimal solutions, but we're much more efficient than the all subsets procedure. But instead of explicitly searching over models with different sets of features, we talked about how to use lasso regression to implicitly do this feature selection where the objective looks just like but instead of using the L2 norm, we're using L1 norm of our coefficients. And we showed how that led to sparse solutions. So in particular, if we look at the coefficient path associated with lasso We saw that for any value of lambda, we ended up, typically, with a sparse solution, getting sparser and sparser as we increase lambda. And this was in contrast to what we saw for ridge, where the coefficients just got smaller and smaller, here we actually end up with the sparse solutions that lead to this idea of future selection. Then to optimize this laws of objective, we talked about a coordinate descent algorithm where we solved collection of one deoptimization problems iterating through the different dimensions of our objectives. So in particular the different features of our regression model. And what we saw was that for lasso we ended up Setting our coefficients according to something that we called soft thresholding, where in a certain range of the correlation, this correlation coefficient that we described in this module. We're gonna set our coefficient exactly to zero. And outside that range, relative to our least squares solution. We're gonna shrink the value of the estimated coefficient. So lasso can lead to these far solutions and has shown impact in just really, really large set of different applied domains. In our last module, we talked about a set of parametric techniques called nearest neighbor and kernel regression. And one nearest neighbor was a really, really simple procedure, the most basic procedure that you would imagine doing. But we show that it actually could perform really well, especially when you have lots of data. And what this method does is if you're going to estimate the value of your house, you just look for the most similar house, look at its value, and predict your value to be exactly the same. Then we talked about making this a little bit more robust by looking at a set of k-nearest neighbors and then say, well you can also think about weighting these k-nearest neighbors when you're going to compute your predicted value by how similar they are to you. And then average across these ratings to form your estimated prediction. And this led directly to an idea of kernel regression, where instead of just waiting a collection of neighbors, you actually weighed every observation in your data set, but a lot of the kernels that we specify actually set those weights to zero outside a certain range and decay them within a given range. And so what this leads to is an idea of these very local fits, and we talked about how kernel regression was equivalent to forming these locally constant fits, which was in contrast to our parametric models, that formed these global fits. So here's a visualization of our kernel regression that we saw in this module, and we see how it leads to these really, nice, smooth fits. And these fits are very adaptive to the complexity of the data that we see, and can increase in complexity as we get more and more data. [MUSIC]