1 00:00:00,000 --> 00:00:04,334 [MUSIC] 2 00:00:04,334 --> 00:00:07,040 The fifth module was then all about feature selection. 3 00:00:08,370 --> 00:00:11,740 So, to motivate this, we talked about the fact that every house might have a really 4 00:00:11,740 --> 00:00:16,830 long list of The attributes associated with it and for reasons of 5 00:00:16,830 --> 00:00:21,870 both interpretability as well as efficiency in forming our predictions. 6 00:00:21,870 --> 00:00:26,930 We want to select a sparse subset of these features to include in our model. 7 00:00:28,240 --> 00:00:32,070 So to perform this feature selection the first thing that we talked about we a set 8 00:00:32,070 --> 00:00:38,010 of methods that Explicitly searched over models with different numbers of features 9 00:00:38,010 --> 00:00:43,610 and the exhaustive approach was something that's called all subsets selection. 10 00:00:43,610 --> 00:00:47,370 But then we also talked about greedy procedures like forward selection and 11 00:00:47,370 --> 00:00:51,250 saw that these gave, perhaps suboptimal solutions, but 12 00:00:51,250 --> 00:00:53,670 we're much more efficient than the all subsets procedure. 13 00:00:56,080 --> 00:00:59,310 But instead of explicitly searching over models with different sets of features, 14 00:00:59,310 --> 00:01:03,300 we talked about how to use lasso regression to implicitly do this feature 15 00:01:03,300 --> 00:01:07,720 selection where the objective looks just like but instead of using the L2 norm, 16 00:01:07,720 --> 00:01:10,840 we're using L1 norm of our coefficients. 17 00:01:10,840 --> 00:01:13,580 And we showed how that led to sparse solutions. 18 00:01:13,580 --> 00:01:17,310 So in particular, if we look at the coefficient path associated with lasso 19 00:01:17,310 --> 00:01:21,550 We saw that for any value of lambda, we ended up, typically, 20 00:01:21,550 --> 00:01:26,240 with a sparse solution, getting sparser and sparser as we increase lambda. 21 00:01:26,240 --> 00:01:28,620 And this was in contrast to what we saw for ridge, 22 00:01:28,620 --> 00:01:31,320 where the coefficients just got smaller and smaller, here we actually 23 00:01:31,320 --> 00:01:35,320 end up with the sparse solutions that lead to this idea of future selection. 24 00:01:37,070 --> 00:01:41,720 Then to optimize this laws of objective, we talked about a coordinate descent 25 00:01:41,720 --> 00:01:46,110 algorithm where we solved collection of one deoptimization 26 00:01:46,110 --> 00:01:50,130 problems iterating through the different dimensions of our objectives. 27 00:01:50,130 --> 00:01:54,000 So in particular the different features of our regression model. 28 00:01:54,000 --> 00:01:58,910 And what we saw was that for lasso we ended up Setting our coefficients 29 00:01:58,910 --> 00:02:02,030 according to something that we called soft thresholding, 30 00:02:02,030 --> 00:02:07,750 where in a certain range of the correlation, 31 00:02:07,750 --> 00:02:10,600 this correlation coefficient that we described in this module. 32 00:02:10,600 --> 00:02:13,690 We're gonna set our coefficient exactly to zero. 33 00:02:13,690 --> 00:02:16,950 And outside that range, relative to our least squares solution. 34 00:02:16,950 --> 00:02:19,720 We're gonna shrink the value of the estimated coefficient. 35 00:02:20,980 --> 00:02:25,790 So lasso can lead to these far solutions and has shown impact in just really, 36 00:02:25,790 --> 00:02:28,040 really large set of different applied domains. 37 00:02:29,670 --> 00:02:33,980 In our last module, we talked about a set of parametric techniques called nearest 38 00:02:33,980 --> 00:02:36,630 neighbor and kernel regression. 39 00:02:36,630 --> 00:02:38,140 And one nearest neighbor was a really, 40 00:02:38,140 --> 00:02:42,500 really simple procedure, the most basic procedure that you would imagine doing. 41 00:02:42,500 --> 00:02:44,710 But we show that it actually could perform really well, 42 00:02:44,710 --> 00:02:46,140 especially when you have lots of data. 43 00:02:46,140 --> 00:02:49,970 And what this method does is if you're going to estimate the value of your house, 44 00:02:49,970 --> 00:02:53,040 you just look for the most similar house, look at its value, and 45 00:02:53,040 --> 00:02:55,990 predict your value to be exactly the same. 46 00:02:55,990 --> 00:02:59,505 Then we talked about making this a little bit more robust by looking at a set of 47 00:02:59,505 --> 00:03:01,770 k-nearest neighbors and then say, 48 00:03:01,770 --> 00:03:05,980 well you can also think about weighting these k-nearest neighbors when you're 49 00:03:05,980 --> 00:03:11,250 going to compute your predicted value by how similar they are to you. 50 00:03:11,250 --> 00:03:15,790 And then average across these ratings to form your estimated prediction. 51 00:03:17,300 --> 00:03:21,490 And this led directly to an idea of kernel regression, where instead of 52 00:03:21,490 --> 00:03:26,340 just waiting a collection of neighbors, you actually weighed every observation 53 00:03:26,340 --> 00:03:31,310 in your data set, but a lot of the kernels that we specify actually set 54 00:03:31,310 --> 00:03:36,780 those weights to zero outside a certain range and decay them within a given range. 55 00:03:36,780 --> 00:03:40,730 And so what this leads to is an idea of these very local fits, and we talked 56 00:03:40,730 --> 00:03:45,530 about how kernel regression was equivalent to forming these locally constant fits, 57 00:03:45,530 --> 00:03:50,600 which was in contrast to our parametric models, that formed these global fits. 58 00:03:50,600 --> 00:03:55,730 So here's a visualization of our kernel regression that we saw in this module, and 59 00:03:55,730 --> 00:03:59,290 we see how it leads to these really, nice, smooth fits. 60 00:03:59,290 --> 00:04:03,584 And these fits are very adaptive to the complexity of the data that we see, and 61 00:04:03,584 --> 00:04:06,709 can increase in complexity as we get more and more data. 62 00:04:06,709 --> 00:04:11,449 [MUSIC]