1 00:00:00,208 --> 00:00:04,503 [MUSIC] 2 00:00:04,503 --> 00:00:05,220 Okay. So, 3 00:00:05,220 --> 00:00:08,630 in cases where you can't do all subsets, what can we do? 4 00:00:08,630 --> 00:00:10,301 Well, one thing we can do, 5 00:00:10,301 --> 00:00:15,401 which is gonna be option number two in our series of feature selection algorithms. 6 00:00:15,401 --> 00:00:18,055 Is to do a sub-optimal, but 7 00:00:18,055 --> 00:00:23,041 tractable approach to selecting amongst features. 8 00:00:23,041 --> 00:00:25,990 And that's a set of greedy algorithms that we're gonna consider now. 9 00:00:27,430 --> 00:00:31,684 So, one simple algorithm is what's called the forward stepwise algorithm, 10 00:00:31,684 --> 00:00:34,525 where you start with some set of possible features. 11 00:00:34,525 --> 00:00:40,102 And then what you're gonna do is, you're going to just greedily walk through each, 12 00:00:40,102 --> 00:00:43,550 select the next best feature and keep iterating. 13 00:00:43,550 --> 00:00:46,830 So, let's be a little bit more explicit about this. 14 00:00:46,830 --> 00:00:49,957 So, first we're gonna start with the empty set, no features. 15 00:00:49,957 --> 00:00:53,221 Or you could start with a list of features that you know you want to be included 16 00:00:53,221 --> 00:00:53,834 in the model. 17 00:00:53,834 --> 00:01:00,110 So, for example the constant feature which might be your feature, h0. 18 00:01:00,110 --> 00:01:03,804 But then what you're gonna do is, you're gonna fit your model with whatever your 19 00:01:03,804 --> 00:01:07,408 current set of features are to get an estimate of the weights on those features. 20 00:01:07,408 --> 00:01:10,332 Your estimate of your regression coefficients. 21 00:01:10,332 --> 00:01:16,427 And then what you're gonna do is, you're gonna search over all the possible 22 00:01:16,427 --> 00:01:22,546 features that you might add, and choose the one that best improves your fit. 23 00:01:22,546 --> 00:01:25,437 And so, one measure of this might be training errors. 24 00:01:25,437 --> 00:01:30,172 So, you choose the feature that most reduces your training error. 25 00:01:30,172 --> 00:01:34,938 And then you just include that in your model, and you recurse this procedure. 26 00:01:34,938 --> 00:01:38,386 Where now you have a model with some number of features. 27 00:01:38,386 --> 00:01:42,149 And you're gonna search over every possible feature you might wanna add and 28 00:01:42,149 --> 00:01:45,990 choose the next one that drops your training error the most, and keep going. 29 00:01:45,990 --> 00:01:47,740 So, let's visualize this procedure. 30 00:01:48,760 --> 00:01:49,770 For this visualization, 31 00:01:49,770 --> 00:01:53,610 we're gonna start with the picture that we had in our all subset search. 32 00:01:53,610 --> 00:01:56,610 And we're gonna look at how the greedy search differs from 33 00:01:56,610 --> 00:01:59,330 the search we would have done in all subsets. 34 00:01:59,330 --> 00:02:03,305 So at the first step, we start with no features in our model. 35 00:02:03,305 --> 00:02:10,010 So, we're actually gonna be at exactly the same point as we were in all subsets. 36 00:02:10,010 --> 00:02:13,390 So, start at same 37 00:02:16,530 --> 00:02:20,820 training error as all subsets. 38 00:02:25,170 --> 00:02:29,385 Which is equivalent to no features in the model. 39 00:02:29,385 --> 00:02:34,089 And then what we're gonna do is we're gonna search over all our features, and 40 00:02:34,089 --> 00:02:37,308 choose the one that reduces training error the most. 41 00:02:37,308 --> 00:02:38,983 And which one is that gonna be? 42 00:02:38,983 --> 00:02:41,555 It was actually going to be the same one we got for all subsets. 43 00:02:41,555 --> 00:02:44,243 Because at that point, when we we're doing all subsets and 44 00:02:44,243 --> 00:02:46,620 looking over all models of one feature. 45 00:02:46,620 --> 00:02:49,980 Again, we have the same choice of choosing 46 00:02:49,980 --> 00:02:53,720 the best single feature even though we're doing this greedy procedure. 47 00:02:53,720 --> 00:02:57,450 When we start with an empty set, our first best feature is gonna be the same 48 00:02:57,450 --> 00:03:01,746 as the same best feature that you would get doing all subsets. 49 00:03:01,746 --> 00:03:06,495 So again, stay at 50 00:03:06,495 --> 00:03:11,882 all subset solution 51 00:03:17,025 --> 00:03:22,881 For best model with 52 00:03:22,881 --> 00:03:27,276 one feature. 53 00:03:27,276 --> 00:03:30,833 And in that case, remember just like in all subsets, 54 00:03:30,833 --> 00:03:35,590 we choose square feet as the best feature when we have just one feature. 55 00:03:36,750 --> 00:03:38,270 But then things can start to diverge. 56 00:03:39,700 --> 00:03:44,828 Because when we go to our second step in the greedy algorithm, 57 00:03:44,828 --> 00:03:48,784 we're forced to keep square feet in our model. 58 00:03:59,127 --> 00:04:02,924 And we're just greedily gonna search over the next feature. 59 00:04:02,924 --> 00:04:04,515 That's why it's called greedy. 60 00:04:04,515 --> 00:04:09,069 I should've mentioned this previously because we're just doing the quote 61 00:04:09,069 --> 00:04:13,477 unquote, greedy thing of just saying, what's the next best thing, and 62 00:04:13,477 --> 00:04:17,100 not thinking globally about what might be best. 63 00:04:17,100 --> 00:04:17,760 Okay. 64 00:04:17,760 --> 00:04:21,390 So in this case, maybe the next best thing that we could add to the model, 65 00:04:21,390 --> 00:04:27,080 forcing ourselves to keep square feet in our model, is year renovated. 66 00:04:27,080 --> 00:04:32,710 So, when we think about what greedy finds, add two features. 67 00:04:32,710 --> 00:04:35,990 We have square feet living and year renovated. 68 00:04:35,990 --> 00:04:39,910 But if we compare to what our all subsets had, 69 00:04:39,910 --> 00:04:43,030 it was number of bedrooms and number of bathrooms. 70 00:04:44,540 --> 00:04:47,455 So, this is where we see that things can be different. 71 00:04:47,455 --> 00:04:50,837 Because once we're forced to keep square feet, 72 00:04:50,837 --> 00:04:54,547 maybe the next thing that's most relevant to value, 73 00:04:54,547 --> 00:05:00,370 you can have these very large old houses, but those might not be worth as much. 74 00:05:00,370 --> 00:05:03,884 It might be the year that the house was renovated that's really relevant to 75 00:05:03,884 --> 00:05:04,455 the value. 76 00:05:04,455 --> 00:05:08,313 But as we described before, it might be reasonable to think that the best model 77 00:05:08,313 --> 00:05:11,827 with just two features was number of bedrooms and number of bathrooms. 78 00:05:11,827 --> 00:05:16,742 Because those are really key attributes that people use to value whether a house 79 00:05:16,742 --> 00:05:18,143 is a good fit for them. 80 00:05:18,143 --> 00:05:20,610 Well, then we continue with this procedure. 81 00:05:20,610 --> 00:05:23,710 We already had square feet living, we had year renovated. 82 00:05:23,710 --> 00:05:29,023 And now, in this case, our next feature we're gonna add is waterfront, 83 00:05:29,023 --> 00:05:32,920 and we're gonna keep going on this process. 84 00:05:32,920 --> 00:05:37,770 And when's the next time that our greedy procedure 85 00:05:37,770 --> 00:05:42,297 is gonna meet the same solution as all subsets? 86 00:05:42,297 --> 00:05:44,823 So, we started at the same point, 87 00:05:44,823 --> 00:05:49,367 we said after the first iteration, we were at the same point. 88 00:05:49,367 --> 00:05:53,529 Well, again, we'll be at the same point once we get all the way, obviously, 89 00:05:53,529 --> 00:05:57,837 to including all features, coz there's only one such model with all features. 90 00:05:57,837 --> 00:06:01,380 So, there are a couple points that I wanna make about this greedy procedure. 91 00:06:01,380 --> 00:06:06,740 One is the fact that from iteration to iteration, error can never decrease. 92 00:06:06,740 --> 00:06:07,370 And the reason for 93 00:06:07,370 --> 00:06:11,450 that is, even if there's nothing else that you want to add to your model. 94 00:06:11,450 --> 00:06:13,298 Even if everything else is junk, 95 00:06:13,298 --> 00:06:16,999 or perhaps only making things a lot worse in terms of predictions. 96 00:06:16,999 --> 00:06:22,156 You can always just set the weight equal to zero on that additional feature. 97 00:06:22,156 --> 00:06:27,786 And then you'd have exactly the same model you had, which is one fewer feature. 98 00:06:27,786 --> 00:06:30,445 And so, the training error would be exactly the same. 99 00:06:30,445 --> 00:06:34,275 So, we know the training error will never increase. 100 00:06:34,275 --> 00:06:39,181 And the other thing that we mentioned is that eventually, the training error will 101 00:06:39,181 --> 00:06:43,802 be equivalent to that of all subsets when you're using all of your features, 102 00:06:43,802 --> 00:06:45,953 and perhaps, before that as well. 103 00:06:45,953 --> 00:06:48,520 Well, when do we stop this greedy procedure? 104 00:06:49,620 --> 00:06:54,720 Just to really hammer this home, I'll just ask the question again. 105 00:06:54,720 --> 00:06:56,590 Do we stop when the training error is low enough? 106 00:06:57,950 --> 00:06:58,949 No. 107 00:06:58,949 --> 00:07:02,589 Remember that training error will decrease as we keep adding model complexity. 108 00:07:02,589 --> 00:07:06,478 So, as we keep adding more and more features, our training error will go down. 109 00:07:06,478 --> 00:07:09,080 What about when test error's low enough? 110 00:07:09,080 --> 00:07:10,780 Again, no. 111 00:07:10,780 --> 00:07:15,392 Remember that we never touch our test data when we're searching over model 112 00:07:15,392 --> 00:07:18,914 complexity, or choosing any kind of tuning parameter. 113 00:07:18,914 --> 00:07:22,187 Again, what we're gonna do is use our validation set or 114 00:07:22,187 --> 00:07:25,962 cross validation to choose when to stop this greedy procedure. 115 00:07:25,962 --> 00:07:29,999 [MUSIC]