1 00:00:00,000 --> 00:00:04,600 [MUSIC] 2 00:00:04,600 --> 00:00:07,950 Okay, so how are we going to go about this feature selection task? 3 00:00:07,950 --> 00:00:11,880 Well one option we have, is the obvious choice, which is to search over 4 00:00:11,880 --> 00:00:16,500 every possible combination of features we might want to include in our model and 5 00:00:16,500 --> 00:00:18,770 look at the performance of each of those models. 6 00:00:18,770 --> 00:00:21,890 And that's exactly what the all subsets algorithm does and 7 00:00:21,890 --> 00:00:22,990 we're going to describe this now. 8 00:00:24,170 --> 00:00:28,546 Okay, well the all subsets algorithm starts by considering a model with 9 00:00:28,546 --> 00:00:30,421 absolutely no features in it. 10 00:00:30,421 --> 00:00:33,430 Okay, removing all these features we might have in our house and 11 00:00:33,430 --> 00:00:35,671 saying what's the performance of that model? 12 00:00:35,671 --> 00:00:41,366 So just to be clear, start with no features and 13 00:00:41,366 --> 00:00:46,800 there's still a model for no features. 14 00:00:46,800 --> 00:00:52,921 So the model for no features, remember, is just that our observation is simply noise. 15 00:00:56,560 --> 00:01:01,957 Okay, so we can assess the performance of this model on our training data, 16 00:01:01,957 --> 00:01:07,370 and there is some training error associated with a model with no features. 17 00:01:16,010 --> 00:01:17,230 So, we're going to plot that point. 18 00:01:17,230 --> 00:01:18,820 And then the next thing we're going to do, 19 00:01:18,820 --> 00:01:23,530 is we're going to search over ever possible model of just one feature. 20 00:01:23,530 --> 00:01:28,510 So let's say to start with, we consider a model which is number of bedrooms. 21 00:01:28,510 --> 00:01:35,314 And here we're gonna plot the training 22 00:01:35,314 --> 00:01:41,293 error of model fit just with number 23 00:01:41,293 --> 00:01:46,254 of bedrooms as the feature. 24 00:01:49,770 --> 00:01:54,164 Then we're gonna say, well, what's the training error of a model fit just 25 00:01:54,164 --> 00:01:56,568 with number of bathrooms, square feet, 26 00:01:56,568 --> 00:02:01,720 square feet of the lot, and cycle through each one of our possible features. 27 00:02:01,720 --> 00:02:06,660 And at the end of this we can say out of all models with just one feature, 28 00:02:06,660 --> 00:02:09,230 which one fit the training data the best? 29 00:02:09,230 --> 00:02:13,050 And in this case it happened to be the model that included square feet living. 30 00:02:13,050 --> 00:02:15,690 So we've seen that before, that that's a very relevant feature. 31 00:02:16,880 --> 00:02:24,461 Okay, so we're gonna highlight, this is the best fitting model. 32 00:02:30,730 --> 00:02:32,548 With only one feature. 33 00:02:35,177 --> 00:02:39,340 And we're gonna keep track of this model and we can discard all the other ones. 34 00:02:39,340 --> 00:02:44,520 Then we're gonna go and search over all models with 2 features. 35 00:02:44,520 --> 00:02:51,087 So search over all combinations, 36 00:02:55,171 --> 00:02:57,660 Of 2 features. 37 00:03:00,300 --> 00:03:03,840 And we're gonna figure out which one of these has the lowest training error, 38 00:03:03,840 --> 00:03:05,690 keep track of that model. 39 00:03:05,690 --> 00:03:09,350 And that happens to be a model that has number of bedrooms and 40 00:03:09,350 --> 00:03:11,130 number of bathrooms. 41 00:03:11,130 --> 00:03:14,820 And that might make sense, because when we're going to search for a property, 42 00:03:14,820 --> 00:03:20,110 often someone will say, I want a three bedroom house with two bathrooms. 43 00:03:20,110 --> 00:03:25,050 So that might be a reasonable choice for the best model, which is two features. 44 00:03:25,050 --> 00:03:26,980 And I wanna emphasize that the best model, 45 00:03:26,980 --> 00:03:31,800 which is two features, doesn't have to contain any of the features that were 46 00:03:31,800 --> 00:03:34,350 contained in the best model of one feature. 47 00:03:34,350 --> 00:03:38,440 So here, our best model with two features has number of bedrooms, 48 00:03:38,440 --> 00:03:39,510 number of bathrooms. 49 00:03:39,510 --> 00:03:45,580 Whereas in contrast, our best model with just one feature has square feet living. 50 00:03:45,580 --> 00:03:48,960 Okay, so these aren't necessarily nested. 51 00:03:48,960 --> 00:03:50,822 So maybe I'll write this explicitly. 52 00:03:50,822 --> 00:03:55,076 Best model of 53 00:03:55,076 --> 00:03:59,716 size k need not 54 00:03:59,716 --> 00:04:06,296 contain features. 55 00:04:10,587 --> 00:04:15,320 Of best model of size k minus one. 56 00:04:24,500 --> 00:04:26,820 Okay, so hopefully that's clear. 57 00:04:26,820 --> 00:04:31,830 And we're gonna continue our procedure searching over all models then with 58 00:04:31,830 --> 00:04:36,110 3 features, all models with 4 features, 5 features, and at some point what we're 59 00:04:36,110 --> 00:04:40,580 gonna get to is, we're gonna get to a model that has capital D features. 60 00:04:40,580 --> 00:04:45,350 That's all of the features that we include, and there is only one such model. 61 00:04:45,350 --> 00:04:46,610 So it's just one point here. 62 00:04:48,490 --> 00:04:53,740 Then, what we can do is we can draw this line, which represents, the set 63 00:04:53,740 --> 00:04:58,840 is connecting the points which represent the set of all best possible models, 64 00:04:58,840 --> 00:05:03,030 each with a given number of features. 65 00:05:03,030 --> 00:05:07,620 Then the question is, which of these models of these best models with 66 00:05:07,620 --> 00:05:11,550 k features do we want to use for our predictions? 67 00:05:11,550 --> 00:05:16,180 Well hopefully it's clear from this course, as well as from this slide, 68 00:05:16,180 --> 00:05:19,240 that we don't just wanna choose the model with the lowest training error, 69 00:05:19,240 --> 00:05:23,010 because as we know at this point, as we increase model complexity, 70 00:05:23,010 --> 00:05:27,260 our training error is gonna go down, and that's what we're seeing in this plot. 71 00:05:27,260 --> 00:05:30,940 So instead, it's the same type of choices that we've had 72 00:05:30,940 --> 00:05:36,540 previously in this course for choosing between models of various complexity. 73 00:05:36,540 --> 00:05:37,130 One choice, 74 00:05:37,130 --> 00:05:40,750 is if you have enough data you can access performance on the validation set. 75 00:05:40,750 --> 00:05:43,180 That's separate from your training and test set. 76 00:05:44,340 --> 00:05:47,150 We also talked about doing cross validation. 77 00:05:47,150 --> 00:05:49,920 And in this case there many other 78 00:05:52,320 --> 00:05:56,560 metrics we can look at for how to think about penalizing model complexity. 79 00:05:56,560 --> 00:06:01,020 There are things called BIC and a long list of other methods that people have for 80 00:06:01,020 --> 00:06:04,720 choosing amongst these different models. 81 00:06:04,720 --> 00:06:07,673 But we're not going to go through the details of that, for 82 00:06:07,673 --> 00:06:11,546 our course we're gonna focus on this notion of error on the validation set. 83 00:06:11,546 --> 00:06:15,699 [MUSIC]