1 00:00:01,251 --> 00:00:04,944 [MUSIC] 2 00:00:04,944 --> 00:00:05,638 In this module, 3 00:00:05,638 --> 00:00:09,170 we're gonna discuss how to select amongst a set of features to include in our model. 4 00:00:09,170 --> 00:00:13,291 And to do this, we're first gonna start by describing a way to explicitly search over 5 00:00:13,291 --> 00:00:14,830 all possible models. 6 00:00:14,830 --> 00:00:20,250 And then what we're gonna do is describe a way to implicitly do a feature selection 7 00:00:20,250 --> 00:00:22,510 using regularized regression, 8 00:00:22,510 --> 00:00:26,940 akin to the types of ideas we were talking about when we discussed ridge regression. 9 00:00:26,940 --> 00:00:30,375 So let's start by motivating this feature selection task. 10 00:00:30,375 --> 00:00:35,223 So question is, why might you want to select amongst your set of features? 11 00:00:35,223 --> 00:00:36,872 One is for efficiency. 12 00:00:36,872 --> 00:00:41,131 So let's say you have a problem which has 100 billion features. 13 00:00:41,131 --> 00:00:42,548 That might sound like a lot. 14 00:00:42,548 --> 00:00:46,177 It is actually a lot, but there are many applications out there these days where 15 00:00:46,177 --> 00:00:48,350 we're faced with this many features. 16 00:00:48,350 --> 00:00:52,590 Well, every time we go to do prediction with 100 billion features, 17 00:00:52,590 --> 00:00:56,030 that multiplication we have to do between our feature vector and 18 00:00:56,030 --> 00:01:00,600 the weights on all these features is really computationally intensive. 19 00:01:00,600 --> 00:01:06,120 In contrast, if we assume that the weights on our features are sparse, 20 00:01:06,120 --> 00:01:08,890 and what I mean by that is that there are many zeros, 21 00:01:08,890 --> 00:01:11,280 then things can be done much more efficiently. 22 00:01:11,280 --> 00:01:16,380 Because when we're going to form our prediction, all we need to do is 23 00:01:16,380 --> 00:01:21,680 sum over all of the features whose weights are not zero. 24 00:01:23,760 --> 00:01:24,840 But another reason for 25 00:01:24,840 --> 00:01:29,140 wanting to do this feature selection and the one that perhaps might be more common, 26 00:01:29,140 --> 00:01:33,640 at least classically, is for the reason of interpretability, where we wanna 27 00:01:33,640 --> 00:01:37,710 understand what are the features relevant for, for example, a prediction task. 28 00:01:38,730 --> 00:01:42,951 So, for example, in our housing application, we might have a really long 29 00:01:42,951 --> 00:01:46,168 list of possible features associated with every house. 30 00:01:46,168 --> 00:01:49,696 This is actually an example of a set of features that were listed for 31 00:01:49,696 --> 00:01:51,529 a house that was listed on Zillow. 32 00:01:51,529 --> 00:01:56,104 And there are lots and lots of detailed things, including what roof type the house 33 00:01:56,104 --> 00:02:00,610 has, and whether it includes a microwave or not when it's getting sold. 34 00:02:00,610 --> 00:02:01,510 And the question is, 35 00:02:01,510 --> 00:02:05,100 are all these features really relevant to assessing the value of the house, or 36 00:02:05,100 --> 00:02:07,990 at least how much somebody will go and pay for the house? 37 00:02:07,990 --> 00:02:11,740 They probably wouldn't mind if they're spending a couple $100,000 38 00:02:11,740 --> 00:02:12,740 to go buy a microwave. 39 00:02:12,740 --> 00:02:16,045 That's probably not factoring in very significantly in making their decision 40 00:02:16,045 --> 00:02:17,678 about the value of the house. 41 00:02:17,678 --> 00:02:19,890 So when we're faced with all these features, 42 00:02:19,890 --> 00:02:24,700 we might wanna select a subset that are really representative or 43 00:02:24,700 --> 00:02:28,410 relevant to our task of predicting the value of a house. 44 00:02:28,410 --> 00:02:33,533 So here I've shown perhaps a reasonable subset of features that we might use for 45 00:02:33,533 --> 00:02:34,996 assessing the value. 46 00:02:34,996 --> 00:02:38,246 And one question we're gonna go through in this module is, 47 00:02:38,246 --> 00:02:40,661 how do we think about choosing this subset? 48 00:02:40,661 --> 00:02:45,544 And another application we talked about a couple modules ago was this reading 49 00:02:45,544 --> 00:02:49,821 your mind task, where we get a scan of your brain, and for our sake, 50 00:02:49,821 --> 00:02:52,720 we can just think of this as an image. 51 00:02:52,720 --> 00:02:57,050 And then we'd like to predict whether you are happy or 52 00:02:57,050 --> 00:02:58,950 sad in response to whatever you were shown. 53 00:02:58,950 --> 00:03:02,720 So when we went to take the scan of your brain, you're shown either a word, or 54 00:03:02,720 --> 00:03:04,470 an image, or something like this. 55 00:03:04,470 --> 00:03:09,373 And we wanna be able to predict how you felt about that just from your brain scan. 56 00:03:09,373 --> 00:03:13,876 So we talked about the fact that we treat our inputs, our features, 57 00:03:13,876 --> 00:03:15,223 as just the voxels. 58 00:03:15,223 --> 00:03:19,404 We can think of them as just pixels in this image, and wanted to relate 59 00:03:19,404 --> 00:03:23,744 those pixel intensities to this output, this response of happiness. 60 00:03:23,744 --> 00:03:29,349 And in many cases, maybe what we'd like to do is find a small subset 61 00:03:29,349 --> 00:03:34,971 of regions in the brain that are relevant to this prediction task. 62 00:03:34,971 --> 00:03:38,467 And so here again is a reason for interpretability, 63 00:03:38,467 --> 00:03:41,971 that we might wanna do this feature selection task. 64 00:03:41,971 --> 00:03:42,471 [MUSIC]