1
00:00:01,251 --> 00:00:04,944
[MUSIC]

2
00:00:04,944 --> 00:00:05,638
In this module,

3
00:00:05,638 --> 00:00:09,170
we're gonna discuss how to select amongst
a set of features to include in our model.

4
00:00:09,170 --> 00:00:13,291
And to do this, we're first gonna start by
describing a way to explicitly search over

5
00:00:13,291 --> 00:00:14,830
all possible models.

6
00:00:14,830 --> 00:00:20,250
And then what we're gonna do is describe
a way to implicitly do a feature selection

7
00:00:20,250 --> 00:00:22,510
using regularized regression,

8
00:00:22,510 --> 00:00:26,940
akin to the types of ideas we were talking
about when we discussed ridge regression.

9
00:00:26,940 --> 00:00:30,375
So let's start by motivating
this feature selection task.

10
00:00:30,375 --> 00:00:35,223
So question is, why might you want to
select amongst your set of features?

11
00:00:35,223 --> 00:00:36,872
One is for efficiency.

12
00:00:36,872 --> 00:00:41,131
So let's say you have a problem
which has 100 billion features.

13
00:00:41,131 --> 00:00:42,548
That might sound like a lot.

14
00:00:42,548 --> 00:00:46,177
It is actually a lot, but there are many
applications out there these days where

15
00:00:46,177 --> 00:00:48,350
we're faced with this many features.

16
00:00:48,350 --> 00:00:52,590
Well, every time we go to do
prediction with 100 billion features,

17
00:00:52,590 --> 00:00:56,030
that multiplication we have to do
between our feature vector and

18
00:00:56,030 --> 00:01:00,600
the weights on all these features is
really computationally intensive.

19
00:01:00,600 --> 00:01:06,120
In contrast, if we assume that
the weights on our features are sparse,

20
00:01:06,120 --> 00:01:08,890
and what I mean by that is
that there are many zeros,

21
00:01:08,890 --> 00:01:11,280
then things can be done
much more efficiently.

22
00:01:11,280 --> 00:01:16,380
Because when we're going to form our
prediction, all we need to do is

23
00:01:16,380 --> 00:01:21,680
sum over all of the features
whose weights are not zero.

24
00:01:23,760 --> 00:01:24,840
But another reason for

25
00:01:24,840 --> 00:01:29,140
wanting to do this feature selection and
the one that perhaps might be more common,

26
00:01:29,140 --> 00:01:33,640
at least classically, is for the reason
of interpretability, where we wanna

27
00:01:33,640 --> 00:01:37,710
understand what are the features relevant
for, for example, a prediction task.

28
00:01:38,730 --> 00:01:42,951
So, for example, in our housing
application, we might have a really long

29
00:01:42,951 --> 00:01:46,168
list of possible features
associated with every house.

30
00:01:46,168 --> 00:01:49,696
This is actually an example of a set
of features that were listed for

31
00:01:49,696 --> 00:01:51,529
a house that was listed on Zillow.

32
00:01:51,529 --> 00:01:56,104
And there are lots and lots of detailed
things, including what roof type the house

33
00:01:56,104 --> 00:02:00,610
has, and whether it includes a microwave
or not when it's getting sold.

34
00:02:00,610 --> 00:02:01,510
And the question is,

35
00:02:01,510 --> 00:02:05,100
are all these features really relevant
to assessing the value of the house, or

36
00:02:05,100 --> 00:02:07,990
at least how much somebody will go and
pay for the house?

37
00:02:07,990 --> 00:02:11,740
They probably wouldn't mind if
they're spending a couple $100,000

38
00:02:11,740 --> 00:02:12,740
to go buy a microwave.

39
00:02:12,740 --> 00:02:16,045
That's probably not factoring in very
significantly in making their decision

40
00:02:16,045 --> 00:02:17,678
about the value of the house.

41
00:02:17,678 --> 00:02:19,890
So when we're faced with
all these features,

42
00:02:19,890 --> 00:02:24,700
we might wanna select a subset
that are really representative or

43
00:02:24,700 --> 00:02:28,410
relevant to our task of
predicting the value of a house.

44
00:02:28,410 --> 00:02:33,533
So here I've shown perhaps a reasonable
subset of features that we might use for

45
00:02:33,533 --> 00:02:34,996
assessing the value.

46
00:02:34,996 --> 00:02:38,246
And one question we're gonna
go through in this module is,

47
00:02:38,246 --> 00:02:40,661
how do we think about
choosing this subset?

48
00:02:40,661 --> 00:02:45,544
And another application we talked about
a couple modules ago was this reading

49
00:02:45,544 --> 00:02:49,821
your mind task, where we get a scan
of your brain, and for our sake,

50
00:02:49,821 --> 00:02:52,720
we can just think of this as an image.

51
00:02:52,720 --> 00:02:57,050
And then we'd like to predict
whether you are happy or

52
00:02:57,050 --> 00:02:58,950
sad in response to
whatever you were shown.

53
00:02:58,950 --> 00:03:02,720
So when we went to take the scan of your
brain, you're shown either a word, or

54
00:03:02,720 --> 00:03:04,470
an image, or something like this.

55
00:03:04,470 --> 00:03:09,373
And we wanna be able to predict how you
felt about that just from your brain scan.

56
00:03:09,373 --> 00:03:13,876
So we talked about the fact that
we treat our inputs, our features,

57
00:03:13,876 --> 00:03:15,223
as just the voxels.

58
00:03:15,223 --> 00:03:19,404
We can think of them as just pixels
in this image, and wanted to relate

59
00:03:19,404 --> 00:03:23,744
those pixel intensities to this output,
this response of happiness.

60
00:03:23,744 --> 00:03:29,349
And in many cases, maybe what we'd
like to do is find a small subset

61
00:03:29,349 --> 00:03:34,971
of regions in the brain that
are relevant to this prediction task.

62
00:03:34,971 --> 00:03:38,467
And so here again is a reason for
interpretability,

63
00:03:38,467 --> 00:03:41,971
that we might wanna do this
feature selection task.

64
00:03:41,971 --> 00:03:42,471
[MUSIC]