1
00:00:00,000 --> 00:00:04,600
[MUSIC]

2
00:00:04,600 --> 00:00:07,950
Okay, so how are we going to go
about this feature selection task?

3
00:00:07,950 --> 00:00:11,880
Well one option we have, is the obvious
choice, which is to search over

4
00:00:11,880 --> 00:00:16,500
every possible combination of features we
might want to include in our model and

5
00:00:16,500 --> 00:00:18,770
look at the performance
of each of those models.

6
00:00:18,770 --> 00:00:21,890
And that's exactly what the all
subsets algorithm does and

7
00:00:21,890 --> 00:00:22,990
we're going to describe this now.

8
00:00:24,170 --> 00:00:28,546
Okay, well the all subsets algorithm
starts by considering a model with

9
00:00:28,546 --> 00:00:30,421
absolutely no features in it.

10
00:00:30,421 --> 00:00:33,430
Okay, removing all these features
we might have in our house and

11
00:00:33,430 --> 00:00:35,671
saying what's the performance
of that model?

12
00:00:35,671 --> 00:00:41,366
So just to be clear,
start with no features and

13
00:00:41,366 --> 00:00:46,800
there's still a model for no features.

14
00:00:46,800 --> 00:00:52,921
So the model for no features, remember, is
just that our observation is simply noise.

15
00:00:56,560 --> 00:01:01,957
Okay, so we can assess the performance
of this model on our training data,

16
00:01:01,957 --> 00:01:07,370
and there is some training error
associated with a model with no features.

17
00:01:16,010 --> 00:01:17,230
So, we're going to plot that point.

18
00:01:17,230 --> 00:01:18,820
And then the next thing we're going to do,

19
00:01:18,820 --> 00:01:23,530
is we're going to search over ever
possible model of just one feature.

20
00:01:23,530 --> 00:01:28,510
So let's say to start with, we consider
a model which is number of bedrooms.

21
00:01:28,510 --> 00:01:35,314
And here we're gonna plot the training

22
00:01:35,314 --> 00:01:41,293
error of model fit just with number

23
00:01:41,293 --> 00:01:46,254
of bedrooms as the feature.

24
00:01:49,770 --> 00:01:54,164
Then we're gonna say, well, what's
the training error of a model fit just

25
00:01:54,164 --> 00:01:56,568
with number of bathrooms, square feet,

26
00:01:56,568 --> 00:02:01,720
square feet of the lot, and cycle through
each one of our possible features.

27
00:02:01,720 --> 00:02:06,660
And at the end of this we can say out
of all models with just one feature,

28
00:02:06,660 --> 00:02:09,230
which one fit the training data the best?

29
00:02:09,230 --> 00:02:13,050
And in this case it happened to be the
model that included square feet living.

30
00:02:13,050 --> 00:02:15,690
So we've seen that before,
that that's a very relevant feature.

31
00:02:16,880 --> 00:02:24,461
Okay, so we're gonna highlight,
this is the best fitting model.

32
00:02:30,730 --> 00:02:32,548
With only one feature.

33
00:02:35,177 --> 00:02:39,340
And we're gonna keep track of this model
and we can discard all the other ones.

34
00:02:39,340 --> 00:02:44,520
Then we're gonna go and
search over all models with 2 features.

35
00:02:44,520 --> 00:02:51,087
So search over all combinations,

36
00:02:55,171 --> 00:02:57,660
Of 2 features.

37
00:03:00,300 --> 00:03:03,840
And we're gonna figure out which one
of these has the lowest training error,

38
00:03:03,840 --> 00:03:05,690
keep track of that model.

39
00:03:05,690 --> 00:03:09,350
And that happens to be a model
that has number of bedrooms and

40
00:03:09,350 --> 00:03:11,130
number of bathrooms.

41
00:03:11,130 --> 00:03:14,820
And that might make sense, because when
we're going to search for a property,

42
00:03:14,820 --> 00:03:20,110
often someone will say, I want a three
bedroom house with two bathrooms.

43
00:03:20,110 --> 00:03:25,050
So that might be a reasonable choice for
the best model, which is two features.

44
00:03:25,050 --> 00:03:26,980
And I wanna emphasize that the best model,

45
00:03:26,980 --> 00:03:31,800
which is two features, doesn't have to
contain any of the features that were

46
00:03:31,800 --> 00:03:34,350
contained in the best
model of one feature.

47
00:03:34,350 --> 00:03:38,440
So here, our best model with two
features has number of bedrooms,

48
00:03:38,440 --> 00:03:39,510
number of bathrooms.

49
00:03:39,510 --> 00:03:45,580
Whereas in contrast, our best model with
just one feature has square feet living.

50
00:03:45,580 --> 00:03:48,960
Okay, so these aren't necessarily nested.

51
00:03:48,960 --> 00:03:50,822
So maybe I'll write this explicitly.

52
00:03:50,822 --> 00:03:55,076
Best model of

53
00:03:55,076 --> 00:03:59,716
size k need not

54
00:03:59,716 --> 00:04:06,296
contain features.

55
00:04:10,587 --> 00:04:15,320
Of best model of size k minus one.

56
00:04:24,500 --> 00:04:26,820
Okay, so hopefully that's clear.

57
00:04:26,820 --> 00:04:31,830
And we're gonna continue our procedure
searching over all models then with

58
00:04:31,830 --> 00:04:36,110
3 features, all models with 4 features,
5 features, and at some point what we're

59
00:04:36,110 --> 00:04:40,580
gonna get to is, we're gonna get to
a model that has capital D features.

60
00:04:40,580 --> 00:04:45,350
That's all of the features that we
include, and there is only one such model.

61
00:04:45,350 --> 00:04:46,610
So it's just one point here.

62
00:04:48,490 --> 00:04:53,740
Then, what we can do is we can draw
this line, which represents, the set

63
00:04:53,740 --> 00:04:58,840
is connecting the points which represent
the set of all best possible models,

64
00:04:58,840 --> 00:05:03,030
each with a given number of features.

65
00:05:03,030 --> 00:05:07,620
Then the question is, which of these
models of these best models with

66
00:05:07,620 --> 00:05:11,550
k features do we want to use for
our predictions?

67
00:05:11,550 --> 00:05:16,180
Well hopefully it's clear from this
course, as well as from this slide,

68
00:05:16,180 --> 00:05:19,240
that we don't just wanna choose the model
with the lowest training error,

69
00:05:19,240 --> 00:05:23,010
because as we know at this point,
as we increase model complexity,

70
00:05:23,010 --> 00:05:27,260
our training error is gonna go down, and
that's what we're seeing in this plot.

71
00:05:27,260 --> 00:05:30,940
So instead, it's the same type
of choices that we've had

72
00:05:30,940 --> 00:05:36,540
previously in this course for choosing
between models of various complexity.

73
00:05:36,540 --> 00:05:37,130
One choice,

74
00:05:37,130 --> 00:05:40,750
is if you have enough data you can access
performance on the validation set.

75
00:05:40,750 --> 00:05:43,180
That's separate from your training and
test set.

76
00:05:44,340 --> 00:05:47,150
We also talked about
doing cross validation.

77
00:05:47,150 --> 00:05:49,920
And in this case there many other

78
00:05:52,320 --> 00:05:56,560
metrics we can look at for how to think
about penalizing model complexity.

79
00:05:56,560 --> 00:06:01,020
There are things called BIC and a long
list of other methods that people have for

80
00:06:01,020 --> 00:06:04,720
choosing amongst these different models.

81
00:06:04,720 --> 00:06:07,673
But we're not going to go
through the details of that, for

82
00:06:07,673 --> 00:06:11,546
our course we're gonna focus on this
notion of error on the validation set.

83
00:06:11,546 --> 00:06:15,699
[MUSIC]