1
00:00:00,208 --> 00:00:04,503
[MUSIC]

2
00:00:04,503 --> 00:00:05,220
Okay.
So,

3
00:00:05,220 --> 00:00:08,630
in cases where you can't do all subsets,
what can we do?

4
00:00:08,630 --> 00:00:10,301
Well, one thing we can do,

5
00:00:10,301 --> 00:00:15,401
which is gonna be option number two in our
series of feature selection algorithms.

6
00:00:15,401 --> 00:00:18,055
Is to do a sub-optimal, but

7
00:00:18,055 --> 00:00:23,041
tractable approach to
selecting amongst features.

8
00:00:23,041 --> 00:00:25,990
And that's a set of greedy algorithms
that we're gonna consider now.

9
00:00:27,430 --> 00:00:31,684
So, one simple algorithm is what's
called the forward stepwise algorithm,

10
00:00:31,684 --> 00:00:34,525
where you start with some
set of possible features.

11
00:00:34,525 --> 00:00:40,102
And then what you're gonna do is, you're
going to just greedily walk through each,

12
00:00:40,102 --> 00:00:43,550
select the next best feature and
keep iterating.

13
00:00:43,550 --> 00:00:46,830
So, let's be a little bit
more explicit about this.

14
00:00:46,830 --> 00:00:49,957
So, first we're gonna start with
the empty set, no features.

15
00:00:49,957 --> 00:00:53,221
Or you could start with a list of features
that you know you want to be included

16
00:00:53,221 --> 00:00:53,834
in the model.

17
00:00:53,834 --> 00:01:00,110
So, for example the constant feature
which might be your feature, h0.

18
00:01:00,110 --> 00:01:03,804
But then what you're gonna do is, you're
gonna fit your model with whatever your

19
00:01:03,804 --> 00:01:07,408
current set of features are to get an
estimate of the weights on those features.

20
00:01:07,408 --> 00:01:10,332
Your estimate of your
regression coefficients.

21
00:01:10,332 --> 00:01:16,427
And then what you're gonna do is,
you're gonna search over all the possible

22
00:01:16,427 --> 00:01:22,546
features that you might add, and choose
the one that best improves your fit.

23
00:01:22,546 --> 00:01:25,437
And so, one measure of this
might be training errors.

24
00:01:25,437 --> 00:01:30,172
So, you choose the feature that
most reduces your training error.

25
00:01:30,172 --> 00:01:34,938
And then you just include that in your
model, and you recurse this procedure.

26
00:01:34,938 --> 00:01:38,386
Where now you have a model
with some number of features.

27
00:01:38,386 --> 00:01:42,149
And you're gonna search over every
possible feature you might wanna add and

28
00:01:42,149 --> 00:01:45,990
choose the next one that drops your
training error the most, and keep going.

29
00:01:45,990 --> 00:01:47,740
So, let's visualize this procedure.

30
00:01:48,760 --> 00:01:49,770
For this visualization,

31
00:01:49,770 --> 00:01:53,610
we're gonna start with the picture
that we had in our all subset search.

32
00:01:53,610 --> 00:01:56,610
And we're gonna look at how
the greedy search differs from

33
00:01:56,610 --> 00:01:59,330
the search we would have
done in all subsets.

34
00:01:59,330 --> 00:02:03,305
So at the first step,
we start with no features in our model.

35
00:02:03,305 --> 00:02:10,010
So, we're actually gonna be at exactly
the same point as we were in all subsets.

36
00:02:10,010 --> 00:02:13,390
So, start at same

37
00:02:16,530 --> 00:02:20,820
training error as all subsets.

38
00:02:25,170 --> 00:02:29,385
Which is equivalent to no
features in the model.

39
00:02:29,385 --> 00:02:34,089
And then what we're gonna do is we're
gonna search over all our features, and

40
00:02:34,089 --> 00:02:37,308
choose the one that reduces
training error the most.

41
00:02:37,308 --> 00:02:38,983
And which one is that gonna be?

42
00:02:38,983 --> 00:02:41,555
It was actually going to be the same
one we got for all subsets.

43
00:02:41,555 --> 00:02:44,243
Because at that point,
when we we're doing all subsets and

44
00:02:44,243 --> 00:02:46,620
looking over all models of one feature.

45
00:02:46,620 --> 00:02:49,980
Again, we have the same choice of choosing

46
00:02:49,980 --> 00:02:53,720
the best single feature even though
we're doing this greedy procedure.

47
00:02:53,720 --> 00:02:57,450
When we start with an empty set, our
first best feature is gonna be the same

48
00:02:57,450 --> 00:03:01,746
as the same best feature that
you would get doing all subsets.

49
00:03:01,746 --> 00:03:06,495
So again, stay at

50
00:03:06,495 --> 00:03:11,882
all subset solution

51
00:03:17,025 --> 00:03:22,881
For best model with

52
00:03:22,881 --> 00:03:27,276
one feature.

53
00:03:27,276 --> 00:03:30,833
And in that case,
remember just like in all subsets,

54
00:03:30,833 --> 00:03:35,590
we choose square feet as the best
feature when we have just one feature.

55
00:03:36,750 --> 00:03:38,270
But then things can start to diverge.

56
00:03:39,700 --> 00:03:44,828
Because when we go to our second
step in the greedy algorithm,

57
00:03:44,828 --> 00:03:48,784
we're forced to keep
square feet in our model.

58
00:03:59,127 --> 00:04:02,924
And we're just greedily gonna
search over the next feature.

59
00:04:02,924 --> 00:04:04,515
That's why it's called greedy.

60
00:04:04,515 --> 00:04:09,069
I should've mentioned this previously
because we're just doing the quote

61
00:04:09,069 --> 00:04:13,477
unquote, greedy thing of just saying,
what's the next best thing, and

62
00:04:13,477 --> 00:04:17,100
not thinking globally
about what might be best.

63
00:04:17,100 --> 00:04:17,760
Okay.

64
00:04:17,760 --> 00:04:21,390
So in this case, maybe the next best
thing that we could add to the model,

65
00:04:21,390 --> 00:04:27,080
forcing ourselves to keep square feet
in our model, is year renovated.

66
00:04:27,080 --> 00:04:32,710
So, when we think about what greedy finds,
add two features.

67
00:04:32,710 --> 00:04:35,990
We have square feet living and
year renovated.

68
00:04:35,990 --> 00:04:39,910
But if we compare to
what our all subsets had,

69
00:04:39,910 --> 00:04:43,030
it was number of bedrooms and
number of bathrooms.

70
00:04:44,540 --> 00:04:47,455
So, this is where we see that
things can be different.

71
00:04:47,455 --> 00:04:50,837
Because once we're forced
to keep square feet,

72
00:04:50,837 --> 00:04:54,547
maybe the next thing that's
most relevant to value,

73
00:04:54,547 --> 00:05:00,370
you can have these very large old houses,
but those might not be worth as much.

74
00:05:00,370 --> 00:05:03,884
It might be the year that the house was
renovated that's really relevant to

75
00:05:03,884 --> 00:05:04,455
the value.

76
00:05:04,455 --> 00:05:08,313
But as we described before, it might be
reasonable to think that the best model

77
00:05:08,313 --> 00:05:11,827
with just two features was number of
bedrooms and number of bathrooms.

78
00:05:11,827 --> 00:05:16,742
Because those are really key attributes
that people use to value whether a house

79
00:05:16,742 --> 00:05:18,143
is a good fit for them.

80
00:05:18,143 --> 00:05:20,610
Well, then we continue
with this procedure.

81
00:05:20,610 --> 00:05:23,710
We already had square feet living,
we had year renovated.

82
00:05:23,710 --> 00:05:29,023
And now, in this case, our next
feature we're gonna add is waterfront,

83
00:05:29,023 --> 00:05:32,920
and we're gonna keep
going on this process.

84
00:05:32,920 --> 00:05:37,770
And when's the next time
that our greedy procedure

85
00:05:37,770 --> 00:05:42,297
is gonna meet the same
solution as all subsets?

86
00:05:42,297 --> 00:05:44,823
So, we started at the same point,

87
00:05:44,823 --> 00:05:49,367
we said after the first iteration,
we were at the same point.

88
00:05:49,367 --> 00:05:53,529
Well, again, we'll be at the same point
once we get all the way, obviously,

89
00:05:53,529 --> 00:05:57,837
to including all features, coz there's
only one such model with all features.

90
00:05:57,837 --> 00:06:01,380
So, there are a couple points that I
wanna make about this greedy procedure.

91
00:06:01,380 --> 00:06:06,740
One is the fact that from iteration to
iteration, error can never decrease.

92
00:06:06,740 --> 00:06:07,370
And the reason for

93
00:06:07,370 --> 00:06:11,450
that is, even if there's nothing else
that you want to add to your model.

94
00:06:11,450 --> 00:06:13,298
Even if everything else is junk,

95
00:06:13,298 --> 00:06:16,999
or perhaps only making things a lot
worse in terms of predictions.

96
00:06:16,999 --> 00:06:22,156
You can always just set the weight equal
to zero on that additional feature.

97
00:06:22,156 --> 00:06:27,786
And then you'd have exactly the same model
you had, which is one fewer feature.

98
00:06:27,786 --> 00:06:30,445
And so, the training error
would be exactly the same.

99
00:06:30,445 --> 00:06:34,275
So, we know the training
error will never increase.

100
00:06:34,275 --> 00:06:39,181
And the other thing that we mentioned is
that eventually, the training error will

101
00:06:39,181 --> 00:06:43,802
be equivalent to that of all subsets
when you're using all of your features,

102
00:06:43,802 --> 00:06:45,953
and perhaps, before that as well.

103
00:06:45,953 --> 00:06:48,520
Well, when do we stop
this greedy procedure?

104
00:06:49,620 --> 00:06:54,720
Just to really hammer this home,
I'll just ask the question again.

105
00:06:54,720 --> 00:06:56,590
Do we stop when the training
error is low enough?

106
00:06:57,950 --> 00:06:58,949
No.

107
00:06:58,949 --> 00:07:02,589
Remember that training error will decrease
as we keep adding model complexity.

108
00:07:02,589 --> 00:07:06,478
So, as we keep adding more and more
features, our training error will go down.

109
00:07:06,478 --> 00:07:09,080
What about when test error's low enough?

110
00:07:09,080 --> 00:07:10,780
Again, no.

111
00:07:10,780 --> 00:07:15,392
Remember that we never touch our test
data when we're searching over model

112
00:07:15,392 --> 00:07:18,914
complexity, or
choosing any kind of tuning parameter.

113
00:07:18,914 --> 00:07:22,187
Again, what we're gonna do
is use our validation set or

114
00:07:22,187 --> 00:07:25,962
cross validation to choose when
to stop this greedy procedure.

115
00:07:25,962 --> 00:07:29,999
[MUSIC]