[MUSIC] Okay.
So, in cases where you can't do all subsets,
what can we do? Well, one thing we can do, which is gonna be option number two in our
series of feature selection algorithms. Is to do a sub-optimal, but tractable approach to
selecting amongst features. And that's a set of greedy algorithms
that we're gonna consider now. So, one simple algorithm is what's
called the forward stepwise algorithm, where you start with some
set of possible features. And then what you're gonna do is, you're
going to just greedily walk through each, select the next best feature and
keep iterating. So, let's be a little bit
more explicit about this. So, first we're gonna start with
the empty set, no features. Or you could start with a list of features
that you know you want to be included in the model. So, for example the constant feature
which might be your feature, h0. But then what you're gonna do is, you're
gonna fit your model with whatever your current set of features are to get an
estimate of the weights on those features. Your estimate of your
regression coefficients. And then what you're gonna do is,
you're gonna search over all the possible features that you might add, and choose
the one that best improves your fit. And so, one measure of this
might be training errors. So, you choose the feature that
most reduces your training error. And then you just include that in your
model, and you recurse this procedure. Where now you have a model
with some number of features. And you're gonna search over every
possible feature you might wanna add and choose the next one that drops your
training error the most, and keep going. So, let's visualize this procedure. For this visualization, we're gonna start with the picture
that we had in our all subset search. And we're gonna look at how
the greedy search differs from the search we would have
done in all subsets. So at the first step,
we start with no features in our model. So, we're actually gonna be at exactly
the same point as we were in all subsets. So, start at same training error as all subsets. Which is equivalent to no
features in the model. And then what we're gonna do is we're
gonna search over all our features, and choose the one that reduces
training error the most. And which one is that gonna be? It was actually going to be the same
one we got for all subsets. Because at that point,
when we we're doing all subsets and looking over all models of one feature. Again, we have the same choice of choosing the best single feature even though
we're doing this greedy procedure. When we start with an empty set, our
first best feature is gonna be the same as the same best feature that
you would get doing all subsets. So again, stay at all subset solution For best model with one feature. And in that case,
remember just like in all subsets, we choose square feet as the best
feature when we have just one feature. But then things can start to diverge. Because when we go to our second
step in the greedy algorithm, we're forced to keep
square feet in our model. And we're just greedily gonna
search over the next feature. That's why it's called greedy. I should've mentioned this previously
because we're just doing the quote unquote, greedy thing of just saying,
what's the next best thing, and not thinking globally
about what might be best. Okay. So in this case, maybe the next best
thing that we could add to the model, forcing ourselves to keep square feet
in our model, is year renovated. So, when we think about what greedy finds,
add two features. We have square feet living and
year renovated. But if we compare to
what our all subsets had, it was number of bedrooms and
number of bathrooms. So, this is where we see that
things can be different. Because once we're forced
to keep square feet, maybe the next thing that's
most relevant to value, you can have these very large old houses,
but those might not be worth as much. It might be the year that the house was
renovated that's really relevant to the value. But as we described before, it might be
reasonable to think that the best model with just two features was number of
bedrooms and number of bathrooms. Because those are really key attributes
that people use to value whether a house is a good fit for them. Well, then we continue
with this procedure. We already had square feet living,
we had year renovated. And now, in this case, our next
feature we're gonna add is waterfront, and we're gonna keep
going on this process. And when's the next time
that our greedy procedure is gonna meet the same
solution as all subsets? So, we started at the same point, we said after the first iteration,
we were at the same point. Well, again, we'll be at the same point
once we get all the way, obviously, to including all features, coz there's
only one such model with all features. So, there are a couple points that I
wanna make about this greedy procedure. One is the fact that from iteration to
iteration, error can never decrease. And the reason for that is, even if there's nothing else
that you want to add to your model. Even if everything else is junk, or perhaps only making things a lot
worse in terms of predictions. You can always just set the weight equal
to zero on that additional feature. And then you'd have exactly the same model
you had, which is one fewer feature. And so, the training error
would be exactly the same. So, we know the training
error will never increase. And the other thing that we mentioned is
that eventually, the training error will be equivalent to that of all subsets
when you're using all of your features, and perhaps, before that as well. Well, when do we stop
this greedy procedure? Just to really hammer this home,
I'll just ask the question again. Do we stop when the training
error is low enough? No. Remember that training error will decrease
as we keep adding model complexity. So, as we keep adding more and more
features, our training error will go down. What about when test error's low enough? Again, no. Remember that we never touch our test
data when we're searching over model complexity, or
choosing any kind of tuning parameter. Again, what we're gonna do
is use our validation set or cross validation to choose when
to stop this greedy procedure. [MUSIC]