[MUSIC] Okay, so how are we going to go
about this feature selection task? Well one option we have, is the obvious
choice, which is to search over every possible combination of features we
might want to include in our model and look at the performance
of each of those models. And that's exactly what the all
subsets algorithm does and we're going to describe this now. Okay, well the all subsets algorithm
starts by considering a model with absolutely no features in it. Okay, removing all these features
we might have in our house and saying what's the performance
of that model? So just to be clear,
start with no features and there's still a model for no features. So the model for no features, remember, is
just that our observation is simply noise. Okay, so we can assess the performance
of this model on our training data, and there is some training error
associated with a model with no features. So, we're going to plot that point. And then the next thing we're going to do, is we're going to search over ever
possible model of just one feature. So let's say to start with, we consider
a model which is number of bedrooms. And here we're gonna plot the training error of model fit just with number of bedrooms as the feature. Then we're gonna say, well, what's
the training error of a model fit just with number of bathrooms, square feet, square feet of the lot, and cycle through
each one of our possible features. And at the end of this we can say out
of all models with just one feature, which one fit the training data the best? And in this case it happened to be the
model that included square feet living. So we've seen that before,
that that's a very relevant feature. Okay, so we're gonna highlight,
this is the best fitting model. With only one feature. And we're gonna keep track of this model
and we can discard all the other ones. Then we're gonna go and
search over all models with 2 features. So search over all combinations, Of 2 features. And we're gonna figure out which one
of these has the lowest training error, keep track of that model. And that happens to be a model
that has number of bedrooms and number of bathrooms. And that might make sense, because when
we're going to search for a property, often someone will say, I want a three
bedroom house with two bathrooms. So that might be a reasonable choice for
the best model, which is two features. And I wanna emphasize that the best model, which is two features, doesn't have to
contain any of the features that were contained in the best
model of one feature. So here, our best model with two
features has number of bedrooms, number of bathrooms. Whereas in contrast, our best model with
just one feature has square feet living. Okay, so these aren't necessarily nested. So maybe I'll write this explicitly. Best model of size k need not contain features. Of best model of size k minus one. Okay, so hopefully that's clear. And we're gonna continue our procedure
searching over all models then with 3 features, all models with 4 features,
5 features, and at some point what we're gonna get to is, we're gonna get to
a model that has capital D features. That's all of the features that we
include, and there is only one such model. So it's just one point here. Then, what we can do is we can draw
this line, which represents, the set is connecting the points which represent
the set of all best possible models, each with a given number of features. Then the question is, which of these
models of these best models with k features do we want to use for
our predictions? Well hopefully it's clear from this
course, as well as from this slide, that we don't just wanna choose the model
with the lowest training error, because as we know at this point,
as we increase model complexity, our training error is gonna go down, and
that's what we're seeing in this plot. So instead, it's the same type
of choices that we've had previously in this course for choosing
between models of various complexity. One choice, is if you have enough data you can access
performance on the validation set. That's separate from your training and
test set. We also talked about
doing cross validation. And in this case there many other metrics we can look at for how to think
about penalizing model complexity. There are things called BIC and a long
list of other methods that people have for choosing amongst these different models. But we're not going to go
through the details of that, for our course we're gonna focus on this
notion of error on the validation set. [MUSIC]