[MUSIC] As one example of a way to handle this
bias variance trade off, we're gonna talk about something called ridge regression,
which not only includes a term that measures the fit of the function
to the data, which is what we talked about before, but also incorporates a term
that encodes what the model complexity is. Not quite directly, pretty indirectly
as we're gonna describe in this module. But a key question then is how
are we gonna define the balance between how much we emphasize the fit to
data versus this model complexity term. For this in ridge regression
there's a parameter that balances between these two terms. And to define this parameter, we're gonna discuss choosing it using
something called cross validation. And this again is a tool that's much
more general than just for regression, and it's an idea for
how to choose these tuning parameters in any machine learning model
that we might look at. Next we're gonna discuss
a feature selection task. So for example, I have my house that I
wanna list for sale and I might have a really, really long list of house
attributes associated with this house. And I wanna figure out which
are those subset of attributes that are really informative for
assessing the value of my house. So, for example, maybe it doesn't
really matter the fact that my house has a microwave when I'm going
to predict the value of the house. So, for reasons of interpretability, it can be really useful to do
this feature selection task. And in addition, we're gonna show
that if we have just a few set of features in our model, after we've done
this feature selection, then that can lead to significant increases in
efficiencies in forming our predictions. And so, to do this feature selection task,
the first thing we're gonna talk about is ways to explicitly search between models,
that include different sets of features. But than we're gonna turn to a method
that's really really similar in spirit to ridge regression that allows us to do
this feature selection task implicitly. In particular, again we're gonna have this
measure of fit of our function to our data and a measure of the model complexity but it's gonna be a different measure
that what we use for ridge. And this measure in particular
is what's gonna lead to these, what are called sparse
solutions where only a few of the features are actually
present in our estimated model. And we're gonna use this lasso
regression task as an opportunity to teach about another optimization
method that's called coordinate descent. So we talked about gradient
descent earlier, and this is another one of these really
important optimization methods that we're gonna see again later
in this specialization. And what coordinate descent does
Is instead of solving a big, high dimensional optimization objective,
it's gonna go coordinate by coordinate. So variable by variable,
optimizing each in turn. So we're gonna end up making
these axis aligned moves as we iterate in this algorithm. So again, just like radiant descent,
it's an iterative procedure. But is fundamentally
a different formulation for how these iterates are defined. Finally we're gonna conclude by discussing
something called nearest neighbor regression, which is a really simple,
but very, very powerful technique. So in the simplest case that we're gonna
describe, if I'm interested in predicting the value of my house, what I'm gonna do
is I'm gonna go through my data set and I'm gonna find the most
similar house to mine. Then, I'm simply gonna look at
how much that house sold for and I'm gonna say that's what I'm
predicting my house sale's price to be. Well you can generalize this idea of just
looking at the most similar house to looking at a set of similar houses and then taking the average value of
those houses as your prediction, but what you can also do is something
that's called kernel regression. Where you actually include
every observation in your data set informing
your predicted value. But when you go to computer this
average you're gonna weight the houses by how close they are to you. So houses that are very similar which are
quote, unquote, nearby to you in the space of similarity are gonna be weighted very
heavily in this weighted average, and houses that are very dissimilar
are gonna be down weighted a lot. And this leads to these really nice fits
for regression and they're very adaptive. As you get more data you can describe
more and more complicated relationships. So these methods are useful
when you have lots of data, and we're gonna discuss this data versus
complexity trade off in this module. So in summary, we're gonna cover
a lot of ground in this course. So we're gonna talk about all different
kinds of models for regression, but we're also gonna talk
about very general purpose optimization algorithms like gradient
descent and coordinate descent, and a whole bunch of concepts that are really
foundational to machine learning. Including things like the bias variance
trade off, cross validation for selecting tuning parameters,
ideas of sparsity and over fitting and how to do model selection and
feature selection. So, this is gonna be a really, really
important course in our specialization. [MUSIC]