[MUSIC] Well we discussed ridge regression and
cross-validation. But we kinda brushed under the rug
what can be a fairly important issue when we discussed our ridge
regression objective, which is how to deal with the intercept term
that's commonly included in most models. So in particular let's recall our multiple
regression model, which is shown here. And so far we've just treated
generically that there's some h0 of x, that's our first feature
with coefficient w0. But as we mentioned two modules ago,
typically, that first feature is
treated to be what's called the constant feature, so that w0 just
represents the intercept of the model. So if you're thinking of
some hyper-point issues, where is it sitting along that y-axis? And then all the other features
are some arbitrary set of other terms that you might be interested in. Okay. Well if we have this constant
feature in our model, then the model that I wrote on the previous
slide simplifies to the following. Where in this case when we think of
our matrix notation for having And different observations. When we're forming our H matrix,
the first column of that matrix, that's the coefficient for
the w0 term, the w0 coefficient. So in this special case, that entire first
column is filled entirely with ones. So that we get w0 all along as the first
feature for every observation. Okay so
this is the specific form that our H matrix is gonna take in this case where
we have an intercept term in the model. Now let's return to our standard
ridge regression objective that we had where we said we have
the RSS(w) + lambda times ||w||_2 squared where that ||w||_2
vector included w_0 for the intercept term in the models
where that's what it represents. So a question is does this
really make sense to do? Because what this is doing is it's
encouraging that intercept term to be small. That's what this ridge
regression penalty is doing. And do we want a small intercept? So it's useful to think about doing ridge
regression when you're adding lots and lots of features but regardless of how
many features you add to your model, does that really matter in how we're thinking
about the magnitude of the intercept? Not really. So it probably doesn't make
a lot of sense intuitively to think about shrinking the intercept just because we have this very flexible
model with lots of other features. So let's think about how to address this. Okay, the first option we have is
not to penalize the intercept term. And the way we can do that is to
separate out that w0 coefficient from all the other w's. w1, w2 all the way up to wd, when we're
thinking about that penalty term. So we have residual sum of squares of w0,
and what I'll call w rest,
all those other w's. And when we add our ridge
regression penalty, the 2 norm is only taken
of that w rest factor. All those w's not including our intercept. So a question is,
how do we implement this in practice? How is this gonna modify the closed
form solution or the gradient descent algorithm that we showed previously when
we weren't handling this specific case. So the very simple
modification we can make is simply defining something
that I'm calling Imod. It's a modified identity matrix. That has a 0 in the first entry,
and so in the one one entry, and all the other elements are exactly
the same as an identity matrix before. So to be explicit our H transpose H terms
is gonna look just as it did before but now this lambda Imod has a 0. So this is the entry. Corresponding to the w0 index. And then we have lambdas as before everywhere else on this diagonal
and of course still our 0s off diagonal. Okay, now let's look at our
gradient descent algorithm. And here it's gonna be very simple, we
just add in a special case that if we're updating our intercept term, so
if we're looking at that zero feature, we're just gonna use our old re-sqaures update. No shrinkage to w0, but otherwise, for all other features we're gonna do the ridge update. Okay so we see algorithmically its
very straightforward to make this modification where we don't want
to penalize that intercept term. But there's another option we have
which is to transform the data. So in particular if we center
the data about 0 as a pre-processing step then it doesn't matter so much we're
shrinking the intercept towards 0 and not correcting for that,
because when we have data centered about 0 in general we tend to believe that
the intercept will be pretty small. So here what I'm saying is step one, first we transform all our y
observations to have mean 0. And then as a second step we just run
exactly the ridge regression we described at the beginning of this module. Where we don't account for the fact
that there's this intercept term at all. So, that's another perfectly
reasonable solution to this problem. [MUSIC]