[MUSIC] So, these ideas lead
directly to a related, but slightly a different approach
called kernel regression. Let's start by just recalling
what we did for weighted k-NN. And what we did was we took our set of
k-NN, and we applied some weight to each nearest neighbor based on how close
that neighbor was to our query point. Well for kernel regression, instead
of just weighting some set of k-NN, we're gonna apply weights to every
observation in our training set. So in particular, our predicted value is
gonna sum over every observation, and we're gonna have a weight C sub qi
on each one of these data points. And this weight is gonna be defined
according to the kernels that we talked about for weighted k-NN. And in statistics, this is often called
Nadarya-Watson kernel weighted averages. So let's see what effect this
kernel regression has in practice. And now what we're showing is this yellow,
curved region, represents some kernel
that we're choosing. In this case,
it's called the Epanechnikov kernel, and what we see are some set of red,
highlighted observations for our given target point,
in this case called (x, 0). And I want to emphasize that we didn't
set a fixed number of observations to highlight as red like we did in our k-NN. Here what we did is we chose a kernel
with a given, what's called bandwidth, that's the lambda parameter we discussed
earlier, and that defines a region in which our observations are gonna
be included in our waiting. Because, in this case, when we're
looking at this Epanechnikov kernel, this kernel has bounded support. And that's in contrast to, for
example, the Gaussian kernel. And what that means is that there's some
region in which points will have this weighting, these decaying sets of weights. And then outside this region
the observations are completely discarded from our fit at this one point. So, what we do is at every target point,
(x, 0), in our input space,
we weight our observations by this kernel. So we compute this weighted average,
and we say that is our predicted value. We do that at each point
in our input space to carve out this green line as our predicted fit. So the result of this kernel regression
isn't very different from than what the fit would look like
from weighted k-NN. It's just in this case
instead of specifying k, we're specifying a region based
on this kernel bandwidth for weighting the observations
to form this fit. But what we see is, in this case for,
for this kernel regression fit, which like I've alluded to, should look
fairly similar to our weighted k-NN. Things look a lot smoother
than our standard k-NN's fit. So there are two important questions
when doing kernal regression. One is which kernel to use, and the other is, for a given kernel
what bandwidth should I use? But typically the choice of
the bandwidth matters much more then the choice of kernel. So to motivate this,
let's just look again at this Epanechnikov kernel with a couple different
choices of bandwidths. And what we see is that
the fit dramatically changes. Here for example,
we get a much wilder overfitting. Here for this lambda value,
things look pretty reasonable. But when we choose a bandwidth that's too
large, we start to get over smoothing. So this is an oversmooth fit to the data. I'm not making very good predictions. So we can think of this in terms
of this bias variance trade-off. This lambda parameter is
controlling how much we're fitting our observations, which is having
low bias but high variance. Very sensitive to the observations. If I change the observations,
I get a dramatically different fit for kernel regression, versus over here for
a very large bandwidth, it's the opposite. I have very low variance
as I'm changing my data. I'm not gonna change the fit very much,
but high bias. Significant differences relative to
the true function, which is shown in blue. But just to show how we're fairly
insensitive to the choice of kernel, here in the middle plot I'm just gonna change the kernel from this
Epanechnikov to our boxcar kernel. And where the boxcar kernel, instead of
having decaying weights with distance, it's gonna have a fixed set of weights. But we're only gonna include
observations within some fixed window, just like the Epanechnikov kernel. So this boxcar kernel starts to look very,
very similar to our standard k-NN. But again, instead of fixing what k is, you're fixing a neighborhood
about your observation. But what you see is that although
this boxcar window has these types of discontinuities we saw with k-NN, because
observations are either in or they're out. The fit looks fairly similar
to our Epanechnikov kernel. So, this is why we're saying that
the choice of bandwidth has a much larger impact than the choice of kernel. So this leads to the next important
question of, how are we gonna choose our bandwidth parameter lambda
that we said matters so much? Or when we're talking about k-NN, the
equivalent parameter we have to choose is k, the number of nearest
neighbors we're gonna look at. And again, in the k-NN,
I just wanna mention this now, we saw a similar kind of bias
variance trade-off, where for one nearest neighbors we saw these crazy,
wildly, overfit functions. But once we got to k,
for some larger k value, we had much more reasonable and
well behaved fits. So again, we have the same type
of bias variance trade-off for that parameter as well. And so the question is, how are we
choosing these tuning parameters for these methods that we're looking at? Well, its the same story as before,
so we don't have to go into the lengthy conversations that
we've had in past modules, and hopefully you know the answer
is cross validation. Or, using some validation set,
assuming you have enough data to do that. [MUSIC]