[MUSIC] So, these ideas lead directly to a related, but slightly a different approach called kernel regression. Let's start by just recalling what we did for weighted k-NN. And what we did was we took our set of k-NN, and we applied some weight to each nearest neighbor based on how close that neighbor was to our query point. Well for kernel regression, instead of just weighting some set of k-NN, we're gonna apply weights to every observation in our training set. So in particular, our predicted value is gonna sum over every observation, and we're gonna have a weight C sub qi on each one of these data points. And this weight is gonna be defined according to the kernels that we talked about for weighted k-NN. And in statistics, this is often called Nadarya-Watson kernel weighted averages. So let's see what effect this kernel regression has in practice. And now what we're showing is this yellow, curved region, represents some kernel that we're choosing. In this case, it's called the Epanechnikov kernel, and what we see are some set of red, highlighted observations for our given target point, in this case called (x, 0). And I want to emphasize that we didn't set a fixed number of observations to highlight as red like we did in our k-NN. Here what we did is we chose a kernel with a given, what's called bandwidth, that's the lambda parameter we discussed earlier, and that defines a region in which our observations are gonna be included in our waiting. Because, in this case, when we're looking at this Epanechnikov kernel, this kernel has bounded support. And that's in contrast to, for example, the Gaussian kernel. And what that means is that there's some region in which points will have this weighting, these decaying sets of weights. And then outside this region the observations are completely discarded from our fit at this one point. So, what we do is at every target point, (x, 0), in our input space, we weight our observations by this kernel. So we compute this weighted average, and we say that is our predicted value. We do that at each point in our input space to carve out this green line as our predicted fit. So the result of this kernel regression isn't very different from than what the fit would look like from weighted k-NN. It's just in this case instead of specifying k, we're specifying a region based on this kernel bandwidth for weighting the observations to form this fit. But what we see is, in this case for, for this kernel regression fit, which like I've alluded to, should look fairly similar to our weighted k-NN. Things look a lot smoother than our standard k-NN's fit. So there are two important questions when doing kernal regression. One is which kernel to use, and the other is, for a given kernel what bandwidth should I use? But typically the choice of the bandwidth matters much more then the choice of kernel. So to motivate this, let's just look again at this Epanechnikov kernel with a couple different choices of bandwidths. And what we see is that the fit dramatically changes. Here for example, we get a much wilder overfitting. Here for this lambda value, things look pretty reasonable. But when we choose a bandwidth that's too large, we start to get over smoothing. So this is an oversmooth fit to the data. I'm not making very good predictions. So we can think of this in terms of this bias variance trade-off. This lambda parameter is controlling how much we're fitting our observations, which is having low bias but high variance. Very sensitive to the observations. If I change the observations, I get a dramatically different fit for kernel regression, versus over here for a very large bandwidth, it's the opposite. I have very low variance as I'm changing my data. I'm not gonna change the fit very much, but high bias. Significant differences relative to the true function, which is shown in blue. But just to show how we're fairly insensitive to the choice of kernel, here in the middle plot I'm just gonna change the kernel from this Epanechnikov to our boxcar kernel. And where the boxcar kernel, instead of having decaying weights with distance, it's gonna have a fixed set of weights. But we're only gonna include observations within some fixed window, just like the Epanechnikov kernel. So this boxcar kernel starts to look very, very similar to our standard k-NN. But again, instead of fixing what k is, you're fixing a neighborhood about your observation. But what you see is that although this boxcar window has these types of discontinuities we saw with k-NN, because observations are either in or they're out. The fit looks fairly similar to our Epanechnikov kernel. So, this is why we're saying that the choice of bandwidth has a much larger impact than the choice of kernel. So this leads to the next important question of, how are we gonna choose our bandwidth parameter lambda that we said matters so much? Or when we're talking about k-NN, the equivalent parameter we have to choose is k, the number of nearest neighbors we're gonna look at. And again, in the k-NN, I just wanna mention this now, we saw a similar kind of bias variance trade-off, where for one nearest neighbors we saw these crazy, wildly, overfit functions. But once we got to k, for some larger k value, we had much more reasonable and well behaved fits. So again, we have the same type of bias variance trade-off for that parameter as well. And so the question is, how are we choosing these tuning parameters for these methods that we're looking at? Well, its the same story as before, so we don't have to go into the lengthy conversations that we've had in past modules, and hopefully you know the answer is cross validation. Or, using some validation set, assuming you have enough data to do that. [MUSIC]