1 00:00:00,000 --> 00:00:04,478 [MUSIC] 2 00:00:04,478 --> 00:00:06,009 At the beginning of this module, 3 00:00:06,009 --> 00:00:10,010 we talked about this idea of fitting globally versus fitting locally. 4 00:00:10,010 --> 00:00:12,480 Now that we've seen k nearest neighbors and kernel regression, 5 00:00:12,480 --> 00:00:15,030 I wanna formalize this idea. 6 00:00:15,030 --> 00:00:16,320 So in particular, 7 00:00:16,320 --> 00:00:22,080 let's look at what happens when we just fit a constant function to our data. 8 00:00:23,290 --> 00:00:27,440 So in that case that's just computing what's called a global average where we 9 00:00:27,440 --> 00:00:32,690 take all of our observations, add them together and take the average or 10 00:00:32,690 --> 00:00:35,260 just divide by that total number of observations. 11 00:00:35,260 --> 00:00:42,450 So that's exactly equivalent to summing over a weighted set of our observations, 12 00:00:42,450 --> 00:00:47,000 where the weights are exactly the same on each of our data points, and 13 00:00:47,000 --> 00:00:49,700 then dividing by the total sum of these weights. 14 00:00:51,030 --> 00:00:55,860 So now that we've put our global average in this form, things start to look 15 00:00:55,860 --> 00:00:58,910 very similar to the kernel regression ideas that we've looked at. 16 00:00:59,980 --> 00:01:03,080 Where here it's almost like kernel regression, but 17 00:01:03,080 --> 00:01:07,940 we're including every observation in our fit, and 18 00:01:07,940 --> 00:01:11,080 we're having exactly the same weights on every observation. 19 00:01:12,860 --> 00:01:16,310 So that's like using this box car kernel that puts the same weights on all 20 00:01:16,310 --> 00:01:19,860 observations, and just having a really really massively large 21 00:01:19,860 --> 00:01:24,110 bandwidth parameters such that for every point in our input space 22 00:01:24,110 --> 00:01:26,790 all the other observations are gonna be included in the fit. 23 00:01:27,940 --> 00:01:32,770 But now let's contrast that with a more standard version of kernel regression, 24 00:01:32,770 --> 00:01:38,170 which leads to what we're gonna think of as locally constant fits. 25 00:01:38,170 --> 00:01:42,290 Because [COUGH] if we look at the kernel regression equation, 26 00:01:42,290 --> 00:01:45,650 what we see is that, it's exactly what we had for 27 00:01:45,650 --> 00:01:49,000 our global average, but now it's gonna be weighted by this kernel. 28 00:01:49,000 --> 00:01:52,050 Where in a lot of cases, what that kernel is doing, 29 00:01:52,050 --> 00:01:57,980 is it's putting a hard limit that some observations outside of our window 30 00:01:57,980 --> 00:02:02,820 of around whatever target point what we're looking at, are out of our calculation. 31 00:02:02,820 --> 00:02:06,550 So the simplest case we can talk about is this box car kernel, 32 00:02:06,550 --> 00:02:09,890 that's gonna put equal weights over all observations, but 33 00:02:09,890 --> 00:02:14,890 just local to our target point x,o. 34 00:02:14,890 --> 00:02:21,050 And so, we're gonna get a constant fit but, just at that one target point, 35 00:02:21,050 --> 00:02:24,920 and then we're going to get a different constant fit at the next target point, and 36 00:02:24,920 --> 00:02:26,990 the next one, and the next one. 37 00:02:26,990 --> 00:02:29,650 And, I want to be clear that the resulting output isn't 38 00:02:29,650 --> 00:02:32,060 a stair case kind of function. 39 00:02:32,060 --> 00:02:35,200 It's not a collection of these constant fits. 40 00:02:35,200 --> 00:02:38,390 It is a collection of the constant fits, but just at a single point. 41 00:02:38,390 --> 00:02:40,680 So we're taking a single point, doing another constant fit, 42 00:02:40,680 --> 00:02:45,570 taking the single point, which is at that target, and as we're doing this over 43 00:02:45,570 --> 00:02:49,510 all our different inputs that's what's defining this green curve. 44 00:02:50,650 --> 00:02:52,630 Okay, but if we look at another kernel, 45 00:02:52,630 --> 00:02:58,210 like our Epanechnikov kernel that has the weights decaying over this fixed region. 46 00:03:00,470 --> 00:03:04,890 Well, it is still doing a constant fit, but how is it 47 00:03:04,890 --> 00:03:10,450 figuring out what the level of that line should be at our target point? 48 00:03:10,450 --> 00:03:11,650 Well, what it's doing is, 49 00:03:11,650 --> 00:03:17,370 it's just down weighting observations that are further from our target point and 50 00:03:17,370 --> 00:03:21,620 emphasizing more heavily the observations more close to our target point. 51 00:03:21,620 --> 00:03:26,050 So this is just a weighted global average but its no longer global it's 52 00:03:26,050 --> 00:03:30,630 local because we're only looking at observations within this defined window. 53 00:03:30,630 --> 00:03:35,550 So we're doing this weighted average locally at each one of our input 54 00:03:35,550 --> 00:03:37,800 points and tracing out this green curve. 55 00:03:39,230 --> 00:03:43,640 So, this hopefully makes very clear how before 56 00:03:43,640 --> 00:03:47,320 in the types of linear regression models we were talking about, we 57 00:03:47,320 --> 00:03:51,430 were doing these global fits which in the simplest case, was just a constant model. 58 00:03:51,430 --> 00:03:57,510 That was our most basic model we could consider having just the constant feature 59 00:03:57,510 --> 00:04:01,780 and now what we're talking about is doing exactly the same thing but 60 00:04:01,780 --> 00:04:06,050 locally and so locally that it's at every single point at our input space. 61 00:04:07,330 --> 00:04:10,540 So this kernel regression method that we've described so far, 62 00:04:10,540 --> 00:04:16,720 we've now motivated as fitting a constant function locally at each observation, 63 00:04:16,720 --> 00:04:20,630 well more than each observation, each point in our input space. 64 00:04:20,630 --> 00:04:24,190 And this is referred to as locally weighted averages but 65 00:04:24,190 --> 00:04:28,790 instead of fitting a constant at each point in our input space 66 00:04:28,790 --> 00:04:32,690 we could have likewise fit a line or polynomial. 67 00:04:32,690 --> 00:04:37,205 And so what this leads to is something that's called locally 68 00:04:37,205 --> 00:04:39,517 weighted linear regression. 69 00:04:39,517 --> 00:04:42,976 We are not going to go through the details of of locally weighted linear regression 70 00:04:42,976 --> 00:04:44,270 in this module. 71 00:04:44,270 --> 00:04:45,670 It's fairly straightforward. 72 00:04:45,670 --> 00:04:48,320 It's a similar idea to these local constant fits, 73 00:04:48,320 --> 00:04:51,810 but now plugging in a line or polynomial. 74 00:04:51,810 --> 00:04:56,410 But I wanted to leave you with a couple rules of thumb for which fit you 75 00:04:56,410 --> 00:05:01,530 might choose between a different set of polynomials that you have options over. 76 00:05:01,530 --> 00:05:07,210 And one thing that fitting a local line instead of a local constant helps you with 77 00:05:07,210 --> 00:05:10,260 are those boundary effects that we talked about before. 78 00:05:10,260 --> 00:05:13,710 The fact that you get these large biases at the boundary. 79 00:05:15,700 --> 00:05:20,610 So you can show very formally that these local linear fits help with that bias, and 80 00:05:20,610 --> 00:05:25,590 if we talk about local quadratic fits, that helps with bias that you get 81 00:05:25,590 --> 00:05:28,400 at points of curvature in the interior view of space. 82 00:05:28,400 --> 00:05:29,300 So, for example, 83 00:05:29,300 --> 00:05:34,630 we see that blue curve we've been trying to fit, and if we go back, 84 00:05:34,630 --> 00:05:39,610 maybe it's worth quickly jumping back to what our fit looks like we see that, 85 00:05:39,610 --> 00:05:44,640 towards the boundary we get large biases, and right at the point of curvature, we 86 00:05:44,640 --> 00:05:50,770 also have a bias where we're under fitting the true curvature of that blue function. 87 00:05:50,770 --> 00:05:54,790 And so the local quadratic fit helps with fitting that curvature. 88 00:05:54,790 --> 00:05:58,120 But what it does is it actually leads to a larger variance so 89 00:05:58,120 --> 00:06:00,740 that can be unattractive. 90 00:06:00,740 --> 00:06:06,152 So in general just a basic recommendation is to use just a standard local 91 00:06:06,152 --> 00:06:11,487 linear regression, fitting lines at every point in the input space. 92 00:06:11,487 --> 00:06:14,550 [MUSIC]