1 00:00:00,000 --> 00:00:04,421 [MUSIC] 2 00:00:04,421 --> 00:00:07,544 So now let's step back and discuss some important theoretical and 3 00:00:07,544 --> 00:00:11,700 practical aspects of K-nearest neighbors and kernel regression. 4 00:00:11,700 --> 00:00:16,500 If you remember the title of this module it was Going Nonparametric, and we've yet 5 00:00:16,500 --> 00:00:18,000 to mention what that means. 6 00:00:18,000 --> 00:00:20,420 What is a nonparametric approach? 7 00:00:20,420 --> 00:00:22,130 Well, K-nearest neighbors and 8 00:00:22,130 --> 00:00:25,540 kernel regression are examples of nonparametric approaches. 9 00:00:25,540 --> 00:00:28,190 And the general goal of a non-parametric approach is 10 00:00:28,190 --> 00:00:32,680 to be really flexible in how you're defining f of x and 11 00:00:32,680 --> 00:00:36,710 in general you want to make as few assumptions as possible. 12 00:00:36,710 --> 00:00:41,900 And the really key that defines a non-parametric method Is that 13 00:00:41,900 --> 00:00:47,220 the complexity of the fit can grow as you get more data points. 14 00:00:47,220 --> 00:00:50,820 We've definitely seen that with K-nearest neighbors and kernel regression, 15 00:00:50,820 --> 00:00:57,020 in particular the fit is a function of how many observations you have. 16 00:00:57,020 --> 00:01:01,190 But these are just two examples of nonparametric methods you might use for 17 00:01:01,190 --> 00:01:01,770 regression. 18 00:01:01,770 --> 00:01:03,760 There are lots of other choices. 19 00:01:03,760 --> 00:01:05,310 Things like splines, and 20 00:01:05,310 --> 00:01:10,100 trees which we'll talk about in the classification course, and locally 21 00:01:10,100 --> 00:01:15,460 weighted structured versions of the types of regression models we've talked about. 22 00:01:15,460 --> 00:01:19,470 So nonparametrics is all about this idea of having the complexity grow with 23 00:01:19,470 --> 00:01:21,820 the number of observations. 24 00:01:21,820 --> 00:01:25,000 So now let's talk about what's the limiting behavior 25 00:01:25,000 --> 00:01:28,690 of nearest neighbor regression as you get more and more data. 26 00:01:28,690 --> 00:01:34,000 And to start with, let's just assume that we get completely noiseless data. 27 00:01:34,000 --> 00:01:38,870 So every observation we get lies exactly on the true function. 28 00:01:38,870 --> 00:01:43,790 Well in this case, the mean squared error of one nearest neighbor 29 00:01:43,790 --> 00:01:47,210 regression goes to zero, as you get more and more data. 30 00:01:48,710 --> 00:01:51,970 But let's just remember what mean squared error is and if you remember from 31 00:01:51,970 --> 00:01:55,680 a couple modules ago, we talked about this bias-variance trade-off and 32 00:01:55,680 --> 00:02:00,550 that mean squared error is the sum of bias squared plus variance. 33 00:02:00,550 --> 00:02:04,710 So having mean squared error go to zero means that both bias and 34 00:02:04,710 --> 00:02:05,790 variance are going to zero. 35 00:02:07,090 --> 00:02:10,680 So to motivate this visually, let's just look at a couple of movies. 36 00:02:10,680 --> 00:02:15,970 Here, in this movie, I'm showing what the one nearest neighbor fit looks like 37 00:02:15,970 --> 00:02:17,990 as we're getting more and more data. 38 00:02:17,990 --> 00:02:22,420 So, remember the blue line is our true curve. 39 00:02:22,420 --> 00:02:26,400 The green line is our current nearest neighbor fit based on some set of 40 00:02:26,400 --> 00:02:32,250 observations that are gonna lie exactly on the true function at that blue curve. 41 00:02:32,250 --> 00:02:35,990 Okay, so here's our fit changing as we get more and 42 00:02:35,990 --> 00:02:39,520 more data and what you see is that it's getting closer, and closer, 43 00:02:39,520 --> 00:02:42,210 and closer, and closer, and closer to the true function. 44 00:02:42,210 --> 00:02:45,800 And hopefully you can believe that in limit of getting an infinite number of 45 00:02:45,800 --> 00:02:50,730 observations spread over our input space this nearest neighbor 46 00:02:50,730 --> 00:02:56,310 fit is gonna lie exactly on top of the true function. 47 00:02:56,310 --> 00:02:59,310 And that's true of all possible data sets of 48 00:02:59,310 --> 00:03:02,800 infinite number of observations that we would get. 49 00:03:02,800 --> 00:03:08,120 In contrast, if we look what just doing a standard quadratic fit, 50 00:03:08,120 --> 00:03:11,340 just our standard least squares fit we talked about before. 51 00:03:11,340 --> 00:03:17,530 No matter how much data we get, there's always gonna be some bias. 52 00:03:17,530 --> 00:03:20,220 So we can see this here, where especially 53 00:03:20,220 --> 00:03:24,280 at this point of curvature we see that this green fit, even as we get lots and 54 00:03:24,280 --> 00:03:29,140 lots of observations, is never matching up to that true blue curve. 55 00:03:29,140 --> 00:03:32,220 And that's because that true blue curve, that's actually a part of a sinusoid. 56 00:03:32,220 --> 00:03:36,050 We've just zoomed in on a certain region of that sinusoid. 57 00:03:36,050 --> 00:03:36,750 And so 58 00:03:36,750 --> 00:03:41,580 this quadratic fit is never exactly gonna describe what a sinusoid is describing. 59 00:03:42,800 --> 00:03:47,010 So this is what we talked about before, about the bias that's inherent 60 00:03:48,160 --> 00:03:51,230 in our model, even if we have no noise. 61 00:03:51,230 --> 00:03:53,590 So even if we eliminate this noise, 62 00:03:53,590 --> 00:03:58,040 we still have the fact that our true error, as we're getting more and more and 63 00:03:58,040 --> 00:04:02,190 more data, is never gonna to go exactly to zero. 64 00:04:02,190 --> 00:04:05,624 Unless of course the data were generated from exactly the model 65 00:04:05,624 --> 00:04:07,517 that we're using to fit the data. 66 00:04:07,517 --> 00:04:12,880 But in most cases, for example, maybe you have data that looks like the following. 67 00:04:12,880 --> 00:04:18,131 So this is our noise list data, and if we can strain our 68 00:04:18,131 --> 00:04:23,153 this was all for fixed model complexity remember, 69 00:04:23,153 --> 00:04:27,821 if we can strain our model to say be a quadratic, 70 00:04:27,821 --> 00:04:32,527 then maybe this will be our best quadratic fit. 71 00:04:35,406 --> 00:04:40,645 And no matter how many observations I give you from this more complicated function, 72 00:04:40,645 --> 00:04:43,540 this quadratic is never gonna have zero bias. 73 00:04:44,730 --> 00:04:48,140 In contrast, let's switch colors here, so 74 00:04:48,140 --> 00:04:51,300 that we can draw our one nearest neighbor fit. 75 00:04:51,300 --> 00:04:54,920 Our one nearest neighbor, as we get more and more data, 76 00:04:56,050 --> 00:05:00,630 it's fitting these constants locally to each observation. 77 00:05:02,340 --> 00:05:03,370 And as we get more and more and 78 00:05:03,370 --> 00:05:07,750 more data, the fit is gonna look exactly like the true curve. 79 00:05:07,750 --> 00:05:15,420 And so when we talk about our true error with increasing number of observations. 80 00:05:15,420 --> 00:05:20,716 Our true error for, this is a plot of 81 00:05:20,716 --> 00:05:26,367 true error for one nearest neighbor, 82 00:05:26,367 --> 00:05:32,561 is going to go to zero for noiseless data. 83 00:05:35,301 --> 00:05:37,050 But now let's talk about the noisy case. 84 00:05:37,050 --> 00:05:41,100 This is the case that we're typically faced with in most applications. 85 00:05:41,100 --> 00:05:44,880 And in this case what you can say is that the mean squared error 86 00:05:44,880 --> 00:05:47,590 of nearest neighbor regression goes to zero. 87 00:05:47,590 --> 00:05:53,290 If you allow the number of neighbors or the k in our nearest neighbor regression, 88 00:05:53,290 --> 00:05:57,000 to increase with the number of observations as well. 89 00:05:57,000 --> 00:05:59,930 Because if you think about getting tons and tons and tons of observations. 90 00:05:59,930 --> 00:06:03,280 If you keep k fixed, you're just gonna be looking at a very, 91 00:06:03,280 --> 00:06:05,930 very, very local region of your input space. 92 00:06:05,930 --> 00:06:10,050 And you're gonna have A lot of noise introduced from that, but if you'll allow 93 00:06:10,050 --> 00:06:13,730 k to grow, it's gonna smooth over the noise that's being introduced. 94 00:06:13,730 --> 00:06:15,590 So let's look at a visualization of this. 95 00:06:16,820 --> 00:06:22,780 So here what we're showing is the same true function we've shown throughout 96 00:06:22,780 --> 00:06:27,800 this module, but we're showing tons of observations, all these great dots. 97 00:06:27,800 --> 00:06:32,240 But they're noisy observations, they're no longer lying exactly on that blue curve. 98 00:06:32,240 --> 00:06:35,230 That's why they're this cloud of blue points. 99 00:06:35,230 --> 00:06:41,066 And we see that our one nearest neighbor fit is very, very noisy. 100 00:06:41,066 --> 00:06:44,651 Okay, it has this wild behavior, because like we discussed before, 101 00:06:44,651 --> 00:06:48,470 one nearest neighbor is very sensitive to noise in the data. 102 00:06:48,470 --> 00:06:54,890 But in contrast, if we look at a large k, so here we're looking at k equals 200. 103 00:06:54,890 --> 00:07:00,280 So our 200 nearest neighbors fit, it looks much, much better. 104 00:07:00,280 --> 00:07:04,050 So you can imagine that as you get more and more observations, 105 00:07:04,050 --> 00:07:08,170 if you're allowing k to grow, you can smooth over the noise being introduced by 106 00:07:08,170 --> 00:07:13,240 each one of these observations, and have the mean squared error going to zero. 107 00:07:13,240 --> 00:07:19,660 But in contrast, again, if we look at just our standard least squares regression 108 00:07:19,660 --> 00:07:24,475 here in this case of a quadratic fit, we're always gonna have bias. 109 00:07:25,835 --> 00:07:28,475 So nothing's different by having introduced noise. 110 00:07:28,475 --> 00:07:31,119 It, if anything, will just make things worse. 111 00:07:31,119 --> 00:07:35,179 [MUSIC]