1
00:00:00,000 --> 00:00:04,421
[MUSIC]

2
00:00:04,421 --> 00:00:07,544
So now let's step back and
discuss some important theoretical and

3
00:00:07,544 --> 00:00:11,700
practical aspects of K-nearest
neighbors and kernel regression.

4
00:00:11,700 --> 00:00:16,500
If you remember the title of this module
it was Going Nonparametric, and we've yet

5
00:00:16,500 --> 00:00:18,000
to mention what that means.

6
00:00:18,000 --> 00:00:20,420
What is a nonparametric approach?

7
00:00:20,420 --> 00:00:22,130
Well, K-nearest neighbors and

8
00:00:22,130 --> 00:00:25,540
kernel regression are examples
of nonparametric approaches.

9
00:00:25,540 --> 00:00:28,190
And the general goal of
a non-parametric approach is

10
00:00:28,190 --> 00:00:32,680
to be really flexible in how
you're defining f of x and

11
00:00:32,680 --> 00:00:36,710
in general you want to make as
few assumptions as possible.

12
00:00:36,710 --> 00:00:41,900
And the really key that defines
a non-parametric method Is that

13
00:00:41,900 --> 00:00:47,220
the complexity of the fit can
grow as you get more data points.

14
00:00:47,220 --> 00:00:50,820
We've definitely seen that with K-nearest
neighbors and kernel regression,

15
00:00:50,820 --> 00:00:57,020
in particular the fit is a function
of how many observations you have.

16
00:00:57,020 --> 00:01:01,190
But these are just two examples of
nonparametric methods you might use for

17
00:01:01,190 --> 00:01:01,770
regression.

18
00:01:01,770 --> 00:01:03,760
There are lots of other choices.

19
00:01:03,760 --> 00:01:05,310
Things like splines, and

20
00:01:05,310 --> 00:01:10,100
trees which we'll talk about in
the classification course, and locally

21
00:01:10,100 --> 00:01:15,460
weighted structured versions of the types
of regression models we've talked about.

22
00:01:15,460 --> 00:01:19,470
So nonparametrics is all about this
idea of having the complexity grow with

23
00:01:19,470 --> 00:01:21,820
the number of observations.

24
00:01:21,820 --> 00:01:25,000
So now let's talk about
what's the limiting behavior

25
00:01:25,000 --> 00:01:28,690
of nearest neighbor regression
as you get more and more data.

26
00:01:28,690 --> 00:01:34,000
And to start with, let's just assume
that we get completely noiseless data.

27
00:01:34,000 --> 00:01:38,870
So every observation we get lies
exactly on the true function.

28
00:01:38,870 --> 00:01:43,790
Well in this case, the mean squared
error of one nearest neighbor

29
00:01:43,790 --> 00:01:47,210
regression goes to zero,
as you get more and more data.

30
00:01:48,710 --> 00:01:51,970
But let's just remember what mean squared
error is and if you remember from

31
00:01:51,970 --> 00:01:55,680
a couple modules ago, we talked about
this bias-variance trade-off and

32
00:01:55,680 --> 00:02:00,550
that mean squared error is the sum
of bias squared plus variance.

33
00:02:00,550 --> 00:02:04,710
So having mean squared error go
to zero means that both bias and

34
00:02:04,710 --> 00:02:05,790
variance are going to zero.

35
00:02:07,090 --> 00:02:10,680
So to motivate this visually,
let's just look at a couple of movies.

36
00:02:10,680 --> 00:02:15,970
Here, in this movie, I'm showing what
the one nearest neighbor fit looks like

37
00:02:15,970 --> 00:02:17,990
as we're getting more and more data.

38
00:02:17,990 --> 00:02:22,420
So, remember the blue
line is our true curve.

39
00:02:22,420 --> 00:02:26,400
The green line is our current nearest
neighbor fit based on some set of

40
00:02:26,400 --> 00:02:32,250
observations that are gonna lie exactly
on the true function at that blue curve.

41
00:02:32,250 --> 00:02:35,990
Okay, so
here's our fit changing as we get more and

42
00:02:35,990 --> 00:02:39,520
more data and what you see is that
it's getting closer, and closer,

43
00:02:39,520 --> 00:02:42,210
and closer, and closer, and
closer to the true function.

44
00:02:42,210 --> 00:02:45,800
And hopefully you can believe that in
limit of getting an infinite number of

45
00:02:45,800 --> 00:02:50,730
observations spread over our
input space this nearest neighbor

46
00:02:50,730 --> 00:02:56,310
fit is gonna lie exactly on
top of the true function.

47
00:02:56,310 --> 00:02:59,310
And that's true of all
possible data sets of

48
00:02:59,310 --> 00:03:02,800
infinite number of observations
that we would get.

49
00:03:02,800 --> 00:03:08,120
In contrast, if we look what just
doing a standard quadratic fit,

50
00:03:08,120 --> 00:03:11,340
just our standard least squares
fit we talked about before.

51
00:03:11,340 --> 00:03:17,530
No matter how much data we get,
there's always gonna be some bias.

52
00:03:17,530 --> 00:03:20,220
So we can see this here, where especially

53
00:03:20,220 --> 00:03:24,280
at this point of curvature we see that
this green fit, even as we get lots and

54
00:03:24,280 --> 00:03:29,140
lots of observations, is never
matching up to that true blue curve.

55
00:03:29,140 --> 00:03:32,220
And that's because that true blue curve,
that's actually a part of a sinusoid.

56
00:03:32,220 --> 00:03:36,050
We've just zoomed in on a certain
region of that sinusoid.

57
00:03:36,050 --> 00:03:36,750
And so

58
00:03:36,750 --> 00:03:41,580
this quadratic fit is never exactly gonna
describe what a sinusoid is describing.

59
00:03:42,800 --> 00:03:47,010
So this is what we talked about before,
about the bias that's inherent

60
00:03:48,160 --> 00:03:51,230
in our model, even if we have no noise.

61
00:03:51,230 --> 00:03:53,590
So even if we eliminate this noise,

62
00:03:53,590 --> 00:03:58,040
we still have the fact that our true
error, as we're getting more and more and

63
00:03:58,040 --> 00:04:02,190
more data,
is never gonna to go exactly to zero.

64
00:04:02,190 --> 00:04:05,624
Unless of course the data were
generated from exactly the model

65
00:04:05,624 --> 00:04:07,517
that we're using to fit the data.

66
00:04:07,517 --> 00:04:12,880
But in most cases, for example, maybe you
have data that looks like the following.

67
00:04:12,880 --> 00:04:18,131
So this is our noise list data,
and if we can strain our

68
00:04:18,131 --> 00:04:23,153
this was all for
fixed model complexity remember,

69
00:04:23,153 --> 00:04:27,821
if we can strain our model
to say be a quadratic,

70
00:04:27,821 --> 00:04:32,527
then maybe this will be
our best quadratic fit.

71
00:04:35,406 --> 00:04:40,645
And no matter how many observations I give
you from this more complicated function,

72
00:04:40,645 --> 00:04:43,540
this quadratic is never
gonna have zero bias.

73
00:04:44,730 --> 00:04:48,140
In contrast, let's switch colors here, so

74
00:04:48,140 --> 00:04:51,300
that we can draw our one
nearest neighbor fit.

75
00:04:51,300 --> 00:04:54,920
Our one nearest neighbor,
as we get more and more data,

76
00:04:56,050 --> 00:05:00,630
it's fitting these constants
locally to each observation.

77
00:05:02,340 --> 00:05:03,370
And as we get more and more and

78
00:05:03,370 --> 00:05:07,750
more data, the fit is gonna look
exactly like the true curve.

79
00:05:07,750 --> 00:05:15,420
And so when we talk about our true error
with increasing number of observations.

80
00:05:15,420 --> 00:05:20,716
Our true error for, this is a plot of

81
00:05:20,716 --> 00:05:26,367
true error for one nearest neighbor,

82
00:05:26,367 --> 00:05:32,561
is going to go to zero for noiseless data.

83
00:05:35,301 --> 00:05:37,050
But now let's talk about the noisy case.

84
00:05:37,050 --> 00:05:41,100
This is the case that we're typically
faced with in most applications.

85
00:05:41,100 --> 00:05:44,880
And in this case what you can say
is that the mean squared error

86
00:05:44,880 --> 00:05:47,590
of nearest neighbor
regression goes to zero.

87
00:05:47,590 --> 00:05:53,290
If you allow the number of neighbors or
the k in our nearest neighbor regression,

88
00:05:53,290 --> 00:05:57,000
to increase with the number
of observations as well.

89
00:05:57,000 --> 00:05:59,930
Because if you think about getting tons
and tons and tons of observations.

90
00:05:59,930 --> 00:06:03,280
If you keep k fixed,
you're just gonna be looking at a very,

91
00:06:03,280 --> 00:06:05,930
very, very local region
of your input space.

92
00:06:05,930 --> 00:06:10,050
And you're gonna have A lot of noise
introduced from that, but if you'll allow

93
00:06:10,050 --> 00:06:13,730
k to grow, it's gonna smooth over
the noise that's being introduced.

94
00:06:13,730 --> 00:06:15,590
So let's look at a visualization of this.

95
00:06:16,820 --> 00:06:22,780
So here what we're showing is the same
true function we've shown throughout

96
00:06:22,780 --> 00:06:27,800
this module, but we're showing tons of
observations, all these great dots.

97
00:06:27,800 --> 00:06:32,240
But they're noisy observations, they're no
longer lying exactly on that blue curve.

98
00:06:32,240 --> 00:06:35,230
That's why they're this
cloud of blue points.

99
00:06:35,230 --> 00:06:41,066
And we see that our one nearest
neighbor fit is very, very noisy.

100
00:06:41,066 --> 00:06:44,651
Okay, it has this wild behavior,
because like we discussed before,

101
00:06:44,651 --> 00:06:48,470
one nearest neighbor is very
sensitive to noise in the data.

102
00:06:48,470 --> 00:06:54,890
But in contrast, if we look at a large k,
so here we're looking at k equals 200.

103
00:06:54,890 --> 00:07:00,280
So our 200 nearest neighbors fit,
it looks much, much better.

104
00:07:00,280 --> 00:07:04,050
So you can imagine that as you
get more and more observations,

105
00:07:04,050 --> 00:07:08,170
if you're allowing k to grow, you can
smooth over the noise being introduced by

106
00:07:08,170 --> 00:07:13,240
each one of these observations, and
have the mean squared error going to zero.

107
00:07:13,240 --> 00:07:19,660
But in contrast, again, if we look at just
our standard least squares regression

108
00:07:19,660 --> 00:07:24,475
here in this case of a quadratic fit,
we're always gonna have bias.

109
00:07:25,835 --> 00:07:28,475
So nothing's different by
having introduced noise.

110
00:07:28,475 --> 00:07:31,119
It, if anything,
will just make things worse.

111
00:07:31,119 --> 00:07:35,179
[MUSIC]