1
00:00:00,000 --> 00:00:04,478
[MUSIC]

2
00:00:04,478 --> 00:00:06,009
At the beginning of this module,

3
00:00:06,009 --> 00:00:10,010
we talked about this idea of fitting
globally versus fitting locally.

4
00:00:10,010 --> 00:00:12,480
Now that we've seen k nearest
neighbors and kernel regression,

5
00:00:12,480 --> 00:00:15,030
I wanna formalize this idea.

6
00:00:15,030 --> 00:00:16,320
So in particular,

7
00:00:16,320 --> 00:00:22,080
let's look at what happens when we just
fit a constant function to our data.

8
00:00:23,290 --> 00:00:27,440
So in that case that's just computing
what's called a global average where we

9
00:00:27,440 --> 00:00:32,690
take all of our observations,
add them together and take the average or

10
00:00:32,690 --> 00:00:35,260
just divide by that total
number of observations.

11
00:00:35,260 --> 00:00:42,450
So that's exactly equivalent to summing
over a weighted set of our observations,

12
00:00:42,450 --> 00:00:47,000
where the weights are exactly the same
on each of our data points, and

13
00:00:47,000 --> 00:00:49,700
then dividing by the total
sum of these weights.

14
00:00:51,030 --> 00:00:55,860
So now that we've put our global average
in this form, things start to look

15
00:00:55,860 --> 00:00:58,910
very similar to the kernel regression
ideas that we've looked at.

16
00:00:59,980 --> 00:01:03,080
Where here it's almost like
kernel regression, but

17
00:01:03,080 --> 00:01:07,940
we're including every
observation in our fit, and

18
00:01:07,940 --> 00:01:11,080
we're having exactly the same
weights on every observation.

19
00:01:12,860 --> 00:01:16,310
So that's like using this box car kernel
that puts the same weights on all

20
00:01:16,310 --> 00:01:19,860
observations, and just having
a really really massively large

21
00:01:19,860 --> 00:01:24,110
bandwidth parameters such that for
every point in our input space

22
00:01:24,110 --> 00:01:26,790
all the other observations
are gonna be included in the fit.

23
00:01:27,940 --> 00:01:32,770
But now let's contrast that with a more
standard version of kernel regression,

24
00:01:32,770 --> 00:01:38,170
which leads to what we're gonna
think of as locally constant fits.

25
00:01:38,170 --> 00:01:42,290
Because [COUGH] if we look at
the kernel regression equation,

26
00:01:42,290 --> 00:01:45,650
what we see is that,
it's exactly what we had for

27
00:01:45,650 --> 00:01:49,000
our global average, but
now it's gonna be weighted by this kernel.

28
00:01:49,000 --> 00:01:52,050
Where in a lot of cases,
what that kernel is doing,

29
00:01:52,050 --> 00:01:57,980
is it's putting a hard limit that some
observations outside of our window

30
00:01:57,980 --> 00:02:02,820
of around whatever target point what we're
looking at, are out of our calculation.

31
00:02:02,820 --> 00:02:06,550
So the simplest case we can talk
about is this box car kernel,

32
00:02:06,550 --> 00:02:09,890
that's gonna put equal weights
over all observations, but

33
00:02:09,890 --> 00:02:14,890
just local to our target point x,o.

34
00:02:14,890 --> 00:02:21,050
And so, we're gonna get a constant fit
but, just at that one target point,

35
00:02:21,050 --> 00:02:24,920
and then we're going to get a different
constant fit at the next target point, and

36
00:02:24,920 --> 00:02:26,990
the next one, and the next one.

37
00:02:26,990 --> 00:02:29,650
And, I want to be clear that
the resulting output isn't

38
00:02:29,650 --> 00:02:32,060
a stair case kind of function.

39
00:02:32,060 --> 00:02:35,200
It's not a collection
of these constant fits.

40
00:02:35,200 --> 00:02:38,390
It is a collection of the constant fits,
but just at a single point.

41
00:02:38,390 --> 00:02:40,680
So we're taking a single point,
doing another constant fit,

42
00:02:40,680 --> 00:02:45,570
taking the single point, which is at that
target, and as we're doing this over

43
00:02:45,570 --> 00:02:49,510
all our different inputs that's
what's defining this green curve.

44
00:02:50,650 --> 00:02:52,630
Okay, but if we look at another kernel,

45
00:02:52,630 --> 00:02:58,210
like our Epanechnikov kernel that has the
weights decaying over this fixed region.

46
00:03:00,470 --> 00:03:04,890
Well, it is still doing a constant fit,
but how is it

47
00:03:04,890 --> 00:03:10,450
figuring out what the level of that
line should be at our target point?

48
00:03:10,450 --> 00:03:11,650
Well, what it's doing is,

49
00:03:11,650 --> 00:03:17,370
it's just down weighting observations that
are further from our target point and

50
00:03:17,370 --> 00:03:21,620
emphasizing more heavily the observations
more close to our target point.

51
00:03:21,620 --> 00:03:26,050
So this is just a weighted global
average but its no longer global it's

52
00:03:26,050 --> 00:03:30,630
local because we're only looking at
observations within this defined window.

53
00:03:30,630 --> 00:03:35,550
So we're doing this weighted average
locally at each one of our input

54
00:03:35,550 --> 00:03:37,800
points and tracing out this green curve.

55
00:03:39,230 --> 00:03:43,640
So, this hopefully makes
very clear how before

56
00:03:43,640 --> 00:03:47,320
in the types of linear regression
models we were talking about, we

57
00:03:47,320 --> 00:03:51,430
were doing these global fits which in the
simplest case, was just a constant model.

58
00:03:51,430 --> 00:03:57,510
That was our most basic model we could
consider having just the constant feature

59
00:03:57,510 --> 00:04:01,780
and now what we're talking about is
doing exactly the same thing but

60
00:04:01,780 --> 00:04:06,050
locally and so locally that it's at
every single point at our input space.

61
00:04:07,330 --> 00:04:10,540
So this kernel regression method
that we've described so far,

62
00:04:10,540 --> 00:04:16,720
we've now motivated as fitting a constant
function locally at each observation,

63
00:04:16,720 --> 00:04:20,630
well more than each observation,
each point in our input space.

64
00:04:20,630 --> 00:04:24,190
And this is referred to as
locally weighted averages but

65
00:04:24,190 --> 00:04:28,790
instead of fitting a constant
at each point in our input space

66
00:04:28,790 --> 00:04:32,690
we could have likewise fit a line or
polynomial.

67
00:04:32,690 --> 00:04:37,205
And so what this leads to is
something that's called locally

68
00:04:37,205 --> 00:04:39,517
weighted linear regression.

69
00:04:39,517 --> 00:04:42,976
We are not going to go through the details
of of locally weighted linear regression

70
00:04:42,976 --> 00:04:44,270
in this module.

71
00:04:44,270 --> 00:04:45,670
It's fairly straightforward.

72
00:04:45,670 --> 00:04:48,320
It's a similar idea to
these local constant fits,

73
00:04:48,320 --> 00:04:51,810
but now plugging in a line or polynomial.

74
00:04:51,810 --> 00:04:56,410
But I wanted to leave you with a couple
rules of thumb for which fit you

75
00:04:56,410 --> 00:05:01,530
might choose between a different set of
polynomials that you have options over.

76
00:05:01,530 --> 00:05:07,210
And one thing that fitting a local line
instead of a local constant helps you with

77
00:05:07,210 --> 00:05:10,260
are those boundary effects
that we talked about before.

78
00:05:10,260 --> 00:05:13,710
The fact that you get these
large biases at the boundary.

79
00:05:15,700 --> 00:05:20,610
So you can show very formally that these
local linear fits help with that bias, and

80
00:05:20,610 --> 00:05:25,590
if we talk about local quadratic fits,
that helps with bias that you get

81
00:05:25,590 --> 00:05:28,400
at points of curvature in
the interior view of space.

82
00:05:28,400 --> 00:05:29,300
So, for example,

83
00:05:29,300 --> 00:05:34,630
we see that blue curve we've been
trying to fit, and if we go back,

84
00:05:34,630 --> 00:05:39,610
maybe it's worth quickly jumping back
to what our fit looks like we see that,

85
00:05:39,610 --> 00:05:44,640
towards the boundary we get large biases,
and right at the point of curvature, we

86
00:05:44,640 --> 00:05:50,770
also have a bias where we're under fitting
the true curvature of that blue function.

87
00:05:50,770 --> 00:05:54,790
And so the local quadratic fit
helps with fitting that curvature.

88
00:05:54,790 --> 00:05:58,120
But what it does is it actually
leads to a larger variance so

89
00:05:58,120 --> 00:06:00,740
that can be unattractive.

90
00:06:00,740 --> 00:06:06,152
So in general just a basic recommendation
is to use just a standard local

91
00:06:06,152 --> 00:06:11,487
linear regression, fitting lines
at every point in the input space.

92
00:06:11,487 --> 00:06:14,550
[MUSIC]