1
00:00:00,000 --> 00:00:04,577
[MUSIC]

2
00:00:04,577 --> 00:00:08,383
So, we've talked about the performance
of nearest neighbor regression,

3
00:00:08,383 --> 00:00:10,520
as you get tons and tons of observations.

4
00:00:10,520 --> 00:00:12,620
And things look pretty good in that case.

5
00:00:12,620 --> 00:00:17,530
Actually, amazingly good, but
the picture isn't quite as nice

6
00:00:17,530 --> 00:00:21,850
when we start increasing
the dimension of our input space or

7
00:00:21,850 --> 00:00:24,620
if we just have a limited
set of observations.

8
00:00:24,620 --> 00:00:28,530
Because the performance of nearest
neighbors regression, as we talked about

9
00:00:28,530 --> 00:00:36,120
before, is very sensitive to how well
the data covers the input space.

10
00:00:36,120 --> 00:00:39,710
If you remember we had this plot early on
where we just removed observations from

11
00:00:39,710 --> 00:00:44,380
some point of our input space and we got
a pretty bad, When you're a [INAUDIBLE].

12
00:00:44,380 --> 00:00:46,790
Well, now imagine you have
a really a huge, really,

13
00:00:46,790 --> 00:00:50,320
really high dimensional space
that you are looking at.

14
00:00:50,320 --> 00:00:55,390
And, you need your observations to
cover all points within this space.

15
00:00:55,390 --> 00:01:00,240
Well, you're gonna need an exponentially
large number of observations in

16
00:01:00,240 --> 00:01:05,100
the dimension of the space you're looking
at in order to cover this space and

17
00:01:05,100 --> 00:01:07,290
have good nearest neighbor performance.

18
00:01:07,290 --> 00:01:09,310
And that's typically really hard to do,

19
00:01:09,310 --> 00:01:14,510
to have as many observations as you would
need in these high-dimensional spaces.

20
00:01:14,510 --> 00:01:18,720
So, the cases where you have limited
data relative to the dimension of

21
00:01:18,720 --> 00:01:23,170
the input space you're looking at Is
where these parametric models that we've

22
00:01:23,170 --> 00:01:27,630
talked about throughout the rest of
this course become so important.

23
00:01:27,630 --> 00:01:30,400
And another thing that's important to
talk about is the complexity of our

24
00:01:30,400 --> 00:01:32,350
nearest neighbor search and

25
00:01:32,350 --> 00:01:36,794
the naive type of algorithm that we've
talked about so far which is just our

26
00:01:36,794 --> 00:01:41,160
search through our entire data set can
be quite computationally intensive.

27
00:01:41,160 --> 00:01:44,760
So, for example,
if you have just one query point, xq,

28
00:01:44,760 --> 00:01:49,200
and you're just doing one
nearest neighbor search,

29
00:01:49,200 --> 00:01:54,600
you have to scan through every single
observation in your data set and

30
00:01:54,600 --> 00:01:57,640
compute the distance to each
one of those observations.

31
00:01:57,640 --> 00:02:02,200
So, going from our query house to
each one of the different houses

32
00:02:02,200 --> 00:02:06,770
in our data set computing this
similarity between these houses

33
00:02:06,770 --> 00:02:08,920
in order to find the nearest neighbor.

34
00:02:08,920 --> 00:02:13,520
So, that has complexity, that's linear in
the number of observations that we have.

35
00:02:13,520 --> 00:02:16,560
Or if we wanna do our k
nearest neighbor as approach,

36
00:02:17,980 --> 00:02:21,550
we have to maintain a list of
the k nearest neighbors and

37
00:02:21,550 --> 00:02:25,960
if you do this in this sorted
queue that we've talked about.

38
00:02:25,960 --> 00:02:29,150
Then, you can do this search in O(Nlogk).

39
00:02:29,150 --> 00:02:34,031
But both of these, either for
our 1-NN or our k-NN,

40
00:02:34,031 --> 00:02:37,790
have complexity linear and
the number of observations we have.

41
00:02:37,790 --> 00:02:39,120
But what if N is huge?

42
00:02:39,120 --> 00:02:45,580
Because this is the situation where we've
said these k-NN approaches work so well.

43
00:02:45,580 --> 00:02:48,040
Well, that can be really,
really intensive to do, and

44
00:02:48,040 --> 00:02:50,970
especially if you have to do many queries.

45
00:02:50,970 --> 00:02:54,830
So, if you wanna predict the values
at many points in your input space,

46
00:02:54,830 --> 00:02:56,900
this can be a really,
really intensive procedure.

47
00:02:58,700 --> 00:03:02,039
So, instead, we're gonna talk
about more efficient methods for

48
00:03:02,039 --> 00:03:05,259
doing this type of nearest neighbor
search in our Clustering and

49
00:03:05,259 --> 00:03:07,780
Retrieval course later
in this specialization.

50
00:03:07,780 --> 00:03:11,909
[MUSIC]