1
00:00:00,000 --> 00:00:04,488
[MUSIC]

2
00:00:04,488 --> 00:00:07,860
KD-trees are one approach for
efficient nearest-neighbor search.

3
00:00:07,860 --> 00:00:10,670
But as we mentioned, there are other
approaches that we could consider.

4
00:00:10,670 --> 00:00:13,170
So let's look at one of these
called locality sensitive hashing.

5
00:00:14,250 --> 00:00:17,730
So why might we want to consider
an approach other than KD-trees?

6
00:00:17,730 --> 00:00:19,990
Well, KD-trees are really cool.

7
00:00:19,990 --> 00:00:23,540
They're a very intuitive way to
think about storing data, and

8
00:00:23,540 --> 00:00:25,630
as we saw,
they could lead to some efficiencies.

9
00:00:25,630 --> 00:00:29,510
And actually some significant efficiencies
in performing nearest neighbor search.

10
00:00:30,690 --> 00:00:31,850
But there are few issues.

11
00:00:31,850 --> 00:00:36,210
One is the fact that KD-trees are not
the simplest things to implement.

12
00:00:36,210 --> 00:00:40,540
You have to structure out this tree, and
it can be pretty challenging to do that.

13
00:00:40,540 --> 00:00:45,250
The other is something that we saw that,
as the dimensionality of our data

14
00:00:45,250 --> 00:00:50,070
increases, we can actually get really,
really poor performance in terms of

15
00:00:50,070 --> 00:00:53,070
the efficiency of nearest
neighbor search using KD-trees.

16
00:00:53,070 --> 00:00:55,470
So let's talk about
this a little bit more.

17
00:00:55,470 --> 00:00:57,800
So what happens in high
dimensions is basically,

18
00:00:57,800 --> 00:01:00,570
every point is far away
from any other point.

19
00:01:00,570 --> 00:01:02,660
So this is something
that we mentioned before.

20
00:01:02,660 --> 00:01:06,800
So if we think about taking our
KD-tree partitioning, remember,

21
00:01:06,800 --> 00:01:07,870
in a high dimensional space,

22
00:01:07,870 --> 00:01:13,070
we're going to think of these hyper cubes
that form the different partitions.

23
00:01:13,070 --> 00:01:17,213
Then if we have our query point, it's
unlikely there's going to be any point

24
00:01:17,213 --> 00:01:19,733
within the given hypercube or
even close by.

25
00:01:19,733 --> 00:01:24,017
So we're going to be searching around,
finding some point, and that will be

26
00:01:24,017 --> 00:01:28,580
considered our nearest neighbor so
far in the algorithm that we specified.

27
00:01:28,580 --> 00:01:33,449
And remember, what we do is we draw
a radius to that nearest neighbor so ,far

28
00:01:33,449 --> 00:01:38,718
and we think of, now in high dimensions,
this big sphere, this hyper-sphere.

29
00:01:38,718 --> 00:01:43,429
And we can only prune bins if
the distance to that bin is

30
00:01:43,429 --> 00:01:48,260
greater than the distance
to nearest neighbor so far.

31
00:01:48,260 --> 00:01:52,190
But what happens in these high
dimensions is that this big sphere,

32
00:01:52,190 --> 00:01:56,410
this radius to our nearest neighbor so
far, intersects lots and

33
00:01:56,410 --> 00:02:00,560
lots of these hypercubes
in at least one dimension.

34
00:02:00,560 --> 00:02:04,910
And so we can't prune them, and what ends
up happening is you have to search many,

35
00:02:04,910 --> 00:02:07,000
many, many, many partitions.

36
00:02:07,000 --> 00:02:11,950
And you can show under some conditions
that you actually need to search over 2 to

37
00:02:11,950 --> 00:02:16,710
the d different hypercubes or
little partitions,

38
00:02:16,710 --> 00:02:22,650
which completely defeats the purpose
of the efficiencies gained by KD-trees.

39
00:02:22,650 --> 00:02:27,930
Unless n, the number of observations,
is just enormously large.

40
00:02:27,930 --> 00:02:30,870
Much, much greater than 2 to the d, okay.

41
00:02:30,870 --> 00:02:32,143
So what can we do?

42
00:02:32,143 --> 00:02:36,028
Well, one thing we can think about doing
is moving away from performing exact

43
00:02:36,028 --> 00:02:37,790
nearest neighbor search.

44
00:02:37,790 --> 00:02:40,890
And this is something that we discussed
while we talked about KD-trees.

45
00:02:40,890 --> 00:02:45,465
We showed how we could use kd-trees
to perform approximate nearest

46
00:02:45,465 --> 00:02:46,400
neighbor search.

47
00:02:46,400 --> 00:02:48,965
And that's going to be
the focus of this section.

48
00:02:48,965 --> 00:02:54,130
because we're going to assume that
we don't fully, fully trust our

49
00:02:54,130 --> 00:02:58,890
distance metric as being exactly the way
that we want to characterize distances.

50
00:02:58,890 --> 00:03:01,350
And often in a lot of applications,

51
00:03:01,350 --> 00:03:06,640
you just want to find a neighbor that's
pretty close to the given query point.

52
00:03:06,640 --> 00:03:11,520
It doesn't necessarily have to be exactly,
exactly, exactly the nearest neighbor.

53
00:03:11,520 --> 00:03:13,540
And so for a lot of retrieval tasks,

54
00:03:13,540 --> 00:03:18,580
that will actually provide enough
performance for the given task.

55
00:03:18,580 --> 00:03:20,998
But here, in contrast to what we did for
KD-trees for

56
00:03:20,998 --> 00:03:24,657
approximate nearest neighbor search, we're
going to look at methods where we can

57
00:03:24,657 --> 00:03:28,244
provide a probabilistic guarantee on
the approximation that we're making.

58
00:03:28,244 --> 00:03:28,744
[MUSIC]