1
00:00:00,000 --> 00:00:04,498
[MUSIC]

2
00:00:04,498 --> 00:00:09,601
Okay, but let's go back to this idea
that depending on the structure,

3
00:00:09,601 --> 00:00:13,089
or rather the distribution
of data in the space,

4
00:00:13,089 --> 00:00:16,850
you can actually get things
looking closer to log N.

5
00:00:16,850 --> 00:00:21,661
So here's an example where we have fairly
uniform distribution of points across

6
00:00:21,661 --> 00:00:22,376
the space.

7
00:00:22,376 --> 00:00:27,927
And for a given query, we've pruned
a huge amount of the search space and

8
00:00:27,927 --> 00:00:31,860
just had to search two
different boxes here.

9
00:00:31,860 --> 00:00:37,210
Whereas in that more contrived example
where our data lied on a circle,

10
00:00:38,690 --> 00:00:43,080
and based on that center heuristic that
we had for how to create the splits,

11
00:00:43,080 --> 00:00:46,980
we end up having to search
almost every node in the tree.

12
00:00:46,980 --> 00:00:52,510
So this looks a lot closer to order N,
so not much savings in this case.

13
00:00:52,510 --> 00:00:58,310
Okay, so like I said, to really offset the
cost of creating this data structure, it's

14
00:00:58,310 --> 00:01:02,760
useful to use these things in cases where
you're doing lots and lots of queries.

15
00:01:02,760 --> 00:01:07,894
So let's imagine asking for the nearest
neighbor for every article in our corpus.

16
00:01:07,894 --> 00:01:12,919
So let's assume that
we're doing N queries.

17
00:01:12,919 --> 00:01:14,810
And let's look at the complexity of this.

18
00:01:14,810 --> 00:01:20,331
So under a brute force search, remember
per query it was order N complexity.

19
00:01:20,331 --> 00:01:26,224
So for N queries, we're going to have
something that's order N squared.

20
00:01:26,224 --> 00:01:28,859
But on the other hand for kd-trees,

21
00:01:28,859 --> 00:01:33,880
it's going to be somewhere in the range
of order N log N, if we get lucky.

22
00:01:34,960 --> 00:01:38,882
And if we get unlucky, it's going to
be on the order of N squared,

23
00:01:38,882 --> 00:01:41,028
just like the brute force search.

24
00:01:41,028 --> 00:01:45,669
But just to note,
this represents, potentially,

25
00:01:45,669 --> 00:01:48,519
very large savings for large N.

26
00:01:55,184 --> 00:01:59,759
Okay, so let's look at a little
empirical study of the number of

27
00:01:59,759 --> 00:02:04,505
inspections we have as a function
of the number of observations and

28
00:02:04,505 --> 00:02:09,614
then as a function of the dimension
of the space that we're working in.

29
00:02:09,614 --> 00:02:13,794
So here, we're increasing
the number of observations, N.

30
00:02:13,794 --> 00:02:17,980
And in this case we see
roughly a log(N) trend.

31
00:02:17,980 --> 00:02:22,830
But on the other hand here,
when we're looking at going to higher and

32
00:02:22,830 --> 00:02:27,890
higher dimensions, so just to be clear,
we're saying that our observations xi

33
00:02:27,890 --> 00:02:33,420
live in some R to the d space,
so that's what d represents.

34
00:02:33,420 --> 00:02:37,180
We get a trend where we have
an exponential increase in the number of

35
00:02:37,180 --> 00:02:40,900
inspections we have to do,
the number of distance calculations, or

36
00:02:40,900 --> 00:02:45,640
really nodes in the tree,
to find our nearest neighbor.

37
00:02:46,800 --> 00:02:52,123
So for large d, This

38
00:02:52,123 --> 00:02:56,215
is really a regime where using
kd-trees becomes pretty pointless.

39
00:02:56,215 --> 00:03:00,570
And we'll return to this to talk
about it a little bit more.

40
00:03:00,570 --> 00:03:04,580
And also to think of an alternative
approach that will be more helpful to us

41
00:03:04,580 --> 00:03:06,607
in high-dimensional scenarios.

42
00:03:07,750 --> 00:03:12,150
But before that, I just want to mention
how we can think about using kd-trees

43
00:03:12,150 --> 00:03:16,400
to do k nearest neighbors search,
instead of one nearest neighbor search.

44
00:03:16,400 --> 00:03:20,790
And the extension is very intuitive and
straightforward.

45
00:03:20,790 --> 00:03:24,110
So instead of just maintaining
the distance to the nearest neighbor,

46
00:03:24,110 --> 00:03:28,280
what we're going to do is
we're going to maintain

47
00:03:28,280 --> 00:03:31,930
the distance to our kth nearest
neighbor that we found so far.

48
00:03:33,130 --> 00:03:35,520
So in this example, this is our distance.

49
00:03:35,520 --> 00:03:39,270
We're looking at a two
nearest neighbor task.

50
00:03:39,270 --> 00:03:43,950
And here, this is our distance to
our second nearest neighbor found so

51
00:03:43,950 --> 00:03:49,450
far in this first search that we've done,
the search of this first box here.

52
00:03:49,450 --> 00:03:54,930
And then just as before,
based on the distance that we found so

53
00:03:54,930 --> 00:03:58,480
far, we'd backtrack,
go and search other nodes.

54
00:03:58,480 --> 00:04:04,240
And prune them if the distance
to that kth nearest neighbor is

55
00:04:04,240 --> 00:04:09,050
smaller than the distance to the bounding
box of the node that we're looking at.

56
00:04:09,050 --> 00:04:11,205
So the algorithm is almost identical.

57
00:04:11,205 --> 00:04:15,497
The only thing you would need to change in
code is instead of maintaining distance to

58
00:04:15,497 --> 00:04:19,872
the nearest neighbor, maintain distance
to the kth nearest neighbor found so far.

59
00:04:19,872 --> 00:04:23,639
[MUSIC]