1
00:00:00,000 --> 00:00:04,582
[MUSIC]

2
00:00:04,582 --> 00:00:09,146
Let's now discuss how KD-trees can aid in
efficiently performing nearest neighbor

3
00:00:09,146 --> 00:00:09,660
search.

4
00:00:10,660 --> 00:00:12,005
So to start with,

5
00:00:12,005 --> 00:00:17,398
we're going to have our query point
showing with this green circle here.

6
00:00:17,398 --> 00:00:21,727
So this is our query point.

7
00:00:21,727 --> 00:00:25,780
And then the first thing we're going to do
is we're going to traverse the tree all

8
00:00:25,780 --> 00:00:29,870
the way down to the leaf node
that contains that query point.

9
00:00:29,870 --> 00:00:31,102
What that means is,

10
00:00:31,102 --> 00:00:36,102
we're going to check all the conditions of
this point against the conditions at each

11
00:00:36,102 --> 00:00:39,451
one of these internal nodes
until we get to a leaf node.

12
00:00:39,451 --> 00:00:43,444
Okay, so to begin with,
we're going to check the first condition,

13
00:00:43,444 --> 00:00:48,602
remember it was a split on x1, whether or
not it was greater than or less than 0.5.

14
00:00:48,602 --> 00:00:53,800
In this case, x1, for
this point, is greater than 0.5.

15
00:00:53,800 --> 00:00:58,299
So, we have x1 for query point,

16
00:00:58,299 --> 00:01:02,645
x cubed is greater than 0.5.

17
00:01:02,645 --> 00:01:08,360
So we say, yes, and
we traverse down this side of the tree.

18
00:01:08,360 --> 00:01:12,088
Then we check the next condition, and

19
00:01:12,088 --> 00:01:18,388
this was a condition of whether
x2 was greater than some value.

20
00:01:18,388 --> 00:01:19,160
I'll just call it v.

21
00:01:20,820 --> 00:01:24,140
So let's call this point here, v.

22
00:01:24,140 --> 00:01:30,150
And here we see that, yes,
x2 is greater than v.

23
00:01:30,150 --> 00:01:31,519
So, again, we have a yes.

24
00:01:32,650 --> 00:01:35,093
And we traverse down this way.

25
00:01:35,093 --> 00:01:36,525
And we keep doing that.

26
00:01:36,525 --> 00:01:39,500
And then all of the sudden
we get to a leaf note.

27
00:01:41,200 --> 00:01:44,750
Okay, now once we're at this leaf node,

28
00:01:44,750 --> 00:01:47,150
the first thing we do is start
searching for nearest neighbors.

29
00:01:47,150 --> 00:01:50,560
because what our hope is
is that within this box,

30
00:01:50,560 --> 00:01:56,560
local to our point, that might be a likely
place to find the nearest neighbor.

31
00:01:56,560 --> 00:02:02,500
Okay, so we compute the distances to
each point contained in this box.

32
00:02:02,500 --> 00:02:07,029
So this node has a list of three points,
the query point and

33
00:02:07,029 --> 00:02:09,347
these two other points here.

34
00:02:09,347 --> 00:02:11,235
And we compute those distances.

35
00:02:14,927 --> 00:02:19,380
So, that's what's said
right here on this slide.

36
00:02:20,910 --> 00:02:23,981
Now that we've computed these distances,

37
00:02:23,981 --> 00:02:29,044
what we do is we record the distance
to the nearest neighbor found so far.

38
00:02:29,044 --> 00:02:31,660
So, let's call this r.

39
00:02:31,660 --> 00:02:36,981
And this is distance to

40
00:02:36,981 --> 00:02:44,549
nearest neighbor found so far.

41
00:02:47,721 --> 00:02:51,630
Okay, then an actual question is,
is that it?

42
00:02:51,630 --> 00:02:52,379
Are we done?

43
00:02:53,490 --> 00:02:57,910
Does the nearest neighbor live within
the blue box that we just searched?

44
00:02:58,910 --> 00:03:00,590
No, not necessarily, right?

45
00:03:00,590 --> 00:03:05,880
That should be obvious because now
that we've drawn this fuchsia circle,

46
00:03:05,880 --> 00:03:10,130
we see that there are other points
that are closer to this green point

47
00:03:10,130 --> 00:03:12,970
that live in other neighboring boxes.

48
00:03:12,970 --> 00:03:16,170
So what we're going to do then is
we're going to continue searching.

49
00:03:16,170 --> 00:03:20,540
So we're going to go from this
node that we already examined.

50
00:03:20,540 --> 00:03:22,262
So that one's already done.

51
00:03:22,262 --> 00:03:26,428
And we're going to backtrack,
we're going to go back up the tree and

52
00:03:26,428 --> 00:03:28,830
start checking out other branches.

53
00:03:29,890 --> 00:03:36,110
Okay, so then we're going to
go down to a leaf node.

54
00:03:37,130 --> 00:03:39,119
So we checked this one.

55
00:03:39,119 --> 00:03:41,941
And now we've traversed here to this leaf.

56
00:03:41,941 --> 00:03:45,422
And now we're going to compute
distances to all of these points and

57
00:03:45,422 --> 00:03:49,303
compare relative to the distance to
our nearest neighbor found so far.

58
00:03:49,303 --> 00:03:52,925
And in this case,

59
00:03:52,925 --> 00:03:57,756
we have that distance to

60
00:03:57,756 --> 00:04:03,560
nearest neighbor is smaller

61
00:04:07,025 --> 00:04:08,531
Than what we had before.

62
00:04:08,531 --> 00:04:11,880
So this becomes our new nearest neighbor.

63
00:04:11,880 --> 00:04:15,260
And this becomes our new
distance to nearest neighbor.

64
00:04:15,260 --> 00:04:18,196
And then we continue searching.

65
00:04:18,196 --> 00:04:21,570
So, Let me just write this.

66
00:04:21,570 --> 00:04:26,521
This becomes our new distance
to nearest neighbor.

67
00:04:26,521 --> 00:04:31,078
And when we continue searching,
we've now checked this node.

68
00:04:31,078 --> 00:04:33,930
We've checked this node.

69
00:04:33,930 --> 00:04:38,687
And we're going to go and
check this other node.

70
00:04:38,687 --> 00:04:44,510
But what this blue box here shows
is that's the type bounding box for

71
00:04:44,510 --> 00:04:45,540
that node.

72
00:04:45,540 --> 00:04:50,850
Remember, it's the box that contains
all the points within that node.

73
00:04:50,850 --> 00:04:56,088
And what we see is that
the distance to that box is greater

74
00:04:56,088 --> 00:05:01,337
than the distance to our
nearest neighbor found so far.

75
00:05:01,337 --> 00:05:08,691
So, Distance

76
00:05:08,691 --> 00:05:14,439
to bounding box is greater

77
00:05:14,439 --> 00:05:19,926
than distance to nearest

78
00:05:19,926 --> 00:05:23,851
neighbor so far.

79
00:05:23,851 --> 00:05:28,200
And so what this implies,
is that we're going to prune this.

80
00:05:29,960 --> 00:05:31,510
We're not even going to go and

81
00:05:31,510 --> 00:05:36,520
compute the distance to any
point in this leaf node here.

82
00:05:36,520 --> 00:05:41,470
Because we know that no
point in there can be

83
00:05:41,470 --> 00:05:45,320
a nearest neighbor closer than the nearest
neighbor that we found so far.

84
00:05:46,720 --> 00:05:49,060
Okay, now what we've done.

85
00:05:49,060 --> 00:05:50,460
We've checked this node.

86
00:05:50,460 --> 00:05:51,560
We've checked this node.

87
00:05:51,560 --> 00:05:53,030
We've pruned this node.

88
00:05:53,030 --> 00:05:56,250
We go up and
we start going down this other branch.

89
00:05:56,250 --> 00:05:57,630
What happens here?

90
00:05:57,630 --> 00:06:03,580
Again, the distance to this box exceeds
the distance to our nearest neighbor,

91
00:06:03,580 --> 00:06:04,180
so we prune.

92
00:06:05,550 --> 00:06:07,560
Then what we're going to do,
is we're going to go up and

93
00:06:07,560 --> 00:06:10,390
start searching the other branch, but

94
00:06:10,390 --> 00:06:15,800
all the other boxes that we have exceed
the distance to our nearest neighbor.

95
00:06:15,800 --> 00:06:23,026
So, what we see is that we've been able
to prune quite a lot of the search space.

96
00:06:23,026 --> 00:06:26,107
Let's just go back and
look at all these points.

97
00:06:26,107 --> 00:06:27,435
In the brute force method,

98
00:06:27,435 --> 00:06:30,850
we would have had to compute
distances to all those points.

99
00:06:30,850 --> 00:06:33,940
But now we only had to compute distance

100
00:06:33,940 --> 00:06:38,500
to a very few number of data points in
this example to find our nearest neighbor.

101
00:06:40,260 --> 00:06:42,390
So, this is where KD-trees are so

102
00:06:42,390 --> 00:06:45,670
useful in performing efficient
nearest neighbor search.

103
00:06:45,670 --> 00:06:47,896
When you have lots and
lots of observations,

104
00:06:47,896 --> 00:06:51,305
especially if you're going to be
performing lots and lots of queries.

105
00:06:51,305 --> 00:06:53,856
Because you have to build the tree.

106
00:06:53,856 --> 00:06:58,701
And there's some computational
cost to building this KD-tree.

107
00:06:58,701 --> 00:07:01,735
But if you're going to be
using it again and again,

108
00:07:01,735 --> 00:07:04,770
it can actually provide
a significant savings.

109
00:07:04,770 --> 00:07:08,659
[MUSIC]