1
00:00:00,000 --> 00:00:03,975
[MUSIC]

2
00:00:03,975 --> 00:00:08,540
Well, luckily the answer is yes,
or often yes.

3
00:00:08,540 --> 00:00:11,640
And we're going to discuss a few
different ways of doing this.

4
00:00:11,640 --> 00:00:14,485
And the first thing that we're
going to talk about are KD-trees.

5
00:00:14,485 --> 00:00:20,386
And KD-trees are a specific data structure
for efficiently representing our data.

6
00:00:20,386 --> 00:00:25,230
In particular,
KD-trees provide an organization of

7
00:00:25,230 --> 00:00:28,690
our documents in terms of this
partitioning of our space.

8
00:00:28,690 --> 00:00:31,360
We're going to be making
these access aligned cuts,

9
00:00:31,360 --> 00:00:36,610
and maintaining lists of points that fall
into each one of these different bins.

10
00:00:36,610 --> 00:00:39,610
And what this structure allows
us to do as we're going to show,

11
00:00:39,610 --> 00:00:41,800
is efficiently prune our search space so

12
00:00:41,800 --> 00:00:46,410
that we don't have to visit every single
data point, For every query, necessarily,

13
00:00:46,410 --> 00:00:50,480
sometimes we'll have to but
hopefully in many cases we won't have to.

14
00:00:52,140 --> 00:00:58,230
And these methods, these KD-trees work
really well in low to medium dimensions

15
00:00:59,930 --> 00:01:04,980
meaning how many features we have and
we'll return to this idea a bit later.

16
00:01:06,550 --> 00:01:10,360
To start with,
let's talk about the KD-tree construction.

17
00:01:10,360 --> 00:01:13,620
So what we're going to do is we're
going to take our data table.

18
00:01:13,620 --> 00:01:17,230
So we have in this example
just two different features.

19
00:01:17,230 --> 00:01:24,900
So we have feature one,
and this is feature two.

20
00:01:24,900 --> 00:01:29,450
So maybe this is just the first
word in our vocabulary and

21
00:01:29,450 --> 00:01:36,240
this is the second word, and then here
are the indices of our data points.

22
00:01:37,310 --> 00:01:39,298
So these are our observation indices.

23
00:01:42,954 --> 00:01:47,210
And these points are displayed
just in r2 here.

24
00:01:47,210 --> 00:01:52,432
So this is, Feature 1 and

25
00:01:52,432 --> 00:01:57,510
this is, Feature 2.

26
00:01:57,510 --> 00:02:02,600
Okay, so the first thing that we
do is we're going to split our

27
00:02:02,600 --> 00:02:05,220
data table into two different groups.

28
00:02:05,220 --> 00:02:08,740
And the way we are going to do that
is by choosing a split dimension.

29
00:02:08,740 --> 00:02:12,540
So which feature are we splitting on and
the split value so

30
00:02:12,540 --> 00:02:14,950
what's the threshold for the split.

31
00:02:14,950 --> 00:02:20,380
So here in this example we
chose split value as 0.5 and

32
00:02:20,380 --> 00:02:23,260
we are splitting on the first feature.

33
00:02:23,260 --> 00:02:28,620
So everything to the right
are cases where this first feature

34
00:02:28,620 --> 00:02:35,380
takes values greater than 0.5, and
everything to the left are cases where

35
00:02:35,380 --> 00:02:40,430
the first feature takes values
less than or equal to 0.5.

36
00:02:40,430 --> 00:02:46,440
Okay, so if we look at datapoint,
and the first feature has a value

37
00:02:46,440 --> 00:02:51,500
greater than 0.5, we're going to place
it in the yes table, the table to right,

38
00:02:51,500 --> 00:02:54,430
otherwise it's going to go into
the no table, the table to the left.

39
00:02:56,450 --> 00:02:59,350
Then, what we're going to do is
we're simply going to recurse

40
00:02:59,350 --> 00:03:01,890
on each of these different
groups separately.

41
00:03:01,890 --> 00:03:06,530
So we're going to continue to split
into two groups on each side.

42
00:03:06,530 --> 00:03:11,800
So remember this was 0.5
looking at our split.

43
00:03:11,800 --> 00:03:15,284
Now what we're going to do is
we're going to choose a new split.

44
00:03:19,128 --> 00:03:23,005
Where first we choose a split dimension
where in this case we're choosing

45
00:03:23,005 --> 00:03:23,761
feature two.

46
00:03:23,761 --> 00:03:27,834
And a split value in this case
we're choosing the value 0.1.

47
00:03:27,834 --> 00:03:33,216
So here's the value 0.1 and
here, these are cases

48
00:03:33,216 --> 00:03:38,239
where the second feature
is greater than 0.1,

49
00:03:38,239 --> 00:03:46,626
and here, these are cases where the second
feature is less than or equal to 0.1.

50
00:03:46,626 --> 00:03:55,050
And so to be clear, all of these points
here will appear on this list here.

51
00:03:55,050 --> 00:03:58,746
They're the ones were x1
is less than 0.5 they

52
00:03:58,746 --> 00:04:02,804
fall on the left hand side
of this first vertical line.

53
00:04:02,804 --> 00:04:07,332
And they also points where x2
at second feature dimension,

54
00:04:07,332 --> 00:04:12,860
is less than 0.1, so they fall below
that horizontal cut that we made.

55
00:04:14,000 --> 00:04:18,690
So the answers to these questions were no,
no, and so they fall into this table here.

56
00:04:20,430 --> 00:04:23,970
Okay, so we continue doing this process,
splitting, splitting,

57
00:04:23,970 --> 00:04:29,270
splitting, splitting, and
this creates a binary structured tree.

58
00:04:29,270 --> 00:04:33,840
So, at the leaf

59
00:04:33,840 --> 00:04:38,590
nodes of this tree we're going to have a
list of points that fall within that bin.

60
00:04:38,590 --> 00:04:43,970
Okay so, there's going to be list of
points in each of this leaf nodes

61
00:04:46,150 --> 00:04:49,030
representing the different
observation within each bin.

62
00:04:50,860 --> 00:04:56,290
So to be clear,
if we trace our way down the tree to

63
00:04:56,290 --> 00:05:01,990
any given leaf node the points that
appear here will satisfy all conditions.

64
00:05:03,300 --> 00:05:08,734
That were specified at each one

65
00:05:08,734 --> 00:05:15,004
of the splits above along the path so

66
00:05:15,004 --> 00:05:21,274
points here satisfy all conditions

67
00:05:21,274 --> 00:05:26,290
down the tree to this point.

68
00:05:31,137 --> 00:05:34,824
So we are going to keep one extra piece
of information, that's going to be very

69
00:05:34,824 --> 00:05:38,380
helpful when we are going to do
our nearest neighbor search.

70
00:05:38,380 --> 00:05:42,680
So just to be clear at any given

71
00:05:42,680 --> 00:05:46,740
node in a tree we are going to
store the following information.

72
00:05:46,740 --> 00:05:51,780
First we're going to store
which dimension we split on.

73
00:05:53,490 --> 00:05:56,410
We're also going to store the split value.

74
00:06:00,830 --> 00:06:04,300
So, where did we set that split threshold.

75
00:06:04,300 --> 00:06:10,920
And the third thing we're going to store
is the Titus bounding box, so the box

76
00:06:12,640 --> 00:06:19,240
that's as small as possible that contains
all observations within that node

77
00:06:21,080 --> 00:06:24,640
or that satisfy
the conditions to that node.

78
00:06:24,640 --> 00:06:30,670
So this will be bounding

79
00:06:30,670 --> 00:06:35,493
box that is as small

80
00:06:35,493 --> 00:06:40,015
as possible while

81
00:06:40,015 --> 00:06:45,449
containing points.

82
00:06:48,440 --> 00:06:49,890
Okay, so that's this.

83
00:06:49,890 --> 00:06:51,920
This is number three, above.

84
00:06:54,810 --> 00:06:57,280
Okay, so hopefully that's clear.

85
00:06:57,280 --> 00:07:02,860
It's a very, very intuitive type
of structure for storing the data.

86
00:07:04,650 --> 00:07:06,970
But there are lots of important choices,

87
00:07:06,970 --> 00:07:09,360
it's kind of been
the theme of this module.

88
00:07:09,360 --> 00:07:12,900
And it continues to be true even when
we're talking about these data structures

89
00:07:12,900 --> 00:07:14,809
like KD-trees.

90
00:07:14,809 --> 00:07:21,130
So examples include how are we going to
figure out which dimension to split on,

91
00:07:21,130 --> 00:07:24,450
what value to split on, and
when are we going to stop splitting?

92
00:07:26,340 --> 00:07:29,490
And in practice people just
use heuristics for this.

93
00:07:29,490 --> 00:07:30,910
So one example for

94
00:07:30,910 --> 00:07:35,560
which dimension to split on we can split
on the one that's the widest dimension.

95
00:07:36,610 --> 00:07:40,830
So if we think about any box and we're
looking at it let's say it's just a box in

96
00:07:40,830 --> 00:07:47,030
r2 We look at how big x2 is versus x1,

97
00:07:47,030 --> 00:07:51,060
and if x1 is bigger than x2,
then maybe we choose to split on x1.

98
00:07:52,160 --> 00:07:54,878
Or, you can think about
alternating dimensions.

99
00:07:58,391 --> 00:08:04,227
Then, when we think about what value
we split at, One option is to split

100
00:08:04,227 --> 00:08:10,090
at the median value of the observations
that are contained in that box.

101
00:08:12,530 --> 00:08:15,081
Or you could split at
the center point of the box,

102
00:08:15,081 --> 00:08:17,394
ignoring the spread of
data within the box.

103
00:08:26,034 --> 00:08:29,210
Then a question is when do you stop?

104
00:08:29,210 --> 00:08:32,200
And here, there are a couple
of choices that we have.

105
00:08:32,200 --> 00:08:38,830
One is you can stop if there are fewer
than a given number of points in the box.

106
00:08:41,640 --> 00:08:46,150
Let's say m data points left, or

107
00:08:46,150 --> 00:08:53,770
alternatively, if a minimum width
to the box has been achieved.

108
00:08:56,000 --> 00:09:00,515
So again, this second criteria would
ignore the actual data in the box.

109
00:09:00,515 --> 00:09:06,160
Whereas the first one uses facts about
the data to drive the stocking criterion.

110
00:09:07,620 --> 00:09:11,098
So as an example of how
important theses choices can be,

111
00:09:11,098 --> 00:09:15,760
here's an example where depending on
the distribution of data in the space

112
00:09:15,760 --> 00:09:20,496
whether we look at this median heuristic,
for where to put this weak point or

113
00:09:20,496 --> 00:09:25,765
we just use a center-of-range can give
new and really different data structures.

114
00:09:25,765 --> 00:09:29,227
And we're going to show how this can
have an implications for the number of

115
00:09:29,227 --> 00:09:32,875
searches that we have to do when we're
doing our nearest neighbor search.

116
00:09:32,875 --> 00:09:38,479
[MUSIC]