1 00:00:00,000 --> 00:00:03,975 [MUSIC] 2 00:00:03,975 --> 00:00:08,540 Well, luckily the answer is yes, or often yes. 3 00:00:08,540 --> 00:00:11,640 And we're going to discuss a few different ways of doing this. 4 00:00:11,640 --> 00:00:14,485 And the first thing that we're going to talk about are KD-trees. 5 00:00:14,485 --> 00:00:20,386 And KD-trees are a specific data structure for efficiently representing our data. 6 00:00:20,386 --> 00:00:25,230 In particular, KD-trees provide an organization of 7 00:00:25,230 --> 00:00:28,690 our documents in terms of this partitioning of our space. 8 00:00:28,690 --> 00:00:31,360 We're going to be making these access aligned cuts, 9 00:00:31,360 --> 00:00:36,610 and maintaining lists of points that fall into each one of these different bins. 10 00:00:36,610 --> 00:00:39,610 And what this structure allows us to do as we're going to show, 11 00:00:39,610 --> 00:00:41,800 is efficiently prune our search space so 12 00:00:41,800 --> 00:00:46,410 that we don't have to visit every single data point, For every query, necessarily, 13 00:00:46,410 --> 00:00:50,480 sometimes we'll have to but hopefully in many cases we won't have to. 14 00:00:52,140 --> 00:00:58,230 And these methods, these KD-trees work really well in low to medium dimensions 15 00:00:59,930 --> 00:01:04,980 meaning how many features we have and we'll return to this idea a bit later. 16 00:01:06,550 --> 00:01:10,360 To start with, let's talk about the KD-tree construction. 17 00:01:10,360 --> 00:01:13,620 So what we're going to do is we're going to take our data table. 18 00:01:13,620 --> 00:01:17,230 So we have in this example just two different features. 19 00:01:17,230 --> 00:01:24,900 So we have feature one, and this is feature two. 20 00:01:24,900 --> 00:01:29,450 So maybe this is just the first word in our vocabulary and 21 00:01:29,450 --> 00:01:36,240 this is the second word, and then here are the indices of our data points. 22 00:01:37,310 --> 00:01:39,298 So these are our observation indices. 23 00:01:42,954 --> 00:01:47,210 And these points are displayed just in r2 here. 24 00:01:47,210 --> 00:01:52,432 So this is, Feature 1 and 25 00:01:52,432 --> 00:01:57,510 this is, Feature 2. 26 00:01:57,510 --> 00:02:02,600 Okay, so the first thing that we do is we're going to split our 27 00:02:02,600 --> 00:02:05,220 data table into two different groups. 28 00:02:05,220 --> 00:02:08,740 And the way we are going to do that is by choosing a split dimension. 29 00:02:08,740 --> 00:02:12,540 So which feature are we splitting on and the split value so 30 00:02:12,540 --> 00:02:14,950 what's the threshold for the split. 31 00:02:14,950 --> 00:02:20,380 So here in this example we chose split value as 0.5 and 32 00:02:20,380 --> 00:02:23,260 we are splitting on the first feature. 33 00:02:23,260 --> 00:02:28,620 So everything to the right are cases where this first feature 34 00:02:28,620 --> 00:02:35,380 takes values greater than 0.5, and everything to the left are cases where 35 00:02:35,380 --> 00:02:40,430 the first feature takes values less than or equal to 0.5. 36 00:02:40,430 --> 00:02:46,440 Okay, so if we look at datapoint, and the first feature has a value 37 00:02:46,440 --> 00:02:51,500 greater than 0.5, we're going to place it in the yes table, the table to right, 38 00:02:51,500 --> 00:02:54,430 otherwise it's going to go into the no table, the table to the left. 39 00:02:56,450 --> 00:02:59,350 Then, what we're going to do is we're simply going to recurse 40 00:02:59,350 --> 00:03:01,890 on each of these different groups separately. 41 00:03:01,890 --> 00:03:06,530 So we're going to continue to split into two groups on each side. 42 00:03:06,530 --> 00:03:11,800 So remember this was 0.5 looking at our split. 43 00:03:11,800 --> 00:03:15,284 Now what we're going to do is we're going to choose a new split. 44 00:03:19,128 --> 00:03:23,005 Where first we choose a split dimension where in this case we're choosing 45 00:03:23,005 --> 00:03:23,761 feature two. 46 00:03:23,761 --> 00:03:27,834 And a split value in this case we're choosing the value 0.1. 47 00:03:27,834 --> 00:03:33,216 So here's the value 0.1 and here, these are cases 48 00:03:33,216 --> 00:03:38,239 where the second feature is greater than 0.1, 49 00:03:38,239 --> 00:03:46,626 and here, these are cases where the second feature is less than or equal to 0.1. 50 00:03:46,626 --> 00:03:55,050 And so to be clear, all of these points here will appear on this list here. 51 00:03:55,050 --> 00:03:58,746 They're the ones were x1 is less than 0.5 they 52 00:03:58,746 --> 00:04:02,804 fall on the left hand side of this first vertical line. 53 00:04:02,804 --> 00:04:07,332 And they also points where x2 at second feature dimension, 54 00:04:07,332 --> 00:04:12,860 is less than 0.1, so they fall below that horizontal cut that we made. 55 00:04:14,000 --> 00:04:18,690 So the answers to these questions were no, no, and so they fall into this table here. 56 00:04:20,430 --> 00:04:23,970 Okay, so we continue doing this process, splitting, splitting, 57 00:04:23,970 --> 00:04:29,270 splitting, splitting, and this creates a binary structured tree. 58 00:04:29,270 --> 00:04:33,840 So, at the leaf 59 00:04:33,840 --> 00:04:38,590 nodes of this tree we're going to have a list of points that fall within that bin. 60 00:04:38,590 --> 00:04:43,970 Okay so, there's going to be list of points in each of this leaf nodes 61 00:04:46,150 --> 00:04:49,030 representing the different observation within each bin. 62 00:04:50,860 --> 00:04:56,290 So to be clear, if we trace our way down the tree to 63 00:04:56,290 --> 00:05:01,990 any given leaf node the points that appear here will satisfy all conditions. 64 00:05:03,300 --> 00:05:08,734 That were specified at each one 65 00:05:08,734 --> 00:05:15,004 of the splits above along the path so 66 00:05:15,004 --> 00:05:21,274 points here satisfy all conditions 67 00:05:21,274 --> 00:05:26,290 down the tree to this point. 68 00:05:31,137 --> 00:05:34,824 So we are going to keep one extra piece of information, that's going to be very 69 00:05:34,824 --> 00:05:38,380 helpful when we are going to do our nearest neighbor search. 70 00:05:38,380 --> 00:05:42,680 So just to be clear at any given 71 00:05:42,680 --> 00:05:46,740 node in a tree we are going to store the following information. 72 00:05:46,740 --> 00:05:51,780 First we're going to store which dimension we split on. 73 00:05:53,490 --> 00:05:56,410 We're also going to store the split value. 74 00:06:00,830 --> 00:06:04,300 So, where did we set that split threshold. 75 00:06:04,300 --> 00:06:10,920 And the third thing we're going to store is the Titus bounding box, so the box 76 00:06:12,640 --> 00:06:19,240 that's as small as possible that contains all observations within that node 77 00:06:21,080 --> 00:06:24,640 or that satisfy the conditions to that node. 78 00:06:24,640 --> 00:06:30,670 So this will be bounding 79 00:06:30,670 --> 00:06:35,493 box that is as small 80 00:06:35,493 --> 00:06:40,015 as possible while 81 00:06:40,015 --> 00:06:45,449 containing points. 82 00:06:48,440 --> 00:06:49,890 Okay, so that's this. 83 00:06:49,890 --> 00:06:51,920 This is number three, above. 84 00:06:54,810 --> 00:06:57,280 Okay, so hopefully that's clear. 85 00:06:57,280 --> 00:07:02,860 It's a very, very intuitive type of structure for storing the data. 86 00:07:04,650 --> 00:07:06,970 But there are lots of important choices, 87 00:07:06,970 --> 00:07:09,360 it's kind of been the theme of this module. 88 00:07:09,360 --> 00:07:12,900 And it continues to be true even when we're talking about these data structures 89 00:07:12,900 --> 00:07:14,809 like KD-trees. 90 00:07:14,809 --> 00:07:21,130 So examples include how are we going to figure out which dimension to split on, 91 00:07:21,130 --> 00:07:24,450 what value to split on, and when are we going to stop splitting? 92 00:07:26,340 --> 00:07:29,490 And in practice people just use heuristics for this. 93 00:07:29,490 --> 00:07:30,910 So one example for 94 00:07:30,910 --> 00:07:35,560 which dimension to split on we can split on the one that's the widest dimension. 95 00:07:36,610 --> 00:07:40,830 So if we think about any box and we're looking at it let's say it's just a box in 96 00:07:40,830 --> 00:07:47,030 r2 We look at how big x2 is versus x1, 97 00:07:47,030 --> 00:07:51,060 and if x1 is bigger than x2, then maybe we choose to split on x1. 98 00:07:52,160 --> 00:07:54,878 Or, you can think about alternating dimensions. 99 00:07:58,391 --> 00:08:04,227 Then, when we think about what value we split at, One option is to split 100 00:08:04,227 --> 00:08:10,090 at the median value of the observations that are contained in that box. 101 00:08:12,530 --> 00:08:15,081 Or you could split at the center point of the box, 102 00:08:15,081 --> 00:08:17,394 ignoring the spread of data within the box. 103 00:08:26,034 --> 00:08:29,210 Then a question is when do you stop? 104 00:08:29,210 --> 00:08:32,200 And here, there are a couple of choices that we have. 105 00:08:32,200 --> 00:08:38,830 One is you can stop if there are fewer than a given number of points in the box. 106 00:08:41,640 --> 00:08:46,150 Let's say m data points left, or 107 00:08:46,150 --> 00:08:53,770 alternatively, if a minimum width to the box has been achieved. 108 00:08:56,000 --> 00:09:00,515 So again, this second criteria would ignore the actual data in the box. 109 00:09:00,515 --> 00:09:06,160 Whereas the first one uses facts about the data to drive the stocking criterion. 110 00:09:07,620 --> 00:09:11,098 So as an example of how important theses choices can be, 111 00:09:11,098 --> 00:09:15,760 here's an example where depending on the distribution of data in the space 112 00:09:15,760 --> 00:09:20,496 whether we look at this median heuristic, for where to put this weak point or 113 00:09:20,496 --> 00:09:25,765 we just use a center-of-range can give new and really different data structures. 114 00:09:25,765 --> 00:09:29,227 And we're going to show how this can have an implications for the number of 115 00:09:29,227 --> 00:09:32,875 searches that we have to do when we're doing our nearest neighbor search. 116 00:09:32,875 --> 00:09:38,479 [MUSIC]