1 00:00:04,680 --> 00:00:07,319 So the last way in which we're going to use KD-trees for 2 00:00:07,319 --> 00:00:11,250 nearest neighbor search is for approximate nearest neighbor search. 3 00:00:11,250 --> 00:00:14,190 And this is something that we're going to spend a lot of time talking about in 4 00:00:14,190 --> 00:00:15,530 the next section. 5 00:00:15,530 --> 00:00:19,580 But this idea, it's really important because so 6 00:00:19,580 --> 00:00:22,280 far, everything we've been talking about assumes that 7 00:00:22,280 --> 00:00:26,030 out of all the articles out there, you want to find the nearest neighbor. 8 00:00:27,160 --> 00:00:31,720 But that assumption relies so heavily on first assuming that it's important to 9 00:00:31,720 --> 00:00:35,940 find the nearest neighbor and not just a good alternative. 10 00:00:35,940 --> 00:00:41,660 But also the fact that you really believe in your representation of your data, as 11 00:00:41,660 --> 00:00:45,690 well as the distance measure that you're using to compute these similarities. 12 00:00:45,690 --> 00:00:49,630 because if you don't really have 100% faith in those to begin with, 13 00:00:49,630 --> 00:00:53,720 then what's the difference of finding the nearest neighbor versus one that's, 14 00:00:53,720 --> 00:00:56,050 pretty, pretty, pretty close to that. 15 00:00:56,050 --> 00:00:59,770 So a question is, can we do things even more efficiently 16 00:00:59,770 --> 00:01:04,110 if we don't care about finding according to our setup, the nearest neighbor but 17 00:01:04,110 --> 00:01:06,310 just a pretty good approximation? 18 00:01:06,310 --> 00:01:08,624 So within the context of KD-trees, 19 00:01:08,624 --> 00:01:13,185 the way we can do our approximate nearest neighbor search is as follows. 20 00:01:13,185 --> 00:01:14,628 Remember before, 21 00:01:14,628 --> 00:01:19,597 we maintained the distance to the nearest neighbor found so far. 22 00:01:19,597 --> 00:01:25,124 So this, I'm going to call r = distance 23 00:01:25,124 --> 00:01:29,280 to nearest neighbor so far. 24 00:01:30,980 --> 00:01:34,210 But now, what we're going to do, is we're going to, 25 00:01:34,210 --> 00:01:38,710 instead of pruning based on the distance to the nearest neighbor so far, 26 00:01:38,710 --> 00:01:40,780 we're going to prune more aggressively. 27 00:01:40,780 --> 00:01:44,030 That's what's going to give us the added efficiency, but 28 00:01:44,030 --> 00:01:45,870 at the cost of an approximation. 29 00:01:45,870 --> 00:01:52,200 And to prune more aggressively, instead of pruning based on distance r, 30 00:01:52,200 --> 00:01:57,116 we're going to prune based on a distance, r over alpha 31 00:01:57,116 --> 00:02:02,490 where alpha is some number greater than 1. 32 00:02:02,490 --> 00:02:09,410 And so this is going to be a smaller radius that we're using. 33 00:02:09,410 --> 00:02:13,840 So it means it's more likely that the distance to the bounding box 34 00:02:13,840 --> 00:02:17,115 exceeds r over alpha than it does just r. 35 00:02:17,115 --> 00:02:22,360 Okay, so what are the properties of performing nearest neighbor search 36 00:02:22,360 --> 00:02:25,230 with this more aggressive pruning? 37 00:02:25,230 --> 00:02:31,054 Well, we know that if we return a neighbor at some distance r, 38 00:02:31,054 --> 00:02:35,539 then there's no neighbor within r over alpha. 39 00:02:35,539 --> 00:02:40,175 So, what we're saying here is, 40 00:02:40,175 --> 00:02:45,483 let's say this is our query point Xq. 41 00:02:45,483 --> 00:02:47,532 This is our distance r, 42 00:02:47,532 --> 00:02:52,352 this is the nearest neighbor that we're going to return. 43 00:02:54,724 --> 00:03:01,851 If we look at, let me draw in a different color, I think it'll help make this clear. 44 00:03:07,174 --> 00:03:10,186 So if this is r over alpha, 45 00:03:14,981 --> 00:03:18,840 Then we know that there could be points here. 46 00:03:18,840 --> 00:03:23,776 So, these would all be, Data 47 00:03:23,776 --> 00:03:27,416 points that were actually closer than the nearest neighbor that we returned. 48 00:03:30,983 --> 00:03:39,160 But we know, That there's no points within this box. 49 00:03:39,160 --> 00:03:40,529 Otherwise we would have found them and 50 00:03:40,529 --> 00:03:42,680 that would have been our distance to the nearest neighbor. 51 00:03:44,520 --> 00:03:47,490 And we're pruning all other things that fall further from that. 52 00:03:48,940 --> 00:03:53,620 Okay, so by doing this, we get a fairly good approximation, and 53 00:03:53,620 --> 00:03:58,386 actually this bound that we're stating here is rather loose, and 54 00:03:58,386 --> 00:04:03,089 in practice, we're often much closer to the optimal solution. 55 00:04:03,089 --> 00:04:08,517 And importantly, it just saves a ton of searching you have to do, 56 00:04:08,517 --> 00:04:11,890 of course if alpha is reasonably large. 57 00:04:11,890 --> 00:04:18,299 If alpha becomes 1, then it defaults back to our exact nearest neighbor search. 58 00:04:18,299 --> 00:04:24,046 Okay, so in a lot of cases, doing approximate nearest neighbor search is 59 00:04:24,046 --> 00:04:30,443 pretty much just as good as what you would do with brute force, because there's so 60 00:04:30,443 --> 00:04:36,494 much noise in how you're thinking about computing distances to begin with. 61 00:04:36,494 --> 00:04:38,699 And whether that's really relevant to, 62 00:04:38,699 --> 00:04:41,345 let's say I'm sitting here reading an article, 63 00:04:41,345 --> 00:04:45,567 whatever you're going to present me as a pretty good approximation to the nearest 64 00:04:45,567 --> 00:04:49,747 neighbor is probably going to satisfy my needs for exploring other literature. 65 00:04:51,653 --> 00:04:55,979 Okay, so we're going to talk a lot more about alternative ways to do approximate 66 00:04:55,979 --> 00:05:01,120 nearest neighbor search, but in ways that also scale better with higher dimensions. 67 00:05:01,120 --> 00:05:02,540 But within the context of KD-trees, 68 00:05:02,540 --> 00:05:06,600 this is a very effective way of performing approximate nearest neighbor search. 69 00:05:07,860 --> 00:05:10,800 So to wrap up on KD-trees, 70 00:05:10,800 --> 00:05:14,780 first I want to say that they really are a very cool and efficient data structure. 71 00:05:14,780 --> 00:05:17,840 But I also want to discuss a few other things. 72 00:05:17,840 --> 00:05:20,760 One is the fact that there are lots and lots of variants. 73 00:05:20,760 --> 00:05:24,725 We talked about some of these in terms of heuristics for how to choose the splitting 74 00:05:24,725 --> 00:05:28,014 dimension and the splitting value, and all sorts of other things. 75 00:05:28,014 --> 00:05:32,696 And, in addition, instead of just making these axis-aligned cuts, 76 00:05:32,696 --> 00:05:35,827 there are other ways of performing a KD-tree. 77 00:05:35,827 --> 00:05:40,010 Well, it's really these ball-trees and different shapes you can plop down. 78 00:05:40,010 --> 00:05:42,509 And what were going to talk about next is, 79 00:05:42,509 --> 00:05:47,455 you can think of it as one of these alternatives to these axis-aligned cuts. 80 00:05:47,455 --> 00:05:49,300 And the other thing that we've seen so 81 00:05:49,300 --> 00:05:54,370 far is the fact that in performing nearest-neighbor search, things rely very, 82 00:05:54,370 --> 00:05:56,430 very heavily on how you're representing your data, 83 00:05:56,430 --> 00:05:59,850 and how you're computing the distance between two different data points. 84 00:06:01,120 --> 00:06:04,970 And for both of these, KD-trees and nearest neighbor search, 85 00:06:04,970 --> 00:06:10,189 things become much, much more challenging when you're in high dimensional spaces. 86 00:06:10,189 --> 00:06:14,963 So for KD-trees, we talked about the fact that the number of inspections 87 00:06:14,963 --> 00:06:18,667 you have to do can be exponential in the dimensionality. 88 00:06:18,667 --> 00:06:20,515 So just as a rule of thumb, 89 00:06:20,515 --> 00:06:26,479 typically KD-trees are only useful if the number of observations you have is much, 90 00:06:26,479 --> 00:06:31,360 much larger than 2 to the d, 2 to the number of dimensions you have. 91 00:06:31,360 --> 00:06:36,020 So in high dimensional cases, often you don't have enough observations 92 00:06:36,020 --> 00:06:40,370 to really make a KD-tree worthwhile. 93 00:06:40,370 --> 00:06:43,240 And in terms of thinking about nearest neighbor search for 94 00:06:43,240 --> 00:06:46,880 getting the challenges just of KD-trees, in high dimensional spaces, 95 00:06:46,880 --> 00:06:51,390 things are really hard, because from a really high dimensional feature vector, 96 00:06:51,390 --> 00:06:54,920 a lot of those features are going to be basically just noise. 97 00:06:54,920 --> 00:06:58,472 And those can be swamping your distance calculations. 98 00:06:58,472 --> 00:07:02,899 And so, what happens is in a high dimensional spaces, 99 00:07:02,899 --> 00:07:07,823 basically everything is far away from itself or each other. 100 00:07:07,823 --> 00:07:12,214 And so, you really need techniques to learn which features are important or 101 00:07:12,214 --> 00:07:16,110 how do you think about wading across the different features. 102 00:07:16,110 --> 00:07:20,210 And like we mentioned before, that's a really, really challenging task that 103 00:07:20,210 --> 00:07:24,990 requires a lot of domain knowledge or a lot of thought and exploration. 104 00:07:24,990 --> 00:07:30,390 And lastly, I really want to thank Andrew Moore for the outline of these slides. 105 00:07:30,390 --> 00:07:34,690 The slides on KD-trees were built very heavily from a really nice 106 00:07:34,690 --> 00:07:37,982 presentation that he put together on KD-trees. 107 00:07:37,982 --> 00:07:41,579 So thanks for the permission to use that. 108 00:07:41,579 --> 00:07:45,417 [MUSIC]