1 00:00:00,000 --> 00:00:04,488 [MUSIC] 2 00:00:04,488 --> 00:00:07,860 KD-trees are one approach for efficient nearest-neighbor search. 3 00:00:07,860 --> 00:00:10,670 But as we mentioned, there are other approaches that we could consider. 4 00:00:10,670 --> 00:00:13,170 So let's look at one of these called locality sensitive hashing. 5 00:00:14,250 --> 00:00:17,730 So why might we want to consider an approach other than KD-trees? 6 00:00:17,730 --> 00:00:19,990 Well, KD-trees are really cool. 7 00:00:19,990 --> 00:00:23,540 They're a very intuitive way to think about storing data, and 8 00:00:23,540 --> 00:00:25,630 as we saw, they could lead to some efficiencies. 9 00:00:25,630 --> 00:00:29,510 And actually some significant efficiencies in performing nearest neighbor search. 10 00:00:30,690 --> 00:00:31,850 But there are few issues. 11 00:00:31,850 --> 00:00:36,210 One is the fact that KD-trees are not the simplest things to implement. 12 00:00:36,210 --> 00:00:40,540 You have to structure out this tree, and it can be pretty challenging to do that. 13 00:00:40,540 --> 00:00:45,250 The other is something that we saw that, as the dimensionality of our data 14 00:00:45,250 --> 00:00:50,070 increases, we can actually get really, really poor performance in terms of 15 00:00:50,070 --> 00:00:53,070 the efficiency of nearest neighbor search using KD-trees. 16 00:00:53,070 --> 00:00:55,470 So let's talk about this a little bit more. 17 00:00:55,470 --> 00:00:57,800 So what happens in high dimensions is basically, 18 00:00:57,800 --> 00:01:00,570 every point is far away from any other point. 19 00:01:00,570 --> 00:01:02,660 So this is something that we mentioned before. 20 00:01:02,660 --> 00:01:06,800 So if we think about taking our KD-tree partitioning, remember, 21 00:01:06,800 --> 00:01:07,870 in a high dimensional space, 22 00:01:07,870 --> 00:01:13,070 we're going to think of these hyper cubes that form the different partitions. 23 00:01:13,070 --> 00:01:17,213 Then if we have our query point, it's unlikely there's going to be any point 24 00:01:17,213 --> 00:01:19,733 within the given hypercube or even close by. 25 00:01:19,733 --> 00:01:24,017 So we're going to be searching around, finding some point, and that will be 26 00:01:24,017 --> 00:01:28,580 considered our nearest neighbor so far in the algorithm that we specified. 27 00:01:28,580 --> 00:01:33,449 And remember, what we do is we draw a radius to that nearest neighbor so ,far 28 00:01:33,449 --> 00:01:38,718 and we think of, now in high dimensions, this big sphere, this hyper-sphere. 29 00:01:38,718 --> 00:01:43,429 And we can only prune bins if the distance to that bin is 30 00:01:43,429 --> 00:01:48,260 greater than the distance to nearest neighbor so far. 31 00:01:48,260 --> 00:01:52,190 But what happens in these high dimensions is that this big sphere, 32 00:01:52,190 --> 00:01:56,410 this radius to our nearest neighbor so far, intersects lots and 33 00:01:56,410 --> 00:02:00,560 lots of these hypercubes in at least one dimension. 34 00:02:00,560 --> 00:02:04,910 And so we can't prune them, and what ends up happening is you have to search many, 35 00:02:04,910 --> 00:02:07,000 many, many, many partitions. 36 00:02:07,000 --> 00:02:11,950 And you can show under some conditions that you actually need to search over 2 to 37 00:02:11,950 --> 00:02:16,710 the d different hypercubes or little partitions, 38 00:02:16,710 --> 00:02:22,650 which completely defeats the purpose of the efficiencies gained by KD-trees. 39 00:02:22,650 --> 00:02:27,930 Unless n, the number of observations, is just enormously large. 40 00:02:27,930 --> 00:02:30,870 Much, much greater than 2 to the d, okay. 41 00:02:30,870 --> 00:02:32,143 So what can we do? 42 00:02:32,143 --> 00:02:36,028 Well, one thing we can think about doing is moving away from performing exact 43 00:02:36,028 --> 00:02:37,790 nearest neighbor search. 44 00:02:37,790 --> 00:02:40,890 And this is something that we discussed while we talked about KD-trees. 45 00:02:40,890 --> 00:02:45,465 We showed how we could use kd-trees to perform approximate nearest 46 00:02:45,465 --> 00:02:46,400 neighbor search. 47 00:02:46,400 --> 00:02:48,965 And that's going to be the focus of this section. 48 00:02:48,965 --> 00:02:54,130 because we're going to assume that we don't fully, fully trust our 49 00:02:54,130 --> 00:02:58,890 distance metric as being exactly the way that we want to characterize distances. 50 00:02:58,890 --> 00:03:01,350 And often in a lot of applications, 51 00:03:01,350 --> 00:03:06,640 you just want to find a neighbor that's pretty close to the given query point. 52 00:03:06,640 --> 00:03:11,520 It doesn't necessarily have to be exactly, exactly, exactly the nearest neighbor. 53 00:03:11,520 --> 00:03:13,540 And so for a lot of retrieval tasks, 54 00:03:13,540 --> 00:03:18,580 that will actually provide enough performance for the given task. 55 00:03:18,580 --> 00:03:20,998 But here, in contrast to what we did for KD-trees for 56 00:03:20,998 --> 00:03:24,657 approximate nearest neighbor search, we're going to look at methods where we can 57 00:03:24,657 --> 00:03:28,244 provide a probabilistic guarantee on the approximation that we're making. 58 00:03:28,244 --> 00:03:28,744 [MUSIC]