1 00:00:00,000 --> 00:00:04,498 [MUSIC] 2 00:00:04,498 --> 00:00:09,601 Okay, but let's go back to this idea that depending on the structure, 3 00:00:09,601 --> 00:00:13,089 or rather the distribution of data in the space, 4 00:00:13,089 --> 00:00:16,850 you can actually get things looking closer to log N. 5 00:00:16,850 --> 00:00:21,661 So here's an example where we have fairly uniform distribution of points across 6 00:00:21,661 --> 00:00:22,376 the space. 7 00:00:22,376 --> 00:00:27,927 And for a given query, we've pruned a huge amount of the search space and 8 00:00:27,927 --> 00:00:31,860 just had to search two different boxes here. 9 00:00:31,860 --> 00:00:37,210 Whereas in that more contrived example where our data lied on a circle, 10 00:00:38,690 --> 00:00:43,080 and based on that center heuristic that we had for how to create the splits, 11 00:00:43,080 --> 00:00:46,980 we end up having to search almost every node in the tree. 12 00:00:46,980 --> 00:00:52,510 So this looks a lot closer to order N, so not much savings in this case. 13 00:00:52,510 --> 00:00:58,310 Okay, so like I said, to really offset the cost of creating this data structure, it's 14 00:00:58,310 --> 00:01:02,760 useful to use these things in cases where you're doing lots and lots of queries. 15 00:01:02,760 --> 00:01:07,894 So let's imagine asking for the nearest neighbor for every article in our corpus. 16 00:01:07,894 --> 00:01:12,919 So let's assume that we're doing N queries. 17 00:01:12,919 --> 00:01:14,810 And let's look at the complexity of this. 18 00:01:14,810 --> 00:01:20,331 So under a brute force search, remember per query it was order N complexity. 19 00:01:20,331 --> 00:01:26,224 So for N queries, we're going to have something that's order N squared. 20 00:01:26,224 --> 00:01:28,859 But on the other hand for kd-trees, 21 00:01:28,859 --> 00:01:33,880 it's going to be somewhere in the range of order N log N, if we get lucky. 22 00:01:34,960 --> 00:01:38,882 And if we get unlucky, it's going to be on the order of N squared, 23 00:01:38,882 --> 00:01:41,028 just like the brute force search. 24 00:01:41,028 --> 00:01:45,669 But just to note, this represents, potentially, 25 00:01:45,669 --> 00:01:48,519 very large savings for large N. 26 00:01:55,184 --> 00:01:59,759 Okay, so let's look at a little empirical study of the number of 27 00:01:59,759 --> 00:02:04,505 inspections we have as a function of the number of observations and 28 00:02:04,505 --> 00:02:09,614 then as a function of the dimension of the space that we're working in. 29 00:02:09,614 --> 00:02:13,794 So here, we're increasing the number of observations, N. 30 00:02:13,794 --> 00:02:17,980 And in this case we see roughly a log(N) trend. 31 00:02:17,980 --> 00:02:22,830 But on the other hand here, when we're looking at going to higher and 32 00:02:22,830 --> 00:02:27,890 higher dimensions, so just to be clear, we're saying that our observations xi 33 00:02:27,890 --> 00:02:33,420 live in some R to the d space, so that's what d represents. 34 00:02:33,420 --> 00:02:37,180 We get a trend where we have an exponential increase in the number of 35 00:02:37,180 --> 00:02:40,900 inspections we have to do, the number of distance calculations, or 36 00:02:40,900 --> 00:02:45,640 really nodes in the tree, to find our nearest neighbor. 37 00:02:46,800 --> 00:02:52,123 So for large d, This 38 00:02:52,123 --> 00:02:56,215 is really a regime where using kd-trees becomes pretty pointless. 39 00:02:56,215 --> 00:03:00,570 And we'll return to this to talk about it a little bit more. 40 00:03:00,570 --> 00:03:04,580 And also to think of an alternative approach that will be more helpful to us 41 00:03:04,580 --> 00:03:06,607 in high-dimensional scenarios. 42 00:03:07,750 --> 00:03:12,150 But before that, I just want to mention how we can think about using kd-trees 43 00:03:12,150 --> 00:03:16,400 to do k nearest neighbors search, instead of one nearest neighbor search. 44 00:03:16,400 --> 00:03:20,790 And the extension is very intuitive and straightforward. 45 00:03:20,790 --> 00:03:24,110 So instead of just maintaining the distance to the nearest neighbor, 46 00:03:24,110 --> 00:03:28,280 what we're going to do is we're going to maintain 47 00:03:28,280 --> 00:03:31,930 the distance to our kth nearest neighbor that we found so far. 48 00:03:33,130 --> 00:03:35,520 So in this example, this is our distance. 49 00:03:35,520 --> 00:03:39,270 We're looking at a two nearest neighbor task. 50 00:03:39,270 --> 00:03:43,950 And here, this is our distance to our second nearest neighbor found so 51 00:03:43,950 --> 00:03:49,450 far in this first search that we've done, the search of this first box here. 52 00:03:49,450 --> 00:03:54,930 And then just as before, based on the distance that we found so 53 00:03:54,930 --> 00:03:58,480 far, we'd backtrack, go and search other nodes. 54 00:03:58,480 --> 00:04:04,240 And prune them if the distance to that kth nearest neighbor is 55 00:04:04,240 --> 00:04:09,050 smaller than the distance to the bounding box of the node that we're looking at. 56 00:04:09,050 --> 00:04:11,205 So the algorithm is almost identical. 57 00:04:11,205 --> 00:04:15,497 The only thing you would need to change in code is instead of maintaining distance to 58 00:04:15,497 --> 00:04:19,872 the nearest neighbor, maintain distance to the kth nearest neighbor found so far. 59 00:04:19,872 --> 00:04:23,639 [MUSIC]