1 00:00:00,000 --> 00:00:04,157 [MUSIC] 2 00:00:04,157 --> 00:00:07,660 So if we end up with a nearly balanced binary tree. 3 00:00:07,660 --> 00:00:13,460 So a nearly balanced binary tree is something that looks mostly. 4 00:00:14,590 --> 00:00:16,380 So this is a binary tree. 5 00:00:16,380 --> 00:00:17,600 A balanced binary tree. 6 00:00:17,600 --> 00:00:18,890 This would be a balanced binary tree. 7 00:00:18,890 --> 00:00:24,190 There are the same number of leaves on the left and right-hand side. 8 00:00:24,190 --> 00:00:28,630 So we can start talking about what is the size and the depth of this tree. 9 00:00:28,630 --> 00:00:33,160 So if there are N total data points, and let's assume that we end up with just one 10 00:00:33,160 --> 00:00:38,350 observation at each leaf node, then how many nodes are there total in this tree? 11 00:00:40,260 --> 00:00:45,090 We can show that there are 2N- 1 nodes. 12 00:00:46,090 --> 00:00:47,369 And so let's see. 13 00:00:47,369 --> 00:00:54,110 If we had four data points, we have one, two, three, four, five, six, seven nodes. 14 00:00:54,110 --> 00:00:56,524 And that's exactly 2 times 4 minus 1 is 7. 15 00:00:56,524 --> 00:00:59,818 Okay, so at least that little example checks out there. 16 00:00:59,818 --> 00:01:04,922 So this is 2N- 1 nodes if 17 00:01:04,922 --> 00:01:10,490 one data point at each leaf. 18 00:01:12,220 --> 00:01:14,510 And assuming that it's a balanced binary tree. 19 00:01:14,510 --> 00:01:21,660 But the point is that there's on the order of N nodes in our tree. 20 00:01:21,660 --> 00:01:22,820 And how deep is the tree? 21 00:01:23,860 --> 00:01:29,190 Well, the tree has depth that's on the order of log N. 22 00:01:30,510 --> 00:01:32,910 And that's also pretty straightforward to show. 23 00:01:33,990 --> 00:01:38,659 So when N is 4, we end up with a depth 2 tree. 24 00:01:40,623 --> 00:01:47,570 Okay, so now let's talk a little bit about the complexity of constructing this tree. 25 00:01:47,570 --> 00:01:48,852 And so, what do we have to do? 26 00:01:48,852 --> 00:01:52,355 Well, if we use our median heuristic for 27 00:01:52,355 --> 00:01:57,505 how to set the split value, then at every level of our tree, 28 00:01:57,505 --> 00:02:04,115 we have N observations that we either have to send to the right or to the left. 29 00:02:04,115 --> 00:02:07,480 So in the first case, it's just one split left and right. 30 00:02:07,480 --> 00:02:10,660 At the second level you have two splits. 31 00:02:10,660 --> 00:02:15,402 But they're splits of smaller numbers of points and this goes on and on and on. 32 00:02:15,402 --> 00:02:21,179 But what you can show is that if you do an efficient implementation of this, 33 00:02:21,179 --> 00:02:26,315 again, using a priority queue like we've talked about earlier, 34 00:02:26,315 --> 00:02:34,771 then you end up with complexity that's order N, At every level of the tree. 35 00:02:37,931 --> 00:02:43,710 So together, the construction time would be order N log N, 36 00:02:43,710 --> 00:02:48,217 because I have an order N operation at each of, 37 00:02:48,217 --> 00:02:52,055 on the order of log N levels of the tree. 38 00:02:52,055 --> 00:02:57,160 Okay, so that's our construction time. 39 00:02:57,160 --> 00:03:01,840 But now we have to think about the complexity of doing our nearest neighbor 40 00:03:01,840 --> 00:03:07,010 query and see if we get gains that offset the cost of constructing this tree. 41 00:03:08,430 --> 00:03:12,420 Okay, so to begin with, the first thing we have to do when we have a query point, 42 00:03:12,420 --> 00:03:15,080 remember, is we traverse all the way down to a leaf node. 43 00:03:15,080 --> 00:03:19,408 So we're going to go all the way down the depth of the tree. 44 00:03:19,408 --> 00:03:26,160 So that's order log N, because remember, the depth is on the order of log N. 45 00:03:26,160 --> 00:03:30,820 And then we have to do this backtrack and traverse and backtrack and traverse. 46 00:03:32,250 --> 00:03:37,700 And in the worst case, how many different points do we have to consider? 47 00:03:37,700 --> 00:03:40,050 How many different nodes do we have to visit? 48 00:03:41,120 --> 00:03:43,875 Well, how many nodes are there in total? 49 00:03:43,875 --> 00:03:45,554 They're on the order of N nodes. 50 00:03:45,554 --> 00:03:49,120 So, N nodes in the worst case. 51 00:03:51,860 --> 00:03:53,421 So the worst case being when, 52 00:03:53,421 --> 00:03:57,680 because of the structure of the problem we're not able to prune anything at all. 53 00:03:58,890 --> 00:04:05,040 So the complexity range is anywhere from on the order of log N, 54 00:04:05,040 --> 00:04:09,550 if we can do lots of pruning, everything gets pruned, 55 00:04:09,550 --> 00:04:15,990 all the way to on the order of N, if we can't do any pruning. 56 00:04:15,990 --> 00:04:22,524 Remember that order N was the same complexity as our brute force search. 57 00:04:22,524 --> 00:04:26,980 So, if we're just going to do a single nearest neighbor query and 58 00:04:26,980 --> 00:04:31,183 if we got really unlucky about the structure of the data, or 59 00:04:31,183 --> 00:04:34,080 how we chose to construct the kd-tree. 60 00:04:34,080 --> 00:04:38,872 We might actually take a penalty over a brute force search. 61 00:04:38,872 --> 00:04:41,585 But in some cases, as we're going to show, 62 00:04:41,585 --> 00:04:45,322 you can actually have significant gains in efficiency. 63 00:04:45,322 --> 00:04:49,850 Okay, so under some assumptions about the distribution of points. 64 00:04:49,850 --> 00:04:55,330 So things that don't look like that really elegant circle that we had before. 65 00:04:55,330 --> 00:05:01,144 We can get things that have on the order of log N number of queries that 66 00:05:01,144 --> 00:05:07,581 we have to do, or distance computations per query that we're looking at. 67 00:05:07,581 --> 00:05:13,634 But the number of inspections we have to do can be exponential in d, 68 00:05:13,634 --> 00:05:16,290 the number of dimensions. 69 00:05:16,290 --> 00:05:18,770 So for all the examples, all the pictures we're showing 70 00:05:18,770 --> 00:05:23,060 are just two-dimensional examples I think of two words in our vocabulary. 71 00:05:23,060 --> 00:05:25,953 But, we might be interested in much higher dimensions, and 72 00:05:25,953 --> 00:05:30,410 k-d trees truthfully don't work very well in high dimensions. 73 00:05:30,410 --> 00:05:35,146 And we're going to return to this later in this course. 74 00:05:35,146 --> 00:05:39,409 [MUSIC]