1 00:00:00,000 --> 00:00:04,582 [MUSIC] 2 00:00:04,582 --> 00:00:09,146 Let's now discuss how KD-trees can aid in efficiently performing nearest neighbor 3 00:00:09,146 --> 00:00:09,660 search. 4 00:00:10,660 --> 00:00:12,005 So to start with, 5 00:00:12,005 --> 00:00:17,398 we're going to have our query point showing with this green circle here. 6 00:00:17,398 --> 00:00:21,727 So this is our query point. 7 00:00:21,727 --> 00:00:25,780 And then the first thing we're going to do is we're going to traverse the tree all 8 00:00:25,780 --> 00:00:29,870 the way down to the leaf node that contains that query point. 9 00:00:29,870 --> 00:00:31,102 What that means is, 10 00:00:31,102 --> 00:00:36,102 we're going to check all the conditions of this point against the conditions at each 11 00:00:36,102 --> 00:00:39,451 one of these internal nodes until we get to a leaf node. 12 00:00:39,451 --> 00:00:43,444 Okay, so to begin with, we're going to check the first condition, 13 00:00:43,444 --> 00:00:48,602 remember it was a split on x1, whether or not it was greater than or less than 0.5. 14 00:00:48,602 --> 00:00:53,800 In this case, x1, for this point, is greater than 0.5. 15 00:00:53,800 --> 00:00:58,299 So, we have x1 for query point, 16 00:00:58,299 --> 00:01:02,645 x cubed is greater than 0.5. 17 00:01:02,645 --> 00:01:08,360 So we say, yes, and we traverse down this side of the tree. 18 00:01:08,360 --> 00:01:12,088 Then we check the next condition, and 19 00:01:12,088 --> 00:01:18,388 this was a condition of whether x2 was greater than some value. 20 00:01:18,388 --> 00:01:19,160 I'll just call it v. 21 00:01:20,820 --> 00:01:24,140 So let's call this point here, v. 22 00:01:24,140 --> 00:01:30,150 And here we see that, yes, x2 is greater than v. 23 00:01:30,150 --> 00:01:31,519 So, again, we have a yes. 24 00:01:32,650 --> 00:01:35,093 And we traverse down this way. 25 00:01:35,093 --> 00:01:36,525 And we keep doing that. 26 00:01:36,525 --> 00:01:39,500 And then all of the sudden we get to a leaf note. 27 00:01:41,200 --> 00:01:44,750 Okay, now once we're at this leaf node, 28 00:01:44,750 --> 00:01:47,150 the first thing we do is start searching for nearest neighbors. 29 00:01:47,150 --> 00:01:50,560 because what our hope is is that within this box, 30 00:01:50,560 --> 00:01:56,560 local to our point, that might be a likely place to find the nearest neighbor. 31 00:01:56,560 --> 00:02:02,500 Okay, so we compute the distances to each point contained in this box. 32 00:02:02,500 --> 00:02:07,029 So this node has a list of three points, the query point and 33 00:02:07,029 --> 00:02:09,347 these two other points here. 34 00:02:09,347 --> 00:02:11,235 And we compute those distances. 35 00:02:14,927 --> 00:02:19,380 So, that's what's said right here on this slide. 36 00:02:20,910 --> 00:02:23,981 Now that we've computed these distances, 37 00:02:23,981 --> 00:02:29,044 what we do is we record the distance to the nearest neighbor found so far. 38 00:02:29,044 --> 00:02:31,660 So, let's call this r. 39 00:02:31,660 --> 00:02:36,981 And this is distance to 40 00:02:36,981 --> 00:02:44,549 nearest neighbor found so far. 41 00:02:47,721 --> 00:02:51,630 Okay, then an actual question is, is that it? 42 00:02:51,630 --> 00:02:52,379 Are we done? 43 00:02:53,490 --> 00:02:57,910 Does the nearest neighbor live within the blue box that we just searched? 44 00:02:58,910 --> 00:03:00,590 No, not necessarily, right? 45 00:03:00,590 --> 00:03:05,880 That should be obvious because now that we've drawn this fuchsia circle, 46 00:03:05,880 --> 00:03:10,130 we see that there are other points that are closer to this green point 47 00:03:10,130 --> 00:03:12,970 that live in other neighboring boxes. 48 00:03:12,970 --> 00:03:16,170 So what we're going to do then is we're going to continue searching. 49 00:03:16,170 --> 00:03:20,540 So we're going to go from this node that we already examined. 50 00:03:20,540 --> 00:03:22,262 So that one's already done. 51 00:03:22,262 --> 00:03:26,428 And we're going to backtrack, we're going to go back up the tree and 52 00:03:26,428 --> 00:03:28,830 start checking out other branches. 53 00:03:29,890 --> 00:03:36,110 Okay, so then we're going to go down to a leaf node. 54 00:03:37,130 --> 00:03:39,119 So we checked this one. 55 00:03:39,119 --> 00:03:41,941 And now we've traversed here to this leaf. 56 00:03:41,941 --> 00:03:45,422 And now we're going to compute distances to all of these points and 57 00:03:45,422 --> 00:03:49,303 compare relative to the distance to our nearest neighbor found so far. 58 00:03:49,303 --> 00:03:52,925 And in this case, 59 00:03:52,925 --> 00:03:57,756 we have that distance to 60 00:03:57,756 --> 00:04:03,560 nearest neighbor is smaller 61 00:04:07,025 --> 00:04:08,531 Than what we had before. 62 00:04:08,531 --> 00:04:11,880 So this becomes our new nearest neighbor. 63 00:04:11,880 --> 00:04:15,260 And this becomes our new distance to nearest neighbor. 64 00:04:15,260 --> 00:04:18,196 And then we continue searching. 65 00:04:18,196 --> 00:04:21,570 So, Let me just write this. 66 00:04:21,570 --> 00:04:26,521 This becomes our new distance to nearest neighbor. 67 00:04:26,521 --> 00:04:31,078 And when we continue searching, we've now checked this node. 68 00:04:31,078 --> 00:04:33,930 We've checked this node. 69 00:04:33,930 --> 00:04:38,687 And we're going to go and check this other node. 70 00:04:38,687 --> 00:04:44,510 But what this blue box here shows is that's the type bounding box for 71 00:04:44,510 --> 00:04:45,540 that node. 72 00:04:45,540 --> 00:04:50,850 Remember, it's the box that contains all the points within that node. 73 00:04:50,850 --> 00:04:56,088 And what we see is that the distance to that box is greater 74 00:04:56,088 --> 00:05:01,337 than the distance to our nearest neighbor found so far. 75 00:05:01,337 --> 00:05:08,691 So, Distance 76 00:05:08,691 --> 00:05:14,439 to bounding box is greater 77 00:05:14,439 --> 00:05:19,926 than distance to nearest 78 00:05:19,926 --> 00:05:23,851 neighbor so far. 79 00:05:23,851 --> 00:05:28,200 And so what this implies, is that we're going to prune this. 80 00:05:29,960 --> 00:05:31,510 We're not even going to go and 81 00:05:31,510 --> 00:05:36,520 compute the distance to any point in this leaf node here. 82 00:05:36,520 --> 00:05:41,470 Because we know that no point in there can be 83 00:05:41,470 --> 00:05:45,320 a nearest neighbor closer than the nearest neighbor that we found so far. 84 00:05:46,720 --> 00:05:49,060 Okay, now what we've done. 85 00:05:49,060 --> 00:05:50,460 We've checked this node. 86 00:05:50,460 --> 00:05:51,560 We've checked this node. 87 00:05:51,560 --> 00:05:53,030 We've pruned this node. 88 00:05:53,030 --> 00:05:56,250 We go up and we start going down this other branch. 89 00:05:56,250 --> 00:05:57,630 What happens here? 90 00:05:57,630 --> 00:06:03,580 Again, the distance to this box exceeds the distance to our nearest neighbor, 91 00:06:03,580 --> 00:06:04,180 so we prune. 92 00:06:05,550 --> 00:06:07,560 Then what we're going to do, is we're going to go up and 93 00:06:07,560 --> 00:06:10,390 start searching the other branch, but 94 00:06:10,390 --> 00:06:15,800 all the other boxes that we have exceed the distance to our nearest neighbor. 95 00:06:15,800 --> 00:06:23,026 So, what we see is that we've been able to prune quite a lot of the search space. 96 00:06:23,026 --> 00:06:26,107 Let's just go back and look at all these points. 97 00:06:26,107 --> 00:06:27,435 In the brute force method, 98 00:06:27,435 --> 00:06:30,850 we would have had to compute distances to all those points. 99 00:06:30,850 --> 00:06:33,940 But now we only had to compute distance 100 00:06:33,940 --> 00:06:38,500 to a very few number of data points in this example to find our nearest neighbor. 101 00:06:40,260 --> 00:06:42,390 So, this is where KD-trees are so 102 00:06:42,390 --> 00:06:45,670 useful in performing efficient nearest neighbor search. 103 00:06:45,670 --> 00:06:47,896 When you have lots and lots of observations, 104 00:06:47,896 --> 00:06:51,305 especially if you're going to be performing lots and lots of queries. 105 00:06:51,305 --> 00:06:53,856 Because you have to build the tree. 106 00:06:53,856 --> 00:06:58,701 And there's some computational cost to building this KD-tree. 107 00:06:58,701 --> 00:07:01,735 But if you're going to be using it again and again, 108 00:07:01,735 --> 00:07:04,770 it can actually provide a significant savings. 109 00:07:04,770 --> 00:07:08,659 [MUSIC]