[MUSIC] Okay, but let's go back to this idea
that depending on the structure, or rather the distribution
of data in the space, you can actually get things
looking closer to log N. So here's an example where we have fairly
uniform distribution of points across the space. And for a given query, we've pruned
a huge amount of the search space and just had to search two
different boxes here. Whereas in that more contrived example
where our data lied on a circle, and based on that center heuristic that
we had for how to create the splits, we end up having to search
almost every node in the tree. So this looks a lot closer to order N,
so not much savings in this case. Okay, so like I said, to really offset the
cost of creating this data structure, it's useful to use these things in cases where
you're doing lots and lots of queries. So let's imagine asking for the nearest
neighbor for every article in our corpus. So let's assume that
we're doing N queries. And let's look at the complexity of this. So under a brute force search, remember
per query it was order N complexity. So for N queries, we're going to have
something that's order N squared. But on the other hand for kd-trees, it's going to be somewhere in the range
of order N log N, if we get lucky. And if we get unlucky, it's going to
be on the order of N squared, just like the brute force search. But just to note,
this represents, potentially, very large savings for large N. Okay, so let's look at a little
empirical study of the number of inspections we have as a function
of the number of observations and then as a function of the dimension
of the space that we're working in. So here, we're increasing
the number of observations, N. And in this case we see
roughly a log(N) trend. But on the other hand here,
when we're looking at going to higher and higher dimensions, so just to be clear,
we're saying that our observations xi live in some R to the d space,
so that's what d represents. We get a trend where we have
an exponential increase in the number of inspections we have to do,
the number of distance calculations, or really nodes in the tree,
to find our nearest neighbor. So for large d, This is really a regime where using
kd-trees becomes pretty pointless. And we'll return to this to talk
about it a little bit more. And also to think of an alternative
approach that will be more helpful to us in high-dimensional scenarios. But before that, I just want to mention
how we can think about using kd-trees to do k nearest neighbors search,
instead of one nearest neighbor search. And the extension is very intuitive and
straightforward. So instead of just maintaining
the distance to the nearest neighbor, what we're going to do is
we're going to maintain the distance to our kth nearest
neighbor that we found so far. So in this example, this is our distance. We're looking at a two
nearest neighbor task. And here, this is our distance to
our second nearest neighbor found so far in this first search that we've done,
the search of this first box here. And then just as before,
based on the distance that we found so far, we'd backtrack,
go and search other nodes. And prune them if the distance
to that kth nearest neighbor is smaller than the distance to the bounding
box of the node that we're looking at. So the algorithm is almost identical. The only thing you would need to change in
code is instead of maintaining distance to the nearest neighbor, maintain distance
to the kth nearest neighbor found so far. [MUSIC]