[MUSIC] So we mentioned that we're going to
look at locality sensitive hashing or LSH as an alternative to KD-trees. Well, now let's dig into the details of
what locality sensitive hashing does. And just like KD-trees, we're going to
think of partitioning our data into bins, but the way in which locality sensitive
hashing does this is just fundamentally different than the types of
axis-aligned cuts made by KD-trees. And in this section we're going to
return to an example that we used in the classification course, which is
imagine we have reviews of restaurants. We have text from those reviews. And we're just going to count
how many times we see the word awesome versus the word awful. And when we're thinking about retrieving
similar types of reviews we're just going to look at these counts. Okay well just like in
classification we can think of dividing these set of points
just using a simple line. So this is like the decision boundary
we saw in the classification course. And here I'm just writing
the equation of this line and we're defining this term
which we call the score. So it says for any input in terms of
counts of the number of awesome and the number of awful,
I'm going to score that review. And everything that falls below
this line has a negative score and everything that's above this
line has a positive score. But we're going to be using these scores
in a different way than we did in the classification course. Here, if we're going to say that if
the sign of the score is negative, then all of these points fall into
what we're calling the 0 bin. But if the sign is positive, then all of those points
are going to be assigned to bin 1. So we're going to translate the sign
of the scores into a binary index. And then we can use this for nearest
neighbor search in the following way. So if we get this green query point here
then, we're only going to search for nearest neighbors in its bin, so
all these other orange points. So, here we're only searching for nearest neighbors with the points
whose score is less than zero. And in terms of this bin in
this is that corresponds to all the points that where
labeled as having bin index 0. Alternatively, if the query were to
have fallen into this other region, this blue region, we would have looked
over all the points that had bin index 1, because all of their scores
were greater than 0. In practice, the way that were going to do
this search over all data points that have the same bin index as the query point,
is using something called a hashtable. And what a hashtable does is for
each of our bin values, bin 0 or bin 1, we're going to call that the key and then we're going to associate with that
key a set of values, a list of points, that are the indices of the data points,
that have that bin value. Said another way, we're taking all of
the datapoints that have been value 0 and we're going to list those data indices
in the first element of this table and we're taking all the datapoints
that have been value 1 and we're going to list those indices in
the second element of this table. Then when we get a given query point and
we look at the bin value of that query based on looking at the score, the sign of
the score and determining the bin index, then we're just going to search along
the list of points that are in that bin. So, this provides a really really simple
way of doing nearest neighbor search. Very simple idea of throwing down a line
and then just simply searching the other points that either fall above or
below this line. But a question is does this
return an exact nearest neighbor? And the answer is no. So for example,
here we have this green query point, and let's say that this orange point
which is just above the line and very close to that green point is
actually the true nearest neighbor. Well, if we look at this green point and
we're only searching all the blue points which have this
same bin value as this green point, then we're not going to
find that orange point. [MUSIC]