Let's consider some other kinds of search
now.
So far, we have been indexing objects,
such as documents, by the features they
contain, which are words.
Or in the case of records in a database,
they are the contents of different
columns.
Our assumption has been that when we
search for a document or a record, our
query is a bag of words.
In other words, features are what we
search with.
But suppose our query is itself an object.
For example, an image, such as in Google
Goggles or Google Image search, or a
bio-metric, such as a fingerprint and iris
scan in the example of the unique
identification project of the Government
of India.
For those of you not familiar with this,
ambitious and equally controversial
project, it aims to...
Every resident of India with the unique
bio-metric ID.
We will return to this in a minute.
But coming back to the question of
searching for an object is in inverted
index of features, whether they are words
or image features, or other features in
the case of fingerprints and irises, the
best way to search for objects.
What do you think?
I would say that, no, it probably is not
the best way to search for objects.
As to why, I encourage you'd all to think
about this.
And discuss this, on the discussion forum.
But, right now we're going to discuss.
Another very powerful method, called
locality sensitive hashing.
In-, invented by Indic and Motwani in the
late'90s, and described in.
Fair detail in the Anand Rajaraman book,
which is a reference for the course.
This technique is particularly important
in all kinds of big data applications
because it somehow manages to compare, at
least approximately, n pair of objects in
linear, that is order and time.
There are n square pairs, but it manages
to tell us which pairs are actually close
together without actually comparing all
pairs.
Sounds like magic, let's see how.
The basic idea of locality sensitive
hashing, LSH as we shall refer to it going
forward.
Is the following.
It's hashing so, an object is obviously
hashed a something called HX.
In a manner so as, that if X and Y are two
objects.
And if X is close to Y.
Then the two hash functions are equal with
high probability.
In this respect its similar to normal
hashing where if X is equal to Y, then X
and Y are hashed with high probabilities
that you are just saying value.
But this deals with, distances and
closeness and not just equality and
that's, that's what makes it interesting.
Its hashing which is sensitive to things
being close together or in the same
locality.
And conversely obviously if x is not close
to y, rather x is far from y, then the two
hash functions will not be equal with a
high probability.
Of course constructing the hash functions
themselves is quite tricky.
And here, we have to.
Do interesting things like.
Take random functions from locally
sensitive family.
And combine them in particular ways.
Allman and Roger Allman's chapter three
described this process.
Fairly well, and in considerable detail.
An example is biometric matching.
In the case of the UID project for
example, more than a billion people are
going to get biometric IDs.
Of these, more than 200,000,000 have
already been enrolled.
At the same let me make the following
disclaimer that, what the UID actually
uses internally is proprietary and I'm not
intending to suggest that they use
locality sensitive hashing.
This is merely a motivating example.