The power of locality sensitive hashing is
closely related to the behavior of this
particular function one minus something to
the power k.
Full, which is then raised.
To the power of b, and then again subtract
it from one.
This particular function, as you can see,
is a very steep one.
With the result that this particular value
pq which is nothing but the probability of
a match in one of the locality sensitive
functions f, is amplified.
So even if this value pq is fairly
moderate, you know, say something close to
0.1 or 0.15, it really becomes amplified
to a very large value giving a very large
probability of match.
At the same time, the false match
probability is driven to zero as long as
it is a little smaller say 0.07.
Fairly close to something like 0.15, but
it gets driven to a very small value as
long it's reasonably smaller than pq.
The place where this steep rise happens
depends on the choices of k and b.
And by suitably adjusting it, these
parameters depending on our values of p
and pq, we can get the required behaviour
all that we want to achieve.
Alesage has many Important applications
which fall under the big data category
these days.
For example, when you have hundreds of
millions of tweets, how do we group
similar tweets without having to compare,
all the pairs?
Which is an impossible task.
But since there is n square pairs and if n
is in the hundreds of millions, it just
becomes impossible.
So LSH which is one way out.
In enterprise search, we saw the problem
of finding near duplicates or versions of
the same root document.
This is another problem which can be
addressed using LSH.
Finding patterns in time series from
sensors where you have multiple sensors
capturing many different parameters such
as from a car or from a piece of machinery
or from a ship, and you want to find
patterns in that time series, then LSH
like concepts turn out to be useful there
as well.
Another example which is also discussed in
the Anand Rajaraman book is resolving
identities of people from different
databases or multiple inputs.
An example of such a problem is, one is
described in the book resolving databases
from different systems, other example is
figuring out which Twitter ID's match
which Facebook ID's and which LinkedIn
ID's and which e-mail ID's now this is
private information which people want to
keep separate but there are many companies
which are engaged in figuring out using
concepts like LSH which identities
actually should be clumped together are
very likely to be from the same person.