1
00:00:00,000 --> 00:00:10,002
The power of locality sensitive hashing is
closely related to the behavior of this

2
00:00:10,002 --> 00:00:16,044
particular function one minus something to
the power k.

3
00:00:16,071 --> 00:00:25,009
Full, which is then raised.
To the power of b, and then again subtract

4
00:00:25,009 --> 00:00:30,002
it from one.
This particular function, as you can see,

5
00:00:30,002 --> 00:00:36,035
is a very steep one.
With the result that this particular value

6
00:00:36,035 --> 00:00:43,918
pq which is nothing but the probability of
a match in one of the locality sensitive

7
00:00:43,918 --> 00:00:49,445
functions f, is amplified.
So even if this value pq is fairly

8
00:00:49,445 --> 00:00:57,130
moderate, you know, say something close to
0.1 or 0.15, it really becomes amplified

9
00:00:57,130 --> 00:01:03,028
to a very large value giving a very large
probability of match.

10
00:01:03,028 --> 00:01:12,739
At the same time, the false match
probability is driven to zero as long as

11
00:01:12,739 --> 00:01:21,684
it is a little smaller say 0.07.
Fairly close to something like 0.15, but

12
00:01:21,684 --> 00:01:30,007
it gets driven to a very small value as
long it's reasonably smaller than pq.

13
00:01:31,001 --> 00:01:38,084
The place where this steep rise happens
depends on the choices of k and b.

14
00:01:39,048 --> 00:01:48,079
And by suitably adjusting it, these
parameters depending on our values of p

15
00:01:48,079 --> 00:01:57,037
and pq, we can get the required behaviour
all that we want to achieve.

16
00:01:57,037 --> 00:02:07,015
Alesage has many Important applications
which fall under the big data category

17
00:02:07,015 --> 00:02:12,078
these days.
For example, when you have hundreds of

18
00:02:12,078 --> 00:02:18,055
millions of tweets, how do we group
similar tweets without having to compare,

19
00:02:18,055 --> 00:02:21,074
all the pairs?
Which is an impossible task.

20
00:02:21,074 --> 00:02:27,096
But since there is n square pairs and if n
is in the hundreds of millions, it just

21
00:02:27,096 --> 00:02:32,013
becomes impossible.
So LSH which is one way out.

22
00:02:33,015 --> 00:02:41,040
In enterprise search, we saw the problem
of finding near duplicates or versions of

23
00:02:41,040 --> 00:02:47,051
the same root document.
This is another problem which can be

24
00:02:47,051 --> 00:02:54,079
addressed using LSH.
Finding patterns in time series from

25
00:02:54,079 --> 00:03:02,021
sensors where you have multiple sensors
capturing many different parameters such

26
00:03:02,021 --> 00:03:09,045
as from a car or from a piece of machinery
or from a ship, and you want to find

27
00:03:09,045 --> 00:03:16,078
patterns in that time series, then LSH
like concepts turn out to be useful there

28
00:03:16,078 --> 00:03:21,066
as well.
Another example which is also discussed in

29
00:03:21,066 --> 00:03:27,063
the Anand Rajaraman book is resolving
identities of people from different

30
00:03:27,063 --> 00:03:34,016
databases or multiple inputs.
An example of such a problem is, one is

31
00:03:34,016 --> 00:03:40,091
described in the book resolving databases
from different systems, other example is

32
00:03:40,091 --> 00:03:47,033
figuring out which Twitter ID's match
which Facebook ID's and which LinkedIn

33
00:03:47,033 --> 00:03:53,091
ID's and which e-mail ID's now this is
private information which people want to

34
00:03:53,091 --> 00:04:00,073
keep separate but there are many companies
which are engaged in figuring out using

35
00:04:00,073 --> 00:04:06,014
concepts like LSH which identities
actually should be clumped together are

36
00:04:06,014 --> 00:04:08,056
very likely to be from the same person.