1
00:00:00,000 --> 00:00:04,119
Let's consider some other kinds of search
now.

2
00:00:04,119 --> 00:00:10,926
So far, we have been indexing objects,
such as documents, by the features they

3
00:00:10,926 --> 00:00:16,658
contain, which are words.
Or in the case of records in a database,

4
00:00:16,658 --> 00:00:20,510
they are the contents of different
columns.

5
00:00:20,510 --> 00:00:29,133
Our assumption has been that when we
search for a document or a record, our

6
00:00:29,133 --> 00:00:36,031
query is a bag of words.
In other words, features are what we

7
00:00:36,031 --> 00:00:42,240
search with.
But suppose our query is itself an object.

8
00:00:42,560 --> 00:00:49,697
For example, an image, such as in Google
Goggles or Google Image search, or a

9
00:00:49,697 --> 00:00:56,930
bio-metric, such as a fingerprint and iris
scan in the example of the unique

10
00:00:56,930 --> 00:01:01,689
identification project of the Government
of India.

11
00:01:01,689 --> 00:01:08,922
For those of you not familiar with this,
ambitious and equally controversial

12
00:01:08,922 --> 00:01:14,727
project, it aims to...
Every resident of India with the unique

13
00:01:14,727 --> 00:01:19,360
bio-metric ID.
We will return to this in a minute.

14
00:01:20,560 --> 00:01:27,216
But coming back to the question of
searching for an object is in inverted

15
00:01:27,216 --> 00:01:34,413
index of features, whether they are words
or image features, or other features in

16
00:01:34,413 --> 00:01:40,800
the case of fingerprints and irises, the
best way to search for objects.

17
00:01:42,280 --> 00:01:48,788
What do you think?
I would say that, no, it probably is not

18
00:01:48,788 --> 00:01:56,668
the best way to search for objects.
As to why, I encourage you'd all to think

19
00:01:56,668 --> 00:02:01,244
about this.
And discuss this, on the discussion forum.

20
00:02:01,244 --> 00:02:08,242
But, right now we're going to discuss.
Another very powerful method, called

21
00:02:08,242 --> 00:02:15,830
locality sensitive hashing.
In-, invented by Indic and Motwani in the

22
00:02:15,830 --> 00:02:22,122
late'90s, and described in.
Fair detail in the Anand Rajaraman book,

23
00:02:22,122 --> 00:02:28,578
which is a reference for the course.
This technique is particularly important

24
00:02:28,578 --> 00:02:35,742
in all kinds of big data applications
because it somehow manages to compare, at

25
00:02:35,742 --> 00:02:42,180
least approximately, n pair of objects in
linear, that is order and time.

26
00:02:42,180 --> 00:02:49,616
There are n square pairs, but it manages
to tell us which pairs are actually close

27
00:02:49,616 --> 00:02:53,788
together without actually comparing all
pairs.

28
00:02:53,788 --> 00:02:59,854
Sounds like magic, let's see how.
The basic idea of locality sensitive

29
00:02:59,854 --> 00:03:04,130
hashing, LSH as we shall refer to it going
forward.

30
00:03:04,130 --> 00:03:09,591
Is the following.
It's hashing so, an object is obviously

31
00:03:09,591 --> 00:03:17,961
hashed a something called HX.
In a manner so as, that if X and Y are two

32
00:03:17,961 --> 00:03:21,240
objects.
And if X is close to Y.

33
00:03:21,240 --> 00:03:25,694
Then the two hash functions are equal with
high probability.

34
00:03:25,694 --> 00:03:31,484
In this respect its similar to normal
hashing where if X is equal to Y, then X

35
00:03:31,484 --> 00:03:36,829
and Y are hashed with high probabilities
that you are just saying value.

36
00:03:36,829 --> 00:03:42,026
But this deals with, distances and
closeness and not just equality and

37
00:03:42,026 --> 00:03:48,039
that's, that's what makes it interesting.
Its hashing which is sensitive to things

38
00:03:48,039 --> 00:03:51,380
being close together or in the same
locality.

39
00:03:51,840 --> 00:03:58,476
And conversely obviously if x is not close
to y, rather x is far from y, then the two

40
00:03:58,476 --> 00:04:02,980
hash functions will not be equal with a
high probability.

41
00:04:03,260 --> 00:04:09,260
Of course constructing the hash functions
themselves is quite tricky.

42
00:04:09,620 --> 00:04:13,724
And here, we have to.
Do interesting things like.

43
00:04:13,724 --> 00:04:18,528
Take random functions from locally
sensitive family.

44
00:04:18,528 --> 00:04:25,165
And combine them in particular ways.
Allman and Roger Allman's chapter three

45
00:04:25,165 --> 00:04:30,580
described this process.
Fairly well, and in considerable detail.

46
00:04:30,920 --> 00:04:37,259
An example is biometric matching.
In the case of the UID project for

47
00:04:37,259 --> 00:04:43,412
example, more than a billion people are
going to get biometric IDs.

48
00:04:43,412 --> 00:04:48,820
Of these, more than 200,000,000 have
already been enrolled.

49
00:04:49,380 --> 00:04:55,572
At the same let me make the following
disclaimer that, what the UID actually

50
00:04:55,572 --> 00:05:01,927
uses internally is proprietary and I'm not
intending to suggest that they use

51
00:05:01,927 --> 00:05:07,060
locality sensitive hashing.
This is merely a motivating example.