1
00:00:00,007 --> 00:00:04,735
[MUSIC]

2
00:00:04,735 --> 00:00:08,798
So we mentioned that we're going to
look at locality sensitive hashing or

3
00:00:08,798 --> 00:00:10,930
LSH as an alternative to KD-trees.

4
00:00:10,930 --> 00:00:15,040
Well, now let's dig into the details of
what locality sensitive hashing does.

5
00:00:16,390 --> 00:00:21,380
And just like KD-trees, we're going to
think of partitioning our data into bins,

6
00:00:21,380 --> 00:00:26,227
but the way in which locality sensitive
hashing does this is just fundamentally

7
00:00:26,227 --> 00:00:30,374
different than the types of
axis-aligned cuts made by KD-trees.

8
00:00:30,374 --> 00:00:34,294
And in this section we're going to
return to an example that we used in

9
00:00:34,294 --> 00:00:38,990
the classification course, which is
imagine we have reviews of restaurants.

10
00:00:38,990 --> 00:00:40,940
We have text from those reviews.

11
00:00:40,940 --> 00:00:44,330
And we're just going to count
how many times we see the word

12
00:00:44,330 --> 00:00:46,740
awesome versus the word awful.

13
00:00:46,740 --> 00:00:50,005
And when we're thinking about retrieving
similar types of reviews we're just

14
00:00:50,005 --> 00:00:53,280
going to look at these counts.

15
00:00:53,280 --> 00:00:57,570
Okay well just like in
classification we can think of

16
00:00:57,570 --> 00:01:02,120
dividing these set of points
just using a simple line.

17
00:01:02,120 --> 00:01:07,070
So this is like the decision boundary
we saw in the classification course.

18
00:01:07,070 --> 00:01:10,540
And here I'm just writing
the equation of this line and

19
00:01:10,540 --> 00:01:13,690
we're defining this term
which we call the score.

20
00:01:13,690 --> 00:01:17,910
So it says for any input in terms of
counts of the number of awesome and

21
00:01:17,910 --> 00:01:21,170
the number of awful,
I'm going to score that review.

22
00:01:22,440 --> 00:01:25,990
And everything that falls below
this line has a negative score and

23
00:01:25,990 --> 00:01:30,080
everything that's above this
line has a positive score.

24
00:01:30,080 --> 00:01:32,740
But we're going to be using these scores
in a different way than we did in

25
00:01:32,740 --> 00:01:34,530
the classification course.

26
00:01:34,530 --> 00:01:40,250
Here, if we're going to say that if
the sign of the score is negative,

27
00:01:40,250 --> 00:01:45,280
then all of these points fall into
what we're calling the 0 bin.

28
00:01:45,280 --> 00:01:47,090
But if the sign is positive,

29
00:01:47,090 --> 00:01:51,640
then all of those points
are going to be assigned to bin 1.

30
00:01:51,640 --> 00:01:56,130
So we're going to translate the sign
of the scores into a binary index.

31
00:01:57,620 --> 00:02:00,740
And then we can use this for nearest
neighbor search in the following way.

32
00:02:00,740 --> 00:02:05,594
So if we get this green query point here
then, we're only going to search for

33
00:02:05,594 --> 00:02:09,916
nearest neighbors in its bin, so
all these other orange points.

34
00:02:09,916 --> 00:02:12,331
So, here we're only searching for

35
00:02:12,331 --> 00:02:16,766
nearest neighbors with the points
whose score is less than zero.

36
00:02:16,766 --> 00:02:20,711
And in terms of this bin in
this is that corresponds to

37
00:02:20,711 --> 00:02:25,050
all the points that where
labeled as having bin index 0.

38
00:02:25,050 --> 00:02:29,640
Alternatively, if the query were to
have fallen into this other region,

39
00:02:29,640 --> 00:02:34,790
this blue region, we would have looked
over all the points that had bin index 1,

40
00:02:34,790 --> 00:02:38,170
because all of their scores
were greater than 0.

41
00:02:38,170 --> 00:02:42,802
In practice, the way that were going to do
this search over all data points that have

42
00:02:42,802 --> 00:02:47,180
the same bin index as the query point,
is using something called a hashtable.

43
00:02:47,180 --> 00:02:51,874
And what a hashtable does is for
each of our bin values, bin 0 or bin 1,

44
00:02:51,874 --> 00:02:54,067
we're going to call that the key and

45
00:02:54,067 --> 00:02:59,076
then we're going to associate with that
key a set of values, a list of points,

46
00:02:59,076 --> 00:03:03,490
that are the indices of the data points,
that have that bin value.

47
00:03:04,590 --> 00:03:08,720
Said another way, we're taking all of
the datapoints that have been value 0 and

48
00:03:08,720 --> 00:03:13,430
we're going to list those data indices
in the first element of this table and

49
00:03:13,430 --> 00:03:16,090
we're taking all the datapoints
that have been value 1 and

50
00:03:16,090 --> 00:03:20,150
we're going to list those indices in
the second element of this table.

51
00:03:20,150 --> 00:03:25,540
Then when we get a given query point and
we look at the bin value of that query

52
00:03:25,540 --> 00:03:29,810
based on looking at the score, the sign of
the score and determining the bin index,

53
00:03:29,810 --> 00:03:35,550
then we're just going to search along
the list of points that are in that bin.

54
00:03:37,320 --> 00:03:42,200
So, this provides a really really simple
way of doing nearest neighbor search.

55
00:03:42,200 --> 00:03:46,728
Very simple idea of throwing down a line
and then just simply searching the other

56
00:03:46,728 --> 00:03:49,580
points that either fall above or
below this line.

57
00:03:49,580 --> 00:03:53,373
But a question is does this
return an exact nearest neighbor?

58
00:03:53,373 --> 00:03:55,730
And the answer is no.

59
00:03:55,730 --> 00:03:58,790
So for example,
here we have this green query point, and

60
00:03:58,790 --> 00:04:03,060
let's say that this orange point
which is just above the line and

61
00:04:03,060 --> 00:04:06,140
very close to that green point is
actually the true nearest neighbor.

62
00:04:06,140 --> 00:04:09,756
Well, if we look at this green point and
we're only searching all

63
00:04:09,756 --> 00:04:13,438
the blue points which have this
same bin value as this green point,

64
00:04:13,438 --> 00:04:16,024
then we're not going to
find that orange point.

65
00:04:16,024 --> 00:04:20,299
[MUSIC]