1
00:00:00,000 --> 00:00:05,964
In this video, I'm going to describe a 
technique called semantic hashing that 

2
00:00:05,964 --> 00:00:11,850
provides an extremely efficient way of 
finding documents similar to a query 

3
00:00:11,850 --> 00:00:15,723
document. 
The idea is to convert the document into 

4
00:00:15,723 --> 00:00:19,946
a memory address. 
And in that memory to organize things so 

5
00:00:19,946 --> 00:00:25,165
that if you go to a particular address 
and look at the nearby addresses, you'll 

6
00:00:25,165 --> 00:00:30,384
find documents that are very similar. 
This is much like a supermarket where if 

7
00:00:30,384 --> 00:00:35,736
you go to a location where a particular 
product is stored and look around, you'll 

8
00:00:35,736 --> 00:00:40,218
find similar products. 
People have known for a long time that if 

9
00:00:40,218 --> 00:00:45,624
you could get binary descriptors of 
images, you'd have a very good way of 

10
00:00:45,624 --> 00:00:50,511
retrieving images quickly. 
Some binary descriptors are easy to get. 

11
00:00:50,511 --> 00:00:54,510
For example, is it an indoor scene or an 
outdoor scene? 

12
00:00:54,510 --> 00:00:58,560
Is to color image or black and white 
image? 

13
00:00:58,560 --> 00:01:04,010
But it's much harder to get a list of 
say, 30 binary descriptors which are more 

14
00:01:04,010 --> 00:01:08,356
or less orthogonal to one another, which 
is what we really need. 

15
00:01:08,356 --> 00:01:12,288
This is a problem that machine learning 
can help us with. 

16
00:01:12,288 --> 00:01:17,532
We're going to start by looking at the 
equivalent problem for documents, but 

17
00:01:17,532 --> 00:01:22,706
then we're going to apply it to images. 
So consider, instead of getting real 

18
00:01:22,706 --> 00:01:27,270
valued codes for documents, 
getting binary codes, from the word 

19
00:01:27,270 --> 00:01:30,990
cancer documents. 
We do this by training a deep 

20
00:01:30,990 --> 00:01:34,790
auto-encoder that has a logistic units in 
it's code layer. 

21
00:01:36,440 --> 00:01:42,533
That by itself is not sufficient because 
the logistic units will be used in their 

22
00:01:42,533 --> 00:01:48,774
middle ranges where they have real values 
in order to convey as much information as 

23
00:01:48,774 --> 00:01:54,199
possible about the 2,000 word counts. 
To prevent that, we add noise to the 

24
00:01:54,199 --> 00:02:00,217
inputs to the code units during the fine 
tuning stage. So, we first train it as a 

25
00:02:00,217 --> 00:02:06,304
stack of restricted Boltzmann machines. 
We can unroll these Boltzmann machines by 

26
00:02:06,304 --> 00:02:11,374
using the transposes of the white 
matrices for the decoder, and then we 

27
00:02:11,374 --> 00:02:16,095
fine tune it with back propagation. 
And as we're doing that, we add 

28
00:02:16,095 --> 00:02:20,185
additional Gaussian noise to the inputs 
to the code units. 

29
00:02:20,185 --> 00:02:25,755
In order to be resistant to that noise, 
the code units need to be either firmly 

30
00:02:25,755 --> 00:02:29,773
on or firmly off. 
And so the noise will encourage the 

31
00:02:29,773 --> 00:02:35,273
learning to avoid the middle region of 
the logistic where it conveys a lot of 

32
00:02:35,273 --> 00:02:39,362
information, but it's very sensitive to 
noise in its inputs. 

33
00:02:39,362 --> 00:02:44,932
At test time, we simply threshold the 
logistic units in the middle layer to get 

34
00:02:44,932 --> 00:02:48,968
binary values. 
So, if we can train an auto-encoder like 

35
00:02:48,968 --> 00:02:54,832
this, we will be able to convert the 
counts for a bag of words into a small 

36
00:02:54,832 --> 00:02:59,835
number of binary values. 
In other words, we'll have learned a set 

37
00:02:59,835 --> 00:03:05,230
of binary features that are good for 
reconstructing the bag of words. 

38
00:03:05,230 --> 00:03:10,818
Later on, Alex Krizhevsky discovered that 
we don't actually have to add Gaussian 

39
00:03:10,818 --> 00:03:16,546
noise to the inputs to the 30 code units. 
Instead, we can just make them stochastic 

40
00:03:16,546 --> 00:03:19,549
binary units. 
So, during the forward pass, we 

41
00:03:19,549 --> 00:03:24,760
stochastically pick a binary value using 
the output of the logistic. 

42
00:03:24,760 --> 00:03:30,093
And then, during the backward pass, we 
pretend that we've transmitted the real 

43
00:03:30,093 --> 00:03:35,287
value probability from the logistic, 
and that gives us a smooth gradient for 

44
00:03:35,287 --> 00:03:39,235
back propagation. 
Once we've got these short binary codes, 

45
00:03:39,235 --> 00:03:44,983
we could of course do a sequential search 
where for each known document, we store a 

46
00:03:44,983 --> 00:03:48,031
code. 
And then when a query document arrives, 

47
00:03:48,031 --> 00:03:53,364
we first extract its code, if it's not 
one of our known documents, and then we 

48
00:03:53,364 --> 00:03:57,520
compare the code with the codes of all 
the stored documents. 

49
00:03:57,520 --> 00:04:03,040
The comparisons can be very fast, because 
they can use special bit operations on a 

50
00:04:03,040 --> 00:04:07,160
typical CPU which can compare many bits 
in parallel. 

51
00:04:07,160 --> 00:04:12,266
But we have to go through a very long 
list of documents, possibly billions. 

52
00:04:12,266 --> 00:04:17,511
There's a much faster thing we can do, 
there's a much faster thing we can do. 

53
00:04:17,511 --> 00:04:21,480
We can treat the code as if it was a 
memory address. 

54
00:04:21,480 --> 00:04:27,432
So, the idea is that we take a document, 
and we use our deep auto-encoder as a 

55
00:04:27,432 --> 00:04:33,240
hash function that converts a document 
into a 30 bit address Now, we have a 

56
00:04:33,240 --> 00:04:38,101
memory with 30 bit addresses. 
And in that memory, each address will 

57
00:04:38,101 --> 00:04:42,519
have a pointer back to the documents that 
have that address. 

58
00:04:42,519 --> 00:04:48,042
If several documents have the same 
address, we can make a little list there. 

59
00:04:48,042 --> 00:04:53,860
Now, if the auto-ncoder is successful in 
making similar documents have similar 

60
00:04:53,860 --> 00:04:58,500
addresses, we have a very fast way of 
finding similar documents. 

61
00:04:58,500 --> 00:05:04,484
We simply take the query document, you go 
to the address in memory that corresponds 

62
00:05:04,484 --> 00:05:08,594
to its binary code, and then you look at 
nearby addresses. 

63
00:05:08,594 --> 00:05:13,785
In other words, you start flipping bits 
in that address to access nearby 

64
00:05:13,785 --> 00:05:17,147
addresses. 
And you could imagine a little humming 

65
00:05:17,147 --> 00:05:21,420
ball of nearby addresses that differ by 
just a few bits. 

66
00:05:21,420 --> 00:05:26,970
What we expect to find at those nearby 
addresses is semantically similar 

67
00:05:26,970 --> 00:05:30,772
documents. 
So, we've completely avoided searching a 

68
00:05:30,772 --> 00:05:34,498
big list. 
We simply compute a memory address, flip 

69
00:05:34,498 --> 00:05:37,920
a few bits, and look up the similar 
documents. 

70
00:05:37,920 --> 00:05:43,287
It's extremely efficient especially if we 
have a very large database of say, a 

71
00:05:43,287 --> 00:05:47,002
billion documents. 
We've completely avoided the serial 

72
00:05:47,002 --> 00:05:51,957
search through a billion items. 
I sometimes call this supermarket search 

73
00:05:51,957 --> 00:05:55,604
because it's like what you would do in a 
supermarket. 

74
00:05:55,604 --> 00:06:00,351
Suppose you went to an unfamiliar 
supermarket and you wanted to find 

75
00:06:00,351 --> 00:06:03,241
anchovies. 
You might ask the teller at the 

76
00:06:03,241 --> 00:06:07,140
supermarket, where do you keep the cans 
of tuna fish? 

77
00:06:07,140 --> 00:06:11,749
You'd then go to that address in the 
supermarket and you'd look around. 

78
00:06:11,749 --> 00:06:16,878
Hopefully, near there is things like cans 
of salmon and maybe cans of anchovies. 

79
00:06:16,878 --> 00:06:21,422
Of course, if you're unlucky, the 
anchovies might have been stored in a 

80
00:06:21,422 --> 00:06:24,928
completely different place, 
next to the pizza toppings. 

81
00:06:24,928 --> 00:06:28,780
And that's the downside of this kind of 
search. 

82
00:06:28,780 --> 00:06:34,140
Known as supermarket, it's essentially a 
2-D surface. 

83
00:06:34,140 --> 00:06:39,385
So, it's really a 1-D string of shells, 
which have height and that gives you 2-D, 

84
00:06:39,385 --> 00:06:44,631
and so you only have two dimensions in 
which to locate things. And that's not 

85
00:06:44,631 --> 00:06:49,809
sufficient to put all the things you'd 
like to be near one another, near one 

86
00:06:49,809 --> 00:06:52,739
another. 
You'd like, for example, to have the 

87
00:06:52,739 --> 00:06:58,053
vegetarian version of things nearby, or 
the Kosher version of things nearby, or 

88
00:06:58,053 --> 00:07:03,026
the slightly out of date version of 
things nearby. And in 2-D you can't do 

89
00:07:03,026 --> 00:07:07,003
all that. 
But what we have here is a 30 dimensional 

90
00:07:07,003 --> 00:07:12,082
supermarket and that's a hugely more 
complex space where it's very easy to 

91
00:07:12,082 --> 00:07:17,093
have things near an item for many 
different reasons because of similarity 

92
00:07:17,093 --> 00:07:21,366
along many different dimensions. 
Here's another view of what we're doing 

93
00:07:21,366 --> 00:07:25,117
in semantic haching. 
Most of the first retrieval methods work 

94
00:07:25,117 --> 00:07:30,381
by intersecting stored lists that are 
associated with cues extracted from the 

95
00:07:30,381 --> 00:07:33,342
query. 
So, Google, for example, will have a list 

96
00:07:33,342 --> 00:07:37,290
of all the documents that contain some 
particular rare word. 

97
00:07:37,290 --> 00:07:42,554
And when you use that rare word in your 
query, they will immediately have access 

98
00:07:42,554 --> 00:07:45,923
to that list. 
They then have to intersect that list 

99
00:07:45,923 --> 00:07:51,016
with other lists in order to find a 
document that satisfies all the terms in 

100
00:07:51,016 --> 00:07:54,124
your query. 
Now, computers actually have special 

101
00:07:54,124 --> 00:07:59,350
hardware that can intersect 32 very long 
lists in a single machine instruction. 

102
00:07:59,350 --> 00:08:05,534
The hardware is called the memory bus. 
So, each bit in a 32-bit binary address 

103
00:08:05,534 --> 00:08:09,470
specifies a list of half the addresses in 
memory. 

104
00:08:09,470 --> 00:08:13,608
For example, if the bit is on and it's 
the first bit in the address, it 

105
00:08:13,608 --> 00:08:17,629
specifies the top half of memory. 
If the bit is off, it specifies the 

106
00:08:17,629 --> 00:08:21,234
bottom half of memory. 
What the memory bus is doing is 

107
00:08:21,234 --> 00:08:26,922
intersecting 32 lists to find the one 
location that satisfies all 32 values in 

108
00:08:26,922 --> 00:08:31,027
the binary code. 
So, we can think of semantic hashing as a 

109
00:08:31,027 --> 00:08:36,715
way of using machine learning to map the 
retrial problem onto the type of list 

110
00:08:36,715 --> 00:08:42,155
intersection computer's good at. 
As long as our 32-bits correspond to 

111
00:08:42,155 --> 00:08:45,530
meaningful properties of documents or 
images, 

112
00:08:45,530 --> 00:08:50,180
then we can find similar ones very fast 
with no search at all.