In this video, I'm going to describe a 
technique called semantic hashing that 
provides an extremely efficient way of 
finding documents similar to a query 
document. 
The idea is to convert the document into 
a memory address. 
And in that memory to organize things so 
that if you go to a particular address 
and look at the nearby addresses, you'll 
find documents that are very similar. 
This is much like a supermarket where if 
you go to a location where a particular 
product is stored and look around, you'll 
find similar products. 
People have known for a long time that if 
you could get binary descriptors of 
images, you'd have a very good way of 
retrieving images quickly. 
Some binary descriptors are easy to get. 
For example, is it an indoor scene or an 
outdoor scene? 
Is to color image or black and white 
image? 
But it's much harder to get a list of 
say, 30 binary descriptors which are more 
or less orthogonal to one another, which 
is what we really need. 
This is a problem that machine learning 
can help us with. 
We're going to start by looking at the 
equivalent problem for documents, but 
then we're going to apply it to images. 
So consider, instead of getting real 
valued codes for documents, 
getting binary codes, from the word 
cancer documents. 
We do this by training a deep 
auto-encoder that has a logistic units in 
it's code layer. 
That by itself is not sufficient because 
the logistic units will be used in their 
middle ranges where they have real values 
in order to convey as much information as 
possible about the 2,000 word counts. 
To prevent that, we add noise to the 
inputs to the code units during the fine 
tuning stage. So, we first train it as a 
stack of restricted Boltzmann machines. 
We can unroll these Boltzmann machines by 
using the transposes of the white 
matrices for the decoder, and then we 
fine tune it with back propagation. 
And as we're doing that, we add 
additional Gaussian noise to the inputs 
to the code units. 
In order to be resistant to that noise, 
the code units need to be either firmly 
on or firmly off. 
And so the noise will encourage the 
learning to avoid the middle region of 
the logistic where it conveys a lot of 
information, but it's very sensitive to 
noise in its inputs. 
At test time, we simply threshold the 
logistic units in the middle layer to get 
binary values. 
So, if we can train an auto-encoder like 
this, we will be able to convert the 
counts for a bag of words into a small 
number of binary values. 
In other words, we'll have learned a set 
of binary features that are good for 
reconstructing the bag of words. 
Later on, Alex Krizhevsky discovered that 
we don't actually have to add Gaussian 
noise to the inputs to the 30 code units. 
Instead, we can just make them stochastic 
binary units. 
So, during the forward pass, we 
stochastically pick a binary value using 
the output of the logistic. 
And then, during the backward pass, we 
pretend that we've transmitted the real 
value probability from the logistic, 
and that gives us a smooth gradient for 
back propagation. 
Once we've got these short binary codes, 
we could of course do a sequential search 
where for each known document, we store a 
code. 
And then when a query document arrives, 
we first extract its code, if it's not 
one of our known documents, and then we 
compare the code with the codes of all 
the stored documents. 
The comparisons can be very fast, because 
they can use special bit operations on a 
typical CPU which can compare many bits 
in parallel. 
But we have to go through a very long 
list of documents, possibly billions. 
There's a much faster thing we can do, 
there's a much faster thing we can do. 
We can treat the code as if it was a 
memory address. 
So, the idea is that we take a document, 
and we use our deep auto-encoder as a 
hash function that converts a document 
into a 30 bit address Now, we have a 
memory with 30 bit addresses. 
And in that memory, each address will 
have a pointer back to the documents that 
have that address. 
If several documents have the same 
address, we can make a little list there. 
Now, if the auto-ncoder is successful in 
making similar documents have similar 
addresses, we have a very fast way of 
finding similar documents. 
We simply take the query document, you go 
to the address in memory that corresponds 
to its binary code, and then you look at 
nearby addresses. 
In other words, you start flipping bits 
in that address to access nearby 
addresses. 
And you could imagine a little humming 
ball of nearby addresses that differ by 
just a few bits. 
What we expect to find at those nearby 
addresses is semantically similar 
documents. 
So, we've completely avoided searching a 
big list. 
We simply compute a memory address, flip 
a few bits, and look up the similar 
documents. 
It's extremely efficient especially if we 
have a very large database of say, a 
billion documents. 
We've completely avoided the serial 
search through a billion items. 
I sometimes call this supermarket search 
because it's like what you would do in a 
supermarket. 
Suppose you went to an unfamiliar 
supermarket and you wanted to find 
anchovies. 
You might ask the teller at the 
supermarket, where do you keep the cans 
of tuna fish? 
You'd then go to that address in the 
supermarket and you'd look around. 
Hopefully, near there is things like cans 
of salmon and maybe cans of anchovies. 
Of course, if you're unlucky, the 
anchovies might have been stored in a 
completely different place, 
next to the pizza toppings. 
And that's the downside of this kind of 
search. 
Known as supermarket, it's essentially a 
2-D surface. 
So, it's really a 1-D string of shells, 
which have height and that gives you 2-D, 
and so you only have two dimensions in 
which to locate things. And that's not 
sufficient to put all the things you'd 
like to be near one another, near one 
another. 
You'd like, for example, to have the 
vegetarian version of things nearby, or 
the Kosher version of things nearby, or 
the slightly out of date version of 
things nearby. And in 2-D you can't do 
all that. 
But what we have here is a 30 dimensional 
supermarket and that's a hugely more 
complex space where it's very easy to 
have things near an item for many 
different reasons because of similarity 
along many different dimensions. 
Here's another view of what we're doing 
in semantic haching. 
Most of the first retrieval methods work 
by intersecting stored lists that are 
associated with cues extracted from the 
query. 
So, Google, for example, will have a list 
of all the documents that contain some 
particular rare word. 
And when you use that rare word in your 
query, they will immediately have access 
to that list. 
They then have to intersect that list 
with other lists in order to find a 
document that satisfies all the terms in 
your query. 
Now, computers actually have special 
hardware that can intersect 32 very long 
lists in a single machine instruction. 
The hardware is called the memory bus. 
So, each bit in a 32-bit binary address 
specifies a list of half the addresses in 
memory. 
For example, if the bit is on and it's 
the first bit in the address, it 
specifies the top half of memory. 
If the bit is off, it specifies the 
bottom half of memory. 
What the memory bus is doing is 
intersecting 32 lists to find the one 
location that satisfies all 32 values in 
the binary code. 
So, we can think of semantic hashing as a 
way of using machine learning to map the 
retrial problem onto the type of list 
intersection computer's good at. 
As long as our 32-bits correspond to 
meaningful properties of documents or 
images, 
then we can find similar ones very fast 
with no search at all.