In this video, I'm going to talk about 
applying deep autoencoders to document 
retrieval. 
There was a method developed some time 
ago called latent semantic analysis, that 
amounts to applying principal components 
analysis to vectors of word counts 
extracted from documents. 
The codes produced by latent semantic 
analysis can then be used for judging 
similarity between documents so they can 
be used for document retrieval. 
Obviously, if deep autoencoders work much 
better than PCA, we would expect to be 
able to extract much better codes using a 
deep autoencoder than using latent 
semantic analysis. 
And Rus-lan Salakhutdinov and I showed 
that, that was indeed the case. 
Using a big database of documents, we 
showed that ten components extracted with 
a deep autoencoder are actually worth 
more than 50 components extracted with a 
linear method, like latent semantic 
analysis. 
We also showed that if you make the code 
very small, having just two components, 
you can use those two components for 
visualizing documents as a point in a 
two-dimensional map. 
And this, again, works much better than 
just extracting the first two principal 
components. 
To find documents that are similar to a 
query document, 
the first thing we do is convert each 
document into a big bag of words. 
In other words, we have a vector of word 
counts that ignores the order of the 
words. 
This clearly throws away quite a lot of 
information. 
But it also retains a lot of information 
about the topic of the document. 
We ignore words like the or over" which 
are called stop words," because they 
don't have much information about the 
topic. 
So if you look on the right, I've done 
the counts for various words, and they're 
actually the counts for the document on 
the left. 
So, if you look at what words we have 
nonzero counts for, 
they are vector, and count, and query, 
and reduce, and bag, and a word, 
that tells you quite a lot about what the 
document is about. 
Now, we could compare the word counts of 
the query document with the word counts 
of millions of other documents. 
But that would involve comparing quite 
big vectors. 
In fact, we use vectors of size 2000. 
So, that would be slow. 
Alternatively, we could use each query 
vector to a much smaller vector that 
still contained most of the information 
about the content. 
So, here's how we do the reduction. 
We take the deep autoencoder and we 
compress the 2,000-word counts down to 
ten real numbers, from which we can 
reconstruct the 2,000-word counts, 
although we can't reconstruct them very 
well. 
We train the neural network to reproduce 
its input vector as its output vector as 
well as possible. 
And that forces it to put as much 
information about the input into those 
ten numbers as possible. 
We can then compare documents using just 
ten numbers. 
That's going to be much faster. 
So, there's one problem with this, which 
is word counts aren't quite the same as 
pixels or real values. 
What we do is we divide the counts in a 
bag of words by the total number of 
non-stop words and that converts the 
vector of counts into a probability 
vector where the numbers add up to one. 
You can think of it as the probability of 
getting a particular word if you picked a 
word at random in the document, as long 
as that is not a stop word. 
So, the output of the autoencoder, we're 
using a great, big 2,000-way softmax. 
And our target values are the 
probabilities of words when we reduce the 
count vector to a probability factor. 
There's one further trick we have to do. 
We treat the word counts as probabilities 
when we're reconstructing them. 
But when we're using them to activate the 
first hidden layer, we multiply all the 
weights by N. 
And that's because we have N different 
observations from that probability 
distribution. 
If we left them as probabilities, the 
input units would have very small 
activities and wouldn't provide much 
input to the first hidden layer. 
So, we have this funny property that for 
the first restricted Boltzmann machine, 
the bottom up weights, are N times bigger 
than the top down weights. 
So, how well does this work? 
We trained using bags of 2,000 words on 
4,000 business documents from the Reuters 
data set. 
And these documents had been hand 
labelled with about a 100 different 
categories. 
We first trained a stack of restricted 
Boltzmann machines, and then, we fine 
tuned with back propagation using a 
2,000-way softmax as the output. 
And then, we test it on a different set 
of 4,000 documents. 
And to test, you pick one document to be 
the query, one of the test documents, 
and then you rank order all the other 
test documents by using the cosine of the 
angles between the ten-dimensional 
vectors that the ordering code gives you. 
You repeat this for each of the 4,000 
possible test documents and then you plot 
the number of documents you're going to 
retrieve, that is how far down that rank 
list you're going to go, against the 
proportion that are in the same hand 
labelled class as the query document. 
This is not a very good measure of the 
quality of the retrieval. 
But we're going to use the same measure 
for comparing the LSA. 
And so, at least, it's a fair comparison. 
So, here's the accuracy of the retrieval 
as a function of the number of retrieved 
documents. 
When you see that an autoencoder was just 
using a code with ten real numbers is 
doing better than latent emantic 
analysis, using 50 real numbers. 
And, of course, it's five times less work 
per document after you've got the code. 
Latent semantic analysis with ten real 
numbers is much worse. 
We can also do the same thing where we 
reduce to two real numbers, and then, 
instead of doing retrieval, we're just 
going to plot all the documents in a map 
but we're going to color that 
two-dimensional point that corresponds to 
the two numbers produced by PCA by the 
class of the document. 
So, we took the major classes of the 
documents. 
We gave those major classes different 
colors. 
And then, we used PCA on log of one plus 
the count. 
The point of doing that is that it 
suppresses counts with very big numbers 
which tends to make PCA work better. 
This is the distribution you get. 
As you can see, there is some separation 
of the classes. 
The green class is in one place. 
The red class is in a slightly different 
place. 
But the classes are very mixed up. 
Then, we did the same thing by using a 
deep autoencoder to reduce the documents 
to two numbers, and, again, plotting the 
documents in a two-dimensional space 
using those two numbers as the 
coordinates. 
And here's what we got. 
It's a much better layout. 
It tells you much more about the 
structure of the data set. 
You can see the different classes, and 
you can see that they're quite 
well-separated. 
We assume that the documents in the 
middle are ones which didn't have many 
words in them, and therefore, it was hard 
to distinguish between the classes. 
A visual display like this could be very 
useful. If, for example, you saw one of 
green dots was the accounts and earning 
reports from Enron, you probably wouldn't 
want to buy shares in a company that has 
a green dot nearby.