1
00:00:00,000 --> 00:00:05,110
In this video, I'm going to talk about 
applying deep autoencoders to document 

2
00:00:05,110 --> 00:00:08,320
retrieval. 
There was a method developed some time 

3
00:00:08,320 --> 00:00:13,692
ago called latent semantic analysis, that 
amounts to applying principal components 

4
00:00:13,692 --> 00:00:17,623
analysis to vectors of word counts 
extracted from documents. 

5
00:00:17,623 --> 00:00:22,602
The codes produced by latent semantic 
analysis can then be used for judging 

6
00:00:22,602 --> 00:00:27,320
similarity between documents so they can 
be used for document retrieval. 

7
00:00:27,320 --> 00:00:32,646
Obviously, if deep autoencoders work much 
better than PCA, we would expect to be 

8
00:00:32,646 --> 00:00:37,840
able to extract much better codes using a 
deep autoencoder than using latent 

9
00:00:37,840 --> 00:00:41,302
semantic analysis. 
And Rus-lan Salakhutdinov and I showed 

10
00:00:41,302 --> 00:00:45,781
that, that was indeed the case. 
Using a big database of documents, we 

11
00:00:45,781 --> 00:00:51,280
showed that ten components extracted with 
a deep autoencoder are actually worth 

12
00:00:51,280 --> 00:00:56,377
more than 50 components extracted with a 
linear method, like latent semantic 

13
00:00:56,377 --> 00:00:59,731
analysis. 
We also showed that if you make the code 

14
00:00:59,731 --> 00:01:04,761
very small, having just two components, 
you can use those two components for 

15
00:01:04,761 --> 00:01:08,651
visualizing documents as a point in a 
two-dimensional map. 

16
00:01:08,651 --> 00:01:13,882
And this, again, works much better than 
just extracting the first two principal 

17
00:01:13,882 --> 00:01:17,470
components. 
To find documents that are similar to a 

18
00:01:17,470 --> 00:01:21,551
query document, 
the first thing we do is convert each 

19
00:01:21,551 --> 00:01:27,336
document into a big bag of words. 
In other words, we have a vector of word 

20
00:01:27,336 --> 00:01:31,140
counts that ignores the order of the 
words. 

21
00:01:31,140 --> 00:01:34,700
This clearly throws away quite a lot of 
information. 

22
00:01:34,700 --> 00:01:39,286
But it also retains a lot of information 
about the topic of the document. 

23
00:01:39,286 --> 00:01:43,935
We ignore words like the or over" which 
are called stop words," because they 

24
00:01:43,935 --> 00:01:46,700
don't have much information about the 
topic. 

25
00:01:46,700 --> 00:01:51,966
So if you look on the right, I've done 
the counts for various words, and they're 

26
00:01:51,966 --> 00:01:55,233
actually the counts for the document on 
the left. 

27
00:01:55,233 --> 00:01:59,033
So, if you look at what words we have 
nonzero counts for, 

28
00:01:59,033 --> 00:02:03,300
they are vector, and count, and query, 
and reduce, and bag, and a word, 

29
00:02:03,300 --> 00:02:07,300
that tells you quite a lot about what the 
document is about. 

30
00:02:07,300 --> 00:02:12,859
Now, we could compare the word counts of 
the query document with the word counts 

31
00:02:12,859 --> 00:02:17,724
of millions of other documents. 
But that would involve comparing quite 

32
00:02:17,724 --> 00:02:21,059
big vectors. 
In fact, we use vectors of size 2000. 

33
00:02:21,059 --> 00:02:25,229
So, that would be slow. 
Alternatively, we could use each query 

34
00:02:25,229 --> 00:02:30,580
vector to a much smaller vector that 
still contained most of the information 

35
00:02:30,580 --> 00:02:35,053
about the content. 
So, here's how we do the reduction. 

36
00:02:35,053 --> 00:02:40,803
We take the deep autoencoder and we 
compress the 2,000-word counts down to 

37
00:02:40,803 --> 00:02:45,819
ten real numbers, from which we can 
reconstruct the 2,000-word counts, 

38
00:02:45,819 --> 00:02:49,090
although we can't reconstruct them very 
well. 

39
00:02:49,090 --> 00:02:54,711
We train the neural network to reproduce 
its input vector as its output vector as 

40
00:02:54,711 --> 00:02:58,207
well as possible. 
And that forces it to put as much 

41
00:02:58,207 --> 00:03:02,526
information about the input into those 
ten numbers as possible. 

42
00:03:02,526 --> 00:03:06,160
We can then compare documents using just 
ten numbers. 

43
00:03:06,160 --> 00:03:11,086
That's going to be much faster. 
So, there's one problem with this, which 

44
00:03:11,086 --> 00:03:15,449
is word counts aren't quite the same as 
pixels or real values. 

45
00:03:15,449 --> 00:03:20,798
What we do is we divide the counts in a 
bag of words by the total number of 

46
00:03:20,798 --> 00:03:25,935
non-stop words and that converts the 
vector of counts into a probability 

47
00:03:25,935 --> 00:03:32,085
vector where the numbers add up to one. 
You can think of it as the probability of 

48
00:03:32,085 --> 00:03:37,996
getting a particular word if you picked a 
word at random in the document, as long 

49
00:03:37,996 --> 00:03:42,958
as that is not a stop word. 
So, the output of the autoencoder, we're 

50
00:03:42,958 --> 00:03:47,775
using a great, big 2,000-way softmax. 
And our target values are the 

51
00:03:47,775 --> 00:03:53,540
probabilities of words when we reduce the 
count vector to a probability factor. 

52
00:03:53,540 --> 00:04:00,312
There's one further trick we have to do. 
We treat the word counts as probabilities 

53
00:04:00,312 --> 00:04:06,443
when we're reconstructing them. 
But when we're using them to activate the 

54
00:04:06,443 --> 00:04:11,120
first hidden layer, we multiply all the 
weights by N. 

55
00:04:11,120 --> 00:04:15,771
And that's because we have N different 
observations from that probability 

56
00:04:15,771 --> 00:04:19,903
distribution. 
If we left them as probabilities, the 

57
00:04:19,903 --> 00:04:25,167
input units would have very small 
activities and wouldn't provide much 

58
00:04:25,167 --> 00:04:30,619
input to the first hidden layer. 
So, we have this funny property that for 

59
00:04:30,619 --> 00:04:36,177
the first restricted Boltzmann machine, 
the bottom up weights, are N times bigger 

60
00:04:36,177 --> 00:04:40,160
than the top down weights. 
So, how well does this work? 

61
00:04:40,160 --> 00:04:45,988
We trained using bags of 2,000 words on 
4,000 business documents from the Reuters 

62
00:04:45,988 --> 00:04:49,121
data set. 
And these documents had been hand 

63
00:04:49,121 --> 00:04:52,400
labelled with about a 100 different 
categories. 

64
00:04:52,400 --> 00:04:57,865
We first trained a stack of restricted 
Boltzmann machines, and then, we fine 

65
00:04:57,865 --> 00:05:03,800
tuned with back propagation using a 
2,000-way softmax as the output. 

66
00:05:03,800 --> 00:05:08,120
And then, we test it on a different set 
of 4,000 documents. 

67
00:05:09,580 --> 00:05:15,840
And to test, you pick one document to be 
the query, one of the test documents, 

68
00:05:15,840 --> 00:05:21,183
and then you rank order all the other 
test documents by using the cosine of the 

69
00:05:21,183 --> 00:05:26,640
angles between the ten-dimensional 
vectors that the ordering code gives you. 

70
00:05:26,640 --> 00:05:34,520
You repeat this for each of the 4,000 
possible test documents and then you plot 

71
00:05:34,520 --> 00:05:39,549
the number of documents you're going to 
retrieve, that is how far down that rank 

72
00:05:39,549 --> 00:05:44,139
list you're going to go, against the 
proportion that are in the same hand 

73
00:05:44,139 --> 00:05:48,917
labelled class as the query document. 
This is not a very good measure of the 

74
00:05:48,917 --> 00:05:53,003
quality of the retrieval. 
But we're going to use the same measure 

75
00:05:53,003 --> 00:05:56,838
for comparing the LSA. 
And so, at least, it's a fair comparison. 

76
00:05:56,838 --> 00:06:01,931
So, here's the accuracy of the retrieval 
as a function of the number of retrieved 

77
00:06:01,931 --> 00:06:06,425
documents. 
When you see that an autoencoder was just 

78
00:06:06,425 --> 00:06:11,968
using a code with ten real numbers is 
doing better than latent emantic 

79
00:06:11,968 --> 00:06:18,336
analysis, using 50 real numbers. 
And, of course, it's five times less work 

80
00:06:18,336 --> 00:06:24,652
per document after you've got the code. 
Latent semantic analysis with ten real 

81
00:06:24,652 --> 00:06:29,534
numbers is much worse. 
We can also do the same thing where we 

82
00:06:29,534 --> 00:06:34,229
reduce to two real numbers, and then, 
instead of doing retrieval, we're just 

83
00:06:34,229 --> 00:06:39,139
going to plot all the documents in a map 
but we're going to color that 

84
00:06:39,139 --> 00:06:47,215
two-dimensional point that corresponds to 
the two numbers produced by PCA by the 

85
00:06:47,215 --> 00:06:51,488
class of the document. 
So, we took the major classes of the 

86
00:06:51,488 --> 00:06:55,024
documents. 
We gave those major classes different 

87
00:06:55,024 --> 00:06:59,615
colors. 
And then, we used PCA on log of one plus 

88
00:06:59,615 --> 00:07:04,589
the count. 
The point of doing that is that it 

89
00:07:04,589 --> 00:07:10,300
suppresses counts with very big numbers 
which tends to make PCA work better. 

90
00:07:11,980 --> 00:07:18,078
This is the distribution you get. 
As you can see, there is some separation 

91
00:07:18,078 --> 00:07:21,354
of the classes. 
The green class is in one place. 

92
00:07:21,354 --> 00:07:24,166
The red class is in a slightly different 
place. 

93
00:07:24,166 --> 00:07:29,849
But the classes are very mixed up. 
Then, we did the same thing by using a 

94
00:07:29,849 --> 00:07:35,566
deep autoencoder to reduce the documents 
to two numbers, and, again, plotting the 

95
00:07:35,566 --> 00:07:40,487
documents in a two-dimensional space 
using those two numbers as the 

96
00:07:40,487 --> 00:07:43,940
coordinates. 
And here's what we got. 

97
00:07:43,940 --> 00:07:47,696
It's a much better layout. 
It tells you much more about the 

98
00:07:47,696 --> 00:07:51,771
structure of the data set. 
You can see the different classes, and 

99
00:07:51,771 --> 00:07:54,699
you can see that they're quite 
well-separated. 

100
00:07:54,699 --> 00:07:59,474
We assume that the documents in the 
middle are ones which didn't have many 

101
00:07:59,474 --> 00:08:04,250
words in them, and therefore, it was hard 
to distinguish between the classes. 

102
00:08:04,250 --> 00:08:09,494
A visual display like this could be very 
useful. If, for example, you saw one of 

103
00:08:09,494 --> 00:08:15,008
green dots was the accounts and earning 
reports from Enron, you probably wouldn't 

104
00:08:15,008 --> 00:08:19,043
want to buy shares in a company that has 
a green dot nearby.