1
00:00:00,760 --> 00:00:06,190
In this video, we're going to see what
happens when the Hessian free optimizer is

2
00:00:06,190 --> 00:00:10,882
used to optimize the current neural
network containing multiplicative

3
00:00:10,882 --> 00:00:16,380
connections and the network is trained to
predict the next character in Wikipedia.

4
00:00:16,700 --> 00:00:22,261
The network is trained on millions of
characters and it works remarkably well.

5
00:00:22,261 --> 00:00:27,894
It learns a lot about English and it
becomes very good at completing sentences

6
00:00:27,894 --> 00:00:33,131
in interesting ways.
Ilya Sutskever used five million strings

7
00:00:33,131 --> 00:00:36,896
of 100 characters each taken from English
Wikipedia.

8
00:00:36,896 --> 00:00:41,313
For each string, he starts predicting
after eleventh characters.

9
00:00:41,313 --> 00:00:45,296
So the recurrent network starts off in a
default state.

10
00:00:45,296 --> 00:00:49,857
It reads eleven characters, changing its
hidden state each time.

11
00:00:49,857 --> 00:00:55,578
And then its ready to start predicting.
And it gets trained by back propagating

12
00:00:55,578 --> 00:01:03,745
the errors it makes in prediction.
He used the Hessian free optimizer, and it

13
00:01:03,745 --> 00:01:09,300
took about a month on a very fast GPU
board to get a really good model.

14
00:01:11,940 --> 00:01:17,161
His current best recurrent neural network
for character prediction is probably the

15
00:01:17,161 --> 00:01:20,494
best single model there is for predicting
characters.

16
00:01:20,494 --> 00:01:25,338
You can do better than this model by
combining many different models, using a

17
00:01:25,338 --> 00:01:30,496
neural network to decide which one to use,
but for a single model, it is as good as

18
00:01:30,496 --> 00:01:36,374
it gets.
It works in a very different way, from the

19
00:01:36,374 --> 00:01:41,697
best other models.
So Ilya's model can balance quotes and

20
00:01:41,697 --> 00:01:47,235
brackets over long distances.
Any model that relies on matching a

21
00:01:47,235 --> 00:01:53,705
specific previous context can't do that
So, for example, if it has a bracket, and

22
00:01:53,705 --> 00:01:59,419
it wants to close it 35 characters later,
in order to do that properly a model that

23
00:01:59,419 --> 00:02:04,651
relies on matching previous contexts,
would have to match all 35 intervening

24
00:02:04,651 --> 00:02:08,163
characters.
And it's very unlikely that it has that

25
00:02:08,163 --> 00:02:14,520
whole string stored.
Once the model's learned, you can see what

26
00:02:14,520 --> 00:02:17,411
it knows by generating strings from the
model.

27
00:02:17,411 --> 00:02:22,000
Of course you have to be very careful not
to over-interpret what it says.

28
00:02:23,020 --> 00:02:27,660
The way we generate strings is we start in
the default to hidden state.

29
00:02:27,980 --> 00:02:32,807
We then give it a burning sequence.
So we feed it characters and let it update

30
00:02:32,807 --> 00:02:38,442
its hidden state after each character.
And then we let it stop predicting.

31
00:02:38,442 --> 00:02:42,820
We look at the probability distribution it
predicts for the next character.

32
00:02:43,840 --> 00:02:47,340
We pick a character randomly from that
distribution.

33
00:02:47,340 --> 00:02:52,457
So if it predicts, the probability Q is
one in a thousand, we pick a Q one times

34
00:02:52,457 --> 00:02:56,164
in a thousand.
We then tell the net that whatever

35
00:02:56,164 --> 00:03:01,059
character we picked was the character that
actually occurred, and ask it to predict

36
00:03:01,059 --> 00:03:04,953
the next character.
In other words, we're telling it whatever

37
00:03:04,953 --> 00:03:10,046
he guesses is correct.
We let it continue to make characters

38
00:03:10,046 --> 00:03:15,584
until we've got as many as we want, and
then we look at the strings it produces to

39
00:03:15,584 --> 00:03:20,318
see what it knows.
So here's an example of a string produced

40
00:03:20,318 --> 00:03:26,223
by Ilya's network after some burn in.
This was selected from a much longer

41
00:03:26,223 --> 00:03:30,557
passage of text.
But it's a continuous passage, which shows

42
00:03:30,557 --> 00:03:39,840
you that it works pretty well.
You'll notice that it has weird semantic

43
00:03:39,840 --> 00:03:42,200
associations.
So, Opus Paul at rome.

44
00:03:42,200 --> 00:03:47,756
No person would never say that.
But we understand that Opus and Paul and

45
00:03:47,756 --> 00:03:54,077
Rome are all highly interconnected.
You'll notice it doesn't really have any

46
00:03:54,077 --> 00:03:58,717
long range thematic structure.
It pretty much changes topic after each

47
00:03:58,717 --> 00:04:03,680
full stop.
One amazing thing is that is produces very

48
00:04:03,680 --> 00:04:07,958
few non words.
What that means is that, even though it's

49
00:04:07,958 --> 00:04:13,527
predicting probabilities of characters,
As soon as you've got enough characters so

50
00:04:13,527 --> 00:04:18,959
there's only one way to complete it as an
English word, it will predict the next

51
00:04:18,959 --> 00:04:23,577
character almost perfectly.
If that wasn't the case, it would produce

52
00:04:23,577 --> 00:04:28,363
non words.
Even when it does produce a non word, like

53
00:04:28,363 --> 00:04:31,406
the word in red, it's a very good non
word.

54
00:04:31,406 --> 00:04:36,913
I'm not absolutely certain, or I wasn't
when I first saw it, that ephemerable

55
00:04:36,913 --> 00:04:42,330
wasn't an English word.
You'll also notice it's produced a closing

56
00:04:42,330 --> 00:04:47,031
bracket without an opening one.
So it doesn't always balance brackets.

57
00:04:47,031 --> 00:04:51,938
It just does it quite frequently.
You'll also notice at the end, that its

58
00:04:51,938 --> 00:04:56,162
produced an opening quote and then a
closing quote much later.

59
00:04:56,162 --> 00:05:01,614
That's consistent behavior on its part.
It really did produce that closing quote,

60
00:05:01,614 --> 00:05:07,819
because it had an open quote earlier on.
If you look at this text you can see

61
00:05:07,819 --> 00:05:12,871
there's a lot of good local syntax.
So, little strings of three or four words

62
00:05:12,871 --> 00:05:19,480
look perfectly reasonable.
There's also lots of semantic knowledge.

63
00:05:21,380 --> 00:05:27,736
So one thing we can do is we can test the
model by giving it carefully designed

64
00:05:27,736 --> 00:05:32,583
strings to see what it knows.
So I tried giving it a non word.

65
00:05:32,583 --> 00:05:36,874
The word thrunge T H R U N G E is not an
english word.

66
00:05:36,874 --> 00:05:43,151
But most English speakers, when they see
that word, would expect it to be a verb

67
00:05:43,151 --> 00:05:48,640
because of it's form.
So I gave it opening contexts in which

68
00:05:48,640 --> 00:05:54,220
that might be averred, to see what
character was most likely to come next.

69
00:05:54,820 --> 00:06:00,527
So, if you give it Sheila Thrunge and ask
for the next character, the most frequent

70
00:06:00,527 --> 00:06:06,304
one is an s, which suggests that it knows
that Sheila is singular, just from reading

71
00:06:06,304 --> 00:06:10,709
Wikipedia.
If you give it people thrunge the most

72
00:06:10,709 --> 00:06:16,531
frequent next character is a space, not an
s, which suggests that it knows that

73
00:06:16,531 --> 00:06:21,599
people is plural.
I then tried giving it a list of names.

74
00:06:21,599 --> 00:06:25,631
So I used capitals for the names with a
comma in between.

75
00:06:25,631 --> 00:06:31,220
And I put a capital T for Thrunge so it
looked like a name, to see what it would

76
00:06:31,220 --> 00:06:35,456
do with that.
So it actually completed it as a name.

77
00:06:35,456 --> 00:06:39,635
And if you look at the name it made, it's
not a bad name.

78
00:06:39,635 --> 00:06:44,860
It indicates that it knows an awful lot
about names in many languages.

79
00:06:47,160 --> 00:06:52,000
You can also give it the meaning of life
is, and the see what comes next.

80
00:06:52,000 --> 00:06:57,715
If it produced 42, that wouldn't be very
interesting cuz I'm fairly sure that will

81
00:06:57,715 --> 00:07:01,211
be somewhere in Wikipedia.
It produces random things.

82
00:07:01,211 --> 00:07:06,118
But in it's first ten tries, it produced
the meaning of life is literally

83
00:07:06,118 --> 00:07:09,614
recognition.
Which is syntactically and semantically

84
00:07:09,614 --> 00:07:13,686
sensible.
We then trained the model some more and

85
00:07:13,686 --> 00:07:16,865
presented it with a Meaning of Life is
again.

86
00:07:16,865 --> 00:07:22,023
We took the first ten things it produced
and I'm going to show you the most

87
00:07:22,023 --> 00:07:26,192
interesting one.
The completion it produced for the Meaning

88
00:07:26,192 --> 00:07:31,279
of Life is, suggests that by reading
Wikipedia it really is beginning to

89
00:07:31,279 --> 00:07:34,600
understand something about the meaning of
life.

90
00:07:35,020 --> 00:07:38,620
That's probably just wild
over-interpretation, though.

91
00:07:39,300 --> 00:07:53,960
So here's its completion.
So, what does the model know after it's

92
00:07:53,960 --> 00:07:59,605
read all these characters in Wikipedia?
Well, it certainly knows about words.

93
00:07:59,605 --> 00:08:05,020
It produces almost always, English words.
It will produce strings of initials,

94
00:08:05,020 --> 00:08:09,224
typically in capitals.
It can produce numbers and dates and

95
00:08:09,224 --> 00:08:13,143
things like that.
But it doesn't produce non words very

96
00:08:13,143 --> 00:08:15,993
often.
It produces them extremely rarely.

97
00:08:15,993 --> 00:08:21,194
And when it does produce them, they're
typically very plausible non words.

98
00:08:21,194 --> 00:08:25,683
It also knows a lot about proper names,
like Frangelini Del Rey.

99
00:08:25,683 --> 00:08:30,600
It knows about dates and numbers, and the
context in which they occur.

100
00:08:31,500 --> 00:08:36,044
It's good at balancing quotes and
brackets, and in fact it can actually

101
00:08:36,044 --> 00:08:39,565
count brackets.
If you give it no opening brackets, it's

102
00:08:39,565 --> 00:08:42,317
very unlikely to produce a closing
bracket.

103
00:08:42,317 --> 00:08:47,566
If you give it one opening bracket, it's
quite likely to produce a closing bracket

104
00:08:47,566 --> 00:08:52,559
in the next twenty characters or so.
If you give it two opening brackets, it'll

105
00:08:52,559 --> 00:08:57,615
produce a closing bracket very quickly.
Giving it three doesn't seem to make it

106
00:08:57,615 --> 00:09:03,820
any faster.
It clearly knows a lot about syntax

107
00:09:03,820 --> 00:09:08,551
because it's able to produce these little
strings of English words that are

108
00:09:08,551 --> 00:09:11,538
sensible.
But it's very hard to pin down exactly

109
00:09:11,538 --> 00:09:15,833
what form this knowledge has.
It's not like trigram models which have

110
00:09:15,833 --> 00:09:20,875
just learned little sequences of words.
Or rather they have a table that contains

111
00:09:20,875 --> 00:09:24,858
little sequences of words.
It's actually synthesizing strings of

112
00:09:24,858 --> 00:09:27,784
words.
And it's synthesizing them with sensible

113
00:09:27,784 --> 00:09:31,216
syntax.
It's very hard to say though what form

114
00:09:31,216 --> 00:09:36,623
that syntactic knowledge had.
It's not a bunch of rules like a linguist

115
00:09:36,623 --> 00:09:39,517
has.
It's much more like what's in the

116
00:09:39,517 --> 00:09:47,268
linguists head when he speaks a language.
It knows a lot of weak semantic

117
00:09:47,268 --> 00:09:50,793
associations.
So, for example, it only ever produced the

118
00:09:50,793 --> 00:09:54,970
word Wittgenstein once.
And it produced that soon after producing

119
00:09:54,970 --> 00:09:58,495
the word Plato.
So it knows that Plato and Wittgenstein

120
00:09:58,495 --> 00:10:01,889
are associated.
Well, that's a pretty good assumption.

121
00:10:01,889 --> 00:10:05,740
It clearly knows that cabbage is
associated with vegetable.

122
00:10:06,340 --> 00:10:11,465
It doesn't know much about the precise
ways in which these things are associated.

123
00:10:11,465 --> 00:10:15,389
People are like that too if you get them
to respond very fast.

124
00:10:15,389 --> 00:10:20,071
So I'm going to ask you a question, and I
want you to shout out the answer.

125
00:10:20,071 --> 00:10:22,856
So you're sitting there watching this
video.

126
00:10:22,856 --> 00:10:28,108
And this experiment will work, by far, the
best, if you actually shout out the answer

127
00:10:28,108 --> 00:10:31,272
out loud.
And you have to shout it out really fast.

128
00:10:31,272 --> 00:10:34,119
You get rewarded for responding very
quickly.

129
00:10:34,119 --> 00:10:37,980
It doesn't matter what you say.
You just have to respond fast.

130
00:10:37,980 --> 00:10:44,422
So the question is, what do cows drink.
Most people, when given that question,

131
00:10:44,422 --> 00:10:48,278
shout out milk.
Now, most cows don't drink milk most of

132
00:10:48,278 --> 00:10:52,062
the time.
We say milk cuz it's associated with both

133
00:10:52,062 --> 00:10:55,775
drink and with cow.
But it's not logical to say milk.

134
00:10:55,775 --> 00:11:01,131
Recently, Thomas Mikolov and his
collaborators have been training large

135
00:11:01,131 --> 00:11:06,130
recurrent neural networks to predict the
next word in large data sets.

136
00:11:06,130 --> 00:11:09,668
They use the same technique as the feet
forward neural nets.

137
00:11:09,668 --> 00:11:13,089
That first convert a word to a real valued
feature vector.

138
00:11:13,089 --> 00:11:17,277
And then use those feature vectors as
input to the rest of the network.

139
00:11:17,277 --> 00:11:20,403
And they do better than the feed forward
neural nets.

140
00:11:20,403 --> 00:11:23,175
They also do better than the best other
models.

141
00:11:23,175 --> 00:11:27,540
And when you average them with the best
other models, they do better still.

142
00:11:27,540 --> 00:11:31,170
So that's the best language models there
are currently.

143
00:11:31,170 --> 00:11:36,420
One interesting property of the RNNs is
that they require less training data than

144
00:11:36,420 --> 00:11:39,800
the other methods to reach a given level
of performance.

145
00:11:40,280 --> 00:11:46,336
More importantly, as the data sets get
bigger, the RNNs improve faster than the

146
00:11:46,336 --> 00:11:50,183
other methods.
So methods like trigrams, for example, do

147
00:11:50,183 --> 00:11:54,601
get better with bigger data sets but it's
a very slow process.

148
00:11:54,601 --> 00:11:59,660
You need to double the size of the data
set to get a small improvement.

149
00:12:00,000 --> 00:12:03,716
With RNNs, they can make much better use
of the data.

150
00:12:03,716 --> 00:12:08,439
This means it's going to be very hard to
beat them as data sets get bigger.

151
00:12:08,439 --> 00:12:13,604
I think it may be the same story as for
object recognition with large, deep neural

152
00:12:13,604 --> 00:12:16,438
nets.
But once the neural nets get ahead, they

153
00:12:16,438 --> 00:12:20,280
can make better use of faster computers
and bigger data sets.

154
00:12:20,280 --> 00:12:24,123
And so, it's going to be very hard for
other methods to catch up.