In this video, we're going to see what
happens when the Hessian free optimizer is
used to optimize the current neural
network containing multiplicative
connections and the network is trained to
predict the next character in Wikipedia.
The network is trained on millions of
characters and it works remarkably well.
It learns a lot about English and it
becomes very good at completing sentences
in interesting ways.
Ilya Sutskever used five million strings
of 100 characters each taken from English
Wikipedia.
For each string, he starts predicting
after eleventh characters.
So the recurrent network starts off in a
default state.
It reads eleven characters, changing its
hidden state each time.
And then its ready to start predicting.
And it gets trained by back propagating
the errors it makes in prediction.
He used the Hessian free optimizer, and it
took about a month on a very fast GPU
board to get a really good model.
His current best recurrent neural network
for character prediction is probably the
best single model there is for predicting
characters.
You can do better than this model by
combining many different models, using a
neural network to decide which one to use,
but for a single model, it is as good as
it gets.
It works in a very different way, from the
best other models.
So Ilya's model can balance quotes and
brackets over long distances.
Any model that relies on matching a
specific previous context can't do that
So, for example, if it has a bracket, and
it wants to close it 35 characters later,
in order to do that properly a model that
relies on matching previous contexts,
would have to match all 35 intervening
characters.
And it's very unlikely that it has that
whole string stored.
Once the model's learned, you can see what
it knows by generating strings from the
model.
Of course you have to be very careful not
to over-interpret what it says.
The way we generate strings is we start in
the default to hidden state.
We then give it a burning sequence.
So we feed it characters and let it update
its hidden state after each character.
And then we let it stop predicting.
We look at the probability distribution it
predicts for the next character.
We pick a character randomly from that
distribution.
So if it predicts, the probability Q is
one in a thousand, we pick a Q one times
in a thousand.
We then tell the net that whatever
character we picked was the character that
actually occurred, and ask it to predict
the next character.
In other words, we're telling it whatever
he guesses is correct.
We let it continue to make characters
until we've got as many as we want, and
then we look at the strings it produces to
see what it knows.
So here's an example of a string produced
by Ilya's network after some burn in.
This was selected from a much longer
passage of text.
But it's a continuous passage, which shows
you that it works pretty well.
You'll notice that it has weird semantic
associations.
So, Opus Paul at rome.
No person would never say that.
But we understand that Opus and Paul and
Rome are all highly interconnected.
You'll notice it doesn't really have any
long range thematic structure.
It pretty much changes topic after each
full stop.
One amazing thing is that is produces very
few non words.
What that means is that, even though it's
predicting probabilities of characters,
As soon as you've got enough characters so
there's only one way to complete it as an
English word, it will predict the next
character almost perfectly.
If that wasn't the case, it would produce
non words.
Even when it does produce a non word, like
the word in red, it's a very good non
word.
I'm not absolutely certain, or I wasn't
when I first saw it, that ephemerable
wasn't an English word.
You'll also notice it's produced a closing
bracket without an opening one.
So it doesn't always balance brackets.
It just does it quite frequently.
You'll also notice at the end, that its
produced an opening quote and then a
closing quote much later.
That's consistent behavior on its part.
It really did produce that closing quote,
because it had an open quote earlier on.
If you look at this text you can see
there's a lot of good local syntax.
So, little strings of three or four words
look perfectly reasonable.
There's also lots of semantic knowledge.
So one thing we can do is we can test the
model by giving it carefully designed
strings to see what it knows.
So I tried giving it a non word.
The word thrunge T H R U N G E is not an
english word.
But most English speakers, when they see
that word, would expect it to be a verb
because of it's form.
So I gave it opening contexts in which
that might be averred, to see what
character was most likely to come next.
So, if you give it Sheila Thrunge and ask
for the next character, the most frequent
one is an s, which suggests that it knows
that Sheila is singular, just from reading
Wikipedia.
If you give it people thrunge the most
frequent next character is a space, not an
s, which suggests that it knows that
people is plural.
I then tried giving it a list of names.
So I used capitals for the names with a
comma in between.
And I put a capital T for Thrunge so it
looked like a name, to see what it would
do with that.
So it actually completed it as a name.
And if you look at the name it made, it's
not a bad name.
It indicates that it knows an awful lot
about names in many languages.
You can also give it the meaning of life
is, and the see what comes next.
If it produced 42, that wouldn't be very
interesting cuz I'm fairly sure that will
be somewhere in Wikipedia.
It produces random things.
But in it's first ten tries, it produced
the meaning of life is literally
recognition.
Which is syntactically and semantically
sensible.
We then trained the model some more and
presented it with a Meaning of Life is
again.
We took the first ten things it produced
and I'm going to show you the most
interesting one.
The completion it produced for the Meaning
of Life is, suggests that by reading
Wikipedia it really is beginning to
understand something about the meaning of
life.
That's probably just wild
over-interpretation, though.
So here's its completion.
So, what does the model know after it's
read all these characters in Wikipedia?
Well, it certainly knows about words.
It produces almost always, English words.
It will produce strings of initials,
typically in capitals.
It can produce numbers and dates and
things like that.
But it doesn't produce non words very
often.
It produces them extremely rarely.
And when it does produce them, they're
typically very plausible non words.
It also knows a lot about proper names,
like Frangelini Del Rey.
It knows about dates and numbers, and the
context in which they occur.
It's good at balancing quotes and
brackets, and in fact it can actually
count brackets.
If you give it no opening brackets, it's
very unlikely to produce a closing
bracket.
If you give it one opening bracket, it's
quite likely to produce a closing bracket
in the next twenty characters or so.
If you give it two opening brackets, it'll
produce a closing bracket very quickly.
Giving it three doesn't seem to make it
any faster.
It clearly knows a lot about syntax
because it's able to produce these little
strings of English words that are
sensible.
But it's very hard to pin down exactly
what form this knowledge has.
It's not like trigram models which have
just learned little sequences of words.
Or rather they have a table that contains
little sequences of words.
It's actually synthesizing strings of
words.
And it's synthesizing them with sensible
syntax.
It's very hard to say though what form
that syntactic knowledge had.
It's not a bunch of rules like a linguist
has.
It's much more like what's in the
linguists head when he speaks a language.
It knows a lot of weak semantic
associations.
So, for example, it only ever produced the
word Wittgenstein once.
And it produced that soon after producing
the word Plato.
So it knows that Plato and Wittgenstein
are associated.
Well, that's a pretty good assumption.
It clearly knows that cabbage is
associated with vegetable.
It doesn't know much about the precise
ways in which these things are associated.
People are like that too if you get them
to respond very fast.
So I'm going to ask you a question, and I
want you to shout out the answer.
So you're sitting there watching this
video.
And this experiment will work, by far, the
best, if you actually shout out the answer
out loud.
And you have to shout it out really fast.
You get rewarded for responding very
quickly.
It doesn't matter what you say.
You just have to respond fast.
So the question is, what do cows drink.
Most people, when given that question,
shout out milk.
Now, most cows don't drink milk most of
the time.
We say milk cuz it's associated with both
drink and with cow.
But it's not logical to say milk.
Recently, Thomas Mikolov and his
collaborators have been training large
recurrent neural networks to predict the
next word in large data sets.
They use the same technique as the feet
forward neural nets.
That first convert a word to a real valued
feature vector.
And then use those feature vectors as
input to the rest of the network.
And they do better than the feed forward
neural nets.
They also do better than the best other
models.
And when you average them with the best
other models, they do better still.
So that's the best language models there
are currently.
One interesting property of the RNNs is
that they require less training data than
the other methods to reach a given level
of performance.
More importantly, as the data sets get
bigger, the RNNs improve faster than the
other methods.
So methods like trigrams, for example, do
get better with bigger data sets but it's
a very slow process.
You need to double the size of the data
set to get a small improvement.
With RNNs, they can make much better use
of the data.
This means it's going to be very hard to
beat them as data sets get bigger.
I think it may be the same story as for
object recognition with large, deep neural
nets.
But once the neural nets get ahead, they
can make better use of faster computers
and bigger data sets.
And so, it's going to be very hard for
other methods to catch up.