We're now going to apply Hessian-free
optimization, to the task of modeling
character strings from Wikipedia.
So, the idea is, you read a lot of
Wikipedia and then try to predict the next
character.
Before we get to see what the model
learns, I want to describe why we need
multiplicative connections and how we can
implement those multiplicative connections
efficiently in a recurrent neural network.
I need to start by explaining why we chose
to model character strings rather than
strings of words, which is what you
normally do when you're trying to model
language.
The web is composed of character strings.
Any learning method that's powerful enough
to understand what's going on in the world
by reading the web, ought to find it
trivial to learn which strings make words.
As we'll see, this turns out to be true.
So we're going to be very ambitious here.
We want something that will read Wikipedia
and understand the world.
If we have to pre-process the text in
Wikipedia into words it's going to be a
big hassle.
There's all sorts of problems.
The first problem is morphemes.
The smallest units of meaning, according
to linguists, are morphemes.
So we're going to have to break up a word
into these morphemes if we want to deal
with it sensibly.
The problem is, it's not quite clear what
morphines are.
There's things that are a bit like
morphemes, but that a linguist wouldn't
call a morpheme.
So in English, if you take any word that
starts with the letters sn, it has a very
high chance of meaning something to do
with the lips or nose, particularly the
upper lip or nose.
So words like snarl, and sneeze, and snot,
and snog, and snort.
There's too many of these words for it
just to be coincidence.
Many people say yes but what about snow?
That's got nothing to do with the apple
lips or nose. But, ask yourself something,
why is snow such a good word for cocaine?
Then there's words that come in several
pieces.
So normally, we'd want to treat New York
as one lexical item.
But if we're talking about the New York
minster roof, then we might want to treat
New and York as two separate lexical
items.
And then there's languages like Finnish.
Finnish is an agglutinative language, so
it puts together lots of morphemes to make
great big words.
So here's an example of a word in Finnish
that takes about five words in English to
say the same thing.
I have no idea what this word means.
But despite my lack of understanding, it
makes the point.
So here's an obvious kind of recurring net
we might use to try and model character
strengths.
It has a hidden state and in this case,
we're going to use 1500 hidden units.
And, the hidden state dynamics is that the
hidden state at time T provides inputs to
determine the hidden state at time T+11.
And the character also provides some
inputs.
So we add together the effect of the
current character with the previous hidden
state to get the new hidden state.
And then when we arrive at a new hidden
state, we try and predict the next
character.
So, we have a single Softmax over the 83
characters, and we get the hidden state to
try and assign high probability to the
correct next character, and low
probability to the others.
And we train the whole system by
backpropagating from that Softmax,
The low probability of getting the correct
character.
We backpropagate that through the hidden
to output connections back through the
hidden to character connections, and then
back through the hidden to hidden
connections, and so on and all the way
back'til the beginning of the string.
It's a lot easier to predict 86 characters
than 100,000 words.
So it's easier to use a Softmax at the
output, we don't have the problem of a
great, big Softmax.
Now, to explain why we didn't use that
kind of recurrent net, but instead used a
different kind of net that worked quite a
lot better.
You could arrange all possible character
strings into a tree with a branching ratio
of 86, in our case.
And what I'm showing here, is a tiny
little subtree, of that great big tree.
In fact, this little subtree will occur
many times, but with different things that
are represented by that dot, dot, dot
before the fix.
So this represents that we had a whole
bunch of characters, then we had f and
then i and then x.
And now if we get an i, we're going to go
to the left.
If we get an e, we're gonna go to the
right, and so on.
So each time we get a character, we move
one step down in this tree to a new note.
There's exponentially many nodes in the
tree of all character strings of length n.
So this is going to be a very big tree.
We couldn't possibly store it all.
If we could store it all, what we'd like
to do is put a probability on each of
those arrows.
And that will be the probability of
producing that letter, given the context
of the node.
In an RNN, we try and deal with the fact
that the full tree is enormous by using a
hidden state vector to represent each of
these nodes.
So now, what the next character has to do
is take the hidden state vector that's
representing the whole string of
characters followed by fix and operate on
the hidden state vector to produce the
appropriate new hidden state vector if the
next character was an i.
So when you see an i, you want to turn the
hidden state vector into a new hidden
state vector.
A nice thing about implementing these
nodes in this character tree by using the
hidden state of recurrent neural network,
is that we can share a lot of structure.
For example, by the time we arrive at that
node, that says f, i, x, we may have
decided that it's probably a verb.
And if it's a verb, then i is quite likely
because of the ending i, n, g. And that
knowledge that i is quite likely with a
verb, can be shared with lots of other
nodes that don't have f I x in.
So we can get I to operate on the part of
the state that represents that it's a
verb, and that can be shared between all
the verbs.
Notice that, it's really the conjunction
of the current state we're at and the
character that determines where we want to
go.
We don't want i, to give us a state that's
expecting to get an n next if it wasn't a
verb.
So, we don't want to say that i tends to
make you expect an n next.
We really want to say, if you already
think it's a verb, then when you see an,
i, you should expect an n next.
It's the conjunction of the fact that we
think it's a verb, and that we saw an i,
that gets us into this state labeled f, i,
x, i, that's expecting to see an n.
So we're going to try and capture that by
using multiplicative connections.
Instead of using the character inputs to
their current net to give extra additive
input to the hidden units, we're going to
use those characters to swap in a whole
hidden-to-hidden weight matrix.
The character is going to determine the
transition matrix.
Now, if we did that in the naive way, we'd
have each of the 86 characters to find a
1500x1500 matrix and that would be a lot
of parameters.
If we have that many parameters, then
that's likely to overfit, unless we run it
on huge amount of text, for which we might
not have time.
So the question is, can we achieve this
kind of multiplicative interaction, where
the character determines the
hidden-to-hidden weight matrix using many
fewer parameters, by making use of the
fact that characters have things in
common.
For example, all of the digits are all
quite similar to each other in the way in
which they make the hidden state evolve.
So, we want to have a different transition
matrix for each of those 86 characters,
but we want those 86 character-specific
weight matrices to share parameters, and
that's a reasonable thing to do cuz we
know that characters eight to nine should
have very similar transition matrices.
So here's how we're going o do it.
We're going to have things called factors,
and they're going to be denoted by this
little triangle with an f above it.
What that factor means is that Group a and
Group b interact multiplicatively to
provide input to Group c.
So, what each factor does is it first
computes a weighted sum for each of its
two input groups.
So, we take the vector state of Group a,
which I just call a, and we multiply that
by the weight sum connections coming into
the factor.
In other words, we take the scale of
product of the vector a and the weight
vector u, and that gives us a number at
the left hand vertex of that triangle.
Similarly, we take the vector stage of
Group b and we multiply it by the weight
factor w, and we get another number off
the bottom vertex of the triangle.
We now multiply those two numbers together
and that gives us a number or scalar.
And we use that scalar to scale the
outgoing weights v in order to provide
input for Group c.
So the input to Group c is just the
product of the two numbers that come into
the two vertices of the triangle, times
the outgoing weight factor v.
We can write that as an equation.
The input that factor f provides to Group
c, so its vector input to Group c, is a
scalar input to f from Group b, that's got
by multiplying the state of Group b by the
weights w, f times a scalar input to f
from Group a, that's got by multiplying
that state of Group a by the weights u.
We then take the product of those two
scalars and multiple the weight vector vf
by that, and that's the input that the
factor gives to Group c.
Then, of course, we're going to have a
whole bunch of those factors.
There's another way we can think about
these factors that gives more insight into
what's going on.
Each of the factors actually defines a
very simple kind of transition matrix.
It's a transition matrix that has rank
one.
So, the equation we had on the previous
slide treats a factor as computing two
scalar products, multiplying them
together, and then using that as a weight
on the outgoing vector v.
We can rearrange that equation.
So that we get one scalar product, and
then we rearrange the last bit so that
now, we take the outer product of the
weight vector u and the weight vector v,
And that gives us a matrix. And the scalar
product that would be computed by
multiplying b by w is just a coefficient
on that matrix.
So we get a scalar coefficient, we
multiply a rank one matrix by that scalar
coefficient to give us that scalar matrix.
And then, we multiply the current hidden
state a by this matrix to determine the
input the factor f gives to the next
hidden state.
If we sum that up over all the factors,
the total input to Group c is just a sum
over all factors of a scaler times a rank
one matrix, and that sum is a great big
matrix, that's the transition matrix, and
it gets multiplied by the current hidden
state to produce the new hidden state.
So, we can see that we synthesized the
transition matrix,
Actually, these rank one matrices provided
by each factor. And what the current
character in Group b has done is, is it's
determine the weight on each of these rank
one matrices.
So, bw times w determines the scalar
weight, the scalar coefficient to put on
each of the matrices, actually which we
are going to compose this great big
character specific right matrix.
So here's a picture of the whole system,
we have a number of factors, in fact we'll
have about 1500 factors.
And the character input is different in
that only one of those is active so there
will only be one relevant weight at a
time.
And that weight from the current character
k, which called w, kf, is the gain that's
used on the rank one matrix got by taking
the outer product of u and v.
So the character determines again W, kf,
you multiply the rank one matrix uv by
that gain, you add together the scale
matrices for all the different factors and
that's your transition matrix.