In this video, I am going to describe an
approach to training recurrent neural
networks that's called Long Short Term
Memory.
You can consider the dynamic state of a
neural network to be a short term memory.
And the idea is, you want to make that
short term memory last for a long time.
This is done by creating special modules
that are designed to allow information to
be gated in, and then information to be
gated out when needed.
And in the intermediate period, the gate
is closed, so the stuff that arrives in
the intermediate period doesn't interfere
with the remembered state.
Long short term memory has been very
successful for tasks like recognizing
handwriting, where it's won a number of
competitions.
In 1997, Hochreiter & Schmidhuber
published a paper in neural computation
that solved the problem of getting a
recurring neural network to remember
things for a long time.
There recurrent nets could remember things
for hundreds of time steps.
They did this by designing a memory cell
that used logistic and linear units with
multiplicative interactions.
So information gets into the memory cell
whenever a logistic write gate is turned
on.
The rest of the recurrent network
determines the state of that write gate,
and when the rest of the recurrent network
wants information to be stored, it turns
the write gate on, and whatever the
current input from the rest of the net to
the memory cell is, gets stored in the
memory cell.
The information stays in the memory cell
so long as its keep gate is on.
So again, the rest of the system is
determining the state of a logistic keep
gate, and if it keeps it on, then the
information will stay there.
And finally, the information gets read
from the memory cell so that it then goes
off to the rest of the recurrent neural
network and influences future states and
it's read by turning on a read gate,
Which again is a logistic unit controlled
by the rest of the neural network.
The memory cell actually stores an analog
value, so we can think of it as a linear
neuron that has an analog value and keeps
writing that value to itself at each time
step by a weight of one, so the
information just stays there.
The weight of one is determined by a keep
gate so the rest of the system determines
the state of that logistic keep gate and
if it puts it into a state of one or close
to one the information just cycles around
and that value of 1.73 will stay there.
As soon as the rest of the system wants to
get rid of that value, all it has to do is
set the keep gate to have a value of zero
and the information will disappear.
To store the information in the memory
cell, the rest of the system has to turn
on the write gate.
And then whatever input is being provided
to the memory cell from the rest of the
system will get written into the memory
cell.
Similarly, to read the information from
the memory cell, the rest of the system
turns on the logistic read gate and then,
the value in the memory cell comes out and
affects the rest of the recurring neural
network.
The point of using logistic units is that
we can back propagate through them because
they have nice derivatives, and that means
we can learn to use this kind of circuit
over many time steps.
So I'm going to show you now a picture of
what backpropagation through a memory cell
looks like.
First we're going to do a forward pass.
So at the initial time, let's suppose that
the keep gate was set to zero, so we wiped
out whatever information was in the memory
cell before,
And the write gate is set to one.
So the value of 1.7 that is coming from
the rest of the recurrent neural network
gets written into the memory cell.
And we're not going to read it at this
time, so the read gate is set to zero.
We then set the keep gate to one, or
rather the rest of the, neural network has
to set the keep gate to one, And that
means that the value is written back into
the memory cell.
It's stored.
At the next time step, we're going to set
the right gate to zero and the read gate
to zero, so the information isn't
influenced by what's going on in the rest
of the net, and it doesn't influence
what's going on in the rest of the net.
It's insulated.
Again, at the next time step, the keep
gate is set to one, so the information is
stored for one more time step.
And then, we're going to t set the right
gate to zero, so no information is written
in, but we're now going to retrieve the
information by setting the reed gate to
one.
The value of 1.7 then comes out of the
memory cell and goes off to influence the
rest of the network.
And if we don't need it anymore then the
keep gate can be set to zero and the
information will be removed.
Now, if you look at the 1.7 that comes out
when we do the retrieve and you look at
the path back to the 1.7 that came in,
along that path is these little triangular
symbols and next to each triangular symbol
is a one.
That means that the effective weight on
that connection is a one.
So as we go back along that path whatever
error derivative we have for the 1.7 when
it's retrieved gets backpropagated to 1.7
when it's stored.
So if you'd rather retrieved a bigger
value to make the right things happen now
you can send the information back and tell
it, it should have stored a bigger value.
And notice that as long as the relevant
gates have values of one, there's no
attenuation in this backpropagated signal.
It's got just the properties we want.
Of course if they're logistic gates there
will be some slight attenuation,
But it can be very small and so
information can travel back through
hundreds of time steps.
Now, let's look at a task that a recurrent
neural network with long short term memory
is very good at.
It's a very natural task for recurrent
neural network.
It's reading cursive handwriting.
The input is just a sequence of the x and
y coordinance of the tip of the pen,
Plus some information about whether the
pen is on the paper or not.
The output is going to be a sequence of
recognized characters.
Graves & Schmidhuber in 2009, showed that
recurrent neural networks with long short
term memory are extremely good at this
task.
So far as I know, they're currently the
best systems there are and I believe
Canada Post is starting to use them for
reading handwriting.
Graves & Schmidhuber who, in 2009, didn't
use pen coordinates as input.
They used a sequence of small images.
And that means they can deal with optical
input where the timing of the pen isn't
known.
They can look at images after they've been
written and read them.
So I'm now gonna show you a demonstration
of Alex Graves's system working on pen
coordinates.
And in the movie that follows you're going
to see four streams of information.
The top row shows the characters as
they're recognized.
The system never revises its output.
So if it has to make a difficult decision,
it delays it for a little bit, so that it
can see a little distance into the future
to help it resolve ambiguities.
The second row shows the states in a
subset of the memory cells, and you should
notice how they get reset when it
recognizes a character.
The third row shows the actual writing and
all the net sees is the x and y
coordinates of the tip of the pen.
Just two numbers plus some information
about whether the pen is up or down.
Finally, the fourth row shows something
much more complicated.
It shows the gradient backpropagated all
the way to the xy locations.
So what you get to see is, for the most
active character,
If you backpropagate from that character
and ask what would make that most active
character more active, you get to see
which bits of the input are affecting the
probability that it's that character.
So that let's you see how the decisions,
are depending on things that happened in
the past.
So here's the movie.