In this video I'm going to describe
eco-state networks.
These use a clever trick to make it much
easier to learn a recurrent neural
network.
They initialize the connections in the
recurrent neural network in such a way
that it has a big reservoir of coupled
oscillators.
So if you provide input to it, it converts
that input into the states of these
oscillators, and then you can predict the
output you want, from the states of these
oscillators.
And the only thing you have learn, is how
to couple the output to the oscillators.
This entirely gets rid of the problem.
Of learning hidden to hidden connections
or even input to hidden connections.
However, to get these networks to be good
at complicated tasks, you need a very big
hidden state.
As we'll see at the end of the video,
There's no reason not to use the
initialization that was carefully designed
for echo state networks, And then to use
back propagation through time with
momentum to train the networks to be even
better at the tasks that they're doing.
One interesting and quite recent idea
about training recurrent neural networks,
is to not train the hidden to hidden
connections at all, but to just fix them
randomly, and hope that you can learn
sequences by just training the way they
affect the outputs.
This has strong similarities with old
ideas about perceptions.
So a very simple way to train a
feedforward neural network, is to make the
early layers of feature detectors just be
random.
You put in sensible sized random weights
and then all you learn is the last layer
so that you're learning a linear model
from the activities of the hidden units in
the last layer to the outputs.
And of course it's much faster to learn a
linear model.
This relies on the idea that a big, random
expansion of the input vector, can often
make it easy for a linear model to fit the
data, when it couldn't fit the data well,
just looking at the raw inputs.
Through the little neural network here,
those red weights will be fixed at random.
They would expand the input vector and
then using that expanded representation,
we try and fit a linear model.
This actually has some quite strong
similarities with support vector machines.
Which are really just a really efficient
way of doing this.
So those same ideas, many years later,
were recycled for recurrent neural
networks.
The idea is to make the input to hidden
connections.
And the hidden to hidden connections have
random values that are carefully chosen.
And just learn the final layer of hidden
to output connections.
The learning is then very simple if you
use linear output units.
And it can be done extremely fast.
This approach is only ever going to work
if you set the random connections very
carefully, so that the recurring neural
network doesn't die out with no activity
and doesn't explode.
So, the way they set the random
connections in a echo state network is
they set the hidden to hidden weights so
that the length of the activity vector
stays about the same after each duration.
For those of you used to linear systems
and matrices, you're setting it so the
spectral radius is one.
That is the biggest eigenvalue of the
matrix of hidden to hidden weights is one.
Or it would be one if it was a linear
system.
And you want to achieve the same property
in a non-linear system.
If you set those weights to be about the
right magnitude, then an input can echo
around in the recurrent state for a long
time.
It's also important to use sparse
connectivity.
So instead of having lots of medium size
weights, we have a few quite large
weights.
And nearly all the weights are zero in the
hidden to hidden connections.
What this does is it makes a lot of
loosely coupled oscillators.
So information can hang around in one part
of the net without being propagated to
other parts of the net too quickly.
It's also important to choose the scale of
the input to hidden connections very
carefully.
Those connections need to drive the states
of the loosely coupled oscillators but,
they mustn't wipe out information that
those oscillators contain about the recent
history.
Fortunately the learning is very fast in
echo state networks so we can afford to
experiment with the scales of the
important connections.
You could think of it as a little learning
loop that's just learning the scales of
those connections and it's doing it by
sort of feedback that involves the
experimenter.
It also helps to learn the level of
sparseness that's needed in the hidden to
hidden connections, and again because the
learning is so fast, you can afford to
experiment with that.
That's important because it's often
necessary to do those experiments to get
the system to work well.
So I'm now going to show you a simple
example taken from the web of an eco-state
network.
It has an input sequence which is a real
value that varies with time, and specifies
the frequency of a sine wave for the
output of the eco-state network.
So you'd like this thing to generate sine
waves, and the input is gonna specify the
frequency.
The target output sequence is going to be
the same wave with the frequency specified
by the output.
And it's going to be learned simply by
putting a linear model that takes the
states of the hidden units and from those
tries to predict the correct scalar output
value.
So here's a picture taken from
scholarpedia of an echo state network
doing this program, the input signal is
the desired frequency of the sine wave.
The output signal after it's learned, or
the teacher signal, when it's learning, is
a sine wave with the frequency specified
by the input.
And the stuff in the middle is a big
dynamical reservoir, so that the inputs
coming from the input signal driver those
loosely coupled oscillators, and cause
complicated dynamics that goes on for a
long time.
And those output weights are learning to
map that complicated dynamics to the
particular dynamics you want for the
output.
All the other pictures are showing you the
actual dynamics of individual units inside
the dynamical reservoir.
One thing to notice is that there are also
connections from the output back to the
reservoir.
Those aren't always needed, but they help
to tell the reservoir what have has been
produced so far.
So here's an example of what they system
actually produces after it's learned, and
you can see that at the beginning it's
producing a sign wave, in phase.
At the end, it's producing a sign wave of
the right frequency, but the phase is
wrong.
And that's because we weren't telling what
phase the sign wave should be in.
So it's satisfying the requirements of
producing an appropriate frequency.
There some very good aspects of echo state
networks.
They can be trained very fast because they
just fit in a linear model.
They also demonstrate how important it is
to initialize the hidden-to-hidden weight
sensibly.
And they can do quite impressive modeling
of one dimensional time savers.
That's where they excel.
They can look at a time series for awhile,
and then predict it very well a long time
into the future.
What they're not so good at is modeling
high dimensional data, like frames of
acoustic coefficients, or frames of video.
In order to model data like that, they
need many more hidden units than a
recurrent neural network where you train
the hidden to hidden connections.
Recently, Ilya Sutskever tried something
which is fairly obvious which is to
initialize a recurrent neural network
using all the tricks developed by the
people doing echo state networks.
Once you've done that, you know you could
learn quite well just by learning the
hidden driver connections.
But then, presumably, you could learn even
better if you also learn to make the
hidden to hidden weights better.
So Ilya tried using the echo state network
initializations but then training with
back propagation through time.
He used rmsprop with momentum and he
discovered that, that is actually a very
effective way to train recurrent neural