In this video, I'm going to talk about the
back propagation through time algorithm.
It's the standard way to train or
recurrence your own network.
The algorithm is really quite simple once
you have seen the equivalents between a
recurrent neural network and a feed
forward neural network that has one layer
for each time step.
I'll also talk about ways of providing
input, and desired outputs, to recurrent
neural networks.
So the diagram shows a simple recurrent
net with three interconnected neurons.
We're going to assume there's a time delay
of one in using each of those connections
and that the network runs in discrete
time, so the clock that has integer ticks.
The key to understanding how to train a
recurrent network is to see that a
recurrent network is really just the same
as a feed forward network, where you've
expanded the recurrent network in time.
So the recurrent network starts off in
some initial state.
Shown at the bottom there, times zero.
And then uses the way some of these
connections to get a new state, shown at
time one.
You then uses the same weights again to
get another new state, and it uses the
same weights again to get another new
state and so on.
So it's really just a lead feed forward
network, where the weight is a constraint
to be the same at every layer.
Now backprop is good at learning when
there are weight constraints.
We saw this for convolutional nets and
just to remind you, we can actually
incorporate any linear constraint quite
easily in backprop. So we compute the
gradients as usual, as if the weights were
not constrained.
And then we modify the gradients, so that
we maintain the constraints.
So if we want W1 to equal W2, we start off
with an equal and then we need to make
sure that the changing W1 is equal to the
changing W2.
And we do that by simply taking the
derivative of the area with respect to W1,
the derivative with respect to W2, and
adding or averaging them, and then
applying the same quantity for updating
both W1 and W2.
So if the weights started off satisfying
the constraints they'll continue to
satisfy the constraints.
The backpropagation through time algorithm
is just the name for what happens when you
think of a recurrent net as a lead feet
forward net with shared weights, and you
train it with backpropagation.
So, we can think of that algorithm in the
time domain.
The forward pass builds up a stack of
activities at each time slice.
And the backward pass peels activities off
that stack and computes error derivatives
each time step backwards.
That's why it's called back propagation
through time.
After the backward pass we can add
together the derivatives at all the
different time step for each particular
weight.
And then change all the copies of that
weight by the same amount which is
proportional to the sum or average of all
those derivatives.
There is an irritating extra issue.
If we don't specify the initial state of
the all the units, for example, if some of
them are hidden or output units, then we
have to start them off in some particular
state.
We could just fix those initial states to
have some default value like 0.5, but that
might make the system work not quite as
well as it would otherwise work if it had
some more sensible initial value.
So we can actually learn the initial
states.
We treat them like parameters rather than
activities and we learn them the same way
as learned the weights.
We start off with an initial random guess
for the initial states.
That is the initial states of all the
units that aren't input units And then at
the end of each training sequence we back
propagate through time all the way back to
the initial states.
And that gives us the gradient of the
error function with respects to the
initial state.
We then just, adjust the initial states by
following, that gradient.
We go downhill in the gradient, and that
gives us new initial states that are
slightly different.
There's many ways in which we can provide
the input to a recurrent neural net.
We could, for example, specify the initial
state of all the units.
That's the most natural thing to do when
we think of a recurrent net, like a feed
forward net with constrained weights.
We could specify the initial state of just
a subset of the units or we can specify
the states at every time stamp of the
subset of the units and that's probably
the most natural way to input sequential
data.
Similarly, there's many way we can specify
targets for a recurrent network.
When we think of it as feed forward
network with constrained weights, the
natural thing to do is to specify the
desired final states for all of the units.
If we're trying to train it to settle to
some attractor, we might want to specify
the desired states not just for the final
time steps but for several time steps.
That will cause it to actually settle down
there, rather than passing through some
state and going off somewhere else.
So by specifying several states of the
end, we can force it to learn attractors
and it's quite easy as we back propagate
to add in derivatives that we get from
each time stamp.
So the back propegation starts at the top,
with the derivatives for the final time
stamp.
And then as we go back through the line
before the top we add in the derivatives
for that man, and so on.
So it's really very little extra effort to
have derivatives at many different layers.
Or we could specify the design activity of
a subset of units which we might think of
as output units.
And that's a very natural way to train a
recurrent neural network that is meant to
be providing a continuous output.