1
00:00:00,000 --> 00:00:04,919
In this video, I'm going to talk about the
back propagation through time algorithm.

2
00:00:04,919 --> 00:00:08,780
It's the standard way to train or
recurrence your own network.

3
00:00:09,500 --> 00:00:14,493
The algorithm is really quite simple once
you have seen the equivalents between a

4
00:00:14,493 --> 00:00:19,243
recurrent neural network and a feed
forward neural network that has one layer

5
00:00:19,243 --> 00:00:23,847
for each time step.
I'll also talk about ways of providing

6
00:00:23,847 --> 00:00:28,060
input, and desired outputs, to recurrent
neural networks.

7
00:00:28,940 --> 00:00:35,070
So the diagram shows a simple recurrent
net with three interconnected neurons.

8
00:00:35,070 --> 00:00:41,750
We're going to assume there's a time delay
of one in using each of those connections

9
00:00:41,750 --> 00:00:47,960
and that the network runs in discrete
time, so the clock that has integer ticks.

10
00:00:49,160 --> 00:00:55,029
The key to understanding how to train a
recurrent network is to see that a

11
00:00:55,029 --> 00:01:01,368
recurrent network is really just the same
as a feed forward network, where you've

12
00:01:01,368 --> 00:01:08,595
expanded the recurrent network in time.
So the recurrent network starts off in

13
00:01:08,595 --> 00:01:12,840
some initial state.
Shown at the bottom there, times zero.

14
00:01:13,100 --> 00:01:19,698
And then uses the way some of these
connections to get a new state, shown at

15
00:01:19,698 --> 00:01:24,026
time one.
You then uses the same weights again to

16
00:01:24,026 --> 00:01:29,625
get another new state, and it uses the
same weights again to get another new

17
00:01:29,625 --> 00:01:33,530
state and so on.
So it's really just a lead feed forward

18
00:01:33,530 --> 00:01:38,540
network, where the weight is a constraint
to be the same at every layer.

19
00:01:39,040 --> 00:01:44,028
Now backprop is good at learning when
there are weight constraints.

20
00:01:44,028 --> 00:01:49,458
We saw this for convolutional nets and
just to remind you, we can actually

21
00:01:49,458 --> 00:01:55,814
incorporate any linear constraint quite
easily in backprop. So we compute the

22
00:01:55,814 --> 00:01:59,534
gradients as usual, as if the weights were
not constrained.

23
00:01:59,534 --> 00:02:03,960
And then we modify the gradients, so that
we maintain the constraints.

24
00:02:04,580 --> 00:02:10,166
So if we want W1 to equal W2, we start off
with an equal and then we need to make

25
00:02:10,166 --> 00:02:13,801
sure that the changing W1 is equal to the
changing W2.

26
00:02:13,801 --> 00:02:19,051
And we do that by simply taking the
derivative of the area with respect to W1,

27
00:02:19,051 --> 00:02:23,897
the derivative with respect to W2, and
adding or averaging them, and then

28
00:02:23,897 --> 00:02:27,600
applying the same quantity for updating
both W1 and W2.

29
00:02:28,600 --> 00:02:32,953
So if the weights started off satisfying
the constraints they'll continue to

30
00:02:32,953 --> 00:02:37,160
satisfy the constraints.
The backpropagation through time algorithm

31
00:02:37,160 --> 00:02:42,537
is just the name for what happens when you
think of a recurrent net as a lead feet

32
00:02:42,537 --> 00:02:47,136
forward net with shared weights, and you
train it with backpropagation.

33
00:02:47,136 --> 00:02:50,570
So, we can think of that algorithm in the
time domain.

34
00:02:50,570 --> 00:02:55,076
The forward pass builds up a stack of
activities at each time slice.

35
00:02:55,076 --> 00:03:00,709
And the backward pass peels activities off
that stack and computes error derivatives

36
00:03:00,709 --> 00:03:05,017
each time step backwards.
That's why it's called back propagation

37
00:03:05,017 --> 00:03:08,198
through time.
After the backward pass we can add

38
00:03:08,198 --> 00:03:13,235
together the derivatives at all the
different time step for each particular

39
00:03:13,235 --> 00:03:16,284
weight.
And then change all the copies of that

40
00:03:16,284 --> 00:03:21,387
weight by the same amount which is
proportional to the sum or average of all

41
00:03:21,387 --> 00:03:25,402
those derivatives.
There is an irritating extra issue.

42
00:03:25,402 --> 00:03:30,610
If we don't specify the initial state of
the all the units, for example, if some of

43
00:03:30,610 --> 00:03:35,818
them are hidden or output units, then we
have to start them off in some particular

44
00:03:35,818 --> 00:03:38,992
state.
We could just fix those initial states to

45
00:03:38,992 --> 00:03:43,937
have some default value like 0.5, but that
might make the system work not quite as

46
00:03:43,937 --> 00:03:48,460
well as it would otherwise work if it had
some more sensible initial value.

47
00:03:48,760 --> 00:03:51,389
So we can actually learn the initial
states.

48
00:03:51,389 --> 00:03:56,408
We treat them like parameters rather than
activities and we learn them the same way

49
00:03:56,408 --> 00:04:00,292
as learned the weights.
We start off with an initial random guess

50
00:04:00,292 --> 00:04:03,937
for the initial states.
That is the initial states of all the

51
00:04:03,937 --> 00:04:08,657
units that aren't input units And then at
the end of each training sequence we back

52
00:04:08,657 --> 00:04:12,362
propagate through time all the way back to
the initial states.

53
00:04:12,362 --> 00:04:16,783
And that gives us the gradient of the
error function with respects to the

54
00:04:16,783 --> 00:04:21,254
initial state.
We then just, adjust the initial states by

55
00:04:21,254 --> 00:04:25,086
following, that gradient.
We go downhill in the gradient, and that

56
00:04:25,086 --> 00:04:28,440
gives us new initial states that are
slightly different.

57
00:04:29,460 --> 00:04:34,344
There's many ways in which we can provide
the input to a recurrent neural net.

58
00:04:34,344 --> 00:04:38,351
We could, for example, specify the initial
state of all the units.

59
00:04:38,351 --> 00:04:43,423
That's the most natural thing to do when
we think of a recurrent net, like a feed

60
00:04:43,423 --> 00:04:49,681
forward net with constrained weights.
We could specify the initial state of just

61
00:04:49,681 --> 00:04:55,233
a subset of the units or we can specify
the states at every time stamp of the

62
00:04:55,233 --> 00:05:00,995
subset of the units and that's probably
the most natural way to input sequential

63
00:05:00,995 --> 00:05:04,298
data.
Similarly, there's many way we can specify

64
00:05:04,298 --> 00:05:09,217
targets for a recurrent network.
When we think of it as feed forward

65
00:05:09,217 --> 00:05:14,488
network with constrained weights, the
natural thing to do is to specify the

66
00:05:14,488 --> 00:05:20,612
desired final states for all of the units.
If we're trying to train it to settle to

67
00:05:20,612 --> 00:05:26,398
some attractor, we might want to specify
the desired states not just for the final

68
00:05:26,398 --> 00:05:31,684
time steps but for several time steps.
That will cause it to actually settle down

69
00:05:31,684 --> 00:05:35,720
there, rather than passing through some
state and going off somewhere else.

70
00:05:36,100 --> 00:05:41,864
So by specifying several states of the
end, we can force it to learn attractors

71
00:05:41,864 --> 00:05:47,701
and it's quite easy as we back propagate
to add in derivatives that we get from

72
00:05:47,701 --> 00:05:51,933
each time stamp.
So the back propegation starts at the top,

73
00:05:51,933 --> 00:05:55,289
with the derivatives for the final time
stamp.

74
00:05:55,289 --> 00:06:01,199
And then as we go back through the line
before the top we add in the derivatives

75
00:06:01,199 --> 00:06:06,015
for that man, and so on.
So it's really very little extra effort to

76
00:06:06,015 --> 00:06:12,148
have derivatives at many different layers.
Or we could specify the design activity of

77
00:06:12,148 --> 00:06:15,854
a subset of units which we might think of
as output units.

78
00:06:15,854 --> 00:06:21,159
And that's a very natural way to train a
recurrent neural network that is meant to

79
00:06:21,159 --> 00:06:23,268
be providing a continuous output.