1 00:00:00,000 --> 00:00:04,919 In this video, I'm going to talk about the back propagation through time algorithm. 2 00:00:04,919 --> 00:00:08,780 It's the standard way to train or recurrence your own network. 3 00:00:09,500 --> 00:00:14,493 The algorithm is really quite simple once you have seen the equivalents between a 4 00:00:14,493 --> 00:00:19,243 recurrent neural network and a feed forward neural network that has one layer 5 00:00:19,243 --> 00:00:23,847 for each time step. I'll also talk about ways of providing 6 00:00:23,847 --> 00:00:28,060 input, and desired outputs, to recurrent neural networks. 7 00:00:28,940 --> 00:00:35,070 So the diagram shows a simple recurrent net with three interconnected neurons. 8 00:00:35,070 --> 00:00:41,750 We're going to assume there's a time delay of one in using each of those connections 9 00:00:41,750 --> 00:00:47,960 and that the network runs in discrete time, so the clock that has integer ticks. 10 00:00:49,160 --> 00:00:55,029 The key to understanding how to train a recurrent network is to see that a 11 00:00:55,029 --> 00:01:01,368 recurrent network is really just the same as a feed forward network, where you've 12 00:01:01,368 --> 00:01:08,595 expanded the recurrent network in time. So the recurrent network starts off in 13 00:01:08,595 --> 00:01:12,840 some initial state. Shown at the bottom there, times zero. 14 00:01:13,100 --> 00:01:19,698 And then uses the way some of these connections to get a new state, shown at 15 00:01:19,698 --> 00:01:24,026 time one. You then uses the same weights again to 16 00:01:24,026 --> 00:01:29,625 get another new state, and it uses the same weights again to get another new 17 00:01:29,625 --> 00:01:33,530 state and so on. So it's really just a lead feed forward 18 00:01:33,530 --> 00:01:38,540 network, where the weight is a constraint to be the same at every layer. 19 00:01:39,040 --> 00:01:44,028 Now backprop is good at learning when there are weight constraints. 20 00:01:44,028 --> 00:01:49,458 We saw this for convolutional nets and just to remind you, we can actually 21 00:01:49,458 --> 00:01:55,814 incorporate any linear constraint quite easily in backprop. So we compute the 22 00:01:55,814 --> 00:01:59,534 gradients as usual, as if the weights were not constrained. 23 00:01:59,534 --> 00:02:03,960 And then we modify the gradients, so that we maintain the constraints. 24 00:02:04,580 --> 00:02:10,166 So if we want W1 to equal W2, we start off with an equal and then we need to make 25 00:02:10,166 --> 00:02:13,801 sure that the changing W1 is equal to the changing W2. 26 00:02:13,801 --> 00:02:19,051 And we do that by simply taking the derivative of the area with respect to W1, 27 00:02:19,051 --> 00:02:23,897 the derivative with respect to W2, and adding or averaging them, and then 28 00:02:23,897 --> 00:02:27,600 applying the same quantity for updating both W1 and W2. 29 00:02:28,600 --> 00:02:32,953 So if the weights started off satisfying the constraints they'll continue to 30 00:02:32,953 --> 00:02:37,160 satisfy the constraints. The backpropagation through time algorithm 31 00:02:37,160 --> 00:02:42,537 is just the name for what happens when you think of a recurrent net as a lead feet 32 00:02:42,537 --> 00:02:47,136 forward net with shared weights, and you train it with backpropagation. 33 00:02:47,136 --> 00:02:50,570 So, we can think of that algorithm in the time domain. 34 00:02:50,570 --> 00:02:55,076 The forward pass builds up a stack of activities at each time slice. 35 00:02:55,076 --> 00:03:00,709 And the backward pass peels activities off that stack and computes error derivatives 36 00:03:00,709 --> 00:03:05,017 each time step backwards. That's why it's called back propagation 37 00:03:05,017 --> 00:03:08,198 through time. After the backward pass we can add 38 00:03:08,198 --> 00:03:13,235 together the derivatives at all the different time step for each particular 39 00:03:13,235 --> 00:03:16,284 weight. And then change all the copies of that 40 00:03:16,284 --> 00:03:21,387 weight by the same amount which is proportional to the sum or average of all 41 00:03:21,387 --> 00:03:25,402 those derivatives. There is an irritating extra issue. 42 00:03:25,402 --> 00:03:30,610 If we don't specify the initial state of the all the units, for example, if some of 43 00:03:30,610 --> 00:03:35,818 them are hidden or output units, then we have to start them off in some particular 44 00:03:35,818 --> 00:03:38,992 state. We could just fix those initial states to 45 00:03:38,992 --> 00:03:43,937 have some default value like 0.5, but that might make the system work not quite as 46 00:03:43,937 --> 00:03:48,460 well as it would otherwise work if it had some more sensible initial value. 47 00:03:48,760 --> 00:03:51,389 So we can actually learn the initial states. 48 00:03:51,389 --> 00:03:56,408 We treat them like parameters rather than activities and we learn them the same way 49 00:03:56,408 --> 00:04:00,292 as learned the weights. We start off with an initial random guess 50 00:04:00,292 --> 00:04:03,937 for the initial states. That is the initial states of all the 51 00:04:03,937 --> 00:04:08,657 units that aren't input units And then at the end of each training sequence we back 52 00:04:08,657 --> 00:04:12,362 propagate through time all the way back to the initial states. 53 00:04:12,362 --> 00:04:16,783 And that gives us the gradient of the error function with respects to the 54 00:04:16,783 --> 00:04:21,254 initial state. We then just, adjust the initial states by 55 00:04:21,254 --> 00:04:25,086 following, that gradient. We go downhill in the gradient, and that 56 00:04:25,086 --> 00:04:28,440 gives us new initial states that are slightly different. 57 00:04:29,460 --> 00:04:34,344 There's many ways in which we can provide the input to a recurrent neural net. 58 00:04:34,344 --> 00:04:38,351 We could, for example, specify the initial state of all the units. 59 00:04:38,351 --> 00:04:43,423 That's the most natural thing to do when we think of a recurrent net, like a feed 60 00:04:43,423 --> 00:04:49,681 forward net with constrained weights. We could specify the initial state of just 61 00:04:49,681 --> 00:04:55,233 a subset of the units or we can specify the states at every time stamp of the 62 00:04:55,233 --> 00:05:00,995 subset of the units and that's probably the most natural way to input sequential 63 00:05:00,995 --> 00:05:04,298 data. Similarly, there's many way we can specify 64 00:05:04,298 --> 00:05:09,217 targets for a recurrent network. When we think of it as feed forward 65 00:05:09,217 --> 00:05:14,488 network with constrained weights, the natural thing to do is to specify the 66 00:05:14,488 --> 00:05:20,612 desired final states for all of the units. If we're trying to train it to settle to 67 00:05:20,612 --> 00:05:26,398 some attractor, we might want to specify the desired states not just for the final 68 00:05:26,398 --> 00:05:31,684 time steps but for several time steps. That will cause it to actually settle down 69 00:05:31,684 --> 00:05:35,720 there, rather than passing through some state and going off somewhere else. 70 00:05:36,100 --> 00:05:41,864 So by specifying several states of the end, we can force it to learn attractors 71 00:05:41,864 --> 00:05:47,701 and it's quite easy as we back propagate to add in derivatives that we get from 72 00:05:47,701 --> 00:05:51,933 each time stamp. So the back propegation starts at the top, 73 00:05:51,933 --> 00:05:55,289 with the derivatives for the final time stamp. 74 00:05:55,289 --> 00:06:01,199 And then as we go back through the line before the top we add in the derivatives 75 00:06:01,199 --> 00:06:06,015 for that man, and so on. So it's really very little extra effort to 76 00:06:06,015 --> 00:06:12,148 have derivatives at many different layers. Or we could specify the design activity of 77 00:06:12,148 --> 00:06:15,854 a subset of units which we might think of as output units. 78 00:06:15,854 --> 00:06:21,159 And that's a very natural way to train a recurrent neural network that is meant to 79 00:06:21,159 --> 00:06:23,268 be providing a continuous output.