1 00:00:00,000 --> 00:00:05,819 In this video, I'm going to describe how a recurrent neural network solves a toy 2 00:00:05,819 --> 00:00:08,729 problem. It's a problem that's chosen to 3 00:00:08,729 --> 00:00:14,621 demonstrate what it is you can do with recurrent neural networks that you cannot 4 00:00:14,621 --> 00:00:18,259 do conveniently with feet forward neural networks. 5 00:00:18,259 --> 00:00:21,460 The problem is adding up two binary numbers. 6 00:00:21,460 --> 00:00:26,205 Off to the recurrent neural network, has learned to solve the problem. 7 00:00:26,205 --> 00:00:31,500 It's interesting to look at its hidden states, and see how they relate to the 8 00:00:31,500 --> 00:00:36,588 hidden states in a finite state automaton that's solving the same problem. 9 00:00:36,588 --> 00:00:40,440 So consider the problem of adding up two binary numbers. 10 00:00:41,040 --> 00:00:44,898 We could train a feed-forward neural network to do that. 11 00:00:44,898 --> 00:00:50,411 And the diagram on the right shows a network that gets some inputs and produces 12 00:00:50,411 --> 00:00:53,650 some outputs. But there's problems with using a 13 00:00:53,650 --> 00:00:58,754 feed-forward neural network. We have to decide in advance what the 14 00:00:58,754 --> 00:01:04,785 maximum number of digits is both for both of the input numbers and for the output 15 00:01:04,785 --> 00:01:08,785 number. And more importantly, the processing that 16 00:01:08,785 --> 00:01:14,094 we apply to the different bits of the input numbers, doesn't generalize. 17 00:01:14,094 --> 00:01:20,226 That is, when we learn how to add up the last two digits and deal with the carries, 18 00:01:20,226 --> 00:01:25,834 that knowledges in some weights. And as we go to a different part of a long 19 00:01:25,834 --> 00:01:30,694 binary number, the knowledge will have to be in different weights. 20 00:01:30,694 --> 00:01:36,938 So we won't get automatic generalization. As a result, although you can train a 21 00:01:36,938 --> 00:01:41,578 neuron feedfoward neural network, and it will eventually learn to do binary 22 00:01:41,578 --> 00:01:46,340 addition on fixed-length numbers, it's not an elegant way to solve the problem. 23 00:01:47,360 --> 00:01:51,863 This is a picture of the algorithm for binary addition. 24 00:01:51,863 --> 00:01:58,576 The states shown here are like the states in a hidden Markov model, except they're 25 00:01:58,576 --> 00:02:03,080 not really hidden. The system is in one state at a time. 26 00:02:03,080 --> 00:02:06,954 When it enters a state it performs an action. 27 00:02:06,954 --> 00:02:13,927 So it either prints a one or prints a zero and when it's in a state it gets some 28 00:02:13,927 --> 00:02:18,232 input, which is the two numbers in the next column. 29 00:02:18,232 --> 00:02:22,365 And that input causes it to go into a new state. 30 00:02:22,365 --> 00:02:27,918 So if you look on the top right, It's in the carry station and it's just 31 00:02:27,918 --> 00:02:32,069 printed a one. If it sees a one, one, it goes back in to 32 00:02:32,069 --> 00:02:38,844 the same stage and print another one. If however it sees a one, zero or zero, 33 00:02:38,844 --> 00:02:43,419 one, It goes into the carry state but prints a zero. 34 00:02:43,419 --> 00:02:49,882 If it sees a zero, zero, it goes into the no carry state, and prints a one. 35 00:02:49,882 --> 00:02:55,562 And so on. So a recurring neuro net for binary 36 00:02:55,562 --> 00:02:59,820 edition needs to have two input units and one output unit. 37 00:03:00,240 --> 00:03:04,480 It's given two input digits at each time stamp. 38 00:03:05,720 --> 00:03:09,656 And it also has to produce an output at each time step. 39 00:03:09,656 --> 00:03:15,311 And the output is the output for the column that it took in two time steps ago. 40 00:03:15,311 --> 00:03:20,966 The reason we need a delay of two time steps, is that it takes one time step to 41 00:03:20,966 --> 00:03:26,477 update the hidden units based on the inputs, and another time step to produce 42 00:03:26,477 --> 00:03:32,436 the output from the hidden state. So the net looks like this. 43 00:03:32,436 --> 00:03:36,389 I only gave it three hidden units. That's sufficient to do the job. 44 00:03:36,389 --> 00:03:40,761 It would learn faster with more hidden units, but it can do it with three. 45 00:03:40,761 --> 00:03:45,192 The three hidden units are fully interconnected and they have connections 46 00:03:45,192 --> 00:03:48,965 in both directions that don't necessarily have the same weight. 47 00:03:48,965 --> 00:03:52,020 In fact in general they don't have the same weight. 48 00:03:53,780 --> 00:04:00,137 The connections between hidden units allow the pattern of one time step to insensate 49 00:04:00,137 --> 00:04:06,553 the pattern of the next time step. The input units have feed forward 50 00:04:06,553 --> 00:04:12,230 connections to the hidden units and that's how it sees the two digits in a column. 51 00:04:12,230 --> 00:04:18,046 And similarly, the hidden units have feed forward connections to the output unit and 52 00:04:18,046 --> 00:04:25,577 that's how it produces its output. It's interesting to look at what the 53 00:04:25,577 --> 00:04:31,854 recurring neural network learns. It learns four distinct patterns of 54 00:04:31,854 --> 00:04:36,781 activity in its three hidden units. And these patterns correspond to the nodes 55 00:04:36,781 --> 00:04:39,940 in the finite state automaton for binary addition. 56 00:04:41,420 --> 00:04:46,791 We must confuse the units in a neural network, with the nodes in a final state 57 00:04:46,791 --> 00:04:50,329 automaton. The nodes in the finite state automaton 58 00:04:50,329 --> 00:04:54,400 correspond to the activity vectors of the recurrent neural network. 59 00:04:56,160 --> 00:05:01,125 The automaton is restricted to being exactly one state at each time. 60 00:05:01,125 --> 00:05:06,767 And similarly, the hidden units are restricted to have exactly one activity 61 00:05:06,767 --> 00:05:10,680 vector at each time in the recurrent neural network. 62 00:05:12,500 --> 00:05:18,089 So a recurrent neural network can emulate a finite state automaton but it's 63 00:05:18,089 --> 00:05:21,766 exponentially more powerful in its representation. 64 00:05:21,766 --> 00:05:26,620 With any hidden neurons, it has 2n to the N possible binary activity vectors. 65 00:05:27,180 --> 00:05:32,455 Of course it only has n squared weights so it can't necessarily make full use of all 66 00:05:32,455 --> 00:05:36,565 that representational power. But if the bottleneck is in the 67 00:05:36,565 --> 00:05:42,114 representation a recurrent neural network can do much better than a finite state 68 00:05:42,114 --> 00:05:46,715 automaton. This is important when the input stream 69 00:05:46,715 --> 00:05:53,199 has two separate things going on at once. A finite state automaton needs to square 70 00:05:53,199 --> 00:05:58,628 its number of states in order to deal with the fact that there's two things going on 71 00:05:58,628 --> 00:06:03,160 at once. A recurrent neural network only needs to 72 00:06:03,160 --> 00:06:08,286 double its number of hidden units. By doubling the number of units, it does 73 00:06:08,286 --> 00:06:12,720 of course square the number of binary vector states that it has.