1 00:00:00,090 --> 00:00:05,060 In this video I'm going to describe various kinds of architectures for neural 2 00:00:05,060 --> 00:00:08,067 networks. What I mean by an architecture, is the way 3 00:00:08,067 --> 00:00:11,033 in which the neurons are connected together. 4 00:00:11,033 --> 00:00:16,003 By far the commonest type of architecture in practical applications is a feet 5 00:00:16,003 --> 00:00:20,061 forward neural network where the information comes into the imput units and 6 00:00:20,061 --> 00:00:25,013 flows in one direction through hidden layers until each reaches the output 7 00:00:25,013 --> 00:00:28,042 units. A much more interesting kind architecture 8 00:00:28,042 --> 00:00:33,013 is a recurrent neural network in which information can flow round in cycles. 9 00:00:33,013 --> 00:00:36,060 These networks can remember information for a long time. 10 00:00:36,060 --> 00:00:41,044 They can exhibit all sorts of interesting oscillations but they are much more 11 00:00:41,044 --> 00:00:46,052 difficult to train in part because they are so much more complicated in what they 12 00:00:46,052 --> 00:00:50,040 can do. Recently, however, people have made a lot 13 00:00:50,040 --> 00:00:55,045 of progress in training recurrent neural networks, and they can now do some fairly 14 00:00:55,045 --> 00:00:59,002 impressive things. The last kind of architecture that I'll 15 00:00:59,002 --> 00:01:03,088 describe is a symmetrically-connected network, one in which the weights are the 16 00:01:03,088 --> 00:01:08,093 same in both directions between two units. The commonest type of neural network in 17 00:01:08,093 --> 00:01:12,037 practical applications is a feed-forward neural network. 18 00:01:12,037 --> 00:01:17,011 This has some input units. And in the first layer at the bottom, some 19 00:01:17,011 --> 00:01:22,049 output units in the last layer at the top, and one or more layers of hidden units. 20 00:01:22,049 --> 00:01:27,087 If there's more than one layer of hidden units, we call them deep neural networks. 21 00:01:28,054 --> 00:01:32,090 These networks compute a series of transformations between their input and 22 00:01:32,090 --> 00:01:35,046 their output. So at each layer, you get a new 23 00:01:35,046 --> 00:01:40,000 representation of the input in which things that were similar in the previous 24 00:01:40,000 --> 00:01:44,025 layer may have become less similar, or things that were dissimilar in the 25 00:01:44,025 --> 00:01:46,081 previous layer may have become more similar. 26 00:01:46,081 --> 00:01:51,047 So in speech recognition, for example, we'd like the same thing said by different 27 00:01:51,047 --> 00:01:56,001 speakers to become more similar, and different things said by the same speaker 28 00:01:56,001 --> 00:01:59,079 to be less similar as we go up through the layers of the network. 29 00:02:00,045 --> 00:02:04,097 In order to achieve this, we need the activities of the neurons in each layer to 30 00:02:04,097 --> 00:02:08,047 be a non-linear function of the activities in the layer below. 31 00:02:10,034 --> 00:02:15,053 Recurrent neural networks are much more powerful than feed forward neural 32 00:02:15,053 --> 00:02:18,096 networks. They have directed cycles in the direct, 33 00:02:18,096 --> 00:02:23,073 in their connection graph. What this means is that if you start at a 34 00:02:23,073 --> 00:02:29,013 node or a neuron and you follow the arrows, you can sometimes get back to the 35 00:02:29,013 --> 00:02:34,063 neuron you started at. They can have very complicated dynamics, 36 00:02:34,063 --> 00:02:37,074 and this can make them very difficult to train. 37 00:02:38,078 --> 00:02:42,097 There's a lot of interest at present at finding efficient ways of training our 38 00:02:42,097 --> 00:02:46,069 current [inaudible], because they are so powerful if we can train them. 39 00:02:47,050 --> 00:02:54,086 They're also more biologically realistic. Recurrent neural networks with multiple 40 00:02:54,086 --> 00:02:59,026 hidden layers are really just a special case of a general recurrent neural network 41 00:02:59,026 --> 00:03:02,034 that has some of its hidden to hidden connections missing. 42 00:03:05,012 --> 00:03:09,069 Recurring networks are a very natural way to model sequential data. 43 00:03:10,080 --> 00:03:15,052 So what we do is we have connections between hidden units. 44 00:03:15,078 --> 00:03:20,036 And the hidden units act like a network that's very deep in time. 45 00:03:20,036 --> 00:03:26,008 So at each time step the states of the hidden units determines the states of the 46 00:03:26,008 --> 00:03:32,073 hidden units of the next time step. One way in which they differ from 47 00:03:32,073 --> 00:03:36,049 feed-forward nets is that we use the same weights at every time step. 48 00:03:36,049 --> 00:03:41,001 So if you look at those red arrows where the hidden units are determining the next 49 00:03:41,001 --> 00:03:45,026 state of the hidden units, the weight matrix depicted by each red arrow is the 50 00:03:45,026 --> 00:03:50,023 same at each time step. They also get inputs at every time stamp 51 00:03:50,023 --> 00:03:54,086 and often give outputs at every time stamp, and those'll use the same weight 52 00:03:54,086 --> 00:03:59,171 matrices too. Recurrent networks have the ability to 53 00:03:59,171 --> 00:04:02,030 remember information in the hidden state for a long time. 54 00:04:02,062 --> 00:04:07,097 Unfortunately, it's quite hard to train them to use that ability. 55 00:04:07,097 --> 00:04:12,031 However, recent algorithms have been able to do that. 56 00:04:13,052 --> 00:04:19,032 So just to show you what recurrent neural nets can now do, I'm gonna show you a net 57 00:04:19,032 --> 00:04:24,001 designed by Ilya Sutskever. It's a special kind of recurrent neural 58 00:04:24,001 --> 00:04:29,048 net, slightly different from the kind in the diagram on the previous slide, and 59 00:04:29,048 --> 00:04:33,025 it's used to predict the next character in a sequence. 60 00:04:33,080 --> 00:04:38,042 So Ilya trained it on lots and lots of strings from English Wikipedia. 61 00:04:38,042 --> 00:04:43,070 It's seeing English characters and trying to predict the next English character. 62 00:04:43,070 --> 00:04:49,006 He actually used 86 different characters to allow for punctuation, and digits, and 63 00:04:49,006 --> 00:04:54,000 capital letters and so on. After you trained it, one way of seeing 64 00:04:54,000 --> 00:04:59,069 how well it can do is to see whether it assigns high probability to the next 65 00:04:59,069 --> 00:05:05,060 character that actually occurs. Another way of seeing what it can do is to 66 00:05:05,060 --> 00:05:10,073 get it to generate text. So what you do is you give it a string of 67 00:05:10,073 --> 00:05:14,093 characters and get it to predict probabilities for the next character. 68 00:05:14,093 --> 00:05:19,000 Then you pick the next character from that probability distribution. 69 00:05:19,000 --> 00:05:21,076 It's no use picking the most likely character. 70 00:05:21,076 --> 00:05:26,044 If you do that after a while it starts saying the United States of the United 71 00:05:26,044 --> 00:05:29,038 States of the United States of the United States. 72 00:05:29,038 --> 00:05:34,065 That tells you something about Wikipedia. But if you pick from the probability 73 00:05:34,065 --> 00:05:40,097 distribution, so if it says there's a one in 100 chance it was a Z, you pick a Z one 74 00:05:40,097 --> 00:05:45,045 time in 100, then you see much more about what it's learned. 75 00:05:45,095 --> 00:05:50,002 The next slide shows an example of the text that it generates, and it's 76 00:05:50,002 --> 00:05:54,071 interesting to notice how much is learned just by reading Wikipedia, and trying to 77 00:05:54,071 --> 00:06:00,011 predict the next character. So remember this text was generated one 78 00:06:00,011 --> 00:06:06,000 character at a time. Notice that it makes reasonable sensible 79 00:06:06,000 --> 00:06:11,000 sentences and they composed always entirely of real English words. 80 00:06:11,000 --> 00:06:15,099 Occasionally, it makes a non-word but they typically sensible ones. 81 00:06:15,099 --> 00:06:20,084 And notice that within a sentence, it has some thematic sentence. 82 00:06:20,084 --> 00:06:25,053 So the phrase, Several perishing intelligence agents is in the 83 00:06:25,053 --> 00:06:30,031 Mediterranean region, has problems but it's almost good English. 84 00:06:30,031 --> 00:06:35,095 Notice also the thing it says at the end, such that it is the blurring of appearing 85 00:06:35,095 --> 00:06:41,026 on any well-paid type of box printer. There's a certain sort of thematic thing 86 00:06:41,026 --> 00:06:45,096 there about appearance and printing, and the syntax is pretty good. 87 00:06:45,096 --> 00:06:48,095 And remember, that's one character at a time. 88 00:06:52,062 --> 00:06:57,043 Quite different for a current nets, symmetrically connected networks. 89 00:06:58,050 --> 00:07:03,030 In these the connections between units have the same weight in both directions. 90 00:07:03,056 --> 00:07:09,077 John Hopfield and others realized that symmetric networks are much easier to 91 00:07:09,077 --> 00:07:15,032 analyze than recurrent networks. This is mainly because they're more 92 00:07:15,032 --> 00:07:19,086 restricted in what they can do, and that's because they obey an energy function. 93 00:07:20,069 --> 00:07:25,054 So they come, for example, model cycles. You can't get back to where you started in 94 00:07:25,054 --> 00:07:27,045 one of these symmetric networks.