1 00:00:00,000 --> 00:00:06,334 In this video, I am going to give an overview of various types of models that 2 00:00:06,334 --> 00:00:11,185 have been used for sequences. I'll start with the simplest kinds of 3 00:00:11,185 --> 00:00:15,799 model, which is ultra aggressive models, that just try and predict the next term or 4 00:00:15,799 --> 00:00:21,299 the sequence from previous terms. I'll talk about more elaborate variants of 5 00:00:21,299 --> 00:00:26,347 them using hidden units. And then I'll talk about, more interesting 6 00:00:26,347 --> 00:00:30,741 kinds of models, that have hidden state, and hidden dynamics. 7 00:00:30,741 --> 00:00:35,880 These include linear dynamical systems and hidden Markov models. 8 00:00:36,160 --> 00:00:41,230 Most of these are quite complicated kinds of models, and I don't expect you to 9 00:00:41,230 --> 00:00:46,300 understand all the details of them. The main point of mentioning them is to be 10 00:00:46,300 --> 00:00:51,175 able to show how recurrent your own networks are related to models of that 11 00:00:51,175 --> 00:00:54,553 kind. When we're using machine learning to model 12 00:00:54,553 --> 00:00:58,812 sequences, we often want to turn one sequence into another sequence. 13 00:00:58,812 --> 00:01:04,088 For example, we might want to turn English words into French words or we might want 14 00:01:04,088 --> 00:01:09,428 to take a sequence of sand pressures and turn it into a sequence of word identities 15 00:01:09,428 --> 00:01:12,480 which is what's happening in speech recognition. 16 00:01:13,340 --> 00:01:18,713 Sometimes we don't have a separate target sequence, and in that case we can get a 17 00:01:18,713 --> 00:01:23,556 teaching signal by trying to predict the next term in the input sequence. 18 00:01:23,556 --> 00:01:28,996 So the target output sequence is simply the input sequence with an advance of one 19 00:01:28,996 --> 00:01:32,888 time step. This seems much more natural, than trying 20 00:01:32,888 --> 00:01:38,315 to predict one pixel in an image from all the other pixels or one patch of an image 21 00:01:38,315 --> 00:01:43,837 from the rest of the image. One reason it probably seems more natural 22 00:01:43,837 --> 00:01:48,807 is that for temporal sequences, there is a natural order to do the predictions in. 23 00:01:48,807 --> 00:01:52,980 Whereas for images it's not clear what you should predict from what. 24 00:01:52,980 --> 00:01:56,540 But in fact a similar approach works very well for images. 25 00:01:58,200 --> 00:02:02,538 When we predict the next term in a sequence, it blurs the distinction, 26 00:02:02,538 --> 00:02:07,379 between supervised and unsupervised learning, that I made at beginning of the 27 00:02:07,379 --> 00:02:10,761 course. So we use methods that were designed for 28 00:02:10,761 --> 00:02:13,682 supervised learning to predict the next term. 29 00:02:13,682 --> 00:02:16,668 But we don't require separate teaching signal. 30 00:02:16,668 --> 00:02:22,877 So in that sense, it's unsupervised. I'm now going to give a quick review of 31 00:02:22,877 --> 00:02:27,893 some of the, other models of sequences, before we get on to using recurrent neural 32 00:02:27,893 --> 00:02:33,126 nets to model sequences. So a nice simple model for sequences that 33 00:02:33,126 --> 00:02:36,798 doesn't have any memory is an auto regressive model. 34 00:02:36,798 --> 00:02:42,660 What that does is take some previous terms in the sequence and try and predict the 35 00:02:42,660 --> 00:02:46,897 next term basically as a weighted average of previous terms. 36 00:02:46,897 --> 00:02:52,335 The previous terms might be individual values or they might be whole vectors. 37 00:02:52,335 --> 00:02:58,126 And a linear auto regressive model would just take a weighted average of those to 38 00:02:58,126 --> 00:03:04,716 predict the next term. We can make that considerably more 39 00:03:04,716 --> 00:03:10,380 complicated by adding hidden units. So in a feedforward neural net, we might 40 00:03:10,380 --> 00:03:16,744 take some previous input terms, put them through some hidden units, and predict the 41 00:03:16,744 --> 00:03:21,359 next term. Memory list models are only one subclass 42 00:03:21,359 --> 00:03:26,909 of models that can be used for sequences. We can think about ways of generating 43 00:03:26,909 --> 00:03:32,317 sequences, and one very natural way to generate a sequence is to have a model 44 00:03:32,317 --> 00:03:36,743 that has some hidden state which has its own internal dynamics. 45 00:03:36,743 --> 00:03:42,222 So, the hidden state evolves according to its internal dynamics, and the hidden 46 00:03:42,222 --> 00:03:47,139 state also produces observations, and we get to see those observations. 47 00:03:47,139 --> 00:03:50,300 That's a much more interesting kind of model. 48 00:03:51,660 --> 00:03:55,446 It can store information in its hidden state for a long time. 49 00:03:55,446 --> 00:04:00,350 Unlike the memoryless models, there's no simple band, to how far we have to look 50 00:04:00,350 --> 00:04:03,640 back before we can be sure it's not affecting things. 51 00:04:04,840 --> 00:04:09,637 If the dynamics of the hidden state is noisy and the way it generates outputs 52 00:04:09,637 --> 00:04:14,681 from its hidden state is noisy, then by observing the output of a generative model 53 00:04:14,681 --> 00:04:18,680 like this, you can never know for sure what it's hidden state was. 54 00:04:19,100 --> 00:04:23,491 The best you can do is to infer probability distribution over the space of 55 00:04:23,491 --> 00:04:27,824 all possible hidden state vectors. You can know that it's probably in some 56 00:04:27,824 --> 00:04:32,391 part of the space and not another part of the space, but you can't pin it down 57 00:04:32,391 --> 00:04:37,421 exactly. So with a generative model like this, if 58 00:04:37,421 --> 00:04:43,484 you get to observe what it produces, and you now try to infer what the hidden state 59 00:04:43,484 --> 00:04:49,036 was, in general that's very hard, but there's two types of hidden state model 60 00:04:49,036 --> 00:04:54,880 for which the computation is tractable. That is, there's a fairly straightforward 61 00:04:54,880 --> 00:05:00,869 computation that allows you to infer the probability distribution over the hidden 62 00:05:00,869 --> 00:05:04,230 state vectors that might have caused the data. 63 00:05:04,230 --> 00:05:07,430 Of course when we do this and apply it to real data. 64 00:05:07,430 --> 00:05:11,123 We're assuming that the real data is generated by our model. 65 00:05:11,123 --> 00:05:14,692 So that's typically what we do when we're modeling things. 66 00:05:14,692 --> 00:05:19,554 We assume the data was generated by the model and then we infer what state the 67 00:05:19,554 --> 00:05:22,940 model must have been in, in order to generate that data. 68 00:05:23,920 --> 00:05:30,025 The next three slides are mainly intended for people who already know about the two 69 00:05:30,025 --> 00:05:33,660 types of hidden state model I'm going to describe. 70 00:05:34,020 --> 00:05:39,936 The point of the slides is so that I make it clear how recurrent neural networks 71 00:05:39,936 --> 00:05:45,533 differ from those standard models. If you can't follow the details of the two 72 00:05:45,533 --> 00:05:49,480 standard models, don't worry too much. That's not the main point. 73 00:05:50,440 --> 00:05:54,424 So one standard model is a linear dynamical system. 74 00:05:54,424 --> 00:06:00,517 It's very widely used in engineering. This is a generative model that has real 75 00:06:00,517 --> 00:06:05,205 valued hidden state. The hidden state has linear dynamics, 76 00:06:05,205 --> 00:06:11,400 shown by those red arrows on the right. And the dynamics has Gaussian noise, so 77 00:06:11,400 --> 00:06:15,020 that the hidden state evolves probabilistically. 78 00:06:15,780 --> 00:06:22,765 There may also be driving inputs, shown at the bottom there, which directly influence 79 00:06:22,765 --> 00:06:27,588 the hidden state. So the inputs, influence the hidden state 80 00:06:27,588 --> 00:06:34,387 directly, the hidden state determines the output to predict the next output of a 81 00:06:34,387 --> 00:06:38,897 system like this, we need to be able to infer its hidden state. 82 00:06:38,897 --> 00:06:43,989 And these kinds of systems are used, for example, for tracking missiles. 83 00:06:43,989 --> 00:06:49,590 In fact, one of the earliest uses of Gaussian distributions was for trying to 84 00:06:49,590 --> 00:06:55,191 track planets from noisy observations. Gaussian actually figured out that, if you 85 00:06:55,191 --> 00:06:59,120 assume Gaussian noise, you could do a good job of that. 86 00:07:00,420 --> 00:07:06,205 One nice property that a Gaussian has is that if you linearly transform a gaseon 87 00:07:06,205 --> 00:07:10,965 you get another Gaussian. Because all the noise in a linear dynamic 88 00:07:10,965 --> 00:07:15,480 system is gaseon. It turns out that the distribution over 89 00:07:15,480 --> 00:07:22,101 the hidden state given the observation so far, that is given the output so far, is 90 00:07:22,101 --> 00:07:26,262 also a Gaussian. It's a full covariance Gaussian, and it's 91 00:07:26,262 --> 00:07:31,480 quite complicated to compute what it is. But it can be computed efficiently. 92 00:07:31,480 --> 00:07:35,167 And there's a technique called Kalman Filtering. 93 00:07:35,167 --> 00:07:40,802 This is an efficient recursive way of updating your representation of the hidden 94 00:07:40,802 --> 00:07:44,057 state given a new observation. So, to summarize, 95 00:07:44,057 --> 00:07:50,040 Given observations of the output of the system, we can't be sure what hidden state 96 00:07:50,040 --> 00:07:55,804 it was in, but we can, estimate a Gaussian distribution over the possible hidden 97 00:07:55,804 --> 00:08:00,911 states it might have been in. Always assuming, of course, that our model 98 00:08:00,911 --> 00:08:04,560 is a correct model of the reality we're observing. 99 00:08:06,140 --> 00:08:11,823 A different kind of hidden state model that uses discrete distributions rather 100 00:08:11,823 --> 00:08:15,637 than Gaussian distributions, is a hidden Markov model. 101 00:08:15,637 --> 00:08:20,817 And because it's based on discrete mathematics, computer scientists love 102 00:08:20,817 --> 00:08:24,630 these ones. In a hidden Markov model, the hidden state 103 00:08:24,630 --> 00:08:26,789 consists of a one of N. Choice. 104 00:08:26,789 --> 00:08:32,904 So there a number of things called states. And the system is always in exactly one of 105 00:08:32,904 --> 00:08:36,846 those states. The transitions between states are 106 00:08:36,846 --> 00:08:40,440 probabilistic. They're controlled by a transition matrix 107 00:08:40,440 --> 00:08:45,639 which is simply a bunch of probabilities that say, if you're in state one at time 108 00:08:45,639 --> 00:08:48,399 one, What's the probability of you going to 109 00:08:48,399 --> 00:08:53,531 state three at time two? The output model is also stochastic. 110 00:08:53,531 --> 00:08:59,466 So, the state that the system is in doesn't completely determine what output 111 00:08:59,466 --> 00:09:03,683 it produces. There's some variation in the output that 112 00:09:03,683 --> 00:09:09,083 each state can produce. Because of that, we can't be sure which 113 00:09:09,083 --> 00:09:14,510 state produced a given output. In a sense, the states are hidden behind 114 00:09:14,510 --> 00:09:19,240 this probabilistic veil, and that's why they're called hidden. 115 00:09:19,800 --> 00:09:24,889 Historically the reason hidden units in a neural network are called hidden, is 116 00:09:24,889 --> 00:09:29,196 because I like this term. It sounded mysterious, so I stole it from 117 00:09:29,196 --> 00:09:34,238 neural networks. It is easy to represent the probability 118 00:09:34,238 --> 00:09:36,921 distribution across n states with n numbers. 119 00:09:36,921 --> 00:09:42,226 So, the nice thing about a hidden Markov model, is we can represent the probability 120 00:09:42,226 --> 00:09:46,921 distribution across its discreet states. So, even though we don't know what it, 121 00:09:46,921 --> 00:09:51,860 what state it's in for sure, we can easily represent the probability distribution. 122 00:09:53,080 --> 00:09:58,154 And to predict the next output from a hidden Markov model, we need to infer what 123 00:09:58,154 --> 00:10:02,594 hidden state it's probably in. And so we need to get our hands on that 124 00:10:02,594 --> 00:10:07,303 probability distribution. It turns out there's an easy method based 125 00:10:07,303 --> 00:10:12,765 on dynamic programming that allows us to take the observations we've made and from 126 00:10:12,765 --> 00:10:17,240 those compute the probability distribution across the hidden states. 127 00:10:17,600 --> 00:10:22,531 Once we have that distribution, there is a nice elegant learning algorithm hidden 128 00:10:22,531 --> 00:10:26,854 Markov models, and that's what made them so appropriate for speech. 129 00:10:26,854 --> 00:10:30,020 And in the 1970s, they took over speech recognition. 130 00:10:30,820 --> 00:10:36,624 There's a fundamental limitation of HMMs. It's easiest to understand this 131 00:10:36,624 --> 00:10:42,977 limitation, if we consider what happens when a hidden Markov model generates data. 132 00:10:42,977 --> 00:10:48,860 At each time step when it's generating, it selects one of its hidden states. 133 00:10:48,860 --> 00:10:54,900 So if it's got n hidden states, the temporal information stored in the hidden 134 00:10:54,900 --> 00:10:59,998 state is at most logn) n bits. So that's all it knows about what it's 135 00:10:59,998 --> 00:11:04,964 done so far. So now let's consider how much information 136 00:11:04,964 --> 00:11:10,555 a hidden Markov model can convey to the second half of an utterance it produces 137 00:11:10,555 --> 00:11:14,957 from the first half. So imagine it's already produced the first 138 00:11:14,957 --> 00:11:19,150 half of an utterance. And now it's going to have to produce the 139 00:11:19,150 --> 00:11:22,783 second half. And remember, its memory of what it said 140 00:11:22,783 --> 00:11:26,766 for the first half is in which of the n states it's in. 141 00:11:26,766 --> 00:11:30,610 So its memory only has log n bits of information in it. 142 00:11:30,610 --> 00:11:36,998 To produce the second half that's compatible with the first half, we must 143 00:11:36,998 --> 00:11:42,074 make the syntax fit. So for example, the number intend must 144 00:11:42,074 --> 00:11:45,918 agree. It also needs to make the semantics fit. 145 00:11:45,918 --> 00:11:49,967 It can't have the second half of the sentence be about something totally 146 00:11:49,967 --> 00:11:53,795 different from the first half. Also the intonation needs to fit so it 147 00:11:53,795 --> 00:11:58,398 would look very silly if the, intonation contour completely changed halfway through 148 00:11:58,398 --> 00:12:02,213 the sentence. There's a lot of other things that also 149 00:12:02,213 --> 00:12:04,524 have to fit. The accent of the speaker, 150 00:12:04,524 --> 00:12:07,991 The rate they're speaking at, How loudly they're speaking. 151 00:12:07,991 --> 00:12:11,093 And the vocal tract characteristics of the speaker. 152 00:12:11,093 --> 00:12:16,142 All of those things must fit between the second half of the sentence and the first 153 00:12:16,142 --> 00:12:19,061 half. And so if you wanted a hidden Markov model 154 00:12:19,061 --> 00:12:24,353 to actually generate a sentence, the hidden state has to be able to convey all 155 00:12:24,353 --> 00:12:27,760 that information from the first half to the second half. 156 00:12:28,380 --> 00:12:32,648 Now the problem is that all of those aspects could easily come to a hundred 157 00:12:32,648 --> 00:12:36,186 bits of information. So the first half of the sentence needs to 158 00:12:36,186 --> 00:12:40,679 convey a hundred bits of information to the second half and that means that the 159 00:12:40,679 --> 00:12:45,060 hidden Markov model needs two to the hundreds states and that's just too many. 160 00:12:46,240 --> 00:12:49,231 So that brings us to recurrence your own networks. 161 00:12:49,231 --> 00:12:53,000 They have a much more efficient way of remembering information. 162 00:12:53,340 --> 00:12:59,602 They're very powerful because they combine two properties that have distributed 163 00:12:59,602 --> 00:13:04,022 hidden state. That means, several different units can be 164 00:13:04,022 --> 00:13:07,933 active at once. So they can remember several different 165 00:13:07,933 --> 00:13:11,700 things at once. They don't just have one active unit. 166 00:13:12,080 --> 00:13:16,380 They're also nonlinear. You see, a linear dynamical system has a 167 00:13:16,380 --> 00:13:21,022 whole hidden state vector. So it's got more than one value at a time, 168 00:13:21,022 --> 00:13:26,825 but those values are constrained to act in a linear way so as to make inference easy, 169 00:13:26,825 --> 00:13:32,560 and in a recurrent neural network we allow the dynamics to be much more complicated. 170 00:13:34,220 --> 00:13:39,120 With enough neurons and enough time, a recurring neuron network can compute 171 00:13:39,120 --> 00:13:42,190 anything that can be computed by your computer. 172 00:13:42,190 --> 00:13:49,702 It's a very powerful device. So linear dynamical systems and hidden 173 00:13:49,702 --> 00:13:53,255 mark off models are both stochastic models. 174 00:13:53,255 --> 00:14:00,030 That is the dynamics and the production of observations from the underlying state 175 00:14:00,030 --> 00:14:05,813 both involve intrinsic noise. And the question is do models need to be 176 00:14:05,813 --> 00:14:10,100 like that. Well one thing to notice is that the 177 00:14:10,100 --> 00:14:15,666 posterior probability distribution over hidden states in either a limited anomical 178 00:14:15,666 --> 00:14:20,762 system or hidden markoff model is a deterministic function of the data that 179 00:14:20,762 --> 00:14:24,852 you've seen so far. That is the inference algorithm for these 180 00:14:24,852 --> 00:14:29,479 systems ends up with a probability distribution, and that probability 181 00:14:29,479 --> 00:14:34,709 distribution is just a bunch of numbers, and those numbers are a deterministic 182 00:14:34,709 --> 00:14:40,453 version of the data so far. In a recurrent neural network, you get a 183 00:14:40,453 --> 00:14:46,007 bunch of numbers that are a deterministic function of the data so far. 184 00:14:46,007 --> 00:14:52,433 And it might be a good idea to think of those numbers that constitute the hidden 185 00:14:52,433 --> 00:14:57,987 state of a recurrent neural network. They're very like the probability 186 00:14:57,987 --> 00:15:01,320 distribution for these simple stocastic models. 187 00:15:04,320 --> 00:15:09,020 So what kinds of behavior can recur at your own networks exhibit? 188 00:15:09,620 --> 00:15:13,512 Well, they can oscillate. That's obviously good for things like 189 00:15:13,512 --> 00:15:18,408 motion control, where when you're walking, for example, you want to know regular 190 00:15:18,408 --> 00:15:23,306 oscillation, which is your stride. They can settle to point attractors. 191 00:15:23,306 --> 00:15:25,989 That might be good for retrieving memories. 192 00:15:25,989 --> 00:15:31,168 And later on in the course we'll look at Hopfield nets where they use the settling 193 00:15:31,168 --> 00:15:36,035 to point attractors to store memories. So the idea is you have a sort of rough 194 00:15:36,035 --> 00:15:41,026 idea of what you're trying to retrieve. You then let the system settle down to a 195 00:15:41,026 --> 00:15:45,831 stable point and those stable points correspond to the things you know about. 196 00:15:45,831 --> 00:15:49,700 And so by settling to that stable point you retrieve a memory. 197 00:15:50,840 --> 00:15:56,140 They can also behave chaotically if you set the weights in the appropriate regime. 198 00:15:56,540 --> 00:16:00,837 Often, chaotic behavior is bad for information processing, because in 199 00:16:00,837 --> 00:16:04,755 information processing, you want to be able to behave reliably. 200 00:16:04,755 --> 00:16:09,305 You want to achieve something. There are some circumstances where it's a 201 00:16:09,305 --> 00:16:12,212 good idea. If you're up against a much smarter 202 00:16:12,212 --> 00:16:17,457 adversary, you probably can't outwit them, so it might be a good idea just to behave 203 00:16:17,457 --> 00:16:20,364 randomly. And one way to get the appearance of 204 00:16:20,364 --> 00:16:27,428 randomness is to behave chaotically. One nice thing about R and N's, which, a 205 00:16:27,428 --> 00:16:32,382 long time ago, I thought was gonna make them very powerful, is that an R and N 206 00:16:32,382 --> 00:16:37,528 could learn to implement lots of little programs, using different subsets of its 207 00:16:37,528 --> 00:16:40,938 hidden state. And each of these little programs could 208 00:16:40,938 --> 00:16:45,248 capture a nugget of knowledge. And all of these things could run in 209 00:16:45,248 --> 00:16:48,980 parallel, and interact with each other in complicated ways. 210 00:16:50,000 --> 00:16:55,270 Unfortunately the computational power of recurrent neural networks makes them very 211 00:16:55,270 --> 00:16:59,397 hard to train. For many years, we couldn't exploit the 212 00:16:59,397 --> 00:17:02,460 computational power of recurrent neural networks. 213 00:17:02,460 --> 00:17:06,773 It was some heroic efforts. For example, Tony Robinson managed to make 214 00:17:06,773 --> 00:17:10,023 quite a good speech recognizer using recurrent nets. 215 00:17:10,023 --> 00:17:15,086 He had to do a lot of work implementing them on a parallel computer built out of 216 00:17:15,086 --> 00:17:18,149 transputers. And it was only recently that people 217 00:17:18,149 --> 00:17:23,087 managed to produce recurrent neural networks that outperformed Tony Robinson's