In this video, I am going to give an overview of various types of models that have been used for sequences. I'll start with the simplest kinds of model, which is ultra aggressive models, that just try and predict the next term or the sequence from previous terms. I'll talk about more elaborate variants of them using hidden units. And then I'll talk about, more interesting kinds of models, that have hidden state, and hidden dynamics. These include linear dynamical systems and hidden Markov models. Most of these are quite complicated kinds of models, and I don't expect you to understand all the details of them. The main point of mentioning them is to be able to show how recurrent your own networks are related to models of that kind. When we're using machine learning to model sequences, we often want to turn one sequence into another sequence. For example, we might want to turn English words into French words or we might want to take a sequence of sand pressures and turn it into a sequence of word identities which is what's happening in speech recognition. Sometimes we don't have a separate target sequence, and in that case we can get a teaching signal by trying to predict the next term in the input sequence. So the target output sequence is simply the input sequence with an advance of one time step. This seems much more natural, than trying to predict one pixel in an image from all the other pixels or one patch of an image from the rest of the image. One reason it probably seems more natural is that for temporal sequences, there is a natural order to do the predictions in. Whereas for images it's not clear what you should predict from what. But in fact a similar approach works very well for images. When we predict the next term in a sequence, it blurs the distinction, between supervised and unsupervised learning, that I made at beginning of the course. So we use methods that were designed for supervised learning to predict the next term. But we don't require separate teaching signal. So in that sense, it's unsupervised. I'm now going to give a quick review of some of the, other models of sequences, before we get on to using recurrent neural nets to model sequences. So a nice simple model for sequences that doesn't have any memory is an auto regressive model. What that does is take some previous terms in the sequence and try and predict the next term basically as a weighted average of previous terms. The previous terms might be individual values or they might be whole vectors. And a linear auto regressive model would just take a weighted average of those to predict the next term. We can make that considerably more complicated by adding hidden units. So in a feedforward neural net, we might take some previous input terms, put them through some hidden units, and predict the next term. Memory list models are only one subclass of models that can be used for sequences. We can think about ways of generating sequences, and one very natural way to generate a sequence is to have a model that has some hidden state which has its own internal dynamics. So, the hidden state evolves according to its internal dynamics, and the hidden state also produces observations, and we get to see those observations. That's a much more interesting kind of model. It can store information in its hidden state for a long time. Unlike the memoryless models, there's no simple band, to how far we have to look back before we can be sure it's not affecting things. If the dynamics of the hidden state is noisy and the way it generates outputs from its hidden state is noisy, then by observing the output of a generative model like this, you can never know for sure what it's hidden state was. The best you can do is to infer probability distribution over the space of all possible hidden state vectors. You can know that it's probably in some part of the space and not another part of the space, but you can't pin it down exactly. So with a generative model like this, if you get to observe what it produces, and you now try to infer what the hidden state was, in general that's very hard, but there's two types of hidden state model for which the computation is tractable. That is, there's a fairly straightforward computation that allows you to infer the probability distribution over the hidden state vectors that might have caused the data. Of course when we do this and apply it to real data. We're assuming that the real data is generated by our model. So that's typically what we do when we're modeling things. We assume the data was generated by the model and then we infer what state the model must have been in, in order to generate that data. The next three slides are mainly intended for people who already know about the two types of hidden state model I'm going to describe. The point of the slides is so that I make it clear how recurrent neural networks differ from those standard models. If you can't follow the details of the two standard models, don't worry too much. That's not the main point. So one standard model is a linear dynamical system. It's very widely used in engineering. This is a generative model that has real valued hidden state. The hidden state has linear dynamics, shown by those red arrows on the right. And the dynamics has Gaussian noise, so that the hidden state evolves probabilistically. There may also be driving inputs, shown at the bottom there, which directly influence the hidden state. So the inputs, influence the hidden state directly, the hidden state determines the output to predict the next output of a system like this, we need to be able to infer its hidden state. And these kinds of systems are used, for example, for tracking missiles. In fact, one of the earliest uses of Gaussian distributions was for trying to track planets from noisy observations. Gaussian actually figured out that, if you assume Gaussian noise, you could do a good job of that. One nice property that a Gaussian has is that if you linearly transform a gaseon you get another Gaussian. Because all the noise in a linear dynamic system is gaseon. It turns out that the distribution over the hidden state given the observation so far, that is given the output so far, is also a Gaussian. It's a full covariance Gaussian, and it's quite complicated to compute what it is. But it can be computed efficiently. And there's a technique called Kalman Filtering. This is an efficient recursive way of updating your representation of the hidden state given a new observation. So, to summarize, Given observations of the output of the system, we can't be sure what hidden state it was in, but we can, estimate a Gaussian distribution over the possible hidden states it might have been in. Always assuming, of course, that our model is a correct model of the reality we're observing. A different kind of hidden state model that uses discrete distributions rather than Gaussian distributions, is a hidden Markov model. And because it's based on discrete mathematics, computer scientists love these ones. In a hidden Markov model, the hidden state consists of a one of N. Choice. So there a number of things called states. And the system is always in exactly one of those states. The transitions between states are probabilistic. They're controlled by a transition matrix which is simply a bunch of probabilities that say, if you're in state one at time one, What's the probability of you going to state three at time two? The output model is also stochastic. So, the state that the system is in doesn't completely determine what output it produces. There's some variation in the output that each state can produce. Because of that, we can't be sure which state produced a given output. In a sense, the states are hidden behind this probabilistic veil, and that's why they're called hidden. Historically the reason hidden units in a neural network are called hidden, is because I like this term. It sounded mysterious, so I stole it from neural networks. It is easy to represent the probability distribution across n states with n numbers. So, the nice thing about a hidden Markov model, is we can represent the probability distribution across its discreet states. So, even though we don't know what it, what state it's in for sure, we can easily represent the probability distribution. And to predict the next output from a hidden Markov model, we need to infer what hidden state it's probably in. And so we need to get our hands on that probability distribution. It turns out there's an easy method based on dynamic programming that allows us to take the observations we've made and from those compute the probability distribution across the hidden states. Once we have that distribution, there is a nice elegant learning algorithm hidden Markov models, and that's what made them so appropriate for speech. And in the 1970s, they took over speech recognition. There's a fundamental limitation of HMMs. It's easiest to understand this limitation, if we consider what happens when a hidden Markov model generates data. At each time step when it's generating, it selects one of its hidden states. So if it's got n hidden states, the temporal information stored in the hidden state is at most logn) n bits. So that's all it knows about what it's done so far. So now let's consider how much information a hidden Markov model can convey to the second half of an utterance it produces from the first half. So imagine it's already produced the first half of an utterance. And now it's going to have to produce the second half. And remember, its memory of what it said for the first half is in which of the n states it's in. So its memory only has log n bits of information in it. To produce the second half that's compatible with the first half, we must make the syntax fit. So for example, the number intend must agree. It also needs to make the semantics fit. It can't have the second half of the sentence be about something totally different from the first half. Also the intonation needs to fit so it would look very silly if the, intonation contour completely changed halfway through the sentence. There's a lot of other things that also have to fit. The accent of the speaker, The rate they're speaking at, How loudly they're speaking. And the vocal tract characteristics of the speaker. All of those things must fit between the second half of the sentence and the first half. And so if you wanted a hidden Markov model to actually generate a sentence, the hidden state has to be able to convey all that information from the first half to the second half. Now the problem is that all of those aspects could easily come to a hundred bits of information. So the first half of the sentence needs to convey a hundred bits of information to the second half and that means that the hidden Markov model needs two to the hundreds states and that's just too many. So that brings us to recurrence your own networks. They have a much more efficient way of remembering information. They're very powerful because they combine two properties that have distributed hidden state. That means, several different units can be active at once. So they can remember several different things at once. They don't just have one active unit. They're also nonlinear. You see, a linear dynamical system has a whole hidden state vector. So it's got more than one value at a time, but those values are constrained to act in a linear way so as to make inference easy, and in a recurrent neural network we allow the dynamics to be much more complicated. With enough neurons and enough time, a recurring neuron network can compute anything that can be computed by your computer. It's a very powerful device. So linear dynamical systems and hidden mark off models are both stochastic models. That is the dynamics and the production of observations from the underlying state both involve intrinsic noise. And the question is do models need to be like that. Well one thing to notice is that the posterior probability distribution over hidden states in either a limited anomical system or hidden markoff model is a deterministic function of the data that you've seen so far. That is the inference algorithm for these systems ends up with a probability distribution, and that probability distribution is just a bunch of numbers, and those numbers are a deterministic version of the data so far. In a recurrent neural network, you get a bunch of numbers that are a deterministic function of the data so far. And it might be a good idea to think of those numbers that constitute the hidden state of a recurrent neural network. They're very like the probability distribution for these simple stocastic models. So what kinds of behavior can recur at your own networks exhibit? Well, they can oscillate. That's obviously good for things like motion control, where when you're walking, for example, you want to know regular oscillation, which is your stride. They can settle to point attractors. That might be good for retrieving memories. And later on in the course we'll look at Hopfield nets where they use the settling to point attractors to store memories. So the idea is you have a sort of rough idea of what you're trying to retrieve. You then let the system settle down to a stable point and those stable points correspond to the things you know about. And so by settling to that stable point you retrieve a memory. They can also behave chaotically if you set the weights in the appropriate regime. Often, chaotic behavior is bad for information processing, because in information processing, you want to be able to behave reliably. You want to achieve something. There are some circumstances where it's a good idea. If you're up against a much smarter adversary, you probably can't outwit them, so it might be a good idea just to behave randomly. And one way to get the appearance of randomness is to behave chaotically. One nice thing about R and N's, which, a long time ago, I thought was gonna make them very powerful, is that an R and N could learn to implement lots of little programs, using different subsets of its hidden state. And each of these little programs could capture a nugget of knowledge. And all of these things could run in parallel, and interact with each other in complicated ways. Unfortunately the computational power of recurrent neural networks makes them very hard to train. For many years, we couldn't exploit the computational power of recurrent neural networks. It was some heroic efforts. For example, Tony Robinson managed to make quite a good speech recognizer using recurrent nets. He had to do a lot of work implementing them on a parallel computer built out of transputers. And it was only recently that people managed to produce recurrent neural networks that outperformed Tony Robinson's