In this video, I'll describe the first way we discovered for getting Sigmoid Belief Nets to learn efficiently. It's called the wake-sleep algorithm and it should not be confused with Boltzmann machines. They have two phases, a positive and a negative phase, that could plausibly be related to wake and sleep. But the wake-sleep algorithm is a very different kind of learning, mainly because it's for directed graphical models like Sigmoid Belief Nets, rather than for undirected graphical models like Boltzmann machines. The ideas behind the wake-sleep algorithm led to a whole new area of machine learning called variational learning, which didn't take off until the late 1990s, despite early examples like the wake-sleep algorithm, and is now one of the main ways of learning complicated graphical models in machine learning. The basic idea behind these variational methods sounds crazy. The idea is that since it's hard to compute the correct posterior distribution, we'll compute some cheap approximation to it. And then, we'll do maximum likelihood learning anyway. That is, we'll apply the learning rule that would be correct, if we'd got a sample from the true posterior, and hope that it works, even though we haven't. Now, you could reasonably expect this to be a disaster, but actually the learning comes to your rescue. So, if you look at what's driving the weights during the learning, when you use an approximate posterior, there are actually two terms driving the weights. One term is driving them to get a better model of the data. That is, to make the Sigmoid Belief Net more likely to generate the observed data in the training center. But there's another term that's added to that, that's actually driving the weights towards sets of weights for which the approximate posterior it's using is a good fit to the real posterior. It does this by manipulating the real posterior to try to make it fit the approximate posterior. It's because of this effect, the variational learning of these models works quite nicely. Back in the mid 90s,' when we first came up with it, we thought this was an interesting new theory of how the brain might learn. That idea has been taken up since by Karl Friston, who strongly believes this is what's going on in real neural learning. So, we're now going to look in more detail at how we can use an approximation to the posterior distribution for learning. To summarize, it's hard to learn complicated models like Sigmoid Belief Nets because it's hard to get samples from the true posterior distribution over hidden configurations, when given a data vector. And it's hard even to get a sample from that posterior. That is, it's hard to get an unbiased sample. So, the crazy idea is that we're going to use samples from some other distribution and hope that the learning will still work. And as we'll see, that turns out to be true for Sigmoid Belief Nets. So, the distribution that we're going to use is a distribution that ignores explaining away. We're going to assume (wrongly) that the posterior over hidden configurations factorizes into a product of distributions for each separate hidden unit. In other words, we're going to assume that given the data, the units in each hidden layer are independent of one another, as they are in a Restricted Boltzmann machine. But in a Restricted Boltzmann machine, this is correct. Whereas, in a Sigmoid Belief Net, it's wrong. So, let's quickly look at what a factorial distribution is. In a factorial distribution, the probability of a whole vector is just the product of the probabilities of its individual terms. So, suppose we have three hidden units in the layer and they have probabilities of being wrong of 0.3, 0.6, and 0.8. If we want to compute the probability of the hidden layer having the state (1, 0, 1), We compute that by multiplying 0.3 by (1 - 0.6), by 0.8. So, the probability of a configuration of the hidden layer is just the product of the individual probabilities. That's why it's called factorial. In general, the distribution of binary vectors of length n will have two to the n degrees of freedom. Actually, it's only two to the n minus one because the probabilities must add to one. A factorial distribution, by contrast, only has n degrees of freedom. It's a much simpler beast. So now, I'm going to describe the wake-sleep algorithm that makes use of the idea of using the wrong distribution. And in this algorithm, we have a neural net that has two different sets of weights. It's really a generative model and so, the weights shown in green for generative are the weights of the model. Those are the weights that define the probability distribution over data vectors. We've got some extra weights. The weights shown in red, for recognition weights, and these are weights that are used for approximately getting the posterior distribution. That is, we're going to use these weights to get a factorial distribution at each hidden layer that approximates the posterior, but not very well. So, in this algorithm, there's a wake phase. In the wake phase, you put data in at the visible layer, the bottom, and you do a forward pass through the network using the recognition weights. And in each hidden layer, you make a stochastic binary decision for each hidden unit independently, about whether it should be on or off. So, the forward pass gets us stochastic binary states for all of the hidden units. Then, once we have those stochastic binary states, we treat that as though it was a sample from the true posterior distribution given the data and we do maximum likelihood learning. But what we're doing maximum likelihood learning for is not the recognition weights that we just used to get the approximate sample. It's the generative weights that define our models. So, you drive the system in the forward pass with the recognition weights, but you learn the generative weights. In the sleep phase, you do the opposite. You drive the system with the generative weights. That is, you start with a random vector of the top hidden layer. You generate the binary states of those hidden units from their prior in which they're independent. And then, you go down through the system, generating state for each layer at a time. And here you're using the generative model correctly. That's how the generative model says you want to generate data. And so, you can generate an unbiased sample from the model. Having used the generative weights to generate an unbiased sample, you then say, let's see if we can recover the hidden states from the data. Well, let's see if we can recover the hidden states that layer h2 from the hidden states at layer h1. So, you train recognition weights, to try and recover the hidden states that actually generated the states in the layer below. So, it's just the opposite of the weight phase. We're now using the generative weights to drive the system and we're learning the recognition weights. It turns out that if you start with random weights and you alternate between wake phases and sleep phases it learns a pretty good model. There are flaws in this algorithm. The first flaw is a rather minor one which is, the recognition weights are learning to invert the generative model. But at the beginning of learning, they're learning to invert the generative model in parts of the space where there isn't any data. Because when you generate from the model, you're generating stuff that looks very different from the real data, because the weights aren't any good. That's a waste, but it's not a big problem. The serious problem with this algorithm is that the recognition weights not only don't follow the gradient of the log probability of the data, They don't even follow the gradient of the variational bound on this probability. And because they're not following the right gradient, we get incorrect mode averaging, which I'll explain in the next slide. A final problem is that we know that the true posterior over the top hidden layer is bound to be far from independent because of explaining away effects. And yet, we're forced to approximate it with a distribution that assumes independence. This independence approximation might not be so bad for intermediate hidden layers, because if we're lucky, the explaining away effects that come from below will be partially canceled out by prior effects that come from above. You'll see that in much more detail later. Despite all these problems, Karl Friston thinks this is how the brain works. When we initially came up with the algorithm, we thought it was an interesting new theory of the brain. I currently believe that it's got too many problems to be how the brain works and that we'll find better algorithms. So now let me explain mode averaging, using the little model with the earthquake and the truck that we saw before. Suppose we run the sleep phase, and we generate data from this model. Most of the time, those top two units would be off because they are very unlikely to turn on under their prior, and, because they are off, the visible unit will be firmly off, because its bias is -20. Just occassionally, one time in about e to the -10, one of the two top units will turn on and it will be equally often the left one or the right one. When that unit turns on, there's a probability of a half that the visible unit will turn on. So, if you think about the occassions on which the visible unit turns on, half those occassions have the left-hand hidden unit on, the other half of those occassions have the right-hand hidden unit on and almost none of those occassions have neither or both units on. So now think what the learning would do for the recognition weights. Half the time we'll have a 1 on the visible layer, the leftmost unit will be on at the top, so we'll actually learn to predict that that's on with a probability of 0.5, and the same for the right unit. So the recognition units will learn to produce a factorial distribution over the hidden layer, of (0.5, 0.5) and that factorial distribution puts a quarter of its mass on the configuration (1,1) and another quarter of its mass on the configuration (0,0) and both of those are extremely unlikely configurations given that the visible unit was on. It would have been better just to pick one mode, that is, it would have been better for the visible unit just to go for truck, or just to go for earthquake. That's the best recognition model you can have, that's the best recognition model you can have if you're forced to have a factorial model. So even though the hidden configurations we're dealing with are best represented as the corners of a square actually show it as if it's a one-dimensional continuous value, and the true posterior is bimodal. It's focused on (1,0) or (0,1), that's shown in black. The approximation window, if you use the sleep phase of the wake-sleep algorithm, is the red curve, which gives all four states of the hidden units equal probability and the best solution would be to pick one of these states, and give it all the probability mass. That's the best solution, because in variational learning we're manipulating the true posterior to make it fit the approximation we're using. Normally, in learning we'll manipulate an approximation to fit the true thing, but here it's backwards.