In this video, I'll talk about a different way of learning sigmoid belief notes. This different method arrived in an unexpected way. I stopped working on sigmoid belief nets and went back to Boltzmann machines. And discovered that restricted Boltz machines could actually be learned fairly efficiently. Given that a restricted Boltzmann machine could efficiently learn a layer of nonlinear features. It was tempting to take those features, treat them as data, and apply another restricted Boltzmann machine to model the correlations between those features. And one can continue like this, stacking one Boltzmann machine on top of the next one to learn lots of layers of nonlinear features. This eventually led to a big resurgence of interest in deep neural nets. The issue then arose. Once you stacked up lots of restricted Boltzmann machines, each which is learned by modeling the patterns of future activities produced by the previous Boltzmann machines. Do you just have a set of separate restricted Boltzmann machines or can they all be combined together into one model? Now, anybody sensible would expect that if you combined a set of restricted Boltzmann machines together to make one model, what you'd get would be a multilayer Boltzmann machine. However, a brilliant graduate student of mine called G.Y. Tay, figured out that that's not what you get. You actually get something that looks much more like a sigmoid belief net. This was a big surprise. It was very surprising to me that we'd actually solved the problem of how to learn deep sigmoid belief nets by giving up on it and focusing on learning undirected models like Boltzmann machines. Using the efficient learning algorithm for restricted Boltzmann machines. It's easy to train a layer of features that receive input directly from the pixels. We can treat the patterns of activation of those feature detectors as if they were pixels, and learn another layer of features in a second hidden layer. We can repeat this as many times as we like with each new layer of features modelling the correlated activity in the features in the layer below. It can be proved that each time we add another layer of features, we improve a variational lower bound on the log probability that some combined model would generate the data. The proof is actually complicated, and it only applies if you do everything just right, which you don't do in practice. But, the proof is very reassuring, because it suggests that something sensible is going on when you stack up restricted Boltzmann machines like this. The proof is based on a neat equivalence between a restricted bolson machine and an infinitely deep belief net. So here's a picture of what happens when you learn two restricted Boltzmann machines, one on top of the other, and then you combine them to make one overall model, which I call a deep belief net. So first we learn one Boltzmann machine with its own weights. Once that's been trained, we take the hidden activity patterns of that Boltzmann machine when it's looking at data and we treat each hidden activity pattern as data for training a second Boltzmann machine. So we just copy the binary states to the second Boltzmann machine, and then we learn another Boltzmann machine. Now one interesting thing about this, is that if we start the second Boltzmann machine off with W2 being the transpose of W1, and with as many hidden units in h2 as there are in v, then the second Boltzmann machine will already be a pretty good model of h1, because it's just the first model upside down. And for a restricted Boltzmann machine, it doesn't really care which you call visible and which you call hidden. It's just a bipartite graph that's learned to model. After we've learned those two Boltzmann machines, we're going to compose them together to form a single model and the single model looks like this. Its top two layers adjust the same as the top restricted Boltzmann machine. So that's an undirected model with symmetric connections, but its bottom two layers are a directed model like a sigmoid belief net. So what we've done is we've taken the symmetric connections between v and h1 and we've thrown away the upgoing part of those and just kept the dangering part. To understand why we've done that is quite complicated and that will be explained in video 13F. The resulting combined model is clearly not a Boltzmann machine, because its bottom layer of connections are not symmetric. It's a graphical model that we call a deep belief net, where the lower layers are just like sigmoid belief nets and the top two layers form a restricted Boltzmann machine. So it's a kind of hybrid model. If we do it with three Boltzmann machines stacked up, we'll get a hybrid model that looks like this. The top two layers again are a restricted Boltzmann machine and the layers below are directed layers like in a sigmoid belief net. To generate data from this model the correct procedure is, first of all, you go backwards and forwards between h2 and h3 to reach equilibrium in that top level restricted Boltamann machine. This involves alternating Gibbs sampling, where you update all of the units in h3 in parallel, and update all of the units in h2 in parallel, then go back and update all of the units in h3 in parallel. And you go backwards and forwards like that for a long time until you've got an equilibrium sample from the top-level restricted Boltamann machine. So the top-level restricted Bolson machine is defining the prior distribution of h2. Once you've done that, you simply go once from h2 to h1 using the generative connections w2. And then, whatever binary patent you get in h1, you go once more to get generated data, using the weights w1. So we're performing a top-down pass from h2, to get the states of all the other layers, just like in a sigmoid belief net. The bottom-up connections, shown in red at the lower levels, are not part of the generative model. They're actually going to be the transposes of the corresponding weights. So they're the transpose of w1 and the transpose of w2, and they're going to be used for influence, but they're not part of the model. Now, before I explain why stacking up Boltzmann machines is a good idea, I need to sort out what it means to average two factorial distributions. And it may surprise you to know that if I average two factorial distributions, I do not get a factorial distribution. What I mean by averaging here is taking a mixture of the distributions, so you first pick one of the two at random, and then you generate from whichever one you picked. So, you don't get a factorial distribution. Suppose we have an RBM with 4 hidden units and suppose we give it a visible vector. And given this visible vector, the posterior distribution over those 4 hidden units is factorial. And lets suppose the distribution was that the first and second units have a probability of 0.9 of turning on and the last two have a probability of 0.1 of turning on. What it means for this to be factorial is that, for example, the probability that the first two units were both be on in a sample from this distribution, is exactly 0.81. Now suppose we have a different angle vector v2, and the posterior distribution over the same 4 hidden units is now 0.1, 0.1, 0.9, 0.9, which I chose just to make the math easy. If we average those two distributions, the mean probability of each hidden unit being on, is indeed, the average of the means for each distribution. So the means are 0.5, 0.5, 0.5, 0.5, but what you get is not a factorial distribution defined by those 4 probabilities. To see that, consider the binary vector 1, 1, 0, 0 over the hidden units. In the posterior for v1, that has a probability of 0.9^4, because it's 0.9 * 0.9 * 1 - 0.1 * 1 - 0.1. So that's 0.43. In the posterior for v2, this vector is extremely unlikely. It has a probability of 1 in 10,000. If we average those two probabilities for that particular vector, we'll get a probability of 0.215, and that's much bigger than the probability assigned to the vector 1, 1, 0, 0 by factorial distribution with means of 0.5. That probability will be 0.5^4, which is much smaller. So, the point of all this, is that when you average two factorial posteriors, you get a mixture distribution that's not factorial. Now, let's look at why the greedy learning works. That is why it's a good idea to learn one restricted Boltzmann machine. And then learn a second restricted Boltzmann machine that models the patterns of activity in the hidden units of the first one. The weights of the bottom level restricted Boltzmann machine, actually define four different distributions. Of course, they define them in a consistent way. So the first distribution is the probability of the visible units given the hidden units. And the second one is the probability of the hidden units given the visible units. And those are the two distributions we use for running our alternating mark of chain that updates the visibles given the hiddens and then updates the hiddens given the visibles. If we run that chain long enough, we'll get a sample from the joint distribution of v and h. And so the weights clearly also define the joint distribution. They also define the joint distribution more directly in terms of E to the minus the energy, but for nets with a large number of units, we can't compute that. If you take the joint distribution, p(v|h), and you just ignore v, we now a distribution for h. That's the prior distribution over h, defined by this restricted Boltzmann machine. And similarly, if we ignore h, we have the prior distribution over v, defined by the restricted Boltzmann machine. And now, we're going to pick a rather surprising pair of distributions from those four distributions. We're going to define the probability that the restricted Boltzmann machine assigns to a visible vector v as the sum over all hidden vectors of the probability it assigns to h times the probability of v given h. This seems like a silly thing to do, because defining p(h) is just as hard as defining p(v). And nevertheless, we're going to define p(v) that way. Now, if we now leave p(v|h) alone, but learn a better model of p(h), that is, learn some new parameters that give us a better model of p(h) and substitute that in instead of the old model we had of p(h). We'll actually improve our model of v. And what we mean by a better model of p(h) is a prior over h that fits the aggregated posterior better. The aggregated posterior is the average over all vectors in the training set of the posterior distribution over h. So, what we're going to do, is use our first RBM to get this aggregated posterior and then use our second RBM to build a better model of this aggregated posterior than the first RBM has. And if we start the second RBM off as the first one upside down, it will start with the same model of the aggregated posterior as the first RBM has. And then, if we change the weights we can only make things better. So, that's an explanation of what's happening when we stack up RBMs. Once we've learned to stack up Boltzmann machines, then combine them together to make a deep belief net, we can then actually fine-tune the whole composite model using a variation of the wake-sleep algorithm. So we first learn many layers of features by stacking up IBMs. And then we want to fine-tune both the bottom-up recognition weights and the top-down generative weights to get a better generative model and we can do this by using three different learning routes. First, we do a stochastic bottom-up pass, and we adjust the top down generative weights of the lower layers to be good at reconstructing the feature activities in the layer below. That's just as in the standard wake-sleep algorithm Then, in the top level RBM, we go backwards and forwards a few times, sampling the hiddens of that RBM, and the visibles of that RBM, and the hiddens of the RBM, and so on. So that's just like the learning algorithm for RBMs. And having done a few iterations of that, we do contrastive divergence learning. That is, we update the weights of the RBM using the difference between the correlations when activity first got to that RBM and the correlations after a few iterations in that RBM. We take that difference and use it to update the weights. And then, the third stage, we take the visible units of that top-level RBM by its lower level units. And starting there, we do a top-down stochastic pass, using the directed lower connections, which are just a sigmoid belief net. Then, having generated some data from that sigmoid belief net, we adjust the bottom up rates to be good at reconstructing the feature activities in the layer above. So that's just the sleep phase of the wake-sleep algorithm. The difference from the standard wake-sleep algorithm is that that top-level RBM acts as a much better prior over the top layers, than just a layer of units which are assumed to be independent, which is what you get with a sigmoid belief net. Also, rather than generating data by sampling from the prior, what we're actually doing is looking at a training case, going up to the top-level RBM and just running a few iterations before we generate data. So now we're going to look at an example where we first learn some RBMs, stacking them up, and then we do contrastive wake-sleep to fine-tune it, and then we look to see what it's like. Is it a generative model? And also if we're recognizing things. So first of all, we're going to use 500 binary hidden units to learn to model all 10 digit classes in images of 28 by 28 pixels. Once we've learned that RBM, without knowing what the labels are, so it's unsupervised learning. We're going to take the patterns of activity in those 500 hidden units that they have when they're looking at data. We're going to treat those patterns of activity as data and we're going to learn another RBM that also has 500 units, and those two are learned without knowing what the labels are. Once we've done that we'll actually tell it the labels. So the first two hidden layers are learned without labels, and then, we add a big top layer and we give it the 10 labels. And you can think that we concatenate those 10 labels with the 500 units that represent features, except that the 10 labels are really one soft match unit. Then we train that top-level RBM to model the concatenation of the soft match unit for the 10 labels with the 500 feature activities that were produced by the two layers below. Once we've trained the top-level RBM, we can then fine-tune the whole system by using contrastive wake-sleep. And then we'll have a very good generative model and that's the model that I showed you in the intro video. So if you go back, and you find the introduction video for this course, you'll see what happens when we run that model. You'll see how good it is at recognition and you'll also see that it's very good at generation. In that introductory video, I promised you, you would eventually explain how it worked, and I think you've now seen enough to know what's going on when this model is learned.