In this video, I will introduce Restricted Boltzmann Machines. These have a much simplified architecture in which there are no connections between hidden units. This makes it very easy to get the equilibrium distribution of the hidden units if the visible units are given. That is, once you've clamped the datavector on the visible units, The equilibrium distribution of the hidden units can be computed exactly in one step because they're all independent of one another, given the states of the visible units. The proper Boltzmann machine learning algorithm is still slow for a restricted Boltzmann machine. But in 1998, I discovered a very surprising shortcut that leads to the first efficient learning algorithm for Boltzmann machines. Even though this algorithm has theoretical problems, it works quite well in practice. And it led to a revival of interest in Boltzmann machine learning. In a restricted Boltzmann machine, we restrict the connectivity of the network in order to make both inference and learning easier. So, it only has one layer of hidden units and there's no connections between the hidden units. There's also no connections between the visible units. So, the architecture looks like that, it's what computer scientists call a bipartite graph. There's two pieces, and within each piece, there's no connections. The good thing about an RBM is that if you clamp a datavector in the visible units, you can reach thermal equilibrium in one step. That means with a datavector clamped, we can quickly compute the expected value of vihj because we can compute the exact probability with each j will turn on, and that is independent of all the other units in the hidden layer. The probability that j will turn on is just the logistic function of the input that it gets from the visible units and quite independent of what other hidden units are doing. So, we can compute that probability all in parallel and that's a tremendous win. If you want to make a good model of a set of binary vectors, then the right algorithm to use for a restricted Boltzmann machine is one introduced by Tieleman in 2008 that's based on earlier work by Neal. In the positive phase, you clamp the datavector on the visible units. You then compute the exact value of the expectation vihj for all pairs of invisible in the hidden unit. And you could do that cuz vi is fixed, and you can compute vj exactly. And then, for every connected pair of units, you average the expected value of vihj over all the data vectors in the mini batch. For the negative phase, you keep a set of fantasy particles that is global configurations. And then, you update each fantasy particle a few times by using alternating parallel updates. So, after each weight update, you update the fantasy particles a little bit and that should bring them back to close to equilibrium. And then, for every connected pair of units, you average vihj over all the fantasy particles, and that gives you your negative statistics. This algorithm actually works very well, and allows RBMs to build good density models or sets of binary vectors. Now, I am going to go on to our learning algorithm that is not as good at building density model but is much faster. So, I'm going to start with a picture of an inefficient learning algorithm for restrictive Boltzmann machines. We're going to start by clamping a datavector on the visible units, and we're going to call that time t0. So, we're going to use times now, not to denote weight updates, but to denote steps in a Markov chain. Given that visible vector, we now update the hidden units. So, we choose binary states for the hidden units and we measure the expected value, vihj, for all pairs of visible and binary units that are connected. And I'll call that vihj zero to indicate that it's measured at time zero, With the hidden units being determined by the visible units. And, of course, we can update all the hidden units in parallel. We then use the hidden vector to update all the visible units in parallel, and again we update all the hidden units in parallel. So, the visible vector t1 = one, we'll call a reconstruction, or a one-step reconstruction, And we can keep going with the alternating chain that way, Updating visible units, and then hidden units, Each set being updated in parallel. And after we've gone for a long time, We'll get to some state of the visible units, or I'll call t infinity to indicate it needs to be a long time and the system will be at thermal equilibrium, and now, we can measure the correlation of vi and hj after the chains run for a long time and I'll call that vihj infinity. And the visible state we have after a long time, I'll call it fantasy. So now, the learning rule is simply, we change Wij by the learning rate times the difference between vihj at time zero and vihj at time infinity. And, of course, the problem with this algorithm is that we have to run this chain for a long time before it reaches thermal equilibrium. And if we don't run it for long enough, the learning may go wrong. In fact, that last statement is very misleading. It turns out that even if we only run the chain for a short time, the learning still works. So, here's the very surprising shortcut. You just run the chain up, down, and up again. So, from the data, you generate a hidden state, from that. You generate a reconstruction, and from that, you generate another hidden state. And you may have a statistics once you've done that. So, instead of using the statistics measured at equilibrium, we're using the statistics measured after doing one full update of the Markov chain. The learning rule is, and the same as before, except this much quicker to compute, and this is clearly is not doing maximum likelihood learning because the term we are using for negative statistics is wrong. But the learning, nevertheless, works quite well. Next week, we'll understand a bit more about why it works well. But for now, we'll just see that it does. So, the obvious question is why does actual cut work at all? And here's the reasoning. If we start the chain at the data, the Markov chain will wander away from the data and towards its equilibrium distribution. That is towards things that is initial weights like more than the data. We can see what direction it's wandering in after only a few steps. And if we know the initial weights aren't very good, it's a waste of time to go all the way to equilibrium. We know how to change them to stop it wandering away from the data without going all the way to equilibrium. All we need to do is lower the probability of the reconstructions of confabulations as a psychologist would call them, it produces after one full step, and then, raise the probability of the data. That will stop it wandering away from the data. Once the data and the places it goes to after one full step have the same distribution, then the learning will stop. So, here's a picture of what's going on. Here's the energy surface in the space of global configurations. Here's a data point on the energy surface, and by data point, I mean, both the visible vector and the particular hidden vector that we got by stochastic updating the hidden units. So, that hidden vector is a function of what the data point is. So, starting at that data point, we run the Markov chain for one full step to get a new visible vector and the hidden vector that goes with it. So, a reconstruction of the data point plus the hidden vector that goes with that reconstruction. We then change the weights to pull the energy down at the data point, and pull to the energy up the reconstruction. And the effect of that would be to make the surface look like this. And you'll notice we're beginning to construct an energy minimum at the data. You'll also notice that far away from the data, things have stayed pretty much as they were before. So, this shortcut of only doing one full step to get the reconstruction fails for places that are far away from the data. We need to worry about regions of the data-space that the model likes but which are very far from any data point. These low energy holes cause the normalization term to be big, and we can't sense them if we use the shortcut. If we use persistent particles, where we remembered their states, and after each update, we updated them a few more times, then they would eventually find these holes. They'd move into the holes, and the learning would cause the holes to fill up. A good compromise between speed and correctness is to start with small weights and to use CD1, that is contrust divergence with one full step to get the negative data. Once the weights have grown a bit, the Markov chain is mixing more slowly, and now we can use CD3. Once the weights have grown more, we can use CD5, or nine, or ten. So, by increasing the number of steps as the weights grow, we can keep the learning working reasonably well, even though the mixing rate of the Markov chain is going