In this video, I'm going to explain, how a Boltzmann machine models a set of binary data vectors. I'm going to start by explaining, why we might want to model a set of binary data vectors, and what we could do with such a model if we had it. And then I'm gonna show how the probabilities assigned to binary data vectors are determined by the weights in a Boltzmann machine. Stochastic Hopfield nets with hidden units, which we also call as Boltzmann machines are good at modelling binary data. So, given a set of binary training vectors, they can use the hidden units to fit a model per assigns the probability to every possible binary vector. Per several reasons, why you might like to be able to do that. If, for example you had several distributions of binary vectors, you might like to look at a new binary vector and decide which distribution it came from. So, you might have different kinds of documents, and you might represent a document by, a number of binary features each of which says, whether there is more than zero occurrences of a particular word in that document. For different kinds of documents, you would expect different kinds of the different words, may be you'll see different correlations between words And so you could use a set of hidden units to model the distribution for each document. And then you could pick the most likely document, by seeing. And then you could assign a test document to the appropriate class, by seeing which class of document is most likely to have produced that binary vector. You could also use Boltzmann machines for monitoring complex systems to detect unusual behavior. Suppose for example that you have a nuclear power station, and all of the dials were binary. So you get a whole bunch of binary numbers that tell you something about the state of the power station. What you'd like to do, is notice that it's in an unusual state. A state that's not like states you've seen before. And you don't want to use supervised learning for that. Because really you don't want to have any examples of states that cause it to blowup. You'd rather be able to detect that it's going into such a state without every having seen such a state before. And you could do that by building a model of a normal state and noticing that this state is different from the normal states. If you have models of several different distributions. You can complete the posterior probability that a particular distribution produced the observed data by using Bayes' Theorem. So giving the observed data, the probability it came from Model I, under the assumption that it came from one of your models, is the probability that Model I would have produced that data, divided by the same quantity for all models. Now I want to talk about two ways of producing models of data in particular binary vectors. The most natural way to think about generating a binary vector is to first generate the states of some latent variables, And then use the latent variables to generate the binary vector. So in a causal model, we use two sequential steps. These are the latent variables, or hidden units, and we first pick the states of the latent variables from their prior distributions. Often in the causal model, these will be independent in the prior. So their probability of turning on, if they were binary latent variables, would just depend on some bias that each one of them has. Then, once we picked a state for those, we would use those to generate the states of the visible units by using weighted connections in this model. So this is a kind of neural network, causal, generative model. It's using logistic units, and it uses biases for the hidden units and weights on the connections between hidden and visible units to assign a probability to every possible visible vector. The probability of generating a particular vector v, is just the sum of all the possible hidden states of the probability of generating those hidden state times the probability of generating v, given that you've already generated that hidden state. So, that's a causal model, factor analysis for example is a causal model using continuous variables. And, it's probably the most natural way to think about generating data. In fact, some people when they say generated model mean, the causal model like this. But just a completely different kind of model. A Boltzmann machine is an energy based model, and, in this kind of model, you don't generate data causally. It's not a causal generative model. Instead everything is defined in terms of the energies of joint configurations of visible and hidden units. There's two ways of relating the energy of a joint configuration to its probability. You can simply define the probability to be the probability of a joint configuration of the visible and hidden variables is proportional to e to the negative energy of that joint configuration. Or you can define it procedurally by saying we are going to define the probability as the probability finding the network in that state after we've updating all the stochastic binary units for enough time so that we reached thermal equilibrium. The good news is that those two definitions agree. The energy of a joint configuration of the visible and hidden units has five terms in it. So I've put the negative energy to save having to put lots of minus signs. And so the negative energy of the joint configuration VH. That's with vector V on the visible units, and H on the hidden units, Has bias terms where VI is the binary state of the Ith unit in vector V. And the bk is the bias of the kth unit, in this case, a hidden unit. So that's the first two terms. Then there's the visible-visible interactions, And to avoid counting each of those interactions twice, we can, just say, we're going to count within c's, I, and j and make sure that I's always less than j. That'll avoid counting the interaction of something with itself, and also avoid counting pairs twice, and so we don't have to put a half in front. Then there's the visible hidden interactions. My WIK is a weight on a visible hidden interaction. And then there's the hidden to hidden interactions. So the way we use the energies to define probabilities is that the probability of a joint configuration over vnh is proportional to e to the minus vh. To make that an equality we need to normalize the right hand side by all possible configurations over the visible and hidden and that's what the divisor there is. That's often called the partition function. That's what physicists call it. And notice it has exponentially many terms. To get the probability of a configuration of the visible units alone, we have to sum over all possible configurations of the hidden units. So P of V is the sum over all possible Hs, of each of the minus the energy you get with that H, normalized by the partition function. I want to give you an example of how we compute the probabilities of the different visible vectors, because that'll give you a good feel for what's involved. It's all very well to see the equations, but I find that I understand it much better when I've worked through the computation. So let's take a network with two hidden units and two visible units. And we'll ignore biases, so we just got three weights here. To keep things simple, I'm not gonna connect visible units to each other. So the first thing we do is write down all possible states of the visible units. I need to put them in different colors, and I'm going to write each state four times, Because for each state of visible units, there's four possible states of the hidden units that could go with it. So that gives us sixteen possible joint configurations. Now, for each of those joint configurations, we're going to compute it's negative energy minus E. So if you look at the first line, when all of the units are on. The negative energy will be +two -one, +one is +two. And we do this for all sixteen possible joint configurations. We then take the negative energies and we exponentiate them. And that will give us un-normalized probabilities. So these are the un-normalized probabilities of the configurations. Their probabilities are proportional to this. If we add all those up to 39.7 and then we divide everything by 39.7, we get the probabilities of joint configurations. There they all are. Now, if we want the probability of a particular visible configuration, we have to sum over all the hidden configurations that could go with it. And so we add up the numbers in each block. And now we've computed the probability of each possible visible vector in a Boltson's machine that has these three weights in it. Now let's ask how we get a sample from the model when the network's bigger than that. Obviously, in the network we just computed, we can figure out the probability of everything'cause it's small. But when the network's big, we can't do these exponentially large computations. So, if there's more than a few hidden units, we can't actually compute that partition function, there's too many terms in it. But we can use Markov Chain Monte Carlo to get samples from the model by starting from a random global configuration. And then picking units at random and dating them stochastically based on their energy gaps. Those energy gaps being determined by the states of all the other units in the network. If we keep doing that until the Markov chain reaches its stationary distribution, then we have a sample from the model. And the probability of that sample is related to its energy by the Boltzmann distribution, that is, the probability of the sample is proportional to each-(the of the minus energy. What about getting a sample from the posterior distribution over hidden configurations, when given a data vector? It turns out we're going to need that for learning. So the number of possible hidden configurations is again exponential. So again, we use Markov Chain Monte Carlo. And it's just the same as getting a sample from the model, except that we keep that we keep the visible units clamped to the data vector we're interested in. So we only update the hidden units. The reason we need to get samples from the posterior distribution, given a data vector, is we might want to know a good explanation for the observed data. And, we might want to base our actions on that good explanation. But, we also need to know that for learning.