In this video I'm going to talk about some advanced material. It's not really appropriate for a first course on nerual networks but I know that some of you are particularly interested in the urgent of deep learning. And the content of this video is mathematically very pretty So I couldn't resist putting it in. [INAUDIBLE] insight that stacking up restrictive Boltzmann machines gives you something like a sigmoid belief net can actually be seen without doing any math. Just by noticing, that a restrictive Boltzmann machine is actually the same thing as an infinitely deep sigmoid belief net with shared weights. Once again, wave sharing leads to something very interesting. I'm now going to describe, a very interesting explanation of why layer by layer learning works. It depends on the fact that there is an equivalence between restricted bowlser machines, which are undirected networks with symmetric connections, and infinitely deep directed networks. In which every layer uses the same weight matrix. This equivalence also gives insight into why contrasted divergence learning works. So an RBM is really just an infinitely deep sigmoid belief net with a lot of shared weights. The Markoff chain that we run when we want to sample from an RBM can be viewed as exactly the same thing as a sigmoid belief net. So here's the picture. We have a very deep sigmoid belief net. In fact, infinitely deep. We use the same weights at every layer. We have to have all the V layers being the same size as each other, and all the H layers being the same size as each other. But V and H can be different sizes. The distribution generated by this very deep network with replicated weights is exactly the equilibrium distribution that you get by alternating between doing P of V given H, and P of H given V, where both P of V given H and P of H given V are defined by the same weight matrix W. And that's exactly what you do when you take a restricted Boltzmann machine, and run a Markhov chain to get a sample from the equilibrium distribution. So a top-down pass starting from infinitely higher up. In this directed note, is exactly equivalent to letting a restricted Boltzmann machine settle to equilibrium. But that would define the same distribution. The sample you get at v0 if you run this infinite directed note, would be an equilibrium sample of the equivalent RBM. Now let's look at inference in an infinitely deep sigmoid belief net. So in inference we start at v zero and then we have to infer the state of h zero. Normally this would be a difficult thing to do because of explaining away. If for example hidden units K and J both had big positive weights to visible unit I, then we would expect that when we observe that I is on, K and J become anti-correlated in the posterior distribution. That's explaining a way. However in this net, K and J are completely independent of one another when we do inference given V0. So the inference is trivial, we just multiply V0 by the transpose of W. And put whatever we get through the logistic sigmoid and then sample. And that gives us binary states of the units in H0. But the question is how could they possible be independent given explaining away. The answer to that question is that the model above H0 implements what I call a complementary prior. It implements a prior distribution over H0 that exactly cancels out the correlations in explaining away. So for the example shown, the prior will implement positive correlation stream k and j. Explain your way will cause negative correlations and those will exactly cancel. So what's really going on is that when we multiply v0 by the transpose of the weights, we're not just computing the light unit term. We're computing the product of a light unit term and a prior term. And that's what you need to do to get the posterior. It normally comes as a big surprise to people. That when you multiply by w transpose, it's the product of the prior in the posterior of your computer. So what's happening in this net is that the complementary prior implemented by all the stuff above H0, exactly counts a lot explaining why it makes inference very simple. And that's true at every layer of this net so we can do inference for every layer and get an unbiased sample with each layer simply by multiplying V0 by W transpose. Then once we computed the binary state of H0, we multiple that by W. Put that through the logistic sigmoid and sample and that will give use a binary state for V1 and so on for all the way up. Suggestive generating from this model is equivalent to running the alternating mark off chain on a restricted Boltzmann machine to equilibrium. Performing inference in this model is exactly the same process in the opposite direction. This is a very special kind of sigmoid belief net in which inference is as easy as generation. So here I've shown the generative weights that define the model, and also their transposes, that are the way we do inference. And now I what want to show is how we get the Bolton Machine Learning Algorithm out of the learning algorithm for directed Sigmoid belief nets. So the learning rule for Sigmoid belief net says that we should first get a sample from the posterior, that what the Sj and Si are, samples from the posterior distribution. And then we should change a weight, the generative weight in proportion to the product of the pre activity as J and the difference between the [INAUDIBLE] activity as i and the probability of turning on i given all the binary states of the ladder Sj is in. Now if we ask how do we compute Pi, something very interesting happens. If you look at inference in this network on the right, we first infer a binary state for H0. Once we've chosen that binary state, we then infer a binary state for V1 by multiplying H0 by W, putting the result through the logistic, and then sampling. So if you think about how Si1 was generated? It was a sample from what we get if we put H0 through the weight matrix W and then through the logistic. And that's exactly what we'd have to, to in order to compute PiO. We'd have to take the binary activities in H0 and going downwards now through the green weights, W, we will compute the probability of turning on unit I given the binary states of its parents. So the point is, the process that goes from H0 to V1 is identical to the process that goes from H0 to V0. And so SI1 is an unbiased sample of PI0. That means we can replace it in the learning rule. So we end up with a learning rule that looks like this, because since we have replicated weights, each of those lines is the term in the learning rule that comes from one of those green weight matrices. For the first green weight matrix here. The learning rule is the presynaptic state Sj0 times the difference between the post synaptic state Si0 and the probability that the binary states in H0 would turn on Si. Which we could call PI0 but a sample with that probability is Si1. And so an unbiased estimate of the relative, can be got by plugging in Si1 on that first line of the learning rule. Similarly for the second weight matrix, the learning rule is SI1 into SJ0 minus PJ0 and an unbiased estimate of PJ0 is SJ1. And so that's an unbiased testament of the learning rule, for this second weight matrix. And if you just keep going for all wave-matrices you get this infinite series. And all the terms except the very first term and the very last term cancel out. And so you end up with the Boltzmann machine learning rule. Which is just SJ-zero into Si-zero, minus SI-infinity into SI-infinity. So let's go back and look at how we would learn an infinitely deep sigmoid belief net. We would start by making all the weight matrices the same. So we tie all the weight matrices together. And we learn using those tied weights. Now that's exactly equivalent to learning a restricted Boltzmann machine. The diagram on the right and the diagram on the left are identical. We can think of the symmetric arrow in the diagram on the left, as just a convenient shorthand for an infinite directed net with tied weights. So we first learn that restricted Boltzmann machine. Now we ought to learn it using maximum likelihood learning, but actually we're just going to use contrasted divergence learning. We're going to take a shortcut. Once we've learned the first restricted Boltzmann machine, what we could do is we could freeze the bottom level weights. We'll freeze the generative weights that define the model. We'll also freeze the weights we're going to use for inference to be the transpose of those generative weights. So we freeze those weights. We keep all the other weights tied together. But now we're going to allow them to be different from the weights in the bottom layer but they're still all tied together. So learning the remaining weights tied together is exactly equivalent to learning another restrictive Boltzmann machine. Namely a restricted Boltzmann machine with H0 as its visible units, V1 as its hidden units. And where the data is the aggregated posterior across H0. That is, if we want to sample a data vector to train this network, what we do is we put in a real data vector, V nought, we do inference through those frozen waits, and we get a binary vector at H nought, and we treat that as data for training the next restricted Boltzmann machine. And we can go up for as many layers as we like. And when we get fed up, we just end up with the restrictive Boltzmann machine at the top which is equivalent to saying, all the weights in the infinite directed net above there are still tied together, but the weights below have now all become different. Now an explanation of why the inference procedure was correct, involved the idea of a complementary prior created by the weights in the layers above but of course, when we change the weights in the layers above, but leave the bottom layer of weights fixed, the prior created by those changed weights is no longer exactly complementary. So now our inference procedure, using the frozen weights in the bottom layer, is no longer exactly correct. The good news is, it's nearly always very close to correct and with the incorrect inference procedure, we still get a variational bound on the low probability of the data. The higher layers have changed because they've learned a prior for the bottom hidden layer that's closer to the aggregated posterior distribution. And that makes the model better. So changing the hidden weights makes the inference that we're doing at the bottom hidden layer incorrect, but gives us a better model. And if you look at those two effects, we prove that the improvement that you get in the variation bound from having a better model is always greater than the loss that you get from the inference being slightly incorrect. So in this variation bound you win when you learn the lights in hire less, assuming that you do it with correct maximizer [INAUDIBLE]. So now let's go back to what's happening in contrasted divergence learning. We have the infinite net on the right and we have a restricted Boltzmann machine on the left. And they're equivalent. If we were to do maximum likelihood learning for the restricted Boltzmann machine, it would be maximum likelihood learning for the infinite sigmoid belief net. But what we're going to do is we're going to cut things off. We're going to ignore the small derivitives for the weights you get in the higher layers of the infinite sigmoid belief net. So, we cut it off were that dotted red line is. And now if we look at the derivatives, the derivatives we're going to get look like this. They've got two terms. The first term comes from that bottom layer of nets. We've seen that before, the router for the bottom layer of weights is just that first line here. The second line comes from the next layer of lights. That's this line here. We need to compute the activities in H1, in order to compute the Sj1 in that second line but we're not actually computing derivatives for the third layer of weights. And when we take those first two terms, and we combine them. We get exactly the learning rule for one step contrasted divergence. So what's going on in contrasted divergence, is we're combining weight derivatives for the lower layers, and ignoring the weight derivatives in higher layers. The question is, why can we get away with ignoring those higher derivatives? When the weights are small, the Markov chain mixes very fast. If the weights are zero, it mixes in one step. And if the Markoff chain mixes fast, the higher layers will be close to the equilibrium distribution, i.e. They will have forgotten what the input was at the bottom layer. And now we have a nice property. If the higher layers are sampled from the equilibrium distribution, we know that the derivatives of the log probability, the data with respect to the weights, must average out to zero. And that's because the current weights in the model are a perfect model of the equilibrium distribution. The equilibrium distribution is generated using those weights. And if you want to generate samples from the equilibrium distribution, those are the best possible weights you could have. So we know the root is there is zero. As the weights get larger, we might have to run more iterations of Contrastive Divergence. Which corresponds to taking into account more layers of that infinite sigmoid belief net. That will allow Contrasive Divergence to continue to be a good approximation to maximum likelihood and so if we're trying to learn a density model, that makes a lot of sense. As the weights grow, you run CD for more and more steps. If there's a statistician around, you give him a guarantee, then in the infinite limit, you'll run CD for infinite many steps. And then you have an asymptotic convergence result, which is the thing that keeps statisticians happy. Of course it's completely irrelevant because you'll never reach a point like that. There is however an interesting point here. If our purpose in using CD is to build a stack of restricted Boltzmann machines, that learn multiple layers of features, it turns out that we don't need a good approximation to maximum likelihood. For learning multiple layers of features, CD1 is just fine. In fact it's probably better than doing maximum l likelihood.