In this video I'm going to talk about 
some advanced material. 
It's not really appropriate for a first 
course on nerual networks but I know that 
some of you are particularly interested 
in the urgent of deep learning. 
And the content of this video is 
mathematically very pretty So I couldn't 
resist putting it in. 
[INAUDIBLE] insight that stacking up 
restrictive Boltzmann machines gives you 
something like a sigmoid belief net can 
actually be seen without doing any math. 
Just by noticing, that a restrictive 
Boltzmann machine is actually the same 
thing as an infinitely deep sigmoid 
belief net with shared weights. 
Once again, wave sharing leads to 
something very interesting. 
I'm now going to describe, a very 
interesting explanation of why layer by 
layer learning works. 
It depends on the fact that there is an 
equivalence between restricted bowlser 
machines, which are undirected networks 
with symmetric connections, and 
infinitely deep directed networks. 
In which every layer uses the same weight 
matrix. 
This equivalence also gives insight into 
why contrasted divergence learning works. 
So an RBM is really just an infinitely 
deep sigmoid belief net with a lot of 
shared weights. 
The Markoff chain that we run when we 
want to sample from an RBM can be viewed 
as exactly the same thing as a sigmoid 
belief net. 
So here's the picture. 
We have a very deep sigmoid belief net. 
In fact, infinitely deep. 
We use the same weights at every layer. 
We have to have all the V layers being 
the same size as each other, and all the 
H layers being the same size as each 
other. 
But V and H can be different sizes. 
The distribution generated by this very 
deep network with replicated weights is 
exactly the equilibrium distribution that 
you get by alternating between doing P of 
V given H, and P of H given V, where both 
P of V given H and P of H given V are 
defined by the same weight matrix W. 
And that's exactly what you do when you 
take a restricted Boltzmann machine, and 
run a Markhov chain to get a sample from 
the equilibrium distribution. 
So a top-down pass starting from 
infinitely higher up. 
In this directed note, is exactly 
equivalent to letting a restricted 
Boltzmann machine settle to equilibrium. 
But that would define the same 
distribution. 
The sample you get at v0 if you run this 
infinite directed note, would be an 
equilibrium sample of the equivalent RBM. 
Now let's look at inference in an 
infinitely deep sigmoid belief net. 
So in inference we start at v zero and 
then we have to infer the state of h 
zero. 
Normally this would be a difficult thing 
to do because of explaining away. 
If for example hidden units K and J both 
had big positive weights to visible unit 
I, then we would expect that when we 
observe that I is on, K and J become 
anti-correlated in the posterior 
distribution. 
That's explaining a way. 
However in this net, K and J are 
completely independent of one another 
when we do inference given V0. 
So the inference is trivial, we just 
multiply V0 by the transpose of W. 
And put whatever we get through the 
logistic sigmoid and then sample. 
And that gives us binary states of the 
units in H0. 
But the question is how could they 
possible be independent given explaining 
away. 
The answer to that question is that the 
model above H0 implements what I call a 
complementary prior. 
It implements a prior distribution over 
H0 that exactly cancels out the 
correlations in explaining away. 
So for the example shown, the prior will 
implement positive correlation stream k 
and j. 
Explain your way will cause negative 
correlations and those will exactly 
cancel. 
So what's really going on is that when we 
multiply v0 by the transpose of the 
weights, we're not just computing the 
light unit term. 
We're computing the product of a light 
unit term and a prior term. 
And that's what you need to do to get the 
posterior. 
It normally comes as a big surprise to 
people. 
That when you multiply by w transpose, 
it's the product of the prior in the 
posterior of your computer. 
So what's happening in this net is that 
the complementary prior implemented by 
all the stuff above H0, exactly counts a 
lot explaining why it makes inference 
very simple. 
And that's true at every layer of this 
net so we can do inference for every 
layer and get an unbiased sample with 
each layer simply by multiplying V0 by W 
transpose. 
Then once we computed the binary state of 
H0, we multiple that by W. 
Put that through the logistic sigmoid and 
sample and that will give use a binary 
state for V1 and so on for all the way 
up. 
Suggestive generating from this model is 
equivalent to running the alternating 
mark off chain on a restricted Boltzmann 
machine to equilibrium. 
Performing inference in this model is 
exactly the same process in the opposite 
direction. 
This is a very special kind of sigmoid 
belief net in which inference is as easy 
as generation. 
So here I've shown the generative weights 
that define the model, and also their 
transposes, that are the way we do 
inference. 
And now I what want to show is how we get 
the Bolton Machine Learning Algorithm out 
of the learning algorithm for directed 
Sigmoid belief nets. 
So the learning rule for Sigmoid belief 
net says that we should first get a 
sample from the posterior, that what the 
Sj and Si are, samples from the posterior 
distribution. 
And then we should change a weight, the 
generative weight in proportion to the 
product of the pre activity as J and the 
difference between the [INAUDIBLE] 
activity as i and the probability of 
turning on i given all the binary states 
of the ladder Sj is in. 
Now if we ask how do we compute Pi, 
something very interesting happens. 
If you look at inference in this network 
on the right, we first infer a binary 
state for H0. 
Once we've chosen that binary state, we 
then infer a binary state for V1 by 
multiplying H0 by W, putting the result 
through the logistic, and then sampling. 
So if you think about how Si1 was 
generated? 
It was a sample from what we get if we 
put H0 through the weight matrix W and 
then through the logistic. 
And that's exactly what we'd have to, to 
in order to compute PiO. 
We'd have to take the binary activities 
in H0 and going downwards now through the 
green weights, W, we will compute the 
probability of turning on unit I given 
the binary states of its parents. 
So the point is, the process that goes 
from H0 to V1 is identical to the process 
that goes from H0 to V0. 
And so SI1 is an unbiased sample of PI0. 
That means we can replace it in the 
learning rule. 
So we end up with a learning rule that 
looks like this, because since we have 
replicated weights, each of those lines 
is the term in the learning rule that 
comes from one of those green weight 
matrices. 
For the first green weight matrix here. 
The learning rule is the presynaptic 
state Sj0 times the difference between 
the post synaptic state Si0 and the 
probability that the binary states in H0 
would turn on Si. 
Which we could call PI0 but a sample with 
that probability is Si1. 
And so an unbiased estimate of the 
relative, can be got by plugging in Si1 
on that first line of the learning rule. 
Similarly for the second weight matrix, 
the learning rule is SI1 into SJ0 minus 
PJ0 and an unbiased estimate of PJ0 is 
SJ1. 
And so that's an unbiased testament of 
the learning rule, for this second weight 
matrix. 
And if you just keep going for all 
wave-matrices you get this infinite 
series. 
And all the terms except the very first 
term and the very last term cancel out. 
And so you end up with the Boltzmann 
machine learning rule. 
Which is just SJ-zero into Si-zero, minus 
SI-infinity into SI-infinity. 
So let's go back and look at how we would 
learn an infinitely deep sigmoid belief 
net. 
We would start by making all the weight 
matrices the same. 
So we tie all the weight matrices 
together. 
And we learn using those tied weights. 
Now that's exactly equivalent to learning 
a restricted Boltzmann machine. 
The diagram on the right and the diagram 
on the left are identical. 
We can think of the symmetric arrow in 
the diagram on the left, as just a 
convenient shorthand for an infinite 
directed net with tied weights. 
So we first learn that restricted 
Boltzmann machine. 
Now we ought to learn it using maximum 
likelihood learning, but actually we're 
just going to use contrasted divergence 
learning. 
We're going to take a shortcut. 
Once we've learned the first restricted 
Boltzmann machine, what we could do is we 
could freeze the bottom level weights. 
We'll freeze the generative weights that 
define the model. 
We'll also freeze the weights we're going 
to use for inference to be the transpose 
of those generative weights. 
So we freeze those weights. 
We keep all the other weights tied 
together. 
But now we're going to allow them to be 
different from the weights in the bottom 
layer but they're still all tied 
together. 
So learning the remaining weights tied 
together is exactly equivalent to 
learning another restrictive Boltzmann 
machine. 
Namely a restricted Boltzmann machine 
with H0 as its visible units, V1 as its 
hidden units. 
And where the data is the aggregated 
posterior across H0. 
That is, if we want to sample a data 
vector to train this network, what we do 
is we put in a real data vector, V 
nought, we do inference through those 
frozen waits, and we get a binary vector 
at H nought, and we treat that as data 
for training the next restricted 
Boltzmann machine. 
And we can go up for as many layers as we 
like. 
And when we get fed up, we just end up 
with the restrictive Boltzmann machine at 
the top which is equivalent to saying, 
all the weights in the infinite directed 
net above there are still tied together, 
but the weights below have now all become 
different. 
Now an explanation of why the inference 
procedure was correct, involved the idea 
of a complementary prior created by the 
weights in the layers above but of 
course, when we change the weights in the 
layers above, but leave the bottom layer 
of weights fixed, the prior created by 
those changed weights is no longer 
exactly complementary. 
So now our inference procedure, using the 
frozen weights in the bottom layer, is no 
longer exactly correct. 
The good news is, it's nearly always very 
close to correct and with the incorrect 
inference procedure, we still get a 
variational bound on the low probability 
of the data. 
The higher layers have changed because 
they've learned a prior for the bottom 
hidden layer that's closer to the 
aggregated posterior distribution. 
And that makes the model better. 
So changing the hidden weights makes the 
inference that we're doing at the bottom 
hidden layer incorrect, but gives us a 
better model. 
And if you look at those two effects, 
we prove that the improvement that you 
get in the variation bound from having a 
better model is always greater than the 
loss that you get from the inference 
being slightly incorrect. 
So in this variation bound you win when 
you learn the lights in hire less, 
assuming that you do it with correct 
maximizer [INAUDIBLE]. 
So now let's go back to what's happening 
in contrasted divergence learning. 
We have the infinite net on the right and 
we have a restricted Boltzmann machine on 
the left. 
And they're equivalent. 
If we were to do maximum likelihood 
learning for the restricted Boltzmann 
machine, it would be maximum likelihood 
learning for the infinite sigmoid belief 
net. 
But what we're going to do is we're going 
to cut things off. 
We're going to ignore the small 
derivitives for the weights you get in 
the higher layers of the infinite sigmoid 
belief net. 
So, we cut it off were that dotted red 
line is. 
And now if we look at the derivatives, 
the derivatives we're going to get look 
like this. 
They've got two terms. 
The first term comes from that bottom 
layer of nets. 
We've seen that before, the router for 
the bottom layer of weights is just that 
first line here. 
The second line comes from the next layer 
of lights. 
That's this line here. 
We need to compute the activities in H1, 
in order to compute the Sj1 in that 
second line but we're not actually 
computing derivatives for the third layer 
of weights. 
And when we take those first two terms, 
and we combine them. 
We get exactly the learning rule for one 
step contrasted divergence. 
So what's going on in contrasted 
divergence, is we're combining weight 
derivatives for the lower layers, and 
ignoring the weight derivatives in higher 
layers. 
The question is, why can we get away with 
ignoring those higher derivatives? 
When the weights are small, the Markov 
chain mixes very fast. 
If the weights are zero, it mixes in one 
step. 
And if the Markoff chain mixes fast, the 
higher layers will be close to the 
equilibrium distribution, i.e. 
They will have forgotten what the input 
was at the bottom layer. 
And now we have a nice property. 
If the higher layers are sampled from the 
equilibrium distribution, we know that 
the derivatives of the log probability, 
the data with respect to the weights, 
must average out to zero. 
And that's because the current weights in 
the model are a perfect model of the 
equilibrium distribution. 
The equilibrium distribution is generated 
using those weights. 
And if you want to generate samples from 
the equilibrium distribution, those are 
the best possible weights you could have. 
So we know the root is there is zero. 
As the weights get larger, we might have 
to run more iterations of Contrastive 
Divergence. 
Which corresponds to taking into account 
more layers of that infinite sigmoid 
belief net. 
That will allow Contrasive Divergence to 
continue to be a good approximation to 
maximum likelihood and so if we're trying 
to learn a density model, that makes a 
lot of sense. 
As the weights grow, you run CD for more 
and more steps. 
If there's a statistician around, you 
give him a guarantee, then in the 
infinite limit, you'll run CD for 
infinite many steps. 
And then you have an asymptotic 
convergence result, which is the thing 
that keeps statisticians happy. 
Of course it's completely irrelevant 
because you'll never reach a point like 
that. 
There is however an interesting point 
here. 
If our purpose in using CD is to build a 
stack of restricted Boltzmann machines, 
that learn multiple layers of features, 
it turns out that we don't need a good 
approximation to maximum likelihood. 
For learning multiple layers of features, 
CD1 is just fine. 
In fact it's probably better than doing 
maximum l likelihood.