In this video, I'll go into more detail
about how we can speed up the Boltzmann
Machine Learning Algorithm by using
cleverer ways of keeping Markov chains
near the equilibrium distribution, or by
using what are called mean field methods.
The material is quite advanced and so it's
not really part of the course.
There won't be any quizzes on it and it's
not on the final test.
You can safely skip this video.
It's included for people who are really
interested in how to get deep Boltzmann
machines to work well.
There are better ways of collecting the
statistics than the method that Terry
Snofsky and I originally came up with.
If we start from a random state, it may
take a long time to reach thermal
equilibrium.
Also, there's no easy tests for whether
you've reached thermal equilibrium, so we
don't know how long we need to run for.
So, the idea is why not start from
whatever state you ended up in last time
you saw that particular data vector?
So, we remember the interpretation of the
data vector in the hidden units, and we
start from there.
This stored state, the interpretation of
the data vector, is called a particle.
Using particles that persist gives us a
warm start and it has a big advantage.
If we were at equilibrium before and we
only updated the weights a little bit,
it'll only take a few updates of the units
in a particle to bring it back to
equilibrium.
We can use particles for both the positive
phase, where we have a clamp data vector,
and for the negative phase, when nothing
is clamped.
So, here's the method for directing
statistics introduced by Radford Neal in
1992.
In the positive phase, you have a set of
data specific particles, one or a few per
training case. And each particle has a
current value that's the configuration of
the hidden units plus which data vector it
goes with.
You sequentially update all the hidden
units a few times in each particle with
the relevant data vector clamped.
And then, for every connected pair of
units, you average the probability of the
two units being on over all these
particles.
In the negative phase, you keep a set of
fantasy particles.
That is, these are global configurations.
And again, after each weight update, you
sequentially update all the units in each
fantasy particle a few times.
Now, you're updating the visible units as
well.
And for every connected pair of units,
your average, SiSj, over all the fantasy
particles.
The learning rule is then the change in
the weights is proportional to the average
you got with data, averaged over all
training data, and the average you got
with the fantasy particles when nothing
was clamped.
This works better than the learning rule
that Terry Snofsky and I introduced, at
least for full batch learning.
However, it's difficult to apply this
approach to mini batches.
And the reason is, that by the time we get
back to the same data vectorn if we're
using mini batch learning, the weights
would have been updated many times.
Son the stored data specific particle for
that data vector won't be anywhere near
thermal equilibrium anymore.
The hidden units won't be in thermal
equilibrium with the visible units of the
particle given the new weights.
And again, we don't know how long we're
going to have to run for, before we get
close to equilibrium again.
So, we can overcome this by making a
strong assumption about how we understand
the world.
It's a kind of a epistemological
assumption.
We're going to assume that when a data
vector is clamped, the set of good
explanations, that is states of the hidden
units, that act as interpretations of that
data vector is uni-modal.
That means we're saying that, for a given
data vector, there aren't two very
different explanations for that data
vector.
We assume that for sensory input, there is
one correct explanation.
And if we have a good model of the data,
our model will give us one energy minimum
for that data point.
This is a restriction on the kinds of
models we're willing to learn.
We're going to use a learning algorithm
that's incapable of learning models in
which a data vector has many very
different interpretations.
Provided we're willing to make this
assumption, we can use a very efficient
method for approaching thermal equilibrium
or an approximation to thermal
equilibrium, with the data.
It's called a mean field approximation.
So, if we want to get the statistics
right, we need to update the units
statistically and sequentially.
And the update rule is the probability of
turning on unit, i, is the logistic
function of the total input it receives
from the other units in its bias.
Where Sj, the state of another unit, is a
stochastic binary thing.
Now, instead of using that rule, we could
say, we're not going to keep binary states
for unit i, we're going to keep a real
value between zero and one which we call a
probability.
And that probability at time t + one is
going to be the output of the logistic
function.
The more you put in is the bias, and the
sum of all the probabilities at time t
times the weights.
So, we're replacing this stochastic binary
thing by a real value probability.
And that's not quite right because this
stochastic binary thing is inside a
non-linear function.
If it was a linear function, things would
be fine.
But because the logistics non-linear, we
don't get the right answer when we put
probabilities instead of fluctuating
binary things inside.
However, it works pretty well.
It can go wrong by giving us biphasic
oscillations because now we're going to be
updating everything in parallel.
And we can normally deal with those by
using what's called damped mean field
where we compute that pi of t1.
+ one.
But, we don't go all the way there.
We go to a point in between where we are
now, and where this update wants us to go.
So, in damped mean field, we'll go to
lambda times the place we are now, plus
one minus lambda times the place the
update rule tells us to go to.
And that will kill oscillations.
Now, we can get an efficient mini batch
learning procedure for both the machines,
and this is what Russ Salakhutdinov
realized.
In the positive phase, we can initialize
all probabilities at 0.5.
We can clamp a data vector on the visible
units, and we can update all the hidden
units in parallel using mean field until
convergence.
And for mean field, you can recognize
convergence is when the probability stop
changing.
And once we converged, we record PiPj for
every connected pair of units.
In the negative phase, we do what we were
doing before.
We keep a set of fantasy particles, each
of which has a value that's a global
configuration.
And after each weight update, we
sequentially update all the units in each
fantasy particles a few times.
And then, for every connected pair of
units, we average SiSj, these stochastic
binary things, over all fantasy particles.
And the difference in those averages is
the learning rule.
That is, we change the weights by an
amount proportional to that difference.
If we want to make the updates for the
fantasy particles more parallel, we can
change the architecture of the Boltzmann
machine.
So, we're going to have a special
architecture that allows alternating
parallel updates for the fantasy
particles.
We have no connections within a layer, and
we have no skip-layer connections, but we
allow ourselves lots of hidden layers.
So, the architecture looks like this.
We call it a Deep Boltzmann Machine.
And, it's really a general Boltzmann
machine with lots of missing connections.
All those skipped layer connections, if
they were present.
We wouldn't really have layers at all, it
would just be a general Boltzmann machine.
But, in this special architecture, there's
something nice we can do.
We can update the states for example the
first hidden layer and the third hidden
layer, given the current states of the
visible units and the second hidden layer.
And then, we can update the states of the
visible units in the second hidden layer.
And then, we can go back and update the
other states,
And we can go backwards and forwards like
this.
And so, we can update half the states of
all the units in parallel and that'll be a
correct update.
So, one question is, if we have a deep
Boltzmann machine like that trained by
using mean field for the positive phase
and updating fantasy particles by
alternating between even layers and odd
layers for the negative phase, can we
learn good models of things like the MNIST
digits, or indeed, a more complicated
things?
So, one way to tell whether you've learned
a good model is after learning, you remove
all the input and you just generate
samples from your model.
So, you run the Markov chain for a long
time until it's burned in, and then you
look at the samples you get.
So, Russ Salakhutdinov used a eep
Boltzmann machine to model MNIST digits
using mean field for the positive phase,
And alternating updates of the layers of
the particles for the negative phase.
And the real data looks like this.
And the data that he got from his model
looks like this.
You can see, they're actually fairly
similar.
The model is producing things very like
the MNIST digits so it's learned a pretty
good model.
So here's a puzzle.
When he was learning that, he was using
mini-batches with 100 data examples and
also he was using 100 fantasy particles,
the same 100 fantasy particles for every
mini-batch.
And the puzzle is, how can we estimate the
negative statistics with only 100 negative
examples to characterize the whole space?
For all interesting problems, the global
configurations base is going to be highly
multi model.
And how do we manage to find and represent
all the nodes with only 100 particles?
There's an interesting answer to this.
The learning interacts with the Markov
chain that's being used to gather the
negative statistics, either one that's
used to update the fantasy particles, and
it interacts with it to make it have a
much higher effective mixing rate.
That means, we cannot analyze the learning
by thinking of it being an outer loop that
updates the weights,
And an inner loop that gathers statistics
with a fixed set of weights.
The learning is affecting how effective
that inner loop is.
The reason for this is that whenever the
fantasy particles outnumber the positive
data, the energy surface is raised, and
this has an effect on the mixing rate of
the Markov chain.
It makes the fantasies rush around
hyper-actively,
And they move around much faster than the
mixing rate of the mark of chain to find
better current static weights.
So, here's a picture that shows you what's
going on.
If there's a mode in the energy surface
that has more fantasy particles than data,
the energy surface will be raised until
the fantasy particles escape from that
mode.
So, the mode on the left has four fantasy
particles and only two data points. So,
the effect of the learning is going to be
to raise the energy there.
And that energy barrier might be much too
high for a Markov chain to be able to
cross, so the mixing rate will be very
slow.
But, the learning will actually spill
those red particles out of that energy
minimum by raising the minimum.
And we get filled up and the fantasy
particles will escape and go off somewhere
else, to some other deep minimum.
So, we can get out of minima that the
Markov chain would not be able to get out
of, at least, not in a reasonable time.
So, what's going on here is the energy
surface is really being used for two
different purposes.
The energy surface represents our model,
but it's also being manipulated by the
learning algorithm to make the Markov
chain mix faster.
Or rather, to have the effect of a
faster-mixing Markov chain.
Once the fantasy particles have filled up
one hole, they'll rush off to somewhere
else and deal with the next problem.
An analogy for them is that their like
investigative journalists who rush in to
investigate some nasty problem.
As soon as the publicity has caused that
problem to be fixed, instead of saying,
okay, everything is okay now.
They rush off to find the next nasty
problem.