In the previous video,
I showed how a Boltzmann machine can be
used a probabilistic model of a set of
binary data vectors.
In this video we're finally going to get
around to the Boltzmann machine learning
algorithm.
This is a very simple learning model which
has an elegant theoretical justification,
but it turned out in practice, it was
extremely slow and noisy, and just wasn't
practical.
And for many years, people thought that
Boltzmann machines would never be
practical devices.
Then we found several different ways of
greatly speeding up the learning
algorithm.
And now the algorithm is much more
practical, and has, in fact, been used as
part of the winning entry for a million
dollar machine learning competition, which
I'll talk about in a later video.
The Bolton machine learning algorithm is
an unsupervised learning algorithm.
Unlike the typical user back propagation,
where we have a input vector and we
provide it with a desired output. In
Boltzmann machine learning we just give it
the input vector.
There are q labels. What the algorithm is
trying to do is build a model of a set of
input vectors, though it might be better
to think of them as output vectors.
What we want to do is maximize the product
of the probabilities, that the Boltzmann
machine assigns to a set of binary
vectors,
The ones in the training set.
This is equivalent to maximizing the sum
of the log probabilities that the
Boltzmann machine assigns to the training
vectors.
It's also equivalent to maximizing the
probability that we'd obtain exactly the
end training cases, if we ran the
Boltzmann machine in the following way.
First, we let it settle to its stationary
distribution, and different times, with no
external input.
Then we sample the visible vector once.
Then we let it settle again, and sample
the visible vector again.
And so on.
Now the main reasons why the learning
could be difficult.
This is probably the most important
reason.
If you consider a chain of units,
A chain of hidden units here, with visible
units attached to the two ends,
And if we use a training set that consist
of one, zero and zero, one.
In other words, we want the two visible
units to be in opposite states.
Then the way to achieve that is by making
sure that the product of all those weights
is negative.
So, for example, if all of the weights are
positive, turning on W1 will tend to turn
on the first hidden unit.
And that will tend to turn on the second
hidden unit, and so on.
And the fourth hidden unit will tend to
turn on the other visible unit.
If one of those weights is negative, then
we'll get an anti-correlation between the
two visible units.
What this means is, that if we're thinking
about learning weight W1, we need to know
other weights.
So there's W1.
To know how to change that weight, we need
to know W3.
We need to have information about W3,
because if W3 is negative what we want to
do with W1 is the opposite of what we want
to do with W1 if W3 is positive.
So given that one weight needs to know
about other weights in order to be able to
change even in the right direction, it's
very surprising that there's a very simple
learning algorithm, and that the learning
algorithm only requires local information.
So it turns out that everything that one
weight needs to know about all the other
weights and about the data is contained in
the difference of two correlations.
Another way of saying that is that if you
take the log probability that the
Boltzmann machine assigns to a visible
vector V.
And ask about the derivative of that log
probability with respect to a weight, WIJ.
It's the difference of the expected value
of the products of the states of I and J.
When the networks settle to thermal
equilibrium with v clamped on the visible
units.
That is how often are INJ on together when
V is clamped in visible units and the
network is at thermal equilibrium, minus
the same quantity.
But when V is not clamped on visible
units, so because the derivative of the
log probability of a visible vector is
this simple difference of correlations we
can make the change in the weight be
proportional to the expected product of
the activities average over all visible
vectors in the training set, that's what
we call data.
Minus the product of the same two
activities when your not clamping anything
and the network has reached thermal
equilibrium with no external interference.
So this is a very interesting learning
rule.
The first term in the learning rule says
raise the weights in proportion to the
product of the activities the units have
when you're presenting data.
That's the simplest form of what's known
as a Hebbian learning rule.
Donald Hebb, a long time ago, in the 1940s
or 1950s, suggested that synapses in the
brain might use a rule like that.
But if you just use that rule, the synapse
strengths will keep getting stronger.
The weights will all become very positive,
and the whole system will blow up.
You have to somehow keep things under
control, and this learning algorithm is
keeping things under control by using that
second term.
It's reducing the weights in proportion to
how often those two units are on together,
when you're sampling from the model's
distribution.
You can also think of this as the first
term is like the storage term for a
Hopfield Net.
And the second term is like the term for
getting rid of spurious minima.
And in fact this is the correct way to
think about that.
This rule tells you exactly how much
unlearning to do.
One obvious question is why is the
derivative so simple.
Well, the probability of a global
configuration at thermal equilibrium, that
is once you've let it settle down, is an
exponential function of its energy.
The probability is related to E to the
minus energy.
So when we settle to equilibrium we
achieve a linear relationship between the
log probability and the energy function.
Now, the energy function is linear in the
weights.
So, we have a linear relationship between
the weights and the log probability.
And since we're trying to manipulate log
probabilities by manipulating weights,
that's a good thing to have.
It's a log linear model.
In fact, the relationship's very simple.
It's that the derivative of the energy
with respect to a particular weight WIJ is
just the product of the two activities
that, that weight connects.
So what's happening here?
Is the process of settling to thermal
equilibrium is propagating information
about weights?
We don't need an explicit back propagation
stage.
We do need two stages.
We need to settle with the data.
And we need to settle with no data.
But notice that the networks behaving in
pretty much the same way in those two
phases.
The unit deep within the network is doing
the same thing, just with different
boundary conditions.
With back prop the forward pass and the
backward pass are really rather different.
Another question you could ask is what's
that negative phase for.
I've already said it's like the unlearning
we do in a Hopfield net to get rid of
spurious minima.
But let's look at it in more detail.
The equation for the probability of a
visible vector, is that it's a sum overall
hidden vectors of E to the minus the
energy of that visible and hidden vector
together.
Normalized by the same quantity, summed
overall visible vectors.
So if you look at the top term, what the
first term in the learning rule is doing
is decreasing the energy of terms in that
sum that are already large and it finds
those terms by settling to thermal
equilibrium with the vector V clamped so
that it can find an H that goes nicely
with V, that is gives a nice low energy
with V.
Having sampled those vectors H, it then
changes the weights to make that energy
even lower.
The second phase in the learning, the
negative phase, is doing the same thing,
but for the partition function.
That is, the normalizing term on the
bottom line.
It's finding global configurations,
combinations of visible and hidden states
that give low energy,
And therefore, are large contributors to
the partition function.
And having find those global
configurations, it tries to raise their
energy so that the can contribute less.
So the first term is making the top line
big, and the second term is making the
bottom line small.
Now in order to run this learning rule,
you need to collect those statistics.
You need to collect what we call the
positive statistics, those are the ones
when you have data clamped on the visible
units, and also the negative statistics,
those are the ones when you don't have
data clamped and that you're going to use
for unlearning.
An inefficient way to track these
statistics was suggested by me and Terry
Sejnowski in 1983.
And the idea is, in the positive phase you
clamp a data vector on the visible units,
you set the hidden units to random binary
states,
And then you keep updating the hidden
units in the network, one unit at a time,
until the network reaches thermal
equilibrium at a temperature of one.
We actually did that by starting at a high
temperature and reducing it, but that's
not the main point here.
And then once you reach thermal
equilibrium, you sample how often two
units are on together.
So you're measuring the correlation of INJ
with that visible vector clamped.
You then repeat that, over all the visible
vectors, so that, that correlation you're
sampling is averaged over all the data.
Then in the negative phase, you don't
clamp anything.
The network is free from external
interference.
So, you set all of the units, both visible
and hidden, to random binary states.
And then you update the units, one at a
time, until the network reaches thermal
equilibrium, at a temperature of one.
Just like you did in the positive phase.
And again, you sample the correlation of
every pair of units INJ,
And you repeat that many times.
Now it's very difficult to know how many
times you need to repeat it, but certainly
in the negative phase you expect the
energy landscape to have many different
minima, but are fairly separated and have
about the same energy.
The reason you expect that is we're going
to be using Boltzmann machines to do
things like model a set of images.
And you expect there to be reasonable
images, all of which have about the same
energy.
And then very unreasonable images, which
have much higher energy.
And so you expect a small fraction of the
space to be these low energy states.
And a very large fraction of the space to
be these bad high energy states.
If you have multiple modes, it's very
unclear how many times you need to repeat
this process to be able to sample those
modes.