In this video, I'm going to explain how
adding noise can help systems escape from
local minima.
And, I'm going to show what you have to do
to the units in Hopfield net to add noise
in the appropriate way.
I'm not going to introduce the idea that
we confined better minima by using noise.
So, Hopfield net always makes decisions
that reduce the energy, or if it doesn't
state of the unit, the energy stays the
same.
This makes it impossible to climb out of a
local minimum.
So, if you look at the landscape here.
If we get into the local minimum A,
there's no way we're going to get over the
energy barrier to get to the better
minimum B because we can't go uphill in
energy.
If we add random noise, we can escape from
poor minima, especially minima that is
shallow, that is, ones that don't have big
energy barriers around them.
It turns out, rather than using a fixed
noise level, the most effective strategy
is to start with a lot of noise which
allows you to explore the space on a
coarse scale and find the generally good
regions of the space, and then to decrease
the noise level.
With a lot of noise, you can cross big
barriers.
As you decrease the noise level, you start
concentrating on the best nearby minima.
If you slowly reduce the noise, so the
system ends up in a deep minimum, that's
called simulated annealing.
And this ideal was, propogated by
Kirkpatrick at around the same time as
Hopfield nets were proposed.
So, the reason for simulated annealing is
because the temperature, in a physical
system, or in a simulated system with a
energy function,
Affects the transition probabilities.
So, in a high temperature system, the
probability of going uphill from B to A is
lower than the probability of going
downhill from A to B.
But it's not much lower.
In effect, the temperature flattens the
energy landscape, and so the little black
dots are meant to be particles.
And what we are imagining is particles
moving about according to the transition
probabilities that you get with an energy
function and a temperature.
And this might be a typical distribution
if you're on the system of high
temperature where it's easier to cross
barriers, but it's also hard to stay in a
deep minimum once you've got that.
If you are in the system of much lower
temperature,
Then your probability of crossing barriers
gets much smaller but your ratio gets much
better.
So, the ratio of the probability of going
from A to B versus the probability of
going from B to A is much better in the
low temperature system.
And so, if we run it long enough, we would
expect all of the particles to end up in
B.
But if we just run it for a long time at
low temperature, it will take a very long
time for particles to escape from A.
And it turns out a good compromise is to
start at a high temperature and gradually
reduce the temperature.
The way we get noise in to Hopfield net is
to replace the binary threshold units by
binary stochastic units and make biased
random decisions.
And the amount of noise is controlled by
something called temperature,
Which you'll see in a minute in the
equation.
Raising the noise level is equivalent to
decreasing all the energy gaps between
configurations.
So, this is our normal logistic equation.
But with the energy gap scaled by a
temperature.
If the temperature is very high, that
exponent will be roughly zero, so the
right hand side will be one over one plus
one. And so, the probability of the unit
turning on will be about a half.
It'll be in it's on and off states, more
or less equally off.
As we lower the temperature,
Depending on the sign of delta E, the unit
will become either more and more firmly on
and more and more firmly off.
At zero temperature, which is what we're
be using in a Hopfield net,
Then the sign of delta E determines
whether the right hand side goes to zero
or goes to one.
But, with T zero, it will either be zero
or one on the right hand side.
And so, the unit will behave
deterministically and that's a binary
threshold unit.
It will always adopt whatever of the two
states is the lowest energy.
So, the energy gap we saw on a previous
slide, and it's just the difference in the
energy of the whole system depending on
whether unit I is off, or the unit I is
on.
Although simulated annealing is a very
powerful method for improving searches
that get stuck in local optima, and
although it was influential in leading
Terry Sejnowski and I to the ideas behind
Boltzmann machines, it's actually a big
distraction from understanding Boltzmann
machines.
So, I'm not going to talk about it anymore
in this course even though it's a very
interesting idea.
And, from now on, I'm going to use binary
stochastic units that have a temperature
of one.
That is, it's the standard logistic
function in the energy gap.
So, one concept that you need to
understand in order to understand the
learning procedure for both the machines,
is the concept of thermal equilibrium.
And because we're setting the temperature
to one, this the concept of thermal
equilibrium at a fix temperature.
It's a difficult concept. Most people
think that it means the system is settled
down and isn't changing anymore. That's
normally what equilibrium means. But it's
not the states of the individual units
that are settled down.
The individual units are still rattling
around at thermal equilibrium, and less
temperature zero. The thing that settles
down is the probability distribution over
configurations. That's a difficult concept
the first time you meet it, and so I'm
going to give you an example.
The probability distribution settles to a
particular distribution called the
Stationary Distribution.
The stationary distribution is determined
by the energy function of the system.
And, in fact, in the stationary
distribution, the probability of any
configuration is proportional to each of
the minus its energy.
A nice intuitive way to think about
thermal equilibrium is to imagine a huge
ensemble of identical systems that all
have exactly the same energy function.
So, imagine a very large number of
stochastic Hopfield nets all with the same
weights.
Now, in that huge ensemble, we can define
the probability of configuration as the
fraction of the systems that are in that
configuration.
So, now we can understand what's happening
as we approach thermal equilibrium.
We can start with any distribution we like
over all these identical systems. We could
make them all, be in the same
configuration. So, that's the distribution
with a property of one on one
configuration, and zero on everything
else. Or we could start them off, with an
equal number of systems in each possible
configuration.
So that's a uniform distribution.
And then, we're going to keep applying our
stochastic update rule.
Which, in the case of a stochastic
Hopfield net would mean,
You pick a unit, and you look at its
energy gap.
And you make a random decision based on
that energy gap about whether to turn it
on or turn it off.
Then, you go and pick another unit, and so
on.
We keep applying that stochastic rule.
And after we've run systems stochastically
in this way,
We may eventually reach a situation where
the fraction of the systems in each
configuration remains constant.
In fact, that's what will happen if we
have symmetric connections.
That's the stationary distribution that
physicists call thermal equilibrium.
Any given system keeps changing its
configuration.
We apply the update rule,
And the states of its units will keep
flipping between zero and one.
But, the fraction of systems in any
particular configuration doesn't change.
And that's because we have many, many more
systems than we have configurations.
So, here's an analogy kust to help with
the concept.
Imagine a very large casino in Las Vegas
with lots of card dealers. And, in fact,
we have many more than 52 factorial card
dealers. We start with all the card packs
in the standard order that they come from
the manufacturer. Let's suppose that has
the ace of spades, and the king of spades,
and the queen of spades.
And then, the dealers all start shuffling.
And they do random shuffles, they don't do
fancy shuffles that bring them back to the
same order again.
After a few shuffles, there's still a good
chance that the king of spades will be
next to the queen of spades in any given
pack.
So, the packs have not yet forgotten where
they started.
Their initial order is still influencing
their current order.
If we keep shuffling, eventually the
initial order will be irrelevant.
The packs will have forgotten where they
started.
And, in fact, in this example, there will
be an equal number of packs in each of the
52 factorial possible orders.
Once this has happened, if we carry on
shuffling,
There'll still be an equal number of packs
in each of the 52 factorial orders.
That's why it's called equilibrium.
It's because the fraction in any one
configuration doesn't change,
Even though the individual systems are
still changing.
The thing that's wrong with this analogy
is that once we've each equilibrium here,
all configurations have equal energy.
And so, they all have the same
probability.
In general, we're interested in reaching
equilibrium for systems where some
configurations have lower energy than
others.