In this video, I'm going to explain how adding noise can help systems escape from local minima. And, I'm going to show what you have to do to the units in Hopfield net to add noise in the appropriate way. I'm not going to introduce the idea that we confined better minima by using noise. So, Hopfield net always makes decisions that reduce the energy, or if it doesn't state of the unit, the energy stays the same. This makes it impossible to climb out of a local minimum. So, if you look at the landscape here. If we get into the local minimum A, there's no way we're going to get over the energy barrier to get to the better minimum B because we can't go uphill in energy. If we add random noise, we can escape from poor minima, especially minima that is shallow, that is, ones that don't have big energy barriers around them. It turns out, rather than using a fixed noise level, the most effective strategy is to start with a lot of noise which allows you to explore the space on a coarse scale and find the generally good regions of the space, and then to decrease the noise level. With a lot of noise, you can cross big barriers. As you decrease the noise level, you start concentrating on the best nearby minima. If you slowly reduce the noise, so the system ends up in a deep minimum, that's called simulated annealing. And this ideal was, propogated by Kirkpatrick at around the same time as Hopfield nets were proposed. So, the reason for simulated annealing is because the temperature, in a physical system, or in a simulated system with a energy function, Affects the transition probabilities. So, in a high temperature system, the probability of going uphill from B to A is lower than the probability of going downhill from A to B. But it's not much lower. In effect, the temperature flattens the energy landscape, and so the little black dots are meant to be particles. And what we are imagining is particles moving about according to the transition probabilities that you get with an energy function and a temperature. And this might be a typical distribution if you're on the system of high temperature where it's easier to cross barriers, but it's also hard to stay in a deep minimum once you've got that. If you are in the system of much lower temperature, Then your probability of crossing barriers gets much smaller but your ratio gets much better. So, the ratio of the probability of going from A to B versus the probability of going from B to A is much better in the low temperature system. And so, if we run it long enough, we would expect all of the particles to end up in B. But if we just run it for a long time at low temperature, it will take a very long time for particles to escape from A. And it turns out a good compromise is to start at a high temperature and gradually reduce the temperature. The way we get noise in to Hopfield net is to replace the binary threshold units by binary stochastic units and make biased random decisions. And the amount of noise is controlled by something called temperature, Which you'll see in a minute in the equation. Raising the noise level is equivalent to decreasing all the energy gaps between configurations. So, this is our normal logistic equation. But with the energy gap scaled by a temperature. If the temperature is very high, that exponent will be roughly zero, so the right hand side will be one over one plus one. And so, the probability of the unit turning on will be about a half. It'll be in it's on and off states, more or less equally off. As we lower the temperature, Depending on the sign of delta E, the unit will become either more and more firmly on and more and more firmly off. At zero temperature, which is what we're be using in a Hopfield net, Then the sign of delta E determines whether the right hand side goes to zero or goes to one. But, with T zero, it will either be zero or one on the right hand side. And so, the unit will behave deterministically and that's a binary threshold unit. It will always adopt whatever of the two states is the lowest energy. So, the energy gap we saw on a previous slide, and it's just the difference in the energy of the whole system depending on whether unit I is off, or the unit I is on. Although simulated annealing is a very powerful method for improving searches that get stuck in local optima, and although it was influential in leading Terry Sejnowski and I to the ideas behind Boltzmann machines, it's actually a big distraction from understanding Boltzmann machines. So, I'm not going to talk about it anymore in this course even though it's a very interesting idea. And, from now on, I'm going to use binary stochastic units that have a temperature of one. That is, it's the standard logistic function in the energy gap. So, one concept that you need to understand in order to understand the learning procedure for both the machines, is the concept of thermal equilibrium. And because we're setting the temperature to one, this the concept of thermal equilibrium at a fix temperature. It's a difficult concept. Most people think that it means the system is settled down and isn't changing anymore. That's normally what equilibrium means. But it's not the states of the individual units that are settled down. The individual units are still rattling around at thermal equilibrium, and less temperature zero. The thing that settles down is the probability distribution over configurations. That's a difficult concept the first time you meet it, and so I'm going to give you an example. The probability distribution settles to a particular distribution called the Stationary Distribution. The stationary distribution is determined by the energy function of the system. And, in fact, in the stationary distribution, the probability of any configuration is proportional to each of the minus its energy. A nice intuitive way to think about thermal equilibrium is to imagine a huge ensemble of identical systems that all have exactly the same energy function. So, imagine a very large number of stochastic Hopfield nets all with the same weights. Now, in that huge ensemble, we can define the probability of configuration as the fraction of the systems that are in that configuration. So, now we can understand what's happening as we approach thermal equilibrium. We can start with any distribution we like over all these identical systems. We could make them all, be in the same configuration. So, that's the distribution with a property of one on one configuration, and zero on everything else. Or we could start them off, with an equal number of systems in each possible configuration. So that's a uniform distribution. And then, we're going to keep applying our stochastic update rule. Which, in the case of a stochastic Hopfield net would mean, You pick a unit, and you look at its energy gap. And you make a random decision based on that energy gap about whether to turn it on or turn it off. Then, you go and pick another unit, and so on. We keep applying that stochastic rule. And after we've run systems stochastically in this way, We may eventually reach a situation where the fraction of the systems in each configuration remains constant. In fact, that's what will happen if we have symmetric connections. That's the stationary distribution that physicists call thermal equilibrium. Any given system keeps changing its configuration. We apply the update rule, And the states of its units will keep flipping between zero and one. But, the fraction of systems in any particular configuration doesn't change. And that's because we have many, many more systems than we have configurations. So, here's an analogy kust to help with the concept. Imagine a very large casino in Las Vegas with lots of card dealers. And, in fact, we have many more than 52 factorial card dealers. We start with all the card packs in the standard order that they come from the manufacturer. Let's suppose that has the ace of spades, and the king of spades, and the queen of spades. And then, the dealers all start shuffling. And they do random shuffles, they don't do fancy shuffles that bring them back to the same order again. After a few shuffles, there's still a good chance that the king of spades will be next to the queen of spades in any given pack. So, the packs have not yet forgotten where they started. Their initial order is still influencing their current order. If we keep shuffling, eventually the initial order will be irrelevant. The packs will have forgotten where they started. And, in fact, in this example, there will be an equal number of packs in each of the 52 factorial possible orders. Once this has happened, if we carry on shuffling, There'll still be an equal number of packs in each of the 52 factorial orders. That's why it's called equilibrium. It's because the fraction in any one configuration doesn't change, Even though the individual systems are still changing. The thing that's wrong with this analogy is that once we've each equilibrium here, all configurations have equal energy. And so, they all have the same probability. In general, we're interested in reaching equilibrium for systems where some configurations have lower energy than others.