In this video, I'm going to talk about the storage capacity of Hopfield nets. Their ability to store a lot of memories is limited by what are called spurious memories. These occur when two nearby energy minima combine to make a new minimum in the wrong place. Attempts to remove these spurious minima eventually led to a very interesting way of doing learning in things considerably more complicated than a basic Hopfield net. At the end of the video, I'll also talk about a curious historical rediscovery where the physicist trying to increase the capacity of Hopfield nets, rediscovered the perceptron convergence procedure. Off to Hopfield, invented Hopfield nets as memory storage devices. The field became obsessed by the storage capacity of a Hopfield net. Using Hopfield Storage Rule for a totally connected net, the capacity is about 0.15N memories. That is, if you have N binary threshold units, the number of memories you can store is about 0.15N before memories start getting confused with one another. So that's the number you can store and still hope to retrieve them sensibly. Each memory is a random configuration of the N units, so it has N bits of information it. And so, the total information being stored, in a Hopfield net is about 0.15N squared bits. This doesn't make efficient use of the bits that are required to store the weights. In other words, if you look at how many bits the computer is using to store the weights, it's using well over 0.15N squared2 bits to store the weights. And therefore, this kind of distributed memory and local energy minima is not making efficient use of the bits in the computer. We can analyze how many bits we should be able to store if we were making efficient use of the bits in the computer. Those N squared weights and biases in the net. And after storing M memories, each connection weight has an integer value in the range -M to M. That's because we increase it by one or decrease it by one each time we store a memory, assuming that we used states of -one and one. Now, of course, not all values will be equiprobable, so we could compress that information. But ignoring that, the number bits it would take us to store a connection rate in a naive way is log 2M + one, Cuz that's the number of alternative connection rates and that's a log to the base two. And so, the total number of bits of computer memory that we use is of the order of N squared log 2M + one. So, notice that, that scales logarithmically with M. Whereas, if you store things in the way that Hopfield suggests, you get this constants 0.15 instead of something this scale is logarithmically. So, we're not so worried about the fact that the constant is a lot less than two, What we're worried about is this logarithmic scaling. That shows we ought to be able to do something better. If we ask, what limits the capacity of a Hopfield net? What is it that causes it to break down? Then, its merging of energy minima. So, each time we memorize a binary configuration, we hope that we'll create a new energy minima. So, we might have a state space for all the states of the net being depicted horizontally here, and the energy being depicted vertically. And, we might have one en, energy minimum for the blue pattern and another for the green pattern. But, if those two patents are nearby, what will happen is we won't get two seperate minima. They'll merge to create one minimum at an intermediate location. And that means, we can't distinguish those two seperate memories, and indeed we'll recall something, that's a blend of them rather than the individual memories. That's what limits the capacity of a Hopfield net, that kind of merging of nearby minima. One thing I should mention is this picture, is a big misrepresentation. The states of a Hopfield matter really, the corners of a hyper cube. And, it's not very good to show, the corners of a hyper cube, as if they were continous one dimensional horizontal space. One very interesting idea that came out of thinking about how to improve the crest of the Hopfield net is the idea of unlearning. This was first suggested by Hopfield, Feinstine and Palmer, who suggested the following strategies. You left a net settle from a random initial state, and then you do unlearning. That is whatever binary state it settles to, you apply opposite of the storage rule. I think you can see that with the previous example, that red merge minimum. If you let the net settle there and did some unlearning on that merge minimum, you'd get back the two separate minima cuz you'd pull up that red point. So, by getting rid of deep spurious minima, we can actually increase the memory capacity. Hopfield, Feinstein and Palmer showed that this actually worked, but they didn't have a good analysis of what was really going on. Francis Crick, one of the discovers of the structure of DNA, and Graham Micherson, proposed that unlearning might be what's going on during REM sleep, that is Rapid Eye Movement sleep. So, the idea was that during the day, you store lots of things, and you'll get spurious minima. Then at night, you put the network in a random state, you settle to a minimum, And you unlearn what you settled to. And that actually explains a big puzzle. This is a puzzle that doesn't seem to puzzle most people that study sleep but it ought to. Each night, you go to sleep and you dream for several hours. When you wake up in the morning, those dreams are all gone. Well, they're not quite all gone. The dream you had just before you woke up, you can get into short term memory and you'll remember it for a while. And if you think about it, you might remember it for a long time. But, we know perfectly well that if we'd woken you up at other times in the night, you'd have been having other dreams, and in the morning their just not there. So, it looks like you're simply not storing what you're dreaming about, and the question is, why? In fact, why do you bother to dream at all? Dreaming is paradoxical and that the state of your brain looks extremely like the state of your brain when you're awake, except that it's not being driven by real input. It's being driven by a relay station just after the real input called the thalamus. So the Crick and Mitchison theory at least explains, functionally, what the point of dreams is, is to get rid of the spurious minima. But, there's another problem with unlearning, which is more mathematical problem, Which is, how much unlearning should we do? Now, given more you've seen in the school so far, a real solution to that problem will be to show that unlearning is part of the process of fitting a model to data. And, if you do maximum likelihood fitting of that model, then unlearning will automatically come out and fit into the model. And what's more, you'll know exactly how much unlearning to do. So, what we're going to try and do is derive on learning as the right way to minimize a cost function. Where the cost function is, how well your neural net models the data that you saw during the day. Before we get to that, I want to talk a little bit about ways that physicists discovered for increasing the capacity of the Hopfield net. As I said, this was a big obsession with the field. I think it's because physicists really love the idea that math they already know might explain how the brain works. That means, post doctoral fellows in physics who can't get a job in physics might be able to get a job in neuroscience. So, there are a very large number of papers published in physics journals about Hopfield and their storage capacity. Eventually, a very smart student called, Elizabeth Gardner, figured out that there's actually a much better storage rule if you were concerned about capacity. And that it would use the full capacity of the weights. And I think this storage rule will be familiar to you. Instead of trying to store vectors in one go, what we're going to do is we're going to cycle through the training set many times. So, we lose our nice online property that you only have to go through the data once. But in return, we're going to gain, more efficient storage. What we going to do is we going to use the perceptual convergent procedure to train each unit to have the correct state given the states of all the other units in that global vector that we want to store. So, you take your net, you put it into the memory state you want to store, and then you take each unit separately and say, would this unit adopt the state I want for it, given the states of all the other units? If it would, you leave its incoming weights alone. If it wouldn't, you change its incoming weights in the weights specified by convergence procedures. And notice, these would be integer changes to the weights. You may have to do this several times, and of course, if you give it to many memories, this won't converge. You only get convergence with a perceptron convergence procedure if there is a set of weights that will solve the problem. But assuming there is, this is much more efficient way to store memories in a Hopfield net. This technique is also being developed in another field, statistics. And statisticians call the technique pseudo-likelihood. The idea is to get one thing right given all the other things. And so, with high dimensional data, if you want to build a model of it, the idea is you build a model that tries to get the value on one dimension right given the values on all the other dimensions. The main difference between the perceptron convergence procedure that's normally used and pseudo-likelihood is that, in the Hopfield net, the weights are symmetric. So, we have to get two sets of gradients for each weight and average them. But, apart from that, the way to use the full capacity of a Hopfield net is to use the perceptron convergence procedure and to go through the data several times.