1 00:00:00,000 --> 00:00:05,880 In this video, I'm going to talk about the storage capacity of Hopfield nets. 2 00:00:06,340 --> 00:00:11,927 Their ability to store a lot of memories is limited by what are called spurious 3 00:00:11,927 --> 00:00:15,488 memories. These occur when two nearby energy minima 4 00:00:15,488 --> 00:00:18,910 combine to make a new minimum in the wrong place. 5 00:00:18,910 --> 00:00:24,637 Attempts to remove these spurious minima eventually led to a very interesting way 6 00:00:24,637 --> 00:00:30,224 of doing learning in things considerably more complicated than a basic Hopfield 7 00:00:30,224 --> 00:00:32,180 net. At the end of the video, 8 00:00:32,180 --> 00:00:37,637 I'll also talk about a curious historical rediscovery where the physicist trying to 9 00:00:37,637 --> 00:00:42,770 increase the capacity of Hopfield nets, rediscovered the perceptron convergence 10 00:00:42,770 --> 00:00:48,054 procedure. Off to Hopfield, invented Hopfield nets as 11 00:00:48,054 --> 00:00:52,831 memory storage devices. The field became obsessed by the storage 12 00:00:52,831 --> 00:00:57,981 capacity of a Hopfield net. Using Hopfield Storage Rule for a totally 13 00:00:57,981 --> 00:01:01,788 connected net, the capacity is about 0.15N memories. 14 00:01:01,788 --> 00:01:07,460 That is, if you have N binary threshold units, the number of memories you can 15 00:01:07,460 --> 00:01:13,207 store is about 0.15N before memories start getting confused with one another. 16 00:01:13,207 --> 00:01:18,880 So that's the number you can store and still hope to retrieve them sensibly. 17 00:01:20,400 --> 00:01:26,009 Each memory is a random configuration of the N units, so it has N bits of 18 00:01:26,009 --> 00:01:29,850 information it. And so, the total information being 19 00:01:29,850 --> 00:01:34,000 stored, in a Hopfield net is about 0.15N squared bits. 20 00:01:35,360 --> 00:01:40,614 This doesn't make efficient use of the bits that are required to store the 21 00:01:40,614 --> 00:01:43,906 weights. In other words, if you look at how many 22 00:01:43,906 --> 00:01:49,091 bits the computer is using to store the weights, it's using well over 0.15N 23 00:01:49,091 --> 00:01:53,785 squared2 bits to store the weights. And therefore, this kind of distributed 24 00:01:53,785 --> 00:01:59,249 memory and local energy minima is not making efficient use of the bits in the 25 00:01:59,249 --> 00:02:03,043 computer. We can analyze how many bits we should be 26 00:02:03,043 --> 00:02:07,560 able to store if we were making efficient use of the bits in the computer. 27 00:02:07,980 --> 00:02:11,620 Those N squared weights and biases in the net. 28 00:02:11,620 --> 00:02:16,988 And after storing M memories, each connection weight has an integer value in 29 00:02:16,988 --> 00:02:22,710 the range -M to M. That's because we increase it by one or decrease it by one 30 00:02:22,710 --> 00:02:28,220 each time we store a memory, assuming that we used states of -one and one. 31 00:02:28,220 --> 00:02:34,132 Now, of course, not all values will be equiprobable, so we could compress that 32 00:02:34,132 --> 00:02:37,666 information. But ignoring that, the number bits it 33 00:02:37,666 --> 00:02:44,156 would take us to store a connection rate in a naive way is log 2M + one, Cuz that's 34 00:02:44,156 --> 00:02:49,420 the number of alternative connection rates and that's a log to the base two. 35 00:02:49,900 --> 00:02:55,483 And so, the total number of bits of computer memory that we use is of the 36 00:02:55,483 --> 00:03:00,277 order of N squared log 2M + one. So, notice that, that scales 37 00:03:00,622 --> 00:03:04,249 logarithmically with M. Whereas, if you store things in the way 38 00:03:04,249 --> 00:03:08,798 that Hopfield suggests, you get this constants 0.15 instead of something this 39 00:03:08,798 --> 00:03:12,540 scale is logarithmically. So, we're not so worried about the fact 40 00:03:12,540 --> 00:03:16,859 that the constant is a lot less than two, What we're worried about is this 41 00:03:16,859 --> 00:03:20,141 logarithmic scaling. That shows we ought to be able to do 42 00:03:20,141 --> 00:03:26,466 something better. If we ask, what limits the capacity of a 43 00:03:26,466 --> 00:03:32,439 Hopfield net? What is it that causes it to break down? Then, its merging of energy 44 00:03:32,439 --> 00:03:35,500 minima. So, each time we memorize a binary 45 00:03:35,500 --> 00:03:39,980 configuration, we hope that we'll create a new energy minima. 46 00:03:40,280 --> 00:03:45,157 So, we might have a state space for all the states of the net being depicted 47 00:03:45,157 --> 00:03:48,943 horizontally here, and the energy being depicted vertically. 48 00:03:48,943 --> 00:03:54,141 And, we might have one en, energy minimum for the blue pattern and another for the 49 00:03:54,141 --> 00:03:58,965 green pattern. But, if those two patents are nearby, what 50 00:03:58,965 --> 00:04:03,535 will happen is we won't get two seperate minima. They'll merge to create one 51 00:04:03,535 --> 00:04:08,465 minimum at an intermediate location. And that means, we can't distinguish those two 52 00:04:08,465 --> 00:04:13,396 seperate memories, and indeed we'll recall something, that's a blend of them rather 53 00:04:13,396 --> 00:04:17,900 than the individual memories. That's what limits the capacity of a 54 00:04:17,900 --> 00:04:20,960 Hopfield net, that kind of merging of nearby minima. 55 00:04:22,340 --> 00:04:27,213 One thing I should mention is this picture, is a big misrepresentation. The 56 00:04:27,213 --> 00:04:32,284 states of a Hopfield matter really, the corners of a hyper cube. And, it's not 57 00:04:32,284 --> 00:04:37,290 very good to show, the corners of a hyper cube, as if they were continous one 58 00:04:37,290 --> 00:04:43,353 dimensional horizontal space. One very interesting idea that came out of 59 00:04:43,353 --> 00:04:47,869 thinking about how to improve the crest of the Hopfield net is the idea of 60 00:04:47,869 --> 00:04:51,029 unlearning. This was first suggested by Hopfield, 61 00:04:51,029 --> 00:04:54,900 Feinstine and Palmer, who suggested the following strategies. 62 00:04:55,440 --> 00:05:00,624 You left a net settle from a random initial state, and then you do unlearning. 63 00:05:00,624 --> 00:05:06,164 That is whatever binary state it settles to, you apply opposite of the storage 64 00:05:06,164 --> 00:05:09,453 rule. I think you can see that with the previous 65 00:05:09,453 --> 00:05:13,800 example, that red merge minimum. If you let the net settle there and did 66 00:05:13,800 --> 00:05:18,944 some unlearning on that merge minimum, you'd get back the two separate minima cuz 67 00:05:18,944 --> 00:05:24,626 you'd pull up that red point. So, by getting rid of deep spurious 68 00:05:24,626 --> 00:05:28,380 minima, we can actually increase the memory capacity. 69 00:05:29,280 --> 00:05:34,425 Hopfield, Feinstein and Palmer showed that this actually worked, but they didn't have 70 00:05:34,425 --> 00:05:37,120 a good analysis of what was really going on. 71 00:05:38,660 --> 00:05:44,633 Francis Crick, one of the discovers of the structure of DNA, and Graham Micherson, 72 00:05:44,633 --> 00:05:50,681 proposed that unlearning might be what's going on during REM sleep, that is Rapid 73 00:05:50,681 --> 00:05:55,329 Eye Movement sleep. So, the idea was that during the day, you 74 00:05:55,329 --> 00:05:58,800 store lots of things, and you'll get spurious minima. 75 00:05:59,160 --> 00:06:05,280 Then at night, you put the network in a random state, you settle to a minimum, 76 00:06:05,280 --> 00:06:11,400 And you unlearn what you settled to. And that actually explains a big puzzle. 77 00:06:11,740 --> 00:06:16,141 This is a puzzle that doesn't seem to puzzle most people that study sleep but it 78 00:06:16,141 --> 00:06:19,119 ought to. Each night, you go to sleep and you dream 79 00:06:19,119 --> 00:06:24,036 for several hours. When you wake up in the morning, those dreams are all gone. Well, 80 00:06:24,036 --> 00:06:28,953 they're not quite all gone. The dream you had just before you woke up, you can get 81 00:06:28,953 --> 00:06:34,053 into short term memory and you'll remember it for a while. And if you think about it, 82 00:06:34,053 --> 00:06:39,032 you might remember it for a long time. But, we know perfectly well that if we'd 83 00:06:39,032 --> 00:06:44,068 woken you up at other times in the night, you'd have been having other dreams, and 84 00:06:44,068 --> 00:06:48,481 in the morning their just not there. So, it looks like you're simply not 85 00:06:48,481 --> 00:06:53,330 storing what you're dreaming about, and the question is, why? In fact, why do you 86 00:06:53,330 --> 00:06:57,771 bother to dream at all? Dreaming is paradoxical and that the state 87 00:06:57,771 --> 00:07:02,631 of your brain looks extremely like the state of your brain when you're awake, 88 00:07:02,631 --> 00:07:07,491 except that it's not being driven by real input. It's being driven by a relay 89 00:07:07,491 --> 00:07:10,900 station just after the real input called the thalamus. 90 00:07:11,400 --> 00:07:16,201 So the Crick and Mitchison theory at least explains, functionally, what the point of 91 00:07:16,201 --> 00:07:18,920 dreams is, is to get rid of the spurious minima. 92 00:07:20,920 --> 00:07:25,858 But, there's another problem with unlearning, which is more mathematical 93 00:07:25,858 --> 00:07:29,335 problem, Which is, how much unlearning should we do? 94 00:07:29,335 --> 00:07:34,899 Now, given more you've seen in the school so far, a real solution to that problem 95 00:07:34,899 --> 00:07:40,602 will be to show that unlearning is part of the process of fitting a model to data. 96 00:07:40,602 --> 00:07:45,888 And, if you do maximum likelihood fitting of that model, then unlearning will 97 00:07:45,888 --> 00:07:49,087 automatically come out and fit into the model. 98 00:07:49,087 --> 00:07:53,400 And what's more, you'll know exactly how much unlearning to do. 99 00:07:53,940 --> 00:07:58,289 So, what we're going to try and do is derive on learning as the right way to 100 00:07:58,289 --> 00:08:02,102 minimize a cost function. Where the cost function is, how well your 101 00:08:02,102 --> 00:08:05,380 neural net models the data that you saw during the day. 102 00:08:07,220 --> 00:08:11,887 Before we get to that, I want to talk a little bit about ways that physicists 103 00:08:11,887 --> 00:08:15,464 discovered for increasing the capacity of the Hopfield net. 104 00:08:15,464 --> 00:08:18,131 As I said, this was a big obsession with the field. 105 00:08:18,495 --> 00:08:23,405 I think it's because physicists really love the idea that math they already know 106 00:08:23,405 --> 00:08:27,649 might explain how the brain works. That means, post doctoral fellows in 107 00:08:27,649 --> 00:08:31,831 physics who can't get a job in physics might be able to get a job in 108 00:08:31,831 --> 00:08:35,492 neuroscience. So, there are a very large number of 109 00:08:35,492 --> 00:08:40,260 papers published in physics journals about Hopfield and their storage capacity. 110 00:08:40,580 --> 00:08:44,717 Eventually, a very smart student called, Elizabeth Gardner, figured out that 111 00:08:44,717 --> 00:08:49,245 there's actually a much better storage rule if you were concerned about capacity. 112 00:08:49,245 --> 00:08:52,320 And that it would use the full capacity of the weights. 113 00:08:52,320 --> 00:08:55,340 And I think this storage rule will be familiar to you. 114 00:08:56,100 --> 00:09:01,111 Instead of trying to store vectors in one go, what we're going to do is we're going 115 00:09:01,111 --> 00:09:05,458 to cycle through the training set many times. So, we lose our nice online 116 00:09:05,458 --> 00:09:10,409 property that you only have to go through the data once. But in return, we're going 117 00:09:10,409 --> 00:09:15,922 to gain, more efficient storage. What we going to do is we going to use the 118 00:09:15,922 --> 00:09:20,818 perceptual convergent procedure to train each unit to have the correct state given 119 00:09:20,818 --> 00:09:25,420 the states of all the other units in that global vector that we want to store. 120 00:09:25,700 --> 00:09:31,189 So, you take your net, you put it into the memory state you want to store, and then 121 00:09:31,189 --> 00:09:36,679 you take each unit separately and say, would this unit adopt the state I want for 122 00:09:36,679 --> 00:09:39,744 it, given the states of all the other units? 123 00:09:39,744 --> 00:09:43,238 If it would, you leave its incoming weights alone. 124 00:09:43,238 --> 00:09:48,656 If it wouldn't, you change its incoming weights in the weights specified by 125 00:09:48,656 --> 00:09:54,360 convergence procedures. And notice, these would be integer changes to the weights. 126 00:09:55,040 --> 00:09:59,573 You may have to do this several times, and of course, if you give it to many 127 00:09:59,573 --> 00:10:03,925 memories, this won't converge. You only get convergence with a perceptron 128 00:10:03,925 --> 00:10:08,700 convergence procedure if there is a set of weights that will solve the problem. 129 00:10:08,700 --> 00:10:13,354 But assuming there is, this is much more efficient way to store memories in a 130 00:10:13,354 --> 00:10:17,840 Hopfield net. This technique is also being developed in 131 00:10:17,840 --> 00:10:21,443 another field, statistics. And statisticians call the technique 132 00:10:21,443 --> 00:10:24,940 pseudo-likelihood. The idea is to get one thing right given 133 00:10:24,940 --> 00:10:28,647 all the other things. And so, with high dimensional data, if you 134 00:10:28,647 --> 00:10:33,429 want to build a model of it, the idea is you build a model that tries to get the 135 00:10:33,429 --> 00:10:37,853 value on one dimension right given the values on all the other dimensions. 136 00:10:37,853 --> 00:10:43,352 The main difference between the perceptron convergence procedure that's normally used 137 00:10:43,352 --> 00:10:47,896 and pseudo-likelihood is that, in the Hopfield net, the weights are symmetric. 138 00:10:47,896 --> 00:10:52,260 So, we have to get two sets of gradients for each weight and average them. 139 00:10:52,260 --> 00:10:57,048 But, apart from that, the way to use the full capacity of a Hopfield net is to use 140 00:10:57,048 --> 00:11:01,660 the perceptron convergence procedure and to go through the data several times.