1 00:00:00,000 --> 00:00:03,530 In this video, I'm going to introduce Hopfield Nets. 2 00:00:03,530 --> 00:00:09,783 Together with back propagation, these were one of the main reasons for the resurgence 3 00:00:09,783 --> 00:00:13,020 of interest in neural networks in the 1980s. 4 00:00:13,460 --> 00:00:18,799 Hopfield networks are beautifully simple devices that can be used for storing 5 00:00:18,799 --> 00:00:21,880 memories as distributed patterns of activity. 6 00:00:22,980 --> 00:00:27,517 We are now going to learn about a different kind of model from a feet 7 00:00:27,517 --> 00:00:31,721 forward neural net. These are sometimes called energy based 8 00:00:31,721 --> 00:00:36,520 models because their properties derive from a global energy function. 9 00:00:38,320 --> 00:00:43,339 So, a Hopfield net is one of the simplest kinds of energy-based model. 10 00:00:43,339 --> 00:00:49,160 It's composed of binary threshold units with recurrent connections between them. 11 00:00:50,420 --> 00:00:55,030 In general, if you have networks of non-linear units with recurrent 12 00:00:55,030 --> 00:01:00,535 connections, they're very hard to analyze. They can behave in many different ways. 13 00:01:00,535 --> 00:01:04,182 They can settle to a stable state. They can oscillate. 14 00:01:04,182 --> 00:01:09,962 They can even be chaotic which means that, unless you know their starting state with 15 00:01:09,962 --> 00:01:15,398 infinite precision, you can't predict the state they'll be in very far into the 16 00:01:15,398 --> 00:01:20,378 future. Fortunately, John Hopfield and various 17 00:01:20,378 --> 00:01:25,806 other groups, like Stephen Grossberg's group, realized that if the connections 18 00:01:25,806 --> 00:01:29,120 are symmetric, there's a global energy function. 19 00:01:30,340 --> 00:01:34,170 Each binary configuration of the whole network has an energy. 20 00:01:34,170 --> 00:01:39,508 And so what I mean by binary configuration is, an assignment of binary values to each 21 00:01:39,508 --> 00:01:43,401 neuron in the network. So, every neuron has a particular binary 22 00:01:43,401 --> 00:01:49,833 value in that configuration. The thing that Hopfield realized is that, 23 00:01:49,833 --> 00:01:54,882 if you set up the right energy function for binary threshold decision rule, is 24 00:01:54,882 --> 00:02:00,061 actually causing the network, to go down hill in energy, and if you keep applying 25 00:02:00,061 --> 00:02:02,780 that rule, it'll end up in a energy minima. 26 00:02:05,600 --> 00:02:10,602 So, everything's controlled by the energy function. The global energy of a 27 00:02:10,602 --> 00:02:15,674 configuration is the sum of a number of local contributions, and the main 28 00:02:15,674 --> 00:02:21,511 contributions have the form of the product of one connection weight, with the binary 29 00:02:21,511 --> 00:02:26,720 states of two nuerons. So, the energy function looks like this. 30 00:02:27,280 --> 00:02:32,276 Energy is bad, so low energy is good. And that's what those minus signs are 31 00:02:32,276 --> 00:02:36,058 doing in there. If you look at the main term here, it has 32 00:02:36,058 --> 00:02:40,920 a weight which is the symmetric connection strength between two neurons. 33 00:02:41,260 --> 00:02:45,557 And it has the activities of the two connected neurons. 34 00:02:45,557 --> 00:02:50,088 So, Si is a binary variable that has values of one or zero. 35 00:02:50,088 --> 00:02:55,480 Or in another kind of Hopfield net, it has values of one or minus one. 36 00:02:56,220 --> 00:03:02,435 In addition to that quadratic term that involves the states of two units, there's 37 00:03:02,435 --> 00:03:07,500 also a bias term that only involves the state of individual units. 38 00:03:10,040 --> 00:03:15,601 The quadratic energy function makes it possible for each unit to compute locally 39 00:03:15,601 --> 00:03:19,240 how changing its state will change the global energy. 40 00:03:21,380 --> 00:03:24,728 So, we first need to define the energy gap. 41 00:03:24,728 --> 00:03:30,628 The energy gap for a unit i is the difference in the global energy of the 42 00:03:30,628 --> 00:03:35,093 whole configuration depending on whether or not i is on. 43 00:03:35,093 --> 00:03:41,790 So, the energy gap can be actually defined as the difference between the energy when 44 00:03:41,790 --> 00:03:47,832 i is off and the energy when i is on. And that difference is what is just what 45 00:03:47,832 --> 00:03:51,400 is being computed by the binary threshold decision rule. 46 00:03:52,160 --> 00:03:56,668 So, if you look at the equation for the energy and you differentiate it with a 47 00:03:56,668 --> 00:04:01,408 respect to the state of the i-th unit, it's a funny thing to do cuz it's a binary 48 00:04:01,408 --> 00:04:05,859 variable. But, if you differentiate it, you'll see you get the binary threshold 49 00:04:05,859 --> 00:04:10,600 decision rule, but without the minus sign cuz that's for going down hill in energy. 50 00:04:13,780 --> 00:04:19,650 So, by following the binary threshold decision rule, a Hopfield net will go 51 00:04:19,650 --> 00:04:24,453 downhill in its global energy. One way to find an energy minimum in a 52 00:04:24,453 --> 00:04:29,488 Hopfield net is to start from a random state and then update the units one at a 53 00:04:29,488 --> 00:04:32,968 time in random order. So, we're doing a sequential update. 54 00:04:32,968 --> 00:04:37,816 And for each unit that you pick, you compute whichever of its two states gives 55 00:04:37,816 --> 00:04:42,975 the lowest global energy and you put it in that state independent of what state it 56 00:04:42,975 --> 00:04:46,704 was previously in. That's equivalent to saying you just used 57 00:04:46,704 --> 00:04:52,003 the binary threshold decision rule. So, let's look at a little example for the 58 00:04:52,003 --> 00:04:55,700 net on the right. We'll start with a random global state. 59 00:04:56,140 --> 00:04:58,953 This was a carefully selected random state, 60 00:04:58,953 --> 00:05:02,552 And that has an energy of -three, or a goodness of three. 61 00:05:02,552 --> 00:05:07,328 It's easier to think about negative energies which are called goodnesses. 62 00:05:07,328 --> 00:05:11,908 There aren't any biases here. So to compute the goodness, you just look 63 00:05:11,908 --> 00:05:16,423 at all pairs of units that are on and add in the weight between them. 64 00:05:16,423 --> 00:05:21,068 And in this configuration, there's only one pair of units that's active. 65 00:05:21,068 --> 00:05:25,060 And that has a weight of three, So we get a goodness of three. 66 00:05:25,920 --> 00:05:31,089 Now, let's start probing the units. Let's pick a unit at random, like that 67 00:05:31,089 --> 00:05:34,033 one. And ask, what state should that be in, 68 00:05:34,033 --> 00:05:37,480 given the current states of all the other units? 69 00:05:38,100 --> 00:05:44,531 So, if we look at total input to that, it gets an input of one -four + zero 70 00:05:44,531 --> 00:05:52,282 three, plus another zero three, so it gets a total input of -four. 71 00:05:52,286 --> 00:05:57,860 That's below zero, so we turn it off, i.e. It stays in the off state. 72 00:05:58,780 --> 00:06:05,774 And let's probe another unit. If we look at this unit, again, it gets a 73 00:06:05,774 --> 00:06:14,336 total input of one three + -one zero, so it gets a total input of three, so the 74 00:06:14,336 --> 00:06:18,880 binary threshold decision rule will make it turn on. 75 00:06:19,560 --> 00:06:25,737 Let's probe one more unit. This unit's more interesting. 76 00:06:25,737 --> 00:06:32,235 It's getting an input of one two + one -one + zero three + zero , -one, So 77 00:06:32,235 --> 00:06:36,838 that's a total input of one. So, it will now turn on. 78 00:06:36,838 --> 00:06:41,982 Previously it was off. And so, when it turns on, the global 79 00:06:41,982 --> 00:06:47,037 energy changes. We now have a global energy of -four or a 80 00:06:47,037 --> 00:06:51,640 goodness of four, And that's a local energy minimum. 81 00:06:51,640 --> 00:06:56,754 If you now try probing any of the units, you'll see that they don't want to change 82 00:06:56,754 --> 00:07:00,060 their current state. The next is settled to a minimum. 83 00:07:01,940 --> 00:07:06,895 However, the minimum it settled to is not the deepest energy minimum. 84 00:07:06,895 --> 00:07:10,247 It's just one of two minima that this net has. 85 00:07:10,247 --> 00:07:14,182 The deepest energy minimum is shown on the right here, 86 00:07:14,182 --> 00:07:19,428 And it's when the other triangle of units that support each other is on. 87 00:07:19,428 --> 00:07:24,019 That has a goodness of three + three + -one is five. 88 00:07:24,019 --> 00:07:27,080 So, that's a slightly better energy minima. 89 00:07:28,700 --> 00:07:33,656 If you look at that net, you can see the nets composed of these two triangles in 90 00:07:33,656 --> 00:07:38,860 which the units mostly support each other, although there's a bit of disagreement at 91 00:07:38,860 --> 00:07:43,816 the bottom. And each of those triangles mostly hates the other triangle via that 92 00:07:43,816 --> 00:07:49,205 connection at the top. The triangle on the left differs from the 93 00:07:49,205 --> 00:07:53,654 one on the right by having a weight of two, where the other one has a weight of 94 00:07:53,654 --> 00:07:56,188 three. So, the triangle on the right will give 95 00:07:56,188 --> 00:08:03,449 you the deepest minimum. So, if you ask, why did the decisions need 96 00:08:03,449 --> 00:08:08,122 to be sequential in the Hopfield net? The problem is that if units make 97 00:08:08,122 --> 00:08:12,606 simultaneous decisions, they could each think they were using energy but actually 98 00:08:12,606 --> 00:08:17,580 the energy could go up. With simultaneous parallel updating, we 99 00:08:17,580 --> 00:08:21,200 can get oscillations which always have a period of two. 100 00:08:22,240 --> 00:08:27,761 So, here's a little network where the units have biases of +five, and a weight 101 00:08:27,761 --> 00:08:31,830 between them of -100. So when both units are off, the next 102 00:08:31,830 --> 00:08:37,497 parallel step, if we update them both at the same time, will turn both units on 103 00:08:37,497 --> 00:08:42,220 because they each think they cam improve things by the bias term. 104 00:08:42,580 --> 00:08:45,822 But, as soon as you do that, you get this -100, 105 00:08:45,822 --> 00:08:48,805 And so you've actually made things much worse. 106 00:08:48,805 --> 00:08:53,020 So then, in the next parallel step, both units will turn off again. 107 00:08:54,260 --> 00:08:57,971 If we do the updates in parallel but with random timing. 108 00:08:57,971 --> 00:09:02,743 In other words, we don't wait for one update to communicate the state to 109 00:09:02,743 --> 00:09:05,659 everybody before we consider another update, 110 00:09:05,659 --> 00:09:10,961 But we do wait for random lengths of time between doing updates of a given unit. 111 00:09:10,961 --> 00:09:15,800 Then, those random timings will often destroy these bi-phase oscillations. 112 00:09:16,080 --> 00:09:21,206 That means that the idea that the updates have to be sequential isn't quite as bad 113 00:09:21,206 --> 00:09:27,738 as it seems from a biological perspective. Now, what Hopfield suggested was that we 114 00:09:27,738 --> 00:09:33,819 could make use of this kind of energy based model that settles to a minimum of 115 00:09:33,819 --> 00:09:39,428 it's energy for storing memories. So, we had a very influential paper in 116 00:09:39,428 --> 00:09:44,835 1982 that proposed that memories could be energy minima of a neural net with 117 00:09:44,835 --> 00:09:49,771 symmetric weights. The binary threshold decision rule can 118 00:09:49,771 --> 00:09:55,434 then take partial memories, and clean them up into full memories. So, the memory 119 00:09:55,434 --> 00:10:00,734 could be corrupted by part of it being wrong, or part of it could just be 120 00:10:00,734 --> 00:10:04,800 undecided, and we can use the net, to fill out the memory. 121 00:10:06,600 --> 00:10:12,521 The idea of memories as energy minima goes back a long way. The first example I know 122 00:10:12,521 --> 00:10:16,492 of is in a book called Principles of Literary Criticism by I. 123 00:10:16,492 --> 00:10:19,349 A. Richards, where he proposes that memories 124 00:10:19,349 --> 00:10:23,320 are like a large crystal that can sit on different faces. 125 00:10:25,920 --> 00:10:31,237 Using energy minima to represent memories gives a content erasable memory, as 126 00:10:31,237 --> 00:10:35,407 Hopfield realized. So, that you can access an item just by 127 00:10:35,407 --> 00:10:39,070 knowing part of its content. I can tell you a few properties of 128 00:10:39,070 --> 00:10:42,966 something that'll set the states of some of the neurons in the net. 129 00:10:42,966 --> 00:10:47,617 And if you've put the other neurons in random states and now go around applying 130 00:10:47,617 --> 00:10:51,512 the binary threshold rule, With a bit of luck, you'll feel like that 131 00:10:51,512 --> 00:10:54,420 memory to be some stored item that you know about. 132 00:10:55,060 --> 00:11:01,854 When Hopfield nets were proposed in 1982, that was a very interesting property. 133 00:11:01,854 --> 00:11:08,648 1982 was sixteen years before Google, now that we have Google, we regard this as 134 00:11:08,648 --> 00:11:13,370 perfectly obvious. Another property of Hopfield netsg is 135 00:11:13,370 --> 00:11:16,938 biologically interesting, is their robust against hardware damage. 136 00:11:16,938 --> 00:11:21,494 You could remove a few of the units in the netg and unlike the central processor of 137 00:11:21,494 --> 00:11:24,020 your computer, everything will still work fine. 138 00:11:25,380 --> 00:11:29,145 Psychologists have a nice analogy for this kind of memory. 139 00:11:29,145 --> 00:11:34,469 It's like reconstructing a dinosaur from just a few of its bones because you know 140 00:11:34,469 --> 00:11:38,104 something about how the bones are meant to fit together. 141 00:11:38,104 --> 00:11:43,233 So, the weights in the network give you information about how states of neurons 142 00:11:43,233 --> 00:11:46,739 fit together. And now, given the state of a few neurons, 143 00:11:46,739 --> 00:11:50,440 I can fill out the whole state to recover a whole memory. 144 00:11:52,420 --> 00:11:56,860 The storage rule for memories in the Hopfield net is very simple. 145 00:11:57,240 --> 00:12:02,378 The idea is, if we use activities of one and minus one, that we can store a binary 146 00:12:02,378 --> 00:12:07,643 statement by just incrementing the weights between any two units by the product of 147 00:12:07,643 --> 00:12:11,259 their activities. So, it's a very simple rule shown on the 148 00:12:11,259 --> 00:12:16,034 right. One nice thing about this rule, is that 149 00:12:16,034 --> 00:12:20,903 you just go through the data once and you're done. So, it really is the genuine 150 00:12:20,903 --> 00:12:25,834 online rule. that's because it's not error driven. You're not comparing what you 151 00:12:25,834 --> 00:12:30,390 would have predicted with what the right answer is, and then making small 152 00:12:30,390 --> 00:12:34,149 adjustments. The fact that it's not an error correction 153 00:12:34,149 --> 00:12:36,671 rule is both it's strength and it's weakness. 154 00:12:36,671 --> 00:12:41,098 It means it can be online, but as we'll see later, it also means it's not a very 155 00:12:41,098 --> 00:12:46,455 efficient way to store things. We can also have biases, and as usual, we 156 00:12:46,455 --> 00:12:50,080 treat the biases as weights from a permanently on unit. 157 00:12:51,520 --> 00:12:57,160 If you want to use states of zero and one for units, which is what we'll use later, 158 00:12:57,160 --> 00:13:00,600 the update rule is only slightly more complicated.