1 00:00:00,000 --> 00:00:05,048 In the previous video, I showed how a Boltzmann machine can be 2 00:00:05,048 --> 00:00:08,910 used a probabilistic model of a set of binary data vectors. 3 00:00:08,910 --> 00:00:14,213 In this video we're finally going to get around to the Boltzmann machine learning 4 00:00:14,213 --> 00:00:17,646 algorithm. This is a very simple learning model which 5 00:00:17,646 --> 00:00:22,285 has an elegant theoretical justification, but it turned out in practice, it was 6 00:00:22,285 --> 00:00:25,319 extremely slow and noisy, and just wasn't practical. 7 00:00:25,319 --> 00:00:29,482 And for many years, people thought that Boltzmann machines would never be 8 00:00:29,482 --> 00:00:32,931 practical devices. Then we found several different ways of 9 00:00:32,931 --> 00:00:35,489 greatly speeding up the learning algorithm. 10 00:00:35,489 --> 00:00:39,890 And now the algorithm is much more practical, and has, in fact, been used as 11 00:00:39,890 --> 00:00:44,767 part of the winning entry for a million dollar machine learning competition, which 12 00:00:44,767 --> 00:00:49,168 I'll talk about in a later video. The Bolton machine learning algorithm is 13 00:00:49,168 --> 00:00:54,023 an unsupervised learning algorithm. Unlike the typical user back propagation, 14 00:00:54,023 --> 00:00:58,892 where we have a input vector and we provide it with a desired output. In 15 00:00:58,892 --> 00:01:03,086 Boltzmann machine learning we just give it the input vector. 16 00:01:03,086 --> 00:01:08,767 There are q labels. What the algorithm is trying to do is build a model of a set of 17 00:01:08,767 --> 00:01:13,840 input vectors, though it might be better to think of them as output vectors. 18 00:01:15,980 --> 00:01:21,127 What we want to do is maximize the product of the probabilities, that the Boltzmann 19 00:01:21,127 --> 00:01:23,762 machine assigns to a set of binary vectors, 20 00:01:23,762 --> 00:01:29,520 The ones in the training set. This is equivalent to maximizing the sum 21 00:01:29,520 --> 00:01:34,320 of the log probabilities that the Boltzmann machine assigns to the training 22 00:01:34,320 --> 00:01:38,102 vectors. It's also equivalent to maximizing the 23 00:01:38,102 --> 00:01:42,697 probability that we'd obtain exactly the end training cases, if we ran the 24 00:01:42,697 --> 00:01:49,301 Boltzmann machine in the following way. First, we let it settle to its stationary 25 00:01:49,301 --> 00:01:53,620 distribution, and different times, with no external input. 26 00:01:54,900 --> 00:02:01,621 Then we sample the visible vector once. Then we let it settle again, and sample 27 00:02:01,621 --> 00:02:03,740 the visible vector again. And so on. 28 00:02:07,140 --> 00:02:11,221 Now the main reasons why the learning could be difficult. 29 00:02:11,221 --> 00:02:14,300 This is probably the most important reason. 30 00:02:14,600 --> 00:02:20,773 If you consider a chain of units, A chain of hidden units here, with visible 31 00:02:20,773 --> 00:02:26,865 units attached to the two ends, And if we use a training set that consist 32 00:02:26,865 --> 00:02:30,446 of one, zero and zero, one. In other words, we want the two visible 33 00:02:30,446 --> 00:02:34,839 units to be in opposite states. Then the way to achieve that is by making 34 00:02:34,839 --> 00:02:38,148 sure that the product of all those weights is negative. 35 00:02:38,148 --> 00:02:43,082 So, for example, if all of the weights are positive, turning on W1 will tend to turn 36 00:02:43,082 --> 00:02:47,054 on the first hidden unit. And that will tend to turn on the second 37 00:02:47,054 --> 00:02:50,784 hidden unit, and so on. And the fourth hidden unit will tend to 38 00:02:50,784 --> 00:02:55,658 turn on the other visible unit. If one of those weights is negative, then 39 00:02:55,658 --> 00:03:00,060 we'll get an anti-correlation between the two visible units. 40 00:03:01,000 --> 00:03:07,457 What this means is, that if we're thinking about learning weight W1, we need to know 41 00:03:07,457 --> 00:03:09,635 other weights. So there's W1. 42 00:03:09,635 --> 00:03:13,758 To know how to change that weight, we need to know W3. 43 00:03:13,758 --> 00:03:19,905 We need to have information about W3, because if W3 is negative what we want to 44 00:03:19,905 --> 00:03:25,740 do with W1 is the opposite of what we want to do with W1 if W3 is positive. 45 00:03:28,560 --> 00:03:33,149 So given that one weight needs to know about other weights in order to be able to 46 00:03:33,149 --> 00:03:37,794 change even in the right direction, it's very surprising that there's a very simple 47 00:03:37,794 --> 00:03:42,440 learning algorithm, and that the learning algorithm only requires local information. 48 00:03:44,200 --> 00:03:49,642 So it turns out that everything that one weight needs to know about all the other 49 00:03:49,642 --> 00:03:54,819 weights and about the data is contained in the difference of two correlations. 50 00:03:54,819 --> 00:03:59,864 Another way of saying that is that if you take the log probability that the 51 00:03:59,864 --> 00:04:03,050 Boltzmann machine assigns to a visible vector V. 52 00:04:03,050 --> 00:04:10,199 And ask about the derivative of that log probability with respect to a weight, WIJ. 53 00:04:10,199 --> 00:04:17,436 It's the difference of the expected value of the products of the states of I and J. 54 00:04:17,436 --> 00:04:24,236 When the networks settle to thermal equilibrium with v clamped on the visible 55 00:04:24,236 --> 00:04:28,614 units. That is how often are INJ on together when 56 00:04:28,614 --> 00:04:35,606 V is clamped in visible units and the network is at thermal equilibrium, minus 57 00:04:35,606 --> 00:04:39,820 the same quantity. But when V is not clamped on visible 58 00:04:39,820 --> 00:04:45,301 units, so because the derivative of the log probability of a visible vector is 59 00:04:45,301 --> 00:04:50,853 this simple difference of correlations we can make the change in the weight be 60 00:04:50,853 --> 00:04:56,475 proportional to the expected product of the activities average over all visible 61 00:04:56,475 --> 00:05:00,200 vectors in the training set, that's what we call data. 62 00:05:00,460 --> 00:05:05,616 Minus the product of the same two activities when your not clamping anything 63 00:05:05,616 --> 00:05:10,840 and the network has reached thermal equilibrium with no external interference. 64 00:05:11,400 --> 00:05:14,097 So this is a very interesting learning rule. 65 00:05:14,097 --> 00:05:19,001 The first term in the learning rule says raise the weights in proportion to the 66 00:05:19,001 --> 00:05:23,231 product of the activities the units have when you're presenting data. 67 00:05:23,231 --> 00:05:27,400 That's the simplest form of what's known as a Hebbian learning rule. 68 00:05:27,880 --> 00:05:32,842 Donald Hebb, a long time ago, in the 1940s or 1950s, suggested that synapses in the 69 00:05:32,842 --> 00:05:37,437 brain might use a rule like that. But if you just use that rule, the synapse 70 00:05:37,437 --> 00:05:42,277 strengths will keep getting stronger. The weights will all become very positive, 71 00:05:42,277 --> 00:05:46,688 and the whole system will blow up. You have to somehow keep things under 72 00:05:46,688 --> 00:05:51,712 control, and this learning algorithm is keeping things under control by using that 73 00:05:51,712 --> 00:05:55,082 second term. It's reducing the weights in proportion to 74 00:05:55,082 --> 00:05:59,983 how often those two units are on together, when you're sampling from the model's 75 00:05:59,983 --> 00:06:04,719 distribution. You can also think of this as the first 76 00:06:04,719 --> 00:06:07,854 term is like the storage term for a Hopfield Net. 77 00:06:07,854 --> 00:06:12,461 And the second term is like the term for getting rid of spurious minima. 78 00:06:12,461 --> 00:06:16,044 And in fact this is the correct way to think about that. 79 00:06:16,044 --> 00:06:19,500 This rule tells you exactly how much unlearning to do. 80 00:06:21,660 --> 00:06:25,260 One obvious question is why is the derivative so simple. 81 00:06:26,400 --> 00:06:31,471 Well, the probability of a global configuration at thermal equilibrium, that 82 00:06:31,471 --> 00:06:36,543 is once you've let it settle down, is an exponential function of its energy. 83 00:06:36,543 --> 00:06:40,060 The probability is related to E to the minus energy. 84 00:06:41,180 --> 00:06:46,494 So when we settle to equilibrium we achieve a linear relationship between the 85 00:06:46,494 --> 00:06:52,775 log probability and the energy function. Now, the energy function is linear in the 86 00:06:52,775 --> 00:06:56,093 weights. So, we have a linear relationship between 87 00:06:56,093 --> 00:07:01,306 the weights and the log probability. And since we're trying to manipulate log 88 00:07:01,306 --> 00:07:05,775 probabilities by manipulating weights, that's a good thing to have. 89 00:07:05,775 --> 00:07:12,097 It's a log linear model. In fact, the relationship's very simple. 90 00:07:12,097 --> 00:07:17,643 It's that the derivative of the energy with respect to a particular weight WIJ is 91 00:07:17,643 --> 00:07:22,040 just the product of the two activities that, that weight connects. 92 00:07:23,720 --> 00:07:27,998 So what's happening here? Is the process of settling to thermal 93 00:07:27,998 --> 00:07:31,597 equilibrium is propagating information about weights? 94 00:07:31,597 --> 00:07:34,925 We don't need an explicit back propagation stage. 95 00:07:34,925 --> 00:07:38,593 We do need two stages. We need to settle with the data. 96 00:07:38,593 --> 00:07:43,595 And we need to settle with no data. But notice that the networks behaving in 97 00:07:43,595 --> 00:07:46,477 pretty much the same way in those two phases. 98 00:07:46,477 --> 00:07:51,408 The unit deep within the network is doing the same thing, just with different 99 00:07:51,408 --> 00:07:55,250 boundary conditions. With back prop the forward pass and the 100 00:07:55,250 --> 00:08:02,666 backward pass are really rather different. Another question you could ask is what's 101 00:08:02,666 --> 00:08:06,794 that negative phase for. I've already said it's like the unlearning 102 00:08:06,794 --> 00:08:10,121 we do in a Hopfield net to get rid of spurious minima. 103 00:08:10,121 --> 00:08:15,792 But let's look at it in more detail. The equation for the probability of a 104 00:08:15,792 --> 00:08:21,638 visible vector, is that it's a sum overall hidden vectors of E to the minus the 105 00:08:21,638 --> 00:08:25,338 energy of that visible and hidden vector together. 106 00:08:25,338 --> 00:08:30,000 Normalized by the same quantity, summed overall visible vectors. 107 00:08:30,480 --> 00:08:36,758 So if you look at the top term, what the first term in the learning rule is doing 108 00:08:36,758 --> 00:08:42,608 is decreasing the energy of terms in that sum that are already large and it finds 109 00:08:42,608 --> 00:08:48,315 those terms by settling to thermal equilibrium with the vector V clamped so 110 00:08:48,315 --> 00:08:53,880 that it can find an H that goes nicely with V, that is gives a nice low energy 111 00:08:53,880 --> 00:08:57,663 with V. Having sampled those vectors H, it then 112 00:08:57,663 --> 00:09:01,460 changes the weights to make that energy even lower. 113 00:09:03,000 --> 00:09:08,413 The second phase in the learning, the negative phase, is doing the same thing, 114 00:09:08,413 --> 00:09:13,185 but for the partition function. That is, the normalizing term on the 115 00:09:13,185 --> 00:09:16,802 bottom line. It's finding global configurations, 116 00:09:16,802 --> 00:09:21,198 combinations of visible and hidden states that give low energy, 117 00:09:21,198 --> 00:09:25,593 And therefore, are large contributors to the partition function. 118 00:09:25,593 --> 00:09:30,338 And having find those global configurations, it tries to raise their 119 00:09:30,338 --> 00:09:36,591 energy so that the can contribute less. So the first term is making the top line 120 00:09:36,591 --> 00:09:39,940 big, and the second term is making the bottom line small. 121 00:09:43,300 --> 00:09:47,292 Now in order to run this learning rule, you need to collect those statistics. 122 00:09:47,292 --> 00:09:51,506 You need to collect what we call the positive statistics, those are the ones 123 00:09:51,506 --> 00:09:55,997 when you have data clamped on the visible units, and also the negative statistics, 124 00:09:55,997 --> 00:10:00,488 those are the ones when you don't have data clamped and that you're going to use 125 00:10:00,488 --> 00:10:05,194 for unlearning. An inefficient way to track these 126 00:10:05,194 --> 00:10:09,740 statistics was suggested by me and Terry Sejnowski in 1983. 127 00:10:10,140 --> 00:10:15,987 And the idea is, in the positive phase you clamp a data vector on the visible units, 128 00:10:15,987 --> 00:10:19,440 you set the hidden units to random binary states, 129 00:10:20,100 --> 00:10:24,632 And then you keep updating the hidden units in the network, one unit at a time, 130 00:10:24,632 --> 00:10:28,700 until the network reaches thermal equilibrium at a temperature of one. 131 00:10:29,400 --> 00:10:34,227 We actually did that by starting at a high temperature and reducing it, but that's 132 00:10:34,227 --> 00:10:39,592 not the main point here. And then once you reach thermal 133 00:10:39,592 --> 00:10:43,633 equilibrium, you sample how often two units are on together. 134 00:10:43,633 --> 00:10:48,840 So you're measuring the correlation of INJ with that visible vector clamped. 135 00:10:49,120 --> 00:10:54,175 You then repeat that, over all the visible vectors, so that, that correlation you're 136 00:10:54,175 --> 00:11:00,876 sampling is averaged over all the data. Then in the negative phase, you don't 137 00:11:00,876 --> 00:11:04,314 clamp anything. The network is free from external 138 00:11:04,314 --> 00:11:08,103 interference. So, you set all of the units, both visible 139 00:11:08,103 --> 00:11:13,624 and hidden, to random binary states. And then you update the units, one at a 140 00:11:13,624 --> 00:11:18,017 time, until the network reaches thermal equilibrium, at a temperature of one. 141 00:11:18,017 --> 00:11:24,142 Just like you did in the positive phase. And again, you sample the correlation of 142 00:11:24,142 --> 00:11:29,762 every pair of units INJ, And you repeat that many times. 143 00:11:29,762 --> 00:11:35,071 Now it's very difficult to know how many times you need to repeat it, but certainly 144 00:11:35,071 --> 00:11:39,996 in the negative phase you expect the energy landscape to have many different 145 00:11:39,996 --> 00:11:43,770 minima, but are fairly separated and have about the same energy. 146 00:11:43,770 --> 00:11:48,153 The reason you expect that is we're going to be using Boltzmann machines to do 147 00:11:48,153 --> 00:11:52,199 things like model a set of images. And you expect there to be reasonable 148 00:11:52,199 --> 00:11:54,841 images, all of which have about the same energy. 149 00:11:54,841 --> 00:11:58,437 And then very unreasonable images, which have much higher energy. 150 00:11:58,437 --> 00:12:02,821 And so you expect a small fraction of the space to be these low energy states. 151 00:12:02,821 --> 00:12:06,980 And a very large fraction of the space to be these bad high energy states. 152 00:12:07,360 --> 00:12:12,787 If you have multiple modes, it's very unclear how many times you need to repeat 153 00:12:12,787 --> 00:12:15,398 this process to be able to sample those modes.