1 00:00:00,000 --> 00:00:06,400 In this video, I'm going to explain, how a Boltzmann machine models a set of binary 2 00:00:06,400 --> 00:00:10,588 data vectors. I'm going to start by explaining, why we 3 00:00:10,588 --> 00:00:17,068 might want to model a set of binary data vectors, and what we could do with such a 4 00:00:17,068 --> 00:00:20,758 model if we had it. And then I'm gonna show how the 5 00:00:20,758 --> 00:00:26,298 probabilities assigned to binary data vectors are determined by the weights in a 6 00:00:26,298 --> 00:00:30,196 Boltzmann machine. Stochastic Hopfield nets with hidden 7 00:00:30,196 --> 00:00:35,672 units, which we also call as Boltzmann machines are good at modelling binary 8 00:00:35,672 --> 00:00:38,482 data. So, given a set of binary training 9 00:00:38,482 --> 00:00:44,535 vectors, they can use the hidden units to fit a model per assigns the probability to 10 00:00:44,535 --> 00:00:49,651 every possible binary vector. Per several reasons, why you might like to 11 00:00:49,651 --> 00:00:53,254 be able to do that. If, for example you had several 12 00:00:53,254 --> 00:00:58,515 distributions of binary vectors, you might like to look at a new binary vector and 13 00:00:58,515 --> 00:01:03,495 decide which distribution it came from. So, you might have different kinds of 14 00:01:03,495 --> 00:01:08,170 documents, and you might represent a document by, a number of binary features 15 00:01:08,170 --> 00:01:13,337 each of which says, whether there is more than zero occurrences of a particular word 16 00:01:13,337 --> 00:01:16,659 in that document. For different kinds of documents, you 17 00:01:16,659 --> 00:01:20,904 would expect different kinds of the different words, may be you'll see 18 00:01:20,904 --> 00:01:26,772 different correlations between words And so you could use a set of hidden units to 19 00:01:26,772 --> 00:01:32,465 model the distribution for each document. And then you could pick the most likely 20 00:01:32,465 --> 00:01:36,752 document, by seeing. And then you could assign a test document 21 00:01:36,752 --> 00:01:42,585 to the appropriate class, by seeing which class of document is most likely to have 22 00:01:42,585 --> 00:01:47,364 produced that binary vector. You could also use Boltzmann machines for 23 00:01:47,364 --> 00:01:51,160 monitoring complex systems to detect unusual behavior. 24 00:01:51,600 --> 00:01:56,782 Suppose for example that you have a nuclear power station, and all of the 25 00:01:56,782 --> 00:02:01,125 dials were binary. So you get a whole bunch of binary numbers 26 00:02:01,125 --> 00:02:05,047 that tell you something about the state of the power station. 27 00:02:05,047 --> 00:02:09,034 What you'd like to do, is notice that it's in an unusual state. 28 00:02:09,034 --> 00:02:12,250 A state that's not like states you've seen before. 29 00:02:12,250 --> 00:02:15,295 And you don't want to use supervised learning for that. 30 00:02:15,295 --> 00:02:19,615 Because really you don't want to have any examples of states that cause it to 31 00:02:19,615 --> 00:02:22,273 blowup. You'd rather be able to detect that it's 32 00:02:22,273 --> 00:02:26,150 going into such a state without every having seen such a state before. 33 00:02:26,150 --> 00:02:31,588 And you could do that by building a model of a normal state and noticing that this 34 00:02:31,588 --> 00:02:36,799 state is different from the normal states. If you have models of several different 35 00:02:36,799 --> 00:02:40,533 distributions. You can complete the posterior probability 36 00:02:40,533 --> 00:02:46,014 that a particular distribution produced the observed data by using Bayes' Theorem. 37 00:02:46,014 --> 00:02:50,759 So giving the observed data, the probability it came from Model I, under 38 00:02:50,759 --> 00:02:56,240 the assumption that it came from one of your models, is the probability that Model 39 00:02:56,240 --> 00:03:01,320 I would have produced that data, divided by the same quantity for all models. 40 00:03:01,320 --> 00:03:06,843 Now I want to talk about two ways of producing models of data in particular 41 00:03:06,843 --> 00:03:10,599 binary vectors. The most natural way to think about 42 00:03:10,599 --> 00:03:16,049 generating a binary vector is to first generate the states of some latent 43 00:03:16,049 --> 00:03:19,511 variables, And then use the latent variables to 44 00:03:19,511 --> 00:03:24,509 generate the binary vector. So in a causal model, we use two 45 00:03:24,509 --> 00:03:28,260 sequential steps. These are the latent variables, or hidden 46 00:03:28,260 --> 00:03:33,174 units, and we first pick the states of the latent variables from their prior 47 00:03:33,174 --> 00:03:36,750 distributions. Often in the causal model, these will be 48 00:03:36,750 --> 00:03:40,956 independent in the prior. So their probability of turning on, if 49 00:03:40,956 --> 00:03:46,431 they were binary latent variables, would just depend on some bias that each one of 50 00:03:46,431 --> 00:03:49,770 them has. Then, once we picked a state for those, we 51 00:03:49,770 --> 00:03:54,978 would use those to generate the states of the visible units by using weighted 52 00:03:54,978 --> 00:03:59,117 connections in this model. So this is a kind of neural network, 53 00:03:59,117 --> 00:04:02,988 causal, generative model. It's using logistic units, and it uses 54 00:04:02,988 --> 00:04:08,409 biases for the hidden units and weights on the connections between hidden and visible 55 00:04:08,409 --> 00:04:12,380 units to assign a probability to every possible visible vector. 56 00:04:12,380 --> 00:04:17,381 The probability of generating a particular vector v, is just the sum of all the 57 00:04:17,381 --> 00:04:22,762 possible hidden states of the probability of generating those hidden state times the 58 00:04:22,762 --> 00:04:27,573 probability of generating v, given that you've already generated that hidden 59 00:04:27,573 --> 00:04:30,548 state. So, that's a causal model, factor analysis 60 00:04:30,548 --> 00:04:34,157 for example is a causal model using continuous variables. 61 00:04:34,157 --> 00:04:38,588 And, it's probably the most natural way to think about generating data. 62 00:04:38,588 --> 00:04:43,146 In fact, some people when they say generated model mean, the causal model 63 00:04:43,146 --> 00:04:47,091 like this. But just a completely different kind of 64 00:04:47,091 --> 00:04:50,264 model. A Boltzmann machine is an energy based 65 00:04:50,264 --> 00:04:55,060 model, and, in this kind of model, you don't generate data causally. 66 00:04:55,980 --> 00:05:00,592 It's not a causal generative model. Instead everything is defined in terms of 67 00:05:00,592 --> 00:05:04,590 the energies of joint configurations of visible and hidden units. 68 00:05:04,590 --> 00:05:09,756 There's two ways of relating the energy of a joint configuration to its probability. 69 00:05:09,756 --> 00:05:14,122 You can simply define the probability to be the probability of a joint 70 00:05:14,122 --> 00:05:19,104 configuration of the visible and hidden variables is proportional to e to the 71 00:05:19,104 --> 00:05:21,810 negative energy of that joint configuration. 72 00:05:21,810 --> 00:05:27,067 Or you can define it procedurally by saying we are going to define the 73 00:05:27,067 --> 00:05:33,261 probability as the probability finding the network in that state after we've updating 74 00:05:33,261 --> 00:05:38,374 all the stochastic binary units for enough time so that we reached thermal 75 00:05:38,374 --> 00:05:41,543 equilibrium. The good news is that those two 76 00:05:41,543 --> 00:05:46,575 definitions agree. The energy of a joint configuration of the 77 00:05:46,575 --> 00:05:50,294 visible and hidden units has five terms in it. 78 00:05:50,294 --> 00:05:56,275 So I've put the negative energy to save having to put lots of minus signs. 79 00:05:56,275 --> 00:06:00,883 And so the negative energy of the joint configuration VH. 80 00:06:00,883 --> 00:06:06,380 That's with vector V on the visible units, and H on the hidden units, 81 00:06:06,380 --> 00:06:12,200 Has bias terms where VI is the binary state of the Ith unit in vector V. 82 00:06:13,600 --> 00:06:22,580 And the bk is the bias of the kth unit, in this case, a hidden unit. 83 00:06:23,640 --> 00:06:27,900 So that's the first two terms. Then there's the visible-visible 84 00:06:27,900 --> 00:06:30,817 interactions, And to avoid counting each of those 85 00:06:30,817 --> 00:06:35,062 interactions twice, we can, just say, we're going to count within c's, I, and j 86 00:06:35,062 --> 00:06:39,816 and make sure that I's always less than j. That'll avoid counting the interaction of 87 00:06:39,816 --> 00:06:44,287 something with itself, and also avoid counting pairs twice, and so we don't have 88 00:06:44,287 --> 00:06:47,356 to put a half in front. Then there's the visible hidden 89 00:06:47,356 --> 00:06:50,780 interactions. My WIK is a weight on a visible hidden 90 00:06:50,780 --> 00:06:54,320 interaction. And then there's the hidden to hidden 91 00:06:54,320 --> 00:06:58,228 interactions. So the way we use the energies to define 92 00:06:58,228 --> 00:07:03,657 probabilities is that the probability of a joint configuration over vnh is 93 00:07:03,657 --> 00:07:08,724 proportional to e to the minus vh. To make that an equality we need to 94 00:07:08,724 --> 00:07:14,369 normalize the right hand side by all possible configurations over the visible 95 00:07:14,369 --> 00:07:17,843 and hidden and that's what the divisor there is. 96 00:07:17,843 --> 00:07:20,956 That's often called the partition function. 97 00:07:20,956 --> 00:07:26,306 That's what physicists call it. And notice it has exponentially many 98 00:07:26,306 --> 00:07:29,906 terms. To get the probability of a configuration 99 00:07:29,906 --> 00:07:35,952 of the visible units alone, we have to sum over all possible configurations of the 100 00:07:35,952 --> 00:07:40,007 hidden units. So P of V is the sum over all possible Hs, 101 00:07:40,007 --> 00:07:45,905 of each of the minus the energy you get with that H, normalized by the partition 102 00:07:45,905 --> 00:07:49,517 function. I want to give you an example of how we 103 00:07:49,517 --> 00:07:55,710 compute the probabilities of the different visible vectors, because that'll give you 104 00:07:55,710 --> 00:08:00,447 a good feel for what's involved. It's all very well to see the equations, 105 00:08:00,447 --> 00:08:04,785 but I find that I understand it much better when I've worked through the 106 00:08:04,785 --> 00:08:07,875 computation. So let's take a network with two hidden 107 00:08:07,875 --> 00:08:11,856 units and two visible units. And we'll ignore biases, so we just got 108 00:08:11,856 --> 00:08:15,124 three weights here. To keep things simple, I'm not gonna 109 00:08:15,124 --> 00:08:19,819 connect visible units to each other. So the first thing we do is write down all 110 00:08:19,819 --> 00:08:24,335 possible states of the visible units. I need to put them in different colors, 111 00:08:24,335 --> 00:08:27,010 and I'm going to write each state four times, 112 00:08:27,010 --> 00:08:32,810 Because for each state of visible units, there's four possible states of the hidden 113 00:08:32,810 --> 00:08:37,561 units that could go with it. So that gives us sixteen possible joint 114 00:08:37,561 --> 00:08:40,990 configurations. Now, for each of those joint 115 00:08:40,990 --> 00:08:46,430 configurations, we're going to compute it's negative energy minus E. 116 00:08:46,430 --> 00:08:51,700 So if you look at the first line, when all of the units are on. 117 00:08:51,700 --> 00:08:57,640 The negative energy will be +two -one, +one is +two. 118 00:08:58,600 --> 00:09:03,223 And we do this for all sixteen possible joint configurations. 119 00:09:03,223 --> 00:09:07,770 We then take the negative energies and we exponentiate them. 120 00:09:07,770 --> 00:09:11,560 And that will give us un-normalized probabilities. 121 00:09:12,980 --> 00:09:18,444 So these are the un-normalized probabilities of the configurations. 122 00:09:18,444 --> 00:09:22,115 Their probabilities are proportional to this. 123 00:09:22,115 --> 00:09:28,640 If we add all those up to 39.7 and then we divide everything by 39.7, we get the 124 00:09:28,640 --> 00:09:33,290 probabilities of joint configurations. There they all are. 125 00:09:33,290 --> 00:09:38,763 Now, if we want the probability of a particular visible configuration, we have 126 00:09:38,763 --> 00:09:43,313 to sum over all the hidden configurations that could go with it. 127 00:09:43,313 --> 00:09:46,369 And so we add up the numbers in each block. 128 00:09:46,369 --> 00:09:51,772 And now we've computed the probability of each possible visible vector in a 129 00:09:51,772 --> 00:09:55,540 Boltson's machine that has these three weights in it. 130 00:09:56,000 --> 00:10:01,363 Now let's ask how we get a sample from the model when the network's bigger than that. 131 00:10:01,363 --> 00:10:05,402 Obviously, in the network we just computed, we can figure out the 132 00:10:05,402 --> 00:10:08,116 probability of everything'cause it's small. 133 00:10:08,116 --> 00:10:13,038 But when the network's big, we can't do these exponentially large computations. 134 00:10:13,038 --> 00:10:17,708 So, if there's more than a few hidden units, we can't actually compute that 135 00:10:17,708 --> 00:10:20,737 partition function, there's too many terms in it. 136 00:10:20,737 --> 00:10:25,091 But we can use Markov Chain Monte Carlo to get samples from the model by starting 137 00:10:25,091 --> 00:10:29,791 from a random global configuration. And then picking units at random and 138 00:10:29,791 --> 00:10:33,426 dating them stochastically based on their energy gaps. 139 00:10:33,426 --> 00:10:38,745 Those energy gaps being determined by the states of all the other units in the 140 00:10:38,745 --> 00:10:41,775 network. If we keep doing that until the Markov 141 00:10:41,977 --> 00:10:47,296 chain reaches its stationary distribution, then we have a sample from the model. 142 00:10:47,296 --> 00:10:52,481 And the probability of that sample is related to its energy by the Boltzmann 143 00:10:52,481 --> 00:10:57,328 distribution, that is, the probability of the sample is proportional to each-(the of 144 00:10:57,328 --> 00:11:02,965 the minus energy. What about getting a sample from the posterior distribution 145 00:11:02,965 --> 00:11:06,540 over hidden configurations, when given a data vector? 146 00:11:06,540 --> 00:11:09,840 It turns out we're going to need that for learning. 147 00:11:11,180 --> 00:11:15,285 So the number of possible hidden configurations is again exponential. 148 00:11:15,285 --> 00:11:20,284 So again, we use Markov Chain Monte Carlo. And it's just the same as getting a sample 149 00:11:20,284 --> 00:11:25,104 from the model, except that we keep that we keep the visible units clamped to the 150 00:11:25,104 --> 00:11:29,090 data vector we're interested in. So we only update the hidden units. 151 00:11:29,090 --> 00:11:33,791 The reason we need to get samples from the posterior distribution, given a data 152 00:11:33,791 --> 00:11:38,135 vector, is we might want to know a good explanation for the observed data. 153 00:11:38,135 --> 00:11:41,884 And, we might want to base our actions on that good explanation. 154 00:11:41,884 --> 00:11:44,443 But, we also need to know that for learning.