1 00:00:00,000 --> 00:00:03,612 In this video I'm going to talk about some advanced material. 2 00:00:03,612 --> 00:00:08,657 It's not really appropriate for a first course on nerual networks but I know that 3 00:00:08,657 --> 00:00:13,080 some of you are particularly interested in the urgent of deep learning. 4 00:00:13,080 --> 00:00:18,400 And the content of this video is mathematically very pretty So I couldn't 5 00:00:18,400 --> 00:00:22,570 resist putting it in. [INAUDIBLE] insight that stacking up 6 00:00:22,570 --> 00:00:28,394 restrictive Boltzmann machines gives you something like a sigmoid belief net can 7 00:00:28,394 --> 00:00:33,858 actually be seen without doing any math. Just by noticing, that a restrictive 8 00:00:33,858 --> 00:00:39,251 Boltzmann machine is actually the same thing as an infinitely deep sigmoid 9 00:00:39,251 --> 00:00:44,139 belief net with shared weights. Once again, wave sharing leads to 10 00:00:44,139 --> 00:00:48,769 something very interesting. I'm now going to describe, a very 11 00:00:48,769 --> 00:00:53,370 interesting explanation of why layer by layer learning works. 12 00:00:53,370 --> 00:00:59,329 It depends on the fact that there is an equivalence between restricted bowlser 13 00:00:59,329 --> 00:01:04,609 machines, which are undirected networks with symmetric connections, and 14 00:01:04,609 --> 00:01:10,341 infinitely deep directed networks. In which every layer uses the same weight 15 00:01:10,341 --> 00:01:14,159 matrix. This equivalence also gives insight into 16 00:01:14,159 --> 00:01:20,577 why contrasted divergence learning works. So an RBM is really just an infinitely 17 00:01:20,577 --> 00:01:24,830 deep sigmoid belief net with a lot of shared weights. 18 00:01:24,830 --> 00:01:31,168 The Markoff chain that we run when we want to sample from an RBM can be viewed 19 00:01:31,168 --> 00:01:35,920 as exactly the same thing as a sigmoid belief net. 20 00:01:35,920 --> 00:01:41,172 So here's the picture. We have a very deep sigmoid belief net. 21 00:01:41,172 --> 00:01:46,190 In fact, infinitely deep. We use the same weights at every layer. 22 00:01:46,190 --> 00:01:51,750 We have to have all the V layers being the same size as each other, and all the 23 00:01:51,750 --> 00:01:54,776 H layers being the same size as each other. 24 00:01:54,776 --> 00:02:00,055 But V and H can be different sizes. The distribution generated by this very 25 00:02:00,055 --> 00:02:05,833 deep network with replicated weights is exactly the equilibrium distribution that 26 00:02:05,833 --> 00:02:11,615 you get by alternating between doing P of V given H, and P of H given V, where both 27 00:02:11,615 --> 00:02:16,834 P of V given H and P of H given V are defined by the same weight matrix W. 28 00:02:16,834 --> 00:02:22,475 And that's exactly what you do when you take a restricted Boltzmann machine, and 29 00:02:22,475 --> 00:02:27,412 run a Markhov chain to get a sample from the equilibrium distribution. 30 00:02:27,412 --> 00:02:31,220 So a top-down pass starting from infinitely higher up. 31 00:02:31,220 --> 00:02:36,629 In this directed note, is exactly equivalent to letting a restricted 32 00:02:36,629 --> 00:02:42,277 Boltzmann machine settle to equilibrium. But that would define the same 33 00:02:42,277 --> 00:02:46,573 distribution. The sample you get at v0 if you run this 34 00:02:46,573 --> 00:02:53,560 infinite directed note, would be an equilibrium sample of the equivalent RBM. 35 00:02:53,560 --> 00:02:58,840 Now let's look at inference in an infinitely deep sigmoid belief net. 36 00:02:58,840 --> 00:03:04,751 So in inference we start at v zero and then we have to infer the state of h 37 00:03:04,751 --> 00:03:08,544 zero. Normally this would be a difficult thing 38 00:03:08,544 --> 00:03:14,082 to do because of explaining away. If for example hidden units K and J both 39 00:03:14,082 --> 00:03:20,510 had big positive weights to visible unit I, then we would expect that when we 40 00:03:20,510 --> 00:03:26,271 observe that I is on, K and J become anti-correlated in the posterior 41 00:03:26,271 --> 00:03:29,660 distribution. That's explaining a way. 42 00:03:29,660 --> 00:03:35,926 However in this net, K and J are completely independent of one another 43 00:03:35,926 --> 00:03:41,834 when we do inference given V0. So the inference is trivial, we just 44 00:03:41,834 --> 00:03:48,100 multiply V0 by the transpose of W. And put whatever we get through the 45 00:03:48,100 --> 00:03:54,545 logistic sigmoid and then sample. And that gives us binary states of the 46 00:03:54,545 --> 00:03:58,599 units in H0. But the question is how could they 47 00:03:58,599 --> 00:04:01,700 possible be independent given explaining away. 48 00:04:01,700 --> 00:04:07,860 The answer to that question is that the model above H0 implements what I call a 49 00:04:07,860 --> 00:04:12,481 complementary prior. It implements a prior distribution over 50 00:04:12,481 --> 00:04:17,410 H0 that exactly cancels out the correlations in explaining away. 51 00:04:17,410 --> 00:04:23,173 So for the example shown, the prior will implement positive correlation stream k 52 00:04:23,173 --> 00:04:26,270 and j. Explain your way will cause negative 53 00:04:26,270 --> 00:04:29,368 correlations and those will exactly cancel. 54 00:04:29,368 --> 00:04:34,987 So what's really going on is that when we multiply v0 by the transpose of the 55 00:04:34,987 --> 00:04:38,805 weights, we're not just computing the light unit term. 56 00:04:38,805 --> 00:04:43,560 We're computing the product of a light unit term and a prior term. 57 00:04:43,560 --> 00:04:47,306 And that's what you need to do to get the posterior. 58 00:04:47,306 --> 00:04:50,620 It normally comes as a big surprise to people. 59 00:04:50,620 --> 00:04:55,281 That when you multiply by w transpose, it's the product of the prior in the 60 00:04:55,281 --> 00:04:59,893 posterior of your computer. So what's happening in this net is that 61 00:04:59,893 --> 00:05:05,096 the complementary prior implemented by all the stuff above H0, exactly counts a 62 00:05:05,096 --> 00:05:08,388 lot explaining why it makes inference very simple. 63 00:05:08,388 --> 00:05:14,815 And that's true at every layer of this net so we can do inference for every 64 00:05:14,815 --> 00:05:21,317 layer and get an unbiased sample with each layer simply by multiplying V0 by W 65 00:05:21,317 --> 00:05:26,114 transpose. Then once we computed the binary state of 66 00:05:26,114 --> 00:05:31,044 H0, we multiple that by W. Put that through the logistic sigmoid and 67 00:05:31,044 --> 00:05:35,806 sample and that will give use a binary state for V1 and so on for all the way 68 00:05:35,806 --> 00:05:38,890 up. Suggestive generating from this model is 69 00:05:38,890 --> 00:05:44,407 equivalent to running the alternating mark off chain on a restricted Boltzmann 70 00:05:44,407 --> 00:05:48,722 machine to equilibrium. Performing inference in this model is 71 00:05:48,722 --> 00:05:52,330 exactly the same process in the opposite direction. 72 00:05:52,330 --> 00:05:57,989 This is a very special kind of sigmoid belief net in which inference is as easy 73 00:05:57,989 --> 00:06:01,950 as generation. So here I've shown the generative weights 74 00:06:01,950 --> 00:06:06,972 that define the model, and also their transposes, that are the way we do 75 00:06:06,972 --> 00:06:11,389 inference. And now I what want to show is how we get 76 00:06:11,389 --> 00:06:17,463 the Bolton Machine Learning Algorithm out of the learning algorithm for directed 77 00:06:17,463 --> 00:06:21,962 Sigmoid belief nets. So the learning rule for Sigmoid belief 78 00:06:21,962 --> 00:06:27,660 net says that we should first get a sample from the posterior, that what the 79 00:06:27,660 --> 00:06:31,710 Sj and Si are, samples from the posterior distribution. 80 00:06:31,710 --> 00:06:38,760 And then we should change a weight, the generative weight in proportion to the 81 00:06:38,760 --> 00:06:46,353 product of the pre activity as J and the difference between the [INAUDIBLE] 82 00:06:46,353 --> 00:06:53,403 activity as i and the probability of turning on i given all the binary states 83 00:06:53,403 --> 00:06:58,646 of the ladder Sj is in. Now if we ask how do we compute Pi, 84 00:06:58,646 --> 00:07:05,128 something very interesting happens. If you look at inference in this network 85 00:07:05,128 --> 00:07:09,175 on the right, we first infer a binary state for H0. 86 00:07:09,175 --> 00:07:15,246 Once we've chosen that binary state, we then infer a binary state for V1 by 87 00:07:15,246 --> 00:07:22,180 multiplying H0 by W, putting the result through the logistic, and then sampling. 88 00:07:22,180 --> 00:07:27,016 So if you think about how Si1 was generated? 89 00:07:27,016 --> 00:07:35,064 It was a sample from what we get if we put H0 through the weight matrix W and 90 00:07:35,064 --> 00:07:41,080 then through the logistic. And that's exactly what we'd have to, to 91 00:07:41,080 --> 00:07:46,913 in order to compute PiO. We'd have to take the binary activities 92 00:07:46,913 --> 00:07:53,480 in H0 and going downwards now through the green weights, W, we will compute the 93 00:07:53,480 --> 00:07:59,500 probability of turning on unit I given the binary states of its parents. 94 00:07:59,500 --> 00:08:06,188 So the point is, the process that goes from H0 to V1 is identical to the process 95 00:08:06,188 --> 00:08:11,960 that goes from H0 to V0. And so SI1 is an unbiased sample of PI0. 96 00:08:11,960 --> 00:08:15,579 That means we can replace it in the learning rule. 97 00:08:15,579 --> 00:08:21,153 So we end up with a learning rule that looks like this, because since we have 98 00:08:21,153 --> 00:08:26,727 replicated weights, each of those lines is the term in the learning rule that 99 00:08:26,727 --> 00:08:30,056 comes from one of those green weight matrices. 100 00:08:30,056 --> 00:08:36,357 For the first green weight matrix here. The learning rule is the presynaptic 101 00:08:36,357 --> 00:08:43,609 state Sj0 times the difference between the post synaptic state Si0 and the 102 00:08:43,609 --> 00:08:49,340 probability that the binary states in H0 would turn on Si. 103 00:08:49,340 --> 00:08:57,340 Which we could call PI0 but a sample with that probability is Si1. 104 00:08:57,340 --> 00:09:04,174 And so an unbiased estimate of the relative, can be got by plugging in Si1 105 00:09:04,174 --> 00:09:13,324 on that first line of the learning rule. Similarly for the second weight matrix, 106 00:09:13,324 --> 00:09:24,475 the learning rule is SI1 into SJ0 minus PJ0 and an unbiased estimate of PJ0 is 107 00:09:24,475 --> 00:09:27,577 SJ1. And so that's an unbiased testament of 108 00:09:27,577 --> 00:09:30,700 the learning rule, for this second weight matrix. 109 00:09:30,700 --> 00:09:35,413 And if you just keep going for all wave-matrices you get this infinite 110 00:09:35,413 --> 00:09:38,533 series. And all the terms except the very first 111 00:09:38,533 --> 00:09:43,578 term and the very last term cancel out. And so you end up with the Boltzmann 112 00:09:43,578 --> 00:09:47,760 machine learning rule. Which is just SJ-zero into Si-zero, minus 113 00:09:47,760 --> 00:09:52,474 SI-infinity into SI-infinity. So let's go back and look at how we would 114 00:09:52,474 --> 00:09:55,395 learn an infinitely deep sigmoid belief net. 115 00:09:55,395 --> 00:09:59,245 We would start by making all the weight matrices the same. 116 00:09:59,245 --> 00:10:03,060 So we tie all the weight matrices together. 117 00:10:03,060 --> 00:10:10,106 And we learn using those tied weights. Now that's exactly equivalent to learning 118 00:10:10,106 --> 00:10:15,057 a restricted Boltzmann machine. The diagram on the right and the diagram 119 00:10:15,057 --> 00:10:19,526 on the left are identical. We can think of the symmetric arrow in 120 00:10:19,526 --> 00:10:24,408 the diagram on the left, as just a convenient shorthand for an infinite 121 00:10:24,408 --> 00:10:28,784 directed net with tied weights. So we first learn that restricted 122 00:10:28,784 --> 00:10:32,110 Boltzmann machine. Now we ought to learn it using maximum 123 00:10:32,110 --> 00:10:36,720 likelihood learning, but actually we're just going to use contrasted divergence 124 00:10:36,720 --> 00:10:39,054 learning. We're going to take a shortcut. 125 00:10:39,054 --> 00:10:43,780 Once we've learned the first restricted Boltzmann machine, what we could do is we 126 00:10:43,780 --> 00:10:48,390 could freeze the bottom level weights. We'll freeze the generative weights that 127 00:10:48,390 --> 00:10:51,833 define the model. We'll also freeze the weights we're going 128 00:10:51,833 --> 00:10:56,180 to use for inference to be the transpose of those generative weights. 129 00:10:56,180 --> 00:11:00,338 So we freeze those weights. We keep all the other weights tied 130 00:11:00,338 --> 00:11:03,423 together. But now we're going to allow them to be 131 00:11:03,423 --> 00:11:08,387 different from the weights in the bottom layer but they're still all tied 132 00:11:08,387 --> 00:11:11,607 together. So learning the remaining weights tied 133 00:11:11,607 --> 00:11:16,503 together is exactly equivalent to learning another restrictive Boltzmann 134 00:11:16,503 --> 00:11:20,959 machine. Namely a restricted Boltzmann machine 135 00:11:20,959 --> 00:11:26,600 with H0 as its visible units, V1 as its hidden units. 136 00:11:26,600 --> 00:11:32,000 And where the data is the aggregated posterior across H0. 137 00:11:32,000 --> 00:11:36,774 That is, if we want to sample a data vector to train this network, what we do 138 00:11:36,774 --> 00:11:41,235 is we put in a real data vector, V nought, we do inference through those 139 00:11:41,235 --> 00:11:46,135 frozen waits, and we get a binary vector at H nought, and we treat that as data 140 00:11:46,135 --> 00:11:49,340 for training the next restricted Boltzmann machine. 141 00:11:49,340 --> 00:11:52,121 And we can go up for as many layers as we like. 142 00:11:52,121 --> 00:11:56,619 And when we get fed up, we just end up with the restrictive Boltzmann machine at 143 00:11:56,619 --> 00:12:01,295 the top which is equivalent to saying, all the weights in the infinite directed 144 00:12:01,295 --> 00:12:06,148 net above there are still tied together, but the weights below have now all become 145 00:12:06,148 --> 00:12:09,310 different. Now an explanation of why the inference 146 00:12:09,310 --> 00:12:14,450 procedure was correct, involved the idea of a complementary prior created by the 147 00:12:14,450 --> 00:12:19,561 weights in the layers above but of course, when we change the weights in the 148 00:12:19,561 --> 00:12:24,943 layers above, but leave the bottom layer of weights fixed, the prior created by 149 00:12:24,943 --> 00:12:28,875 those changed weights is no longer exactly complementary. 150 00:12:28,875 --> 00:12:34,533 So now our inference procedure, using the frozen weights in the bottom layer, is no 151 00:12:34,533 --> 00:12:39,259 longer exactly correct. The good news is, it's nearly always very 152 00:12:39,259 --> 00:12:44,087 close to correct and with the incorrect inference procedure, we still get a 153 00:12:44,087 --> 00:12:47,780 variational bound on the low probability of the data. 154 00:12:47,780 --> 00:12:53,237 The higher layers have changed because they've learned a prior for the bottom 155 00:12:53,237 --> 00:12:57,995 hidden layer that's closer to the aggregated posterior distribution. 156 00:12:57,995 --> 00:13:03,103 And that makes the model better. So changing the hidden weights makes the 157 00:13:03,103 --> 00:13:08,631 inference that we're doing at the bottom hidden layer incorrect, but gives us a 158 00:13:08,631 --> 00:13:12,130 better model. And if you look at those two effects, 159 00:13:12,130 --> 00:13:17,359 we prove that the improvement that you get in the variation bound from having a 160 00:13:17,359 --> 00:13:22,456 better model is always greater than the loss that you get from the inference 161 00:13:22,456 --> 00:13:26,693 being slightly incorrect. So in this variation bound you win when 162 00:13:26,693 --> 00:13:31,393 you learn the lights in hire less, assuming that you do it with correct 163 00:13:31,393 --> 00:13:35,141 maximizer [INAUDIBLE]. So now let's go back to what's happening 164 00:13:35,141 --> 00:13:39,612 in contrasted divergence learning. We have the infinite net on the right and 165 00:13:39,612 --> 00:13:42,463 we have a restricted Boltzmann machine on the left. 166 00:13:42,463 --> 00:13:45,761 And they're equivalent. If we were to do maximum likelihood 167 00:13:45,761 --> 00:13:50,064 learning for the restricted Boltzmann machine, it would be maximum likelihood 168 00:13:50,064 --> 00:13:52,980 learning for the infinite sigmoid belief net. 169 00:13:52,980 --> 00:13:56,674 But what we're going to do is we're going to cut things off. 170 00:13:56,674 --> 00:14:01,326 We're going to ignore the small derivitives for the weights you get in 171 00:14:01,326 --> 00:14:04,952 the higher layers of the infinite sigmoid belief net. 172 00:14:04,952 --> 00:14:08,660 So, we cut it off were that dotted red line is. 173 00:14:08,660 --> 00:14:14,534 And now if we look at the derivatives, the derivatives we're going to get look 174 00:14:14,534 --> 00:14:16,943 like this. They've got two terms. 175 00:14:16,943 --> 00:14:21,120 The first term comes from that bottom layer of nets. 176 00:14:21,120 --> 00:14:26,009 We've seen that before, the router for the bottom layer of weights is just that 177 00:14:26,009 --> 00:14:30,643 first line here. The second line comes from the next layer 178 00:14:30,643 --> 00:14:33,176 of lights. That's this line here. 179 00:14:33,176 --> 00:14:39,190 We need to compute the activities in H1, in order to compute the Sj1 in that 180 00:14:39,190 --> 00:14:44,219 second line but we're not actually computing derivatives for the third layer 181 00:14:44,219 --> 00:14:47,357 of weights. And when we take those first two terms, 182 00:14:47,357 --> 00:14:51,186 and we combine them. We get exactly the learning rule for one 183 00:14:51,186 --> 00:14:54,952 step contrasted divergence. So what's going on in contrasted 184 00:14:54,952 --> 00:14:59,596 divergence, is we're combining weight derivatives for the lower layers, and 185 00:14:59,596 --> 00:15:02,672 ignoring the weight derivatives in higher layers. 186 00:15:02,672 --> 00:15:07,379 The question is, why can we get away with ignoring those higher derivatives? 187 00:15:07,379 --> 00:15:11,145 When the weights are small, the Markov chain mixes very fast. 188 00:15:11,145 --> 00:15:13,970 If the weights are zero, it mixes in one step. 189 00:15:13,970 --> 00:15:19,216 And if the Markoff chain mixes fast, the higher layers will be close to the 190 00:15:19,216 --> 00:15:24,042 equilibrium distribution, i.e. They will have forgotten what the input 191 00:15:24,042 --> 00:15:27,959 was at the bottom layer. And now we have a nice property. 192 00:15:27,959 --> 00:15:33,555 If the higher layers are sampled from the equilibrium distribution, we know that 193 00:15:33,555 --> 00:15:38,871 the derivatives of the log probability, the data with respect to the weights, 194 00:15:38,871 --> 00:15:43,312 must average out to zero. And that's because the current weights in 195 00:15:43,312 --> 00:15:47,286 the model are a perfect model of the equilibrium distribution. 196 00:15:47,286 --> 00:15:51,261 The equilibrium distribution is generated using those weights. 197 00:15:51,261 --> 00:15:56,389 And if you want to generate samples from the equilibrium distribution, those are 198 00:15:56,389 --> 00:16:01,390 the best possible weights you could have. So we know the root is there is zero. 199 00:16:01,390 --> 00:16:06,420 As the weights get larger, we might have to run more iterations of Contrastive 200 00:16:06,420 --> 00:16:09,774 Divergence. Which corresponds to taking into account 201 00:16:09,774 --> 00:16:12,870 more layers of that infinite sigmoid belief net. 202 00:16:12,870 --> 00:16:18,094 That will allow Contrasive Divergence to continue to be a good approximation to 203 00:16:18,094 --> 00:16:23,254 maximum likelihood and so if we're trying to learn a density model, that makes a 204 00:16:23,254 --> 00:16:26,672 lot of sense. As the weights grow, you run CD for more 205 00:16:26,672 --> 00:16:30,282 and more steps. If there's a statistician around, you 206 00:16:30,282 --> 00:16:34,462 give him a guarantee, then in the infinite limit, you'll run CD for 207 00:16:34,462 --> 00:16:37,755 infinite many steps. And then you have an asymptotic 208 00:16:37,755 --> 00:16:42,124 convergence result, which is the thing that keeps statisticians happy. 209 00:16:42,124 --> 00:16:47,000 Of course it's completely irrelevant because you'll never reach a point like 210 00:16:47,000 --> 00:16:49,723 that. There is however an interesting point 211 00:16:49,723 --> 00:16:53,366 here. If our purpose in using CD is to build a 212 00:16:53,366 --> 00:16:58,504 stack of restricted Boltzmann machines, that learn multiple layers of features, 213 00:16:58,504 --> 00:17:03,712 it turns out that we don't need a good approximation to maximum likelihood. 214 00:17:03,712 --> 00:17:07,739 For learning multiple layers of features, CD1 is just fine. 215 00:17:07,739 --> 00:17:11,836 In fact it's probably better than doing maximum l likelihood.