1
00:00:00,000 --> 00:00:03,612
In this video I'm going to talk about 
some advanced material. 

2
00:00:03,612 --> 00:00:08,657
It's not really appropriate for a first 
course on nerual networks but I know that 

3
00:00:08,657 --> 00:00:13,080
some of you are particularly interested 
in the urgent of deep learning. 

4
00:00:13,080 --> 00:00:18,400
And the content of this video is 
mathematically very pretty So I couldn't 

5
00:00:18,400 --> 00:00:22,570
resist putting it in. 
[INAUDIBLE] insight that stacking up 

6
00:00:22,570 --> 00:00:28,394
restrictive Boltzmann machines gives you 
something like a sigmoid belief net can 

7
00:00:28,394 --> 00:00:33,858
actually be seen without doing any math. 
Just by noticing, that a restrictive 

8
00:00:33,858 --> 00:00:39,251
Boltzmann machine is actually the same 
thing as an infinitely deep sigmoid 

9
00:00:39,251 --> 00:00:44,139
belief net with shared weights. 
Once again, wave sharing leads to 

10
00:00:44,139 --> 00:00:48,769
something very interesting. 
I'm now going to describe, a very 

11
00:00:48,769 --> 00:00:53,370
interesting explanation of why layer by 
layer learning works. 

12
00:00:53,370 --> 00:00:59,329
It depends on the fact that there is an 
equivalence between restricted bowlser 

13
00:00:59,329 --> 00:01:04,609
machines, which are undirected networks 
with symmetric connections, and 

14
00:01:04,609 --> 00:01:10,341
infinitely deep directed networks. 
In which every layer uses the same weight 

15
00:01:10,341 --> 00:01:14,159
matrix. 
This equivalence also gives insight into 

16
00:01:14,159 --> 00:01:20,577
why contrasted divergence learning works. 
So an RBM is really just an infinitely 

17
00:01:20,577 --> 00:01:24,830
deep sigmoid belief net with a lot of 
shared weights. 

18
00:01:24,830 --> 00:01:31,168
The Markoff chain that we run when we 
want to sample from an RBM can be viewed 

19
00:01:31,168 --> 00:01:35,920
as exactly the same thing as a sigmoid 
belief net. 

20
00:01:35,920 --> 00:01:41,172
So here's the picture. 
We have a very deep sigmoid belief net. 

21
00:01:41,172 --> 00:01:46,190
In fact, infinitely deep. 
We use the same weights at every layer. 

22
00:01:46,190 --> 00:01:51,750
We have to have all the V layers being 
the same size as each other, and all the 

23
00:01:51,750 --> 00:01:54,776
H layers being the same size as each 
other. 

24
00:01:54,776 --> 00:02:00,055
But V and H can be different sizes. 
The distribution generated by this very 

25
00:02:00,055 --> 00:02:05,833
deep network with replicated weights is 
exactly the equilibrium distribution that 

26
00:02:05,833 --> 00:02:11,615
you get by alternating between doing P of 
V given H, and P of H given V, where both 

27
00:02:11,615 --> 00:02:16,834
P of V given H and P of H given V are 
defined by the same weight matrix W. 

28
00:02:16,834 --> 00:02:22,475
And that's exactly what you do when you 
take a restricted Boltzmann machine, and 

29
00:02:22,475 --> 00:02:27,412
run a Markhov chain to get a sample from 
the equilibrium distribution. 

30
00:02:27,412 --> 00:02:31,220
So a top-down pass starting from 
infinitely higher up. 

31
00:02:31,220 --> 00:02:36,629
In this directed note, is exactly 
equivalent to letting a restricted 

32
00:02:36,629 --> 00:02:42,277
Boltzmann machine settle to equilibrium. 
But that would define the same 

33
00:02:42,277 --> 00:02:46,573
distribution. 
The sample you get at v0 if you run this 

34
00:02:46,573 --> 00:02:53,560
infinite directed note, would be an 
equilibrium sample of the equivalent RBM. 

35
00:02:53,560 --> 00:02:58,840
Now let's look at inference in an 
infinitely deep sigmoid belief net. 

36
00:02:58,840 --> 00:03:04,751
So in inference we start at v zero and 
then we have to infer the state of h 

37
00:03:04,751 --> 00:03:08,544
zero. 
Normally this would be a difficult thing 

38
00:03:08,544 --> 00:03:14,082
to do because of explaining away. 
If for example hidden units K and J both 

39
00:03:14,082 --> 00:03:20,510
had big positive weights to visible unit 
I, then we would expect that when we 

40
00:03:20,510 --> 00:03:26,271
observe that I is on, K and J become 
anti-correlated in the posterior 

41
00:03:26,271 --> 00:03:29,660
distribution. 
That's explaining a way. 

42
00:03:29,660 --> 00:03:35,926
However in this net, K and J are 
completely independent of one another 

43
00:03:35,926 --> 00:03:41,834
when we do inference given V0. 
So the inference is trivial, we just 

44
00:03:41,834 --> 00:03:48,100
multiply V0 by the transpose of W. 
And put whatever we get through the 

45
00:03:48,100 --> 00:03:54,545
logistic sigmoid and then sample. 
And that gives us binary states of the 

46
00:03:54,545 --> 00:03:58,599
units in H0. 
But the question is how could they 

47
00:03:58,599 --> 00:04:01,700
possible be independent given explaining 
away. 

48
00:04:01,700 --> 00:04:07,860
The answer to that question is that the 
model above H0 implements what I call a 

49
00:04:07,860 --> 00:04:12,481
complementary prior. 
It implements a prior distribution over 

50
00:04:12,481 --> 00:04:17,410
H0 that exactly cancels out the 
correlations in explaining away. 

51
00:04:17,410 --> 00:04:23,173
So for the example shown, the prior will 
implement positive correlation stream k 

52
00:04:23,173 --> 00:04:26,270
and j. 
Explain your way will cause negative 

53
00:04:26,270 --> 00:04:29,368
correlations and those will exactly 
cancel. 

54
00:04:29,368 --> 00:04:34,987
So what's really going on is that when we 
multiply v0 by the transpose of the 

55
00:04:34,987 --> 00:04:38,805
weights, we're not just computing the 
light unit term. 

56
00:04:38,805 --> 00:04:43,560
We're computing the product of a light 
unit term and a prior term. 

57
00:04:43,560 --> 00:04:47,306
And that's what you need to do to get the 
posterior. 

58
00:04:47,306 --> 00:04:50,620
It normally comes as a big surprise to 
people. 

59
00:04:50,620 --> 00:04:55,281
That when you multiply by w transpose, 
it's the product of the prior in the 

60
00:04:55,281 --> 00:04:59,893
posterior of your computer. 
So what's happening in this net is that 

61
00:04:59,893 --> 00:05:05,096
the complementary prior implemented by 
all the stuff above H0, exactly counts a 

62
00:05:05,096 --> 00:05:08,388
lot explaining why it makes inference 
very simple. 

63
00:05:08,388 --> 00:05:14,815
And that's true at every layer of this 
net so we can do inference for every 

64
00:05:14,815 --> 00:05:21,317
layer and get an unbiased sample with 
each layer simply by multiplying V0 by W 

65
00:05:21,317 --> 00:05:26,114
transpose. 
Then once we computed the binary state of 

66
00:05:26,114 --> 00:05:31,044
H0, we multiple that by W. 
Put that through the logistic sigmoid and 

67
00:05:31,044 --> 00:05:35,806
sample and that will give use a binary 
state for V1 and so on for all the way 

68
00:05:35,806 --> 00:05:38,890
up. 
Suggestive generating from this model is 

69
00:05:38,890 --> 00:05:44,407
equivalent to running the alternating 
mark off chain on a restricted Boltzmann 

70
00:05:44,407 --> 00:05:48,722
machine to equilibrium. 
Performing inference in this model is 

71
00:05:48,722 --> 00:05:52,330
exactly the same process in the opposite 
direction. 

72
00:05:52,330 --> 00:05:57,989
This is a very special kind of sigmoid 
belief net in which inference is as easy 

73
00:05:57,989 --> 00:06:01,950
as generation. 
So here I've shown the generative weights 

74
00:06:01,950 --> 00:06:06,972
that define the model, and also their 
transposes, that are the way we do 

75
00:06:06,972 --> 00:06:11,389
inference. 
And now I what want to show is how we get 

76
00:06:11,389 --> 00:06:17,463
the Bolton Machine Learning Algorithm out 
of the learning algorithm for directed 

77
00:06:17,463 --> 00:06:21,962
Sigmoid belief nets. 
So the learning rule for Sigmoid belief 

78
00:06:21,962 --> 00:06:27,660
net says that we should first get a 
sample from the posterior, that what the 

79
00:06:27,660 --> 00:06:31,710
Sj and Si are, samples from the posterior 
distribution. 

80
00:06:31,710 --> 00:06:38,760
And then we should change a weight, the 
generative weight in proportion to the 

81
00:06:38,760 --> 00:06:46,353
product of the pre activity as J and the 
difference between the [INAUDIBLE] 

82
00:06:46,353 --> 00:06:53,403
activity as i and the probability of 
turning on i given all the binary states 

83
00:06:53,403 --> 00:06:58,646
of the ladder Sj is in. 
Now if we ask how do we compute Pi, 

84
00:06:58,646 --> 00:07:05,128
something very interesting happens. 
If you look at inference in this network 

85
00:07:05,128 --> 00:07:09,175
on the right, we first infer a binary 
state for H0. 

86
00:07:09,175 --> 00:07:15,246
Once we've chosen that binary state, we 
then infer a binary state for V1 by 

87
00:07:15,246 --> 00:07:22,180
multiplying H0 by W, putting the result 
through the logistic, and then sampling. 

88
00:07:22,180 --> 00:07:27,016
So if you think about how Si1 was 
generated? 

89
00:07:27,016 --> 00:07:35,064
It was a sample from what we get if we 
put H0 through the weight matrix W and 

90
00:07:35,064 --> 00:07:41,080
then through the logistic. 
And that's exactly what we'd have to, to 

91
00:07:41,080 --> 00:07:46,913
in order to compute PiO. 
We'd have to take the binary activities 

92
00:07:46,913 --> 00:07:53,480
in H0 and going downwards now through the 
green weights, W, we will compute the 

93
00:07:53,480 --> 00:07:59,500
probability of turning on unit I given 
the binary states of its parents. 

94
00:07:59,500 --> 00:08:06,188
So the point is, the process that goes 
from H0 to V1 is identical to the process 

95
00:08:06,188 --> 00:08:11,960
that goes from H0 to V0. 
And so SI1 is an unbiased sample of PI0. 

96
00:08:11,960 --> 00:08:15,579
That means we can replace it in the 
learning rule. 

97
00:08:15,579 --> 00:08:21,153
So we end up with a learning rule that 
looks like this, because since we have 

98
00:08:21,153 --> 00:08:26,727
replicated weights, each of those lines 
is the term in the learning rule that 

99
00:08:26,727 --> 00:08:30,056
comes from one of those green weight 
matrices. 

100
00:08:30,056 --> 00:08:36,357
For the first green weight matrix here. 
The learning rule is the presynaptic 

101
00:08:36,357 --> 00:08:43,609
state Sj0 times the difference between 
the post synaptic state Si0 and the 

102
00:08:43,609 --> 00:08:49,340
probability that the binary states in H0 
would turn on Si. 

103
00:08:49,340 --> 00:08:57,340
Which we could call PI0 but a sample with 
that probability is Si1. 

104
00:08:57,340 --> 00:09:04,174
And so an unbiased estimate of the 
relative, can be got by plugging in Si1 

105
00:09:04,174 --> 00:09:13,324
on that first line of the learning rule. 
Similarly for the second weight matrix, 

106
00:09:13,324 --> 00:09:24,475
the learning rule is SI1 into SJ0 minus 
PJ0 and an unbiased estimate of PJ0 is 

107
00:09:24,475 --> 00:09:27,577
SJ1. 
And so that's an unbiased testament of 

108
00:09:27,577 --> 00:09:30,700
the learning rule, for this second weight 
matrix. 

109
00:09:30,700 --> 00:09:35,413
And if you just keep going for all 
wave-matrices you get this infinite 

110
00:09:35,413 --> 00:09:38,533
series. 
And all the terms except the very first 

111
00:09:38,533 --> 00:09:43,578
term and the very last term cancel out. 
And so you end up with the Boltzmann 

112
00:09:43,578 --> 00:09:47,760
machine learning rule. 
Which is just SJ-zero into Si-zero, minus 

113
00:09:47,760 --> 00:09:52,474
SI-infinity into SI-infinity. 
So let's go back and look at how we would 

114
00:09:52,474 --> 00:09:55,395
learn an infinitely deep sigmoid belief 
net. 

115
00:09:55,395 --> 00:09:59,245
We would start by making all the weight 
matrices the same. 

116
00:09:59,245 --> 00:10:03,060
So we tie all the weight matrices 
together. 

117
00:10:03,060 --> 00:10:10,106
And we learn using those tied weights. 
Now that's exactly equivalent to learning 

118
00:10:10,106 --> 00:10:15,057
a restricted Boltzmann machine. 
The diagram on the right and the diagram 

119
00:10:15,057 --> 00:10:19,526
on the left are identical. 
We can think of the symmetric arrow in 

120
00:10:19,526 --> 00:10:24,408
the diagram on the left, as just a 
convenient shorthand for an infinite 

121
00:10:24,408 --> 00:10:28,784
directed net with tied weights. 
So we first learn that restricted 

122
00:10:28,784 --> 00:10:32,110
Boltzmann machine. 
Now we ought to learn it using maximum 

123
00:10:32,110 --> 00:10:36,720
likelihood learning, but actually we're 
just going to use contrasted divergence 

124
00:10:36,720 --> 00:10:39,054
learning. 
We're going to take a shortcut. 

125
00:10:39,054 --> 00:10:43,780
Once we've learned the first restricted 
Boltzmann machine, what we could do is we 

126
00:10:43,780 --> 00:10:48,390
could freeze the bottom level weights. 
We'll freeze the generative weights that 

127
00:10:48,390 --> 00:10:51,833
define the model. 
We'll also freeze the weights we're going 

128
00:10:51,833 --> 00:10:56,180
to use for inference to be the transpose 
of those generative weights. 

129
00:10:56,180 --> 00:11:00,338
So we freeze those weights. 
We keep all the other weights tied 

130
00:11:00,338 --> 00:11:03,423
together. 
But now we're going to allow them to be 

131
00:11:03,423 --> 00:11:08,387
different from the weights in the bottom 
layer but they're still all tied 

132
00:11:08,387 --> 00:11:11,607
together. 
So learning the remaining weights tied 

133
00:11:11,607 --> 00:11:16,503
together is exactly equivalent to 
learning another restrictive Boltzmann 

134
00:11:16,503 --> 00:11:20,959
machine. 
Namely a restricted Boltzmann machine 

135
00:11:20,959 --> 00:11:26,600
with H0 as its visible units, V1 as its 
hidden units. 

136
00:11:26,600 --> 00:11:32,000
And where the data is the aggregated 
posterior across H0. 

137
00:11:32,000 --> 00:11:36,774
That is, if we want to sample a data 
vector to train this network, what we do 

138
00:11:36,774 --> 00:11:41,235
is we put in a real data vector, V 
nought, we do inference through those 

139
00:11:41,235 --> 00:11:46,135
frozen waits, and we get a binary vector 
at H nought, and we treat that as data 

140
00:11:46,135 --> 00:11:49,340
for training the next restricted 
Boltzmann machine. 

141
00:11:49,340 --> 00:11:52,121
And we can go up for as many layers as we 
like. 

142
00:11:52,121 --> 00:11:56,619
And when we get fed up, we just end up 
with the restrictive Boltzmann machine at 

143
00:11:56,619 --> 00:12:01,295
the top which is equivalent to saying, 
all the weights in the infinite directed 

144
00:12:01,295 --> 00:12:06,148
net above there are still tied together, 
but the weights below have now all become 

145
00:12:06,148 --> 00:12:09,310
different. 
Now an explanation of why the inference 

146
00:12:09,310 --> 00:12:14,450
procedure was correct, involved the idea 
of a complementary prior created by the 

147
00:12:14,450 --> 00:12:19,561
weights in the layers above but of 
course, when we change the weights in the 

148
00:12:19,561 --> 00:12:24,943
layers above, but leave the bottom layer 
of weights fixed, the prior created by 

149
00:12:24,943 --> 00:12:28,875
those changed weights is no longer 
exactly complementary. 

150
00:12:28,875 --> 00:12:34,533
So now our inference procedure, using the 
frozen weights in the bottom layer, is no 

151
00:12:34,533 --> 00:12:39,259
longer exactly correct. 
The good news is, it's nearly always very 

152
00:12:39,259 --> 00:12:44,087
close to correct and with the incorrect 
inference procedure, we still get a 

153
00:12:44,087 --> 00:12:47,780
variational bound on the low probability 
of the data. 

154
00:12:47,780 --> 00:12:53,237
The higher layers have changed because 
they've learned a prior for the bottom 

155
00:12:53,237 --> 00:12:57,995
hidden layer that's closer to the 
aggregated posterior distribution. 

156
00:12:57,995 --> 00:13:03,103
And that makes the model better. 
So changing the hidden weights makes the 

157
00:13:03,103 --> 00:13:08,631
inference that we're doing at the bottom 
hidden layer incorrect, but gives us a 

158
00:13:08,631 --> 00:13:12,130
better model. 
And if you look at those two effects, 

159
00:13:12,130 --> 00:13:17,359
we prove that the improvement that you 
get in the variation bound from having a 

160
00:13:17,359 --> 00:13:22,456
better model is always greater than the 
loss that you get from the inference 

161
00:13:22,456 --> 00:13:26,693
being slightly incorrect. 
So in this variation bound you win when 

162
00:13:26,693 --> 00:13:31,393
you learn the lights in hire less, 
assuming that you do it with correct 

163
00:13:31,393 --> 00:13:35,141
maximizer [INAUDIBLE]. 
So now let's go back to what's happening 

164
00:13:35,141 --> 00:13:39,612
in contrasted divergence learning. 
We have the infinite net on the right and 

165
00:13:39,612 --> 00:13:42,463
we have a restricted Boltzmann machine on 
the left. 

166
00:13:42,463 --> 00:13:45,761
And they're equivalent. 
If we were to do maximum likelihood 

167
00:13:45,761 --> 00:13:50,064
learning for the restricted Boltzmann 
machine, it would be maximum likelihood 

168
00:13:50,064 --> 00:13:52,980
learning for the infinite sigmoid belief 
net. 

169
00:13:52,980 --> 00:13:56,674
But what we're going to do is we're going 
to cut things off. 

170
00:13:56,674 --> 00:14:01,326
We're going to ignore the small 
derivitives for the weights you get in 

171
00:14:01,326 --> 00:14:04,952
the higher layers of the infinite sigmoid 
belief net. 

172
00:14:04,952 --> 00:14:08,660
So, we cut it off were that dotted red 
line is. 

173
00:14:08,660 --> 00:14:14,534
And now if we look at the derivatives, 
the derivatives we're going to get look 

174
00:14:14,534 --> 00:14:16,943
like this. 
They've got two terms. 

175
00:14:16,943 --> 00:14:21,120
The first term comes from that bottom 
layer of nets. 

176
00:14:21,120 --> 00:14:26,009
We've seen that before, the router for 
the bottom layer of weights is just that 

177
00:14:26,009 --> 00:14:30,643
first line here. 
The second line comes from the next layer 

178
00:14:30,643 --> 00:14:33,176
of lights. 
That's this line here. 

179
00:14:33,176 --> 00:14:39,190
We need to compute the activities in H1, 
in order to compute the Sj1 in that 

180
00:14:39,190 --> 00:14:44,219
second line but we're not actually 
computing derivatives for the third layer 

181
00:14:44,219 --> 00:14:47,357
of weights. 
And when we take those first two terms, 

182
00:14:47,357 --> 00:14:51,186
and we combine them. 
We get exactly the learning rule for one 

183
00:14:51,186 --> 00:14:54,952
step contrasted divergence. 
So what's going on in contrasted 

184
00:14:54,952 --> 00:14:59,596
divergence, is we're combining weight 
derivatives for the lower layers, and 

185
00:14:59,596 --> 00:15:02,672
ignoring the weight derivatives in higher 
layers. 

186
00:15:02,672 --> 00:15:07,379
The question is, why can we get away with 
ignoring those higher derivatives? 

187
00:15:07,379 --> 00:15:11,145
When the weights are small, the Markov 
chain mixes very fast. 

188
00:15:11,145 --> 00:15:13,970
If the weights are zero, it mixes in one 
step. 

189
00:15:13,970 --> 00:15:19,216
And if the Markoff chain mixes fast, the 
higher layers will be close to the 

190
00:15:19,216 --> 00:15:24,042
equilibrium distribution, i.e. 
They will have forgotten what the input 

191
00:15:24,042 --> 00:15:27,959
was at the bottom layer. 
And now we have a nice property. 

192
00:15:27,959 --> 00:15:33,555
If the higher layers are sampled from the 
equilibrium distribution, we know that 

193
00:15:33,555 --> 00:15:38,871
the derivatives of the log probability, 
the data with respect to the weights, 

194
00:15:38,871 --> 00:15:43,312
must average out to zero. 
And that's because the current weights in 

195
00:15:43,312 --> 00:15:47,286
the model are a perfect model of the 
equilibrium distribution. 

196
00:15:47,286 --> 00:15:51,261
The equilibrium distribution is generated 
using those weights. 

197
00:15:51,261 --> 00:15:56,389
And if you want to generate samples from 
the equilibrium distribution, those are 

198
00:15:56,389 --> 00:16:01,390
the best possible weights you could have. 
So we know the root is there is zero. 

199
00:16:01,390 --> 00:16:06,420
As the weights get larger, we might have 
to run more iterations of Contrastive 

200
00:16:06,420 --> 00:16:09,774
Divergence. 
Which corresponds to taking into account 

201
00:16:09,774 --> 00:16:12,870
more layers of that infinite sigmoid 
belief net. 

202
00:16:12,870 --> 00:16:18,094
That will allow Contrasive Divergence to 
continue to be a good approximation to 

203
00:16:18,094 --> 00:16:23,254
maximum likelihood and so if we're trying 
to learn a density model, that makes a 

204
00:16:23,254 --> 00:16:26,672
lot of sense. 
As the weights grow, you run CD for more 

205
00:16:26,672 --> 00:16:30,282
and more steps. 
If there's a statistician around, you 

206
00:16:30,282 --> 00:16:34,462
give him a guarantee, then in the 
infinite limit, you'll run CD for 

207
00:16:34,462 --> 00:16:37,755
infinite many steps. 
And then you have an asymptotic 

208
00:16:37,755 --> 00:16:42,124
convergence result, which is the thing 
that keeps statisticians happy. 

209
00:16:42,124 --> 00:16:47,000
Of course it's completely irrelevant 
because you'll never reach a point like 

210
00:16:47,000 --> 00:16:49,723
that. 
There is however an interesting point 

211
00:16:49,723 --> 00:16:53,366
here. 
If our purpose in using CD is to build a 

212
00:16:53,366 --> 00:16:58,504
stack of restricted Boltzmann machines, 
that learn multiple layers of features, 

213
00:16:58,504 --> 00:17:03,712
it turns out that we don't need a good 
approximation to maximum likelihood. 

214
00:17:03,712 --> 00:17:07,739
For learning multiple layers of features, 
CD1 is just fine. 

215
00:17:07,739 --> 00:17:11,836
In fact it's probably better than doing 
maximum l likelihood.