1
00:00:00,000 --> 00:00:06,348
In this video, I'll talk about a 
different way of learning sigmoid belief 

2
00:00:06,348 --> 00:00:09,179
notes. 
This different method arrived in an 

3
00:00:09,179 --> 00:00:12,772
unexpected way. 
I stopped working on sigmoid belief nets 

4
00:00:12,772 --> 00:00:17,455
and went back to Boltzmann machines. 
And discovered that restricted Boltz 

5
00:00:17,455 --> 00:00:20,919
machines could actually be learned fairly 
efficiently. 

6
00:00:20,919 --> 00:00:25,923
Given that a restricted Boltzmann machine 
could efficiently learn a layer of 

7
00:00:25,923 --> 00:00:30,927
nonlinear features. It was tempting to 
take those features, treat them as data, 

8
00:00:30,927 --> 00:00:35,417
and apply another restricted Boltzmann 
machine to model the correlations between 

9
00:00:35,417 --> 00:00:39,293
those features. 
And one can continue like this, stacking 

10
00:00:39,293 --> 00:00:44,459
one Boltzmann machine on top of the next 
one to learn lots of layers of nonlinear 

11
00:00:44,459 --> 00:00:47,508
features. 
This eventually led to a big resurgence 

12
00:00:47,508 --> 00:00:52,544
of interest in deep neural nets. 
The issue then arose. Once you stacked up 

13
00:00:52,544 --> 00:00:57,910
lots of restricted Boltzmann machines, 
each which is learned by modeling the 

14
00:00:57,910 --> 00:01:02,935
patterns of future activities produced by 
the previous Boltzmann machines. 

15
00:01:02,935 --> 00:01:08,233
Do you just have a set of separate 
restricted Boltzmann machines or can they 

16
00:01:08,233 --> 00:01:13,318
all be combined together into one model? 
Now, anybody sensible would expect that 

17
00:01:13,318 --> 00:01:17,932
if you combined a set of restricted 
Boltzmann machines together to make one 

18
00:01:17,932 --> 00:01:21,697
model, what you'd get would be a 
multilayer Boltzmann machine. 

19
00:01:21,697 --> 00:01:25,097
However, a brilliant graduate student of 
mine called G.Y. 

20
00:01:25,097 --> 00:01:27,829
Tay, figured out that that's not what you 
get. 

21
00:01:27,829 --> 00:01:32,322
You actually get something that looks 
much more like a sigmoid belief net. 

22
00:01:32,322 --> 00:01:36,263
This was a big surprise. 
It was very surprising to me that we'd 

23
00:01:36,263 --> 00:01:41,293
actually solved the problem of how to 
learn deep sigmoid belief nets by giving 

24
00:01:41,293 --> 00:01:45,559
up on it and focusing on learning 
undirected models like Boltzmann 

25
00:01:45,559 --> 00:01:48,615
machines. 
Using the efficient learning algorithm 

26
00:01:48,615 --> 00:01:53,942
for restricted Boltzmann machines. 
It's easy to train a layer of features 

27
00:01:53,942 --> 00:01:57,510
that receive input directly from the 
pixels. 

28
00:01:57,510 --> 00:02:03,093
We can treat the patterns of activation 
of those feature detectors as if they 

29
00:02:03,093 --> 00:02:06,888
were pixels, 
and learn another layer of features in a 

30
00:02:06,888 --> 00:02:11,171
second hidden layer. 
We can repeat this as many times as we 

31
00:02:11,171 --> 00:02:16,873
like with each new layer of features 
modelling the correlated activity in the 

32
00:02:16,873 --> 00:02:21,802
features in the layer below. 
It can be proved that each time we add 

33
00:02:21,802 --> 00:02:26,608
another layer of features, we improve a 
variational lower bound on the log 

34
00:02:26,608 --> 00:02:30,679
probability that some combined model 
would generate the data. 

35
00:02:30,679 --> 00:02:36,019
The proof is actually complicated, and it 
only applies if you do everything just 

36
00:02:36,019 --> 00:02:38,488
right, 
which you don't do in practice. 

37
00:02:38,488 --> 00:02:43,027
But, the proof is very reassuring, 
because it suggests that something 

38
00:02:43,027 --> 00:02:48,300
sensible is going on when you stack up 
restricted Boltzmann machines like this. 

39
00:02:48,300 --> 00:02:54,373
The proof is based on a neat equivalence 
between a restricted bolson machine and 

40
00:02:54,373 --> 00:02:59,537
an infinitely deep belief net. 
So here's a picture of what happens when 

41
00:02:59,537 --> 00:03:04,275
you learn two restricted Boltzmann 
machines, one on top of the other, 

42
00:03:04,275 --> 00:03:09,781
and then you combine them to make one 
overall model, which I call a deep belief 

43
00:03:09,781 --> 00:03:13,312
net. 
So first we learn one Boltzmann machine 

44
00:03:13,312 --> 00:03:17,273
with its own weights. 
Once that's been trained, we take the 

45
00:03:17,273 --> 00:03:22,191
hidden activity patterns of that 
Boltzmann machine when it's looking at 

46
00:03:22,191 --> 00:03:27,450
data and we treat each hidden activity 
pattern as data for training a second 

47
00:03:27,450 --> 00:03:31,836
Boltzmann machine. 
So we just copy the binary states to the 

48
00:03:31,836 --> 00:03:36,709
second Boltzmann machine, and then we 
learn another Boltzmann machine. 

49
00:03:36,709 --> 00:03:42,219
Now one interesting thing about this, is 
that if we start the second Boltzmann 

50
00:03:42,219 --> 00:03:47,799
machine off with W2 being the transpose 
of W1, and with as many hidden units in 

51
00:03:47,799 --> 00:03:52,955
h2 as there are in v, then the second 
Boltzmann machine will already be a 

52
00:03:52,955 --> 00:03:57,546
pretty good model of h1, 
because it's just the first model upside 

53
00:03:57,546 --> 00:04:00,495
down. 
And for a restricted Boltzmann machine, 

54
00:04:00,495 --> 00:04:04,399
it doesn't really care which you call 
visible and which you call hidden. 

55
00:04:04,399 --> 00:04:07,700
It's just a bipartite graph that's 
learned to model. 

56
00:04:07,700 --> 00:04:12,949
After we've learned those two Boltzmann 
machines, we're going to compose them 

57
00:04:12,949 --> 00:04:18,520
together to form a single model and the 
single model looks like this. 

58
00:04:18,520 --> 00:04:24,870
Its top two layers adjust the same as the 
top restricted Boltzmann machine. 

59
00:04:24,870 --> 00:04:29,966
So that's an undirected model with 
symmetric connections, but its bottom two 

60
00:04:29,966 --> 00:04:33,587
layers are a directed model like a 
sigmoid belief net. 

61
00:04:33,587 --> 00:04:38,750
So what we've done is we've taken the 
symmetric connections between v and h1 

62
00:04:38,750 --> 00:04:44,181
and we've thrown away the upgoing part of 
those and just kept the dangering part. 

63
00:04:44,181 --> 00:04:49,113
To understand why we've done that is 
quite complicated and that will be 

64
00:04:49,113 --> 00:04:53,723
explained in video 13F. 
The resulting combined model is clearly 

65
00:04:53,723 --> 00:04:58,991
not a Boltzmann machine, because its 
bottom layer of connections are not 

66
00:04:58,991 --> 00:05:02,503
symmetric. 
It's a graphical model that we call a 

67
00:05:02,503 --> 00:05:08,430
deep belief net, where the lower layers 
are just like sigmoid belief nets and the 

68
00:05:08,430 --> 00:05:12,161
top two layers form a restricted 
Boltzmann machine. 

69
00:05:12,161 --> 00:05:16,959
So it's a kind of hybrid model. 
If we do it with three Boltzmann machines 

70
00:05:16,959 --> 00:05:20,391
stacked up, we'll get a hybrid model that 
looks like this. 

71
00:05:20,391 --> 00:05:25,269
The top two layers again are a restricted 
Boltzmann machine and the layers below 

72
00:05:25,269 --> 00:05:28,040
are directed layers like in a sigmoid 
belief net. 

73
00:05:29,300 --> 00:05:34,060
To generate data from this model the 
correct procedure is, 

74
00:05:34,060 --> 00:05:38,484
first of all, you go backwards and 
forwards between h2 and h3 to reach 

75
00:05:38,484 --> 00:05:42,213
equilibrium in that top level restricted 
Boltamann machine. 

76
00:05:42,213 --> 00:05:47,333
This involves alternating Gibbs sampling, 
where you update all of the units in h3 

77
00:05:47,333 --> 00:05:50,999
in parallel, and update all of the units 
in h2 in parallel, 

78
00:05:50,999 --> 00:05:56,055
then go back and update all of the units 
in h3 in parallel. And you go backwards 

79
00:05:56,055 --> 00:06:00,986
and forwards like that for a long time 
until you've got an equilibrium sample 

80
00:06:00,986 --> 00:06:04,860
from the top-level restricted Boltamann 
machine. 

81
00:06:04,860 --> 00:06:09,796
So the top-level restricted Bolson 
machine is defining the prior 

82
00:06:09,796 --> 00:06:16,030
distribution of h2. 
Once you've done that, you simply go once 

83
00:06:16,030 --> 00:06:21,580
from h2 to h1 using the generative 
connections w2. 

84
00:06:21,580 --> 00:06:28,870
And then, whatever binary patent you get 
in h1, you go once more to get generated 

85
00:06:28,870 --> 00:06:33,807
data, using the weights w1. 
So we're performing a top-down pass from 

86
00:06:33,807 --> 00:06:36,625
h2, to get the states of all the other 
layers, 

87
00:06:36,625 --> 00:06:41,133
just like in a sigmoid belief net. 
The bottom-up connections, shown in red 

88
00:06:41,133 --> 00:06:44,703
at the lower levels, are not part of the 
generative model. 

89
00:06:44,703 --> 00:06:49,274
They're actually going to be the 
transposes of the corresponding weights. 

90
00:06:49,274 --> 00:06:52,718
So they're the transpose of w1 and the 
transpose of w2, 

91
00:06:52,718 --> 00:06:57,164
and they're going to be used for 
influence, but they're not part of the 

92
00:06:57,164 --> 00:07:00,921
model. 
Now, before I explain why stacking up 

93
00:07:00,921 --> 00:07:07,573
Boltzmann machines is a good idea, I need 
to sort out what it means to average two 

94
00:07:07,573 --> 00:07:12,904
factorial distributions. 
And it may surprise you to know that if I 

95
00:07:12,904 --> 00:07:18,300
average two factorial distributions, I do 
not get a factorial distribution. 

96
00:07:18,300 --> 00:07:23,860
What I mean by averaging here is taking a 
mixture of the distributions, so you 

97
00:07:23,860 --> 00:07:29,706
first pick one of the two at random, and 
then you generate from whichever one you 

98
00:07:29,706 --> 00:07:32,273
picked. 
So, you don't get a factorial 

99
00:07:32,273 --> 00:07:36,213
distribution. 
Suppose we have an RBM with 4 hidden 

100
00:07:36,213 --> 00:07:39,310
units and suppose we give it a visible 
vector. 

101
00:07:39,310 --> 00:07:44,225
And given this visible vector, the 
posterior distribution over those 4 

102
00:07:44,225 --> 00:07:48,534
hidden units is factorial. 
And lets suppose the distribution was 

103
00:07:48,534 --> 00:07:53,785
that the first and second units have a 
probability of 0.9 of turning on and the 

104
00:07:53,785 --> 00:07:56,950
last two have a probability of 0.1 of 
turning on. 

105
00:07:56,950 --> 00:08:02,830
What it means for this to be factorial is 
that, for example, the probability that 

106
00:08:02,830 --> 00:08:08,409
the first two units were both be on in a 
sample from this distribution, is exactly 

107
00:08:08,409 --> 00:08:12,103
0.81. 
Now suppose we have a different angle 

108
00:08:12,103 --> 00:08:18,285
vector v2, and the posterior distribution 
over the same 4 hidden units is now 0.1, 

109
00:08:18,285 --> 00:08:22,055
0.1, 0.9, 0.9, which I chose just to make 
the math easy. 

110
00:08:22,055 --> 00:08:28,215
If we average those two distributions, 
the mean probability of each hidden unit 

111
00:08:28,215 --> 00:08:33,521
being on, is indeed, the average of the 
means for each distribution. 

112
00:08:33,521 --> 00:08:38,666
So the means are 0.5, 0.5, 0.5, 0.5, 
but what you get is not a factorial 

113
00:08:38,666 --> 00:08:42,605
distribution defined by those 4 
probabilities. 

114
00:08:42,605 --> 00:08:47,910
To see that, consider the binary vector 
1, 1, 0, 0 over the hidden units. 

115
00:08:47,910 --> 00:08:54,262
In the posterior for v1, 
that has a probability of 0.9^4, because 

116
00:08:54,262 --> 00:09:00,988
it's 0.9 * 0.9 * 1 - 0.1 * 1 - 0.1. 
So that's 0.43. 

117
00:09:00,988 --> 00:09:06,112
In the posterior for v2, this vector is 
extremely unlikely. 

118
00:09:06,112 --> 00:09:13,179
It has a probability of 1 in 10,000. 
If we average those two probabilities for 

119
00:09:13,179 --> 00:09:17,950
that particular vector, we'll get a 
probability of 0.215, 

120
00:09:17,950 --> 00:09:22,830
and that's much bigger than the 
probability assigned to the vector 1, 1, 

121
00:09:22,830 --> 00:09:26,254
0, 0 by factorial distribution with means 
of 0.5. 

122
00:09:26,254 --> 00:09:31,354
That probability will be 0.5^4, which is 
much smaller. 

123
00:09:31,354 --> 00:09:37,109
So, the point of all this, is that when 
you average two factorial posteriors, you 

124
00:09:37,109 --> 00:09:40,606
get a mixture distribution that's not 
factorial. 

125
00:09:40,606 --> 00:09:44,467
Now, let's look at why the greedy 
learning works. 

126
00:09:44,467 --> 00:09:49,640
That is why it's a good idea to learn one 
restricted Boltzmann machine. 

127
00:09:49,640 --> 00:09:53,444
And then learn a second restricted 
Boltzmann machine that models the 

128
00:09:53,444 --> 00:09:56,643
patterns of activity in the hidden units 
of the first one. 

129
00:09:56,643 --> 00:10:00,503
The weights of the bottom level 
restricted Boltzmann machine, actually 

130
00:10:00,503 --> 00:10:04,253
define four different distributions. 
Of course, they define them in a 

131
00:10:04,253 --> 00:10:06,900
consistent way. 
So the first distribution is the 

132
00:10:06,900 --> 00:10:09,988
probability of the visible units given 
the hidden units. 

133
00:10:09,988 --> 00:10:14,510
And the second one is the probability of 
the hidden units given the visible units. 

134
00:10:14,510 --> 00:10:18,950
And those are the two distributions we 
use for running our alternating mark of 

135
00:10:18,950 --> 00:10:23,391
chain that updates the visibles given the 
hiddens and then updates the hiddens 

136
00:10:23,391 --> 00:10:27,544
given the visibles. 
If we run that chain long enough, we'll 

137
00:10:27,544 --> 00:10:30,671
get a sample from the joint distribution 
of v and h. 

138
00:10:30,671 --> 00:10:34,398
And so the weights clearly also define 
the joint distribution. 

139
00:10:34,398 --> 00:10:39,268
They also define the joint distribution 
more directly in terms of E to the minus 

140
00:10:39,268 --> 00:10:42,175
the energy, 
but for nets with a large number of 

141
00:10:42,175 --> 00:10:46,104
units, we can't compute that. 
If you take the joint distribution, 

142
00:10:46,104 --> 00:10:50,095
p(v|h), and you just ignore v, we now a 
distribution for h. 

143
00:10:50,095 --> 00:10:54,711
That's the prior distribution over h, 
defined by this restricted Boltzmann 

144
00:10:54,711 --> 00:10:57,517
machine. 
And similarly, if we ignore h, we have 

145
00:10:57,517 --> 00:11:02,132
the prior distribution over v, defined by 
the restricted Boltzmann machine. 

146
00:11:02,132 --> 00:11:06,560
And now, we're going to pick a rather 
surprising pair of distributions from 

147
00:11:06,560 --> 00:11:10,783
those four distributions. 
We're going to define the probability 

148
00:11:10,783 --> 00:11:16,450
that the restricted Boltzmann machine 
assigns to a visible vector v as the sum 

149
00:11:16,450 --> 00:11:22,100
over all hidden vectors of the 
probability it assigns to h times the 

150
00:11:22,100 --> 00:11:27,095
probability of v given h. 
This seems like a silly thing to do, 

151
00:11:27,095 --> 00:11:31,762
because defining p(h) is just as hard as 
defining p(v). 

152
00:11:31,762 --> 00:11:36,102
And nevertheless, we're going to define 
p(v) that way. 

153
00:11:36,102 --> 00:11:42,227
Now, if we now leave p(v|h) alone, 
but learn a better model of p(h), 

154
00:11:42,227 --> 00:11:48,315
that is, learn some new parameters that 
give us a better model of p(h) and 

155
00:11:48,315 --> 00:11:53,139
substitute that in instead of the old 
model we had of p(h). 

156
00:11:53,139 --> 00:11:59,543
We'll actually improve our model of v. 
And what we mean by a better model of 

157
00:11:59,543 --> 00:12:04,446
p(h) is a prior over h that fits the 
aggregated posterior better. 

158
00:12:04,446 --> 00:12:10,771
The aggregated posterior is the average 
over all vectors in the training set of 

159
00:12:10,771 --> 00:12:17,727
the posterior distribution over h. 
So, what we're going to do, is use our 

160
00:12:17,727 --> 00:12:24,360
first RBM to get this aggregated 
posterior and then use our second RBM to 

161
00:12:24,360 --> 00:12:30,200
build a better model of this aggregated 
posterior than the first RBM has. 

162
00:12:30,200 --> 00:12:35,936
And if we start the second RBM off as the 
first one upside down, it will start with 

163
00:12:35,936 --> 00:12:40,359
the same model of the aggregated 
posterior as the first RBM has. 

164
00:12:40,359 --> 00:12:44,851
And then, if we change the weights we can 
only make things better. 

165
00:12:44,851 --> 00:12:49,481
So, that's an explanation of what's 
happening when we stack up RBMs. 

166
00:12:49,481 --> 00:12:55,010
Once we've learned to stack up Boltzmann 
machines, then combine them together to 

167
00:12:55,010 --> 00:12:59,374
make a deep belief net, 
we can then actually fine-tune the whole 

168
00:12:59,374 --> 00:13:03,570
composite model using a variation of the 
wake-sleep algorithm. 

169
00:13:03,570 --> 00:13:08,590
So we first learn many layers of features 
by stacking up IBMs. 

170
00:13:08,590 --> 00:13:14,263
And then we want to fine-tune both the 
bottom-up recognition weights and the 

171
00:13:14,263 --> 00:13:19,237
top-down generative weights to get a 
better generative model and we can do 

172
00:13:19,237 --> 00:13:22,459
this by using three different learning 
routes. 

173
00:13:22,459 --> 00:13:27,922
First, we do a stochastic bottom-up pass, 
and we adjust the top down generative 

174
00:13:27,922 --> 00:13:33,736
weights of the lower layers to be good at 
reconstructing the feature activities in 

175
00:13:33,736 --> 00:13:37,799
the layer below. 
That's just as in the standard wake-sleep 

176
00:13:37,799 --> 00:13:43,866
algorithm Then, in the top level RBM, we 
go backwards and forwards a few times, 

177
00:13:43,866 --> 00:13:49,555
sampling the hiddens of that RBM, and the 
visibles of that RBM, and the hiddens of 

178
00:13:49,555 --> 00:13:54,085
the RBM, and so on. 
So that's just like the learning 

179
00:13:54,085 --> 00:13:58,531
algorithm for RBMs. 
And having done a few iterations of that, 

180
00:13:58,531 --> 00:14:04,310
we do contrastive divergence learning. 
That is, we update the weights of the RBM 

181
00:14:04,310 --> 00:14:09,720
using the difference between the 
correlations when activity first got to 

182
00:14:09,720 --> 00:14:14,536
that RBM and the correlations after a few 
iterations in that RBM. 

183
00:14:14,536 --> 00:14:19,360
We take that difference and use it to 
update the weights. 

184
00:14:19,360 --> 00:14:24,703
And then, the third stage, we take the 
visible units of that top-level RBM by 

185
00:14:24,703 --> 00:14:29,096
its lower level units. 
And starting there, we do a top-down 

186
00:14:29,096 --> 00:14:34,805
stochastic pass, using the directed lower 
connections, which are just a sigmoid 

187
00:14:34,805 --> 00:14:38,319
belief net. 
Then, having generated some data from 

188
00:14:38,319 --> 00:14:43,297
that sigmoid belief net, we adjust the 
bottom up rates to be good at 

189
00:14:43,297 --> 00:14:47,671
reconstructing the feature activities in 
the layer above. 

190
00:14:47,671 --> 00:14:51,626
So that's just the sleep phase of the 
wake-sleep algorithm. 

191
00:14:51,626 --> 00:14:56,386
The difference from the standard 
wake-sleep algorithm is that that 

192
00:14:56,386 --> 00:15:01,683
top-level RBM acts as a much better prior 
over the top layers, than just a layer of 

193
00:15:01,683 --> 00:15:06,443
units which are assumed to be 
independent, which is what you get with a 

194
00:15:06,443 --> 00:15:10,157
sigmoid belief net. 
Also, rather than generating data by 

195
00:15:10,157 --> 00:15:15,855
sampling from the prior, what we're 
actually doing is looking at a training 

196
00:15:15,855 --> 00:15:21,856
case, going up to the top-level RBM and 
just running a few iterations before we 

197
00:15:21,856 --> 00:15:25,909
generate data. 
So now we're going to look at an example 

198
00:15:25,909 --> 00:15:29,410
where we first learn some RBMs, stacking 
them up, 

199
00:15:29,410 --> 00:15:33,274
and then we do contrastive wake-sleep to 
fine-tune it, 

200
00:15:33,274 --> 00:15:37,941
and then we look to see what it's like. 
Is it a generative model? 

201
00:15:37,941 --> 00:15:44,195
And also if we're recognizing things. 
So first of all, we're going to use 500 

202
00:15:44,195 --> 00:15:50,999
binary hidden units to learn to model all 
10 digit classes in images of 28 by 28 

203
00:15:50,999 --> 00:15:54,566
pixels. 
Once we've learned that RBM, without 

204
00:15:54,566 --> 00:15:58,659
knowing what the labels are, 
so it's unsupervised learning. 

205
00:15:58,659 --> 00:16:03,142
We're going to take the patterns of 
activity in those 500 hidden units that 

206
00:16:03,142 --> 00:16:07,742
they have when they're looking at data. 
We're going to treat those patterns of 

207
00:16:07,742 --> 00:16:12,343
activity as data and we're going to learn 
another RBM that also has 500 units, 

208
00:16:12,343 --> 00:16:16,620
and those two are learned without knowing 
what the labels are. 

209
00:16:16,620 --> 00:16:20,880
Once we've done that we'll actually tell 
it the labels. 

210
00:16:20,880 --> 00:16:24,300
So the first two hidden layers are 
learned without labels, 

211
00:16:24,300 --> 00:16:28,877
and then, we add a big top layer and we 
give it the 10 labels. 

212
00:16:28,877 --> 00:16:34,709
And you can think that we concatenate 
those 10 labels with the 500 units that 

213
00:16:34,709 --> 00:16:39,212
represent features, 
except that the 10 labels are really one 

214
00:16:39,212 --> 00:16:43,493
soft match unit. 
Then we train that top-level RBM to model 

215
00:16:43,493 --> 00:16:49,473
the concatenation of the soft match unit 
for the 10 labels with the 500 feature 

216
00:16:49,473 --> 00:16:53,460
activities that were produced by the two 
layers below. 

217
00:16:53,460 --> 00:16:59,370
Once we've trained the top-level RBM, we 
can then fine-tune the whole system by 

218
00:16:59,370 --> 00:17:03,434
using contrastive wake-sleep. 
And then we'll have a very good 

219
00:17:03,434 --> 00:17:08,749
generative model and that's the model 
that I showed you in the intro video. 

220
00:17:08,749 --> 00:17:13,419
So if you go back, and you find the 
introduction video for this course, 

221
00:17:13,419 --> 00:17:16,555
you'll see what happens when we run that 
model. 

222
00:17:16,555 --> 00:17:21,958
You'll see how good it is at recognition 
and you'll also see that it's very good 

223
00:17:21,958 --> 00:17:25,428
at generation. 
In that introductory video, I promised 

224
00:17:25,428 --> 00:17:28,563
you, you would eventually explain how it 
worked, 

225
00:17:28,563 --> 00:17:33,767
and I think you've now seen enough to 
know what's going on when this model is 

226
00:17:33,767 --> 00:17:34,301
learned.