1
00:00:00,000 --> 00:00:06,400
In this video, I'm going to explain, how a
Boltzmann machine models a set of binary

2
00:00:06,400 --> 00:00:10,588
data vectors.
I'm going to start by explaining, why we

3
00:00:10,588 --> 00:00:17,068
might want to model a set of binary data
vectors, and what we could do with such a

4
00:00:17,068 --> 00:00:20,758
model if we had it.
And then I'm gonna show how the

5
00:00:20,758 --> 00:00:26,298
probabilities assigned to binary data
vectors are determined by the weights in a

6
00:00:26,298 --> 00:00:30,196
Boltzmann machine.
Stochastic Hopfield nets with hidden

7
00:00:30,196 --> 00:00:35,672
units, which we also call as Boltzmann
machines are good at modelling binary

8
00:00:35,672 --> 00:00:38,482
data.
So, given a set of binary training

9
00:00:38,482 --> 00:00:44,535
vectors, they can use the hidden units to
fit a model per assigns the probability to

10
00:00:44,535 --> 00:00:49,651
every possible binary vector.
Per several reasons, why you might like to

11
00:00:49,651 --> 00:00:53,254
be able to do that.
If, for example you had several

12
00:00:53,254 --> 00:00:58,515
distributions of binary vectors, you might
like to look at a new binary vector and

13
00:00:58,515 --> 00:01:03,495
decide which distribution it came from.
So, you might have different kinds of

14
00:01:03,495 --> 00:01:08,170
documents, and you might represent a
document by, a number of binary features

15
00:01:08,170 --> 00:01:13,337
each of which says, whether there is more
than zero occurrences of a particular word

16
00:01:13,337 --> 00:01:16,659
in that document.
For different kinds of documents, you

17
00:01:16,659 --> 00:01:20,904
would expect different kinds of the
different words, may be you'll see

18
00:01:20,904 --> 00:01:26,772
different correlations between words And
so you could use a set of hidden units to

19
00:01:26,772 --> 00:01:32,465
model the distribution for each document.
And then you could pick the most likely

20
00:01:32,465 --> 00:01:36,752
document, by seeing.
And then you could assign a test document

21
00:01:36,752 --> 00:01:42,585
to the appropriate class, by seeing which
class of document is most likely to have

22
00:01:42,585 --> 00:01:47,364
produced that binary vector.
You could also use Boltzmann machines for

23
00:01:47,364 --> 00:01:51,160
monitoring complex systems to detect
unusual behavior.

24
00:01:51,600 --> 00:01:56,782
Suppose for example that you have a
nuclear power station, and all of the

25
00:01:56,782 --> 00:02:01,125
dials were binary.
So you get a whole bunch of binary numbers

26
00:02:01,125 --> 00:02:05,047
that tell you something about the state of
the power station.

27
00:02:05,047 --> 00:02:09,034
What you'd like to do, is notice that it's
in an unusual state.

28
00:02:09,034 --> 00:02:12,250
A state that's not like states you've seen
before.

29
00:02:12,250 --> 00:02:15,295
And you don't want to use supervised
learning for that.

30
00:02:15,295 --> 00:02:19,615
Because really you don't want to have any
examples of states that cause it to

31
00:02:19,615 --> 00:02:22,273
blowup.
You'd rather be able to detect that it's

32
00:02:22,273 --> 00:02:26,150
going into such a state without every
having seen such a state before.

33
00:02:26,150 --> 00:02:31,588
And you could do that by building a model
of a normal state and noticing that this

34
00:02:31,588 --> 00:02:36,799
state is different from the normal states.
If you have models of several different

35
00:02:36,799 --> 00:02:40,533
distributions.
You can complete the posterior probability

36
00:02:40,533 --> 00:02:46,014
that a particular distribution produced
the observed data by using Bayes' Theorem.

37
00:02:46,014 --> 00:02:50,759
So giving the observed data, the
probability it came from Model I, under

38
00:02:50,759 --> 00:02:56,240
the assumption that it came from one of
your models, is the probability that Model

39
00:02:56,240 --> 00:03:01,320
I would have produced that data, divided
by the same quantity for all models.

40
00:03:01,320 --> 00:03:06,843
Now I want to talk about two ways of
producing models of data in particular

41
00:03:06,843 --> 00:03:10,599
binary vectors.
The most natural way to think about

42
00:03:10,599 --> 00:03:16,049
generating a binary vector is to first
generate the states of some latent

43
00:03:16,049 --> 00:03:19,511
variables,
And then use the latent variables to

44
00:03:19,511 --> 00:03:24,509
generate the binary vector.
So in a causal model, we use two

45
00:03:24,509 --> 00:03:28,260
sequential steps.
These are the latent variables, or hidden

46
00:03:28,260 --> 00:03:33,174
units, and we first pick the states of the
latent variables from their prior

47
00:03:33,174 --> 00:03:36,750
distributions.
Often in the causal model, these will be

48
00:03:36,750 --> 00:03:40,956
independent in the prior.
So their probability of turning on, if

49
00:03:40,956 --> 00:03:46,431
they were binary latent variables, would
just depend on some bias that each one of

50
00:03:46,431 --> 00:03:49,770
them has.
Then, once we picked a state for those, we

51
00:03:49,770 --> 00:03:54,978
would use those to generate the states of
the visible units by using weighted

52
00:03:54,978 --> 00:03:59,117
connections in this model.
So this is a kind of neural network,

53
00:03:59,117 --> 00:04:02,988
causal, generative model.
It's using logistic units, and it uses

54
00:04:02,988 --> 00:04:08,409
biases for the hidden units and weights on
the connections between hidden and visible

55
00:04:08,409 --> 00:04:12,380
units to assign a probability to every
possible visible vector.

56
00:04:12,380 --> 00:04:17,381
The probability of generating a particular
vector v, is just the sum of all the

57
00:04:17,381 --> 00:04:22,762
possible hidden states of the probability
of generating those hidden state times the

58
00:04:22,762 --> 00:04:27,573
probability of generating v, given that
you've already generated that hidden

59
00:04:27,573 --> 00:04:30,548
state.
So, that's a causal model, factor analysis

60
00:04:30,548 --> 00:04:34,157
for example is a causal model using
continuous variables.

61
00:04:34,157 --> 00:04:38,588
And, it's probably the most natural way to
think about generating data.

62
00:04:38,588 --> 00:04:43,146
In fact, some people when they say
generated model mean, the causal model

63
00:04:43,146 --> 00:04:47,091
like this.
But just a completely different kind of

64
00:04:47,091 --> 00:04:50,264
model.
A Boltzmann machine is an energy based

65
00:04:50,264 --> 00:04:55,060
model, and, in this kind of model, you
don't generate data causally.

66
00:04:55,980 --> 00:05:00,592
It's not a causal generative model.
Instead everything is defined in terms of

67
00:05:00,592 --> 00:05:04,590
the energies of joint configurations of
visible and hidden units.

68
00:05:04,590 --> 00:05:09,756
There's two ways of relating the energy of
a joint configuration to its probability.

69
00:05:09,756 --> 00:05:14,122
You can simply define the probability to
be the probability of a joint

70
00:05:14,122 --> 00:05:19,104
configuration of the visible and hidden
variables is proportional to e to the

71
00:05:19,104 --> 00:05:21,810
negative energy of that joint
configuration.

72
00:05:21,810 --> 00:05:27,067
Or you can define it procedurally by
saying we are going to define the

73
00:05:27,067 --> 00:05:33,261
probability as the probability finding the
network in that state after we've updating

74
00:05:33,261 --> 00:05:38,374
all the stochastic binary units for enough
time so that we reached thermal

75
00:05:38,374 --> 00:05:41,543
equilibrium.
The good news is that those two

76
00:05:41,543 --> 00:05:46,575
definitions agree.
The energy of a joint configuration of the

77
00:05:46,575 --> 00:05:50,294
visible and hidden units has five terms in
it.

78
00:05:50,294 --> 00:05:56,275
So I've put the negative energy to save
having to put lots of minus signs.

79
00:05:56,275 --> 00:06:00,883
And so the negative energy of the joint
configuration VH.

80
00:06:00,883 --> 00:06:06,380
That's with vector V on the visible units,
and H on the hidden units,

81
00:06:06,380 --> 00:06:12,200
Has bias terms where VI is the binary
state of the Ith unit in vector V.

82
00:06:13,600 --> 00:06:22,580
And the bk is the bias of the kth unit, in
this case, a hidden unit.

83
00:06:23,640 --> 00:06:27,900
So that's the first two terms.
Then there's the visible-visible

84
00:06:27,900 --> 00:06:30,817
interactions,
And to avoid counting each of those

85
00:06:30,817 --> 00:06:35,062
interactions twice, we can, just say,
we're going to count within c's, I, and j

86
00:06:35,062 --> 00:06:39,816
and make sure that I's always less than j.
That'll avoid counting the interaction of

87
00:06:39,816 --> 00:06:44,287
something with itself, and also avoid
counting pairs twice, and so we don't have

88
00:06:44,287 --> 00:06:47,356
to put a half in front.
Then there's the visible hidden

89
00:06:47,356 --> 00:06:50,780
interactions.
My WIK is a weight on a visible hidden

90
00:06:50,780 --> 00:06:54,320
interaction.
And then there's the hidden to hidden

91
00:06:54,320 --> 00:06:58,228
interactions.
So the way we use the energies to define

92
00:06:58,228 --> 00:07:03,657
probabilities is that the probability of a
joint configuration over vnh is

93
00:07:03,657 --> 00:07:08,724
proportional to e to the minus vh.
To make that an equality we need to

94
00:07:08,724 --> 00:07:14,369
normalize the right hand side by all
possible configurations over the visible

95
00:07:14,369 --> 00:07:17,843
and hidden and that's what the divisor
there is.

96
00:07:17,843 --> 00:07:20,956
That's often called the partition
function.

97
00:07:20,956 --> 00:07:26,306
That's what physicists call it.
And notice it has exponentially many

98
00:07:26,306 --> 00:07:29,906
terms.
To get the probability of a configuration

99
00:07:29,906 --> 00:07:35,952
of the visible units alone, we have to sum
over all possible configurations of the

100
00:07:35,952 --> 00:07:40,007
hidden units.
So P of V is the sum over all possible Hs,

101
00:07:40,007 --> 00:07:45,905
of each of the minus the energy you get
with that H, normalized by the partition

102
00:07:45,905 --> 00:07:49,517
function.
I want to give you an example of how we

103
00:07:49,517 --> 00:07:55,710
compute the probabilities of the different
visible vectors, because that'll give you

104
00:07:55,710 --> 00:08:00,447
a good feel for what's involved.
It's all very well to see the equations,

105
00:08:00,447 --> 00:08:04,785
but I find that I understand it much
better when I've worked through the

106
00:08:04,785 --> 00:08:07,875
computation.
So let's take a network with two hidden

107
00:08:07,875 --> 00:08:11,856
units and two visible units.
And we'll ignore biases, so we just got

108
00:08:11,856 --> 00:08:15,124
three weights here.
To keep things simple, I'm not gonna

109
00:08:15,124 --> 00:08:19,819
connect visible units to each other.
So the first thing we do is write down all

110
00:08:19,819 --> 00:08:24,335
possible states of the visible units.
I need to put them in different colors,

111
00:08:24,335 --> 00:08:27,010
and I'm going to write each state four
times,

112
00:08:27,010 --> 00:08:32,810
Because for each state of visible units,
there's four possible states of the hidden

113
00:08:32,810 --> 00:08:37,561
units that could go with it.
So that gives us sixteen possible joint

114
00:08:37,561 --> 00:08:40,990
configurations.
Now, for each of those joint

115
00:08:40,990 --> 00:08:46,430
configurations, we're going to compute
it's negative energy minus E.

116
00:08:46,430 --> 00:08:51,700
So if you look at the first line, when all
of the units are on.

117
00:08:51,700 --> 00:08:57,640
The negative energy will be +two -one,
+one is +two.

118
00:08:58,600 --> 00:09:03,223
And we do this for all sixteen possible
joint configurations.

119
00:09:03,223 --> 00:09:07,770
We then take the negative energies and we
exponentiate them.

120
00:09:07,770 --> 00:09:11,560
And that will give us un-normalized
probabilities.

121
00:09:12,980 --> 00:09:18,444
So these are the un-normalized
probabilities of the configurations.

122
00:09:18,444 --> 00:09:22,115
Their probabilities are proportional to
this.

123
00:09:22,115 --> 00:09:28,640
If we add all those up to 39.7 and then we
divide everything by 39.7, we get the

124
00:09:28,640 --> 00:09:33,290
probabilities of joint configurations.
There they all are.

125
00:09:33,290 --> 00:09:38,763
Now, if we want the probability of a
particular visible configuration, we have

126
00:09:38,763 --> 00:09:43,313
to sum over all the hidden configurations
that could go with it.

127
00:09:43,313 --> 00:09:46,369
And so we add up the numbers in each
block.

128
00:09:46,369 --> 00:09:51,772
And now we've computed the probability of
each possible visible vector in a

129
00:09:51,772 --> 00:09:55,540
Boltson's machine that has these three
weights in it.

130
00:09:56,000 --> 00:10:01,363
Now let's ask how we get a sample from the
model when the network's bigger than that.

131
00:10:01,363 --> 00:10:05,402
Obviously, in the network we just
computed, we can figure out the

132
00:10:05,402 --> 00:10:08,116
probability of everything'cause it's
small.

133
00:10:08,116 --> 00:10:13,038
But when the network's big, we can't do
these exponentially large computations.

134
00:10:13,038 --> 00:10:17,708
So, if there's more than a few hidden
units, we can't actually compute that

135
00:10:17,708 --> 00:10:20,737
partition function, there's too many terms
in it.

136
00:10:20,737 --> 00:10:25,091
But we can use Markov Chain Monte Carlo to
get samples from the model by starting

137
00:10:25,091 --> 00:10:29,791
from a random global configuration.
And then picking units at random and

138
00:10:29,791 --> 00:10:33,426
dating them stochastically based on their
energy gaps.

139
00:10:33,426 --> 00:10:38,745
Those energy gaps being determined by the
states of all the other units in the

140
00:10:38,745 --> 00:10:41,775
network.
If we keep doing that until the Markov

141
00:10:41,977 --> 00:10:47,296
chain reaches its stationary distribution,
then we have a sample from the model.

142
00:10:47,296 --> 00:10:52,481
And the probability of that sample is
related to its energy by the Boltzmann

143
00:10:52,481 --> 00:10:57,328
distribution, that is, the probability of
the sample is proportional to each-(the of

144
00:10:57,328 --> 00:11:02,965
the minus energy. What about getting a
sample from the posterior distribution

145
00:11:02,965 --> 00:11:06,540
over hidden configurations, when given a
data vector?

146
00:11:06,540 --> 00:11:09,840
It turns out we're going to need that for
learning.

147
00:11:11,180 --> 00:11:15,285
So the number of possible hidden
configurations is again exponential.

148
00:11:15,285 --> 00:11:20,284
So again, we use Markov Chain Monte Carlo.
And it's just the same as getting a sample

149
00:11:20,284 --> 00:11:25,104
from the model, except that we keep that
we keep the visible units clamped to the

150
00:11:25,104 --> 00:11:29,090
data vector we're interested in.
So we only update the hidden units.

151
00:11:29,090 --> 00:11:33,791
The reason we need to get samples from the
posterior distribution, given a data

152
00:11:33,791 --> 00:11:38,135
vector, is we might want to know a good
explanation for the observed data.

153
00:11:38,135 --> 00:11:41,884
And, we might want to base our actions on
that good explanation.

154
00:11:41,884 --> 00:11:44,443
But, we also need to know that for
learning.