1
00:00:00,000 --> 00:00:05,381
In this video, I'll talk about why it's
difficult to learn sigmoid belief nets.

2
00:00:05,381 --> 00:00:10,624
And then in the following two videos, I'll
describe two different methods we

3
00:00:10,624 --> 00:00:13,660
discovered that allow us to do the
learning.

4
00:00:13,660 --> 00:00:18,911
The good news about learning in sigmoid
belief nets is that unlike Boltzmann

5
00:00:18,911 --> 00:00:21,912
machines, we don't need two different
phases.

6
00:00:21,912 --> 00:00:26,619
We just need what in a Boltzmann machine
would be the positive phase.

7
00:00:26,619 --> 00:00:32,075
That's because sigmoid belief nets are
what is called locally normalized models.

8
00:00:32,075 --> 00:00:36,850
So we don't have to deal with a partition
function or its derivatives.

9
00:00:36,850 --> 00:00:42,476
Another piece of good news about Sigma
belief nets, is that if we could get

10
00:00:42,476 --> 00:00:48,253
unbiased samples from the posterior
distribution over the hidden units, given

11
00:00:48,253 --> 00:00:51,554
the data vector, Then learning would be
easy.

12
00:00:51,554 --> 00:00:57,406
That is, we could follow the gradient
specified by maximum likelihood learning,

13
00:00:57,406 --> 00:01:02,733
in a mini batch stochastic kind of way.
The problem is that it's hard to get

14
00:01:02,733 --> 00:01:08,060
unbiased samples, from the posterior
distribution over the hidden units.

15
00:01:08,580 --> 00:01:14,302
This is largely due to a phenomenon that
Judeo Po calls explaining away.

16
00:01:14,302 --> 00:01:20,900
And I'll explain, explaining away in this
video and it's important to understand it.

17
00:01:21,300 --> 00:01:26,960
Now, I'm going to talk about why it's
difficult to learn sigmoid belief nets.

18
00:01:27,400 --> 00:01:32,589
As we've seen, it's easy to generate an
unbiased sample, once you've done the

19
00:01:32,589 --> 00:01:36,071
learning.
That is, once we've decided on the weights

20
00:01:36,071 --> 00:01:41,602
and the network, we can easily see the
kinds of things the network believes in by

21
00:01:41,602 --> 00:01:46,518
generating samples from this model.
This is done top down, one layer at a

22
00:01:46,518 --> 00:01:51,501
time.
It's easy, because it's a causal model.

23
00:01:51,501 --> 00:01:58,193
However, even if we know the weights, it's
hard to infer the posterior distribution

24
00:01:58,193 --> 00:02:02,682
over hidden causes when we observe the
visible effects.

25
00:02:02,682 --> 00:02:09,212
The reason for this is that the number of
possible patterns of hidden causes is

26
00:02:09,212 --> 00:02:16,330
exponential in the number of hidden nodes.
It's hard even to get a sample from the

27
00:02:16,330 --> 00:02:21,500
posterior, which is what we need if we're
going to do stochastic gradient descent.

28
00:02:21,760 --> 00:02:25,250
So given this difficulty in sampling from
the posterior.

29
00:02:25,250 --> 00:02:30,236
It's hard to see how we can learn sigmoid
belief nets with millions of parameters,

30
00:02:30,236 --> 00:02:34,662
Which is what we'd like to do.
This is a very different regime from the

31
00:02:34,662 --> 00:02:39,711
one normally used with graphical models.
There they have interpretable models, and

32
00:02:39,711 --> 00:02:43,637
they're trying to learn dozens or maybe
hundreds of parameters.

33
00:02:43,637 --> 00:02:47,440
They're not typically trying to learn
millions of parameters.

34
00:02:48,100 --> 00:02:53,164
Now before I go into ways in which we can
try and get samples from the posterior

35
00:02:53,164 --> 00:02:56,603
distribution.
I just want to tell you what the learning

36
00:02:56,603 --> 00:03:02,507
rule is, if we could get those samples.
So, if we can get an unbiased sample from

37
00:03:02,507 --> 00:03:07,222
the posterior distribution of hidden
states, given the observed data, then

38
00:03:07,222 --> 00:03:12,100
learning is easy.
So here's part of a sigmoid belief net,

39
00:03:12,380 --> 00:03:17,893
and we're going to suppose that for every
node we have a binary value.

40
00:03:17,893 --> 00:03:24,302
So, for node J, that binary value is SJ.
And that vector binary value is a global

41
00:03:24,302 --> 00:03:29,240
configuration for the node, which is the
sample from the posterior distribution.

42
00:03:29,880 --> 00:03:36,373
In order to do maximum likelihood
learning, all we have to do is maximize

43
00:03:36,373 --> 00:03:44,445
the law of probability, that the inferred
binary state of unit, I would be generated

44
00:03:44,445 --> 00:03:48,680
from the inferred binary states of its
parents.

45
00:03:49,080 --> 00:03:57,500
So the learning rule is local and simple.
The probability that the parents of I

46
00:03:57,500 --> 00:04:04,843
would turn I on, is given by a logistic
function that involves the binary states

47
00:04:04,843 --> 00:04:09,768
of the parents.
And what we need to do is make that

48
00:04:09,768 --> 00:04:16,119
probability be similar to the actually
observed binary value of I and although

49
00:04:16,119 --> 00:04:21,708
I'm not going to derive it here,
The maximum likelihood learning rule for

50
00:04:21,708 --> 00:04:28,482
the weight WJI is simply to change it in
proportion to the state of J times the

51
00:04:28,482 --> 00:04:35,595
difference between the binary state of I
and the probability that the binary states

52
00:04:35,595 --> 00:04:39,660
of I's parents would turn it on.
So to summarize,

53
00:04:39,960 --> 00:04:45,758
If we have an assignment of binary states
to all the hidden nodes, then it's easy to

54
00:04:45,758 --> 00:04:49,969
do maximum likelihood learning in our
typical stochastic way.

55
00:04:49,969 --> 00:04:55,491
Where we sample from the posterior, and
then we update the weights based on that

56
00:04:55,491 --> 00:04:58,666
sample.
And we average that update over a mini

57
00:04:58,666 --> 00:05:03,603
batch of samples.
So, let's go back to the issue of why it's

58
00:05:03,603 --> 00:05:10,010
hard to sample from the posterior.
The reason it's hard to get an unbiased

59
00:05:10,010 --> 00:05:15,943
sample from the posterior over the hidden
nodes, given an observed data factor at

60
00:05:15,943 --> 00:05:19,899
the leaf nodes, is a phenomenon called
explaining away.

61
00:05:19,899 --> 00:05:25,906
So if you look at this little sigma B leaf
net chair, it has two hidden causes and

62
00:05:25,906 --> 00:05:30,375
one observed effect.
And if you look at the biases, you'll see

63
00:05:30,375 --> 00:05:36,528
that the observed effect of a high stress
jumping is very unlikely to happen unless

64
00:05:36,528 --> 00:05:41,337
one of those causes is true.
But if one of those causes has happened,

65
00:05:41,337 --> 00:05:45,750
the twenty cancels the minus twenty, and
neither house will jump with the

66
00:05:45,750 --> 00:05:51,286
probability of a half.
Each of the causes is itself rather

67
00:05:51,286 --> 00:05:56,740
unlikely but not nearly as unlikely as a
host spontaneously jumping.

68
00:05:57,460 --> 00:06:03,142
So if you see the house jump, one
plausible explanation is that a truck hit

69
00:06:03,142 --> 00:06:07,240
the house.
A different plausible explanation, is that

70
00:06:07,240 --> 00:06:11,570
it was an earthquake.
And each of those has a probability of

71
00:06:11,570 --> 00:06:15,555
about E to the minus ten.
Whereas the house jumping spontaneously

72
00:06:15,555 --> 00:06:18,560
has a probability of about E to the minus
twenty.

73
00:06:19,560 --> 00:06:25,139
However, if you assume both hidden causes,
that has a probability of E to the

74
00:06:25,139 --> 00:06:29,764
-twenty, so that's extremely unlikely,
even if the house did jump.

75
00:06:29,764 --> 00:06:35,270
So assuming there was an earthquake,
reduces the probability that the house

76
00:06:35,270 --> 00:06:41,560
jumped because the truck hit it.
And we get an anti-correlation between the

77
00:06:41,560 --> 00:06:45,703
two hidden causes when we've observed the
house jumping.

78
00:06:45,703 --> 00:06:51,768
Notice in the model itself, in the prior
for the model, these two hidden causes are

79
00:06:51,768 --> 00:06:56,740
quite independent.
So if the house jumps.

80
00:06:57,640 --> 00:07:01,841
This basically an even chunks that was
because of the truck or because of the

81
00:07:01,841 --> 00:07:04,750
earthquake.
The posterior actually look something like

82
00:07:04,750 --> 00:07:08,714
this.
There's four possible patterns of hidden

83
00:07:08,714 --> 00:07:13,348
causes, given that the house jumped.
Two of them are extremely unlikely.

84
00:07:13,348 --> 00:07:17,584
Namely that the truck hit the house, and
there was an earthquake.

85
00:07:17,584 --> 00:07:22,880
Or that neither of those things happened.
The other two combinations are equally

86
00:07:22,880 --> 00:07:26,455
probable, and you'll notice they form an
exclusive all.

87
00:07:26,455 --> 00:07:31,420
We have two likely patterns of causes
which are just the opposites of each

88
00:07:31,420 --> 00:07:33,340
other.
That's explaining away.

89
00:07:33,780 --> 00:07:38,420
Now that we've understood explaining away,
let's consider,

90
00:07:38,680 --> 00:07:43,240
Let's go back to the issue of learning a
deep sigmoid belief net.

91
00:07:44,180 --> 00:07:47,511
So we're going to have multiple layers of
hidden variables.

92
00:07:47,511 --> 00:07:50,900
They're going to give rise to some data in
our causal model.

93
00:07:51,280 --> 00:07:55,827
And we want to learn those weights, W,
between the first layer of hidden

94
00:07:55,827 --> 00:07:59,733
variables in the data.
And let's see what it takes to learn W.

95
00:07:59,733 --> 00:08:04,921
First of all, the posterior distribution
over the first layer of hidden variables

96
00:08:04,921 --> 00:08:09,340
is not going to be a factorial.
They're not independent in the posterior.

97
00:08:09,820 --> 00:08:13,854
And that's because of explaining away.
So, even if we just had that layer of

98
00:08:13,854 --> 00:08:18,049
hidden variables, once we've seen the
data, they wouldn't be independent of one

99
00:08:18,049 --> 00:08:20,918
another.
But because we have higher layers of

100
00:08:20,918 --> 00:08:24,805
hidden variables, they're not even
independent in the prior.

101
00:08:24,805 --> 00:08:30,143
This hidden variables in the laser bath
created prior, and that prior itself will

102
00:08:30,143 --> 00:08:34,294
cause correlations between the hidden
variables in first layer.

103
00:08:34,294 --> 00:08:39,632
To learn W, we need to learn the posteria
in the first hidden layer were least in

104
00:08:39,632 --> 00:08:43,981
the approximation to it.
And even if you are only approximating it

105
00:08:43,981 --> 00:08:49,055
we need to know all of the weights in
higher layers in order to compute that

106
00:08:49,055 --> 00:08:52,599
prior term.
In fact it's even worse than that.

107
00:08:52,599 --> 00:08:57,670
Because to compute that prior term, we
need to integrate out all the hidden

108
00:08:57,670 --> 00:09:02,336
variables and higher layers.
That is we need to consider all possible

109
00:09:02,336 --> 00:09:05,312
patterns of activity in these higher
layers.

110
00:09:05,312 --> 00:09:10,721
And combine them all to compute the prior
that the higher levels create for the

111
00:09:10,721 --> 00:09:14,914
first hidden layer.
Computing that prior is a very complicated

112
00:09:14,914 --> 00:09:18,003
thing.
So these three problems suggest that it's

113
00:09:18,003 --> 00:09:21,453
gonna be extremely difficult to learn
those weights W.

114
00:09:21,453 --> 00:09:26,819
And in particular, we're not gonna be able
to learn them without doing a lot of work

115
00:09:26,819 --> 00:09:32,122
in the higher layers to compute the prior.
So now we're gonna consider some methods

116
00:09:32,122 --> 00:09:36,594
for learning deep belief mets.
The first one is the Monte Carlo method

117
00:09:36,594 --> 00:09:40,683
used by Radford Neal.
And that Monte Carlo method basically does

118
00:09:40,683 --> 00:09:43,941
all the work.
That is, if we go back to the previous

119
00:09:43,941 --> 00:09:48,350
slide, it considers patents activity over
all of the hidden variables.

120
00:09:48,350 --> 00:09:53,389
And it runs a mark off chain that takes a
long time to settle down, given the data

121
00:09:53,389 --> 00:09:56,344
factor.
And once it's settled down, to thermal

122
00:09:56,344 --> 00:10:01,060
equilibrium, you get a sample from the
posterior, but it's a lot of work.

123
00:10:01,340 --> 00:10:06,340
So, in large deep belief nets, this
methods pretty slow.

124
00:10:07,900 --> 00:10:13,359
In the 1990's people developed much faster
methods for learning deep belief nets,

125
00:10:13,359 --> 00:10:18,482
which we call variational methods.
In fact this is where variational methods

126
00:10:18,482 --> 00:10:22,324
came from at least the artificial
intelligence community.

127
00:10:22,324 --> 00:10:28,120
The variational methods give up on getting
unbiased sound post from the posterior and

128
00:10:28,120 --> 00:10:33,715
they content themselves with just getting
approximate samples that is samples from

129
00:10:33,715 --> 00:10:37,490
some other distribution that approximates
the posterior.

130
00:10:37,490 --> 00:10:42,982
Now as we saw before, if we have samples
from the posterior, maximum likelihood

131
00:10:42,982 --> 00:10:47,368
learning is simple.
If we have samples from some other

132
00:10:47,368 --> 00:10:52,579
distribution, we could still use the
maximum likelihood learning rule, but it's

133
00:10:52,579 --> 00:10:57,462
not clear what will happen.
On the face of it, crazy things might

134
00:10:57,462 --> 00:11:01,341
happen if we're using the wrong
distribution to get our samples.

135
00:11:01,341 --> 00:11:05,220
There doesn't seem to be any guarantee
that things will improve.

136
00:11:05,840 --> 00:11:09,859
In fact there is a guarantee that
something will improve.

137
00:11:09,859 --> 00:11:14,654
It's not the log probability that the
model would generate the data.

138
00:11:14,654 --> 00:11:19,168
But it is related to that.
In fact it's a lower band on that log

139
00:11:19,168 --> 00:11:22,835
probability.
And by pushing up the lower band, we can

140
00:11:22,835 --> 00:11:25,374
usually push up the log probability.