1
00:00:00,000 --> 00:00:05,320
In this video, I will introduce Restricted
Boltzmann Machines.

2
00:00:05,660 --> 00:00:11,245
These have a much simplified architecture
in which there are no connections between

3
00:00:11,245 --> 00:00:14,519
hidden units.
This makes it very easy to get the

4
00:00:14,519 --> 00:00:19,750
equilibrium distribution of the hidden
units if the visible units are given.

5
00:00:19,750 --> 00:00:23,379
That is, once you've clamped the
datavector on the visible units,

6
00:00:23,379 --> 00:00:28,125
The equilibrium distribution of the hidden
units can be computed exactly in one step

7
00:00:28,125 --> 00:00:32,535
because they're all independent of one
another, given the states of the visible

8
00:00:32,535 --> 00:00:35,104
units.
The proper Boltzmann machine learning

9
00:00:35,104 --> 00:00:38,510
algorithm is still slow for a restricted
Boltzmann machine.

10
00:00:38,510 --> 00:00:43,050
But in 1998, I discovered a very
surprising shortcut that leads to the

11
00:00:43,050 --> 00:00:46,923
first efficient learning algorithm for
Boltzmann machines.

12
00:00:46,923 --> 00:00:52,532
Even though this algorithm has theoretical
problems, it works quite well in practice.

13
00:00:52,532 --> 00:00:56,940
And it led to a revival of interest in
Boltzmann machine learning.

14
00:00:57,220 --> 00:01:02,298
In a restricted Boltzmann machine, we
restrict the connectivity of the network

15
00:01:02,298 --> 00:01:05,600
in order to make both inference and
learning easier.

16
00:01:06,000 --> 00:01:11,403
So, it only has one layer of hidden units
and there's no connections between the

17
00:01:11,403 --> 00:01:14,983
hidden units.
There's also no connections between the

18
00:01:14,983 --> 00:01:18,697
visible units.
So, the architecture looks like that, it's

19
00:01:18,697 --> 00:01:22,007
what computer scientists call a bipartite
graph.

20
00:01:22,007 --> 00:01:26,060
There's two pieces, and within each piece,
there's no connections.

21
00:01:27,560 --> 00:01:33,401
The good thing about an RBM is that if you
clamp a datavector in the visible units,

22
00:01:33,401 --> 00:01:36,600
you can reach thermal equilibrium in one
step.

23
00:01:37,260 --> 00:01:43,382
That means with a datavector clamped, we
can quickly compute the expected value of

24
00:01:43,382 --> 00:01:48,472
vihj because we can compute the exact
probability with each j will turn on, and

25
00:01:48,472 --> 00:01:53,120
that is independent of all the other units
in the hidden layer.

26
00:01:55,420 --> 00:02:00,234
The probability that j will turn on is
just the logistic function of the input

27
00:02:00,234 --> 00:02:05,049
that it gets from the visible units and
quite independent of what other hidden

28
00:02:05,049 --> 00:02:08,584
units are doing.
So, we can compute that probability all in

29
00:02:08,584 --> 00:02:15,720
parallel and that's a tremendous win.
If you want to make a good model of a set

30
00:02:15,720 --> 00:02:20,443
of binary vectors, then the right
algorithm to use for a restricted

31
00:02:20,443 --> 00:02:26,083
Boltzmann machine is one introduced by
Tieleman in 2008 that's based on earlier

32
00:02:26,083 --> 00:02:31,924
work by Neal.
In the positive phase, you clamp the

33
00:02:31,924 --> 00:02:36,784
datavector on the visible units.
You then compute the exact value of the

34
00:02:36,784 --> 00:02:41,568
expectation vihj for all pairs of
invisible in the hidden unit.

35
00:02:41,568 --> 00:02:46,580
And you could do that cuz vi is fixed, and
you can compute vj exactly.

36
00:02:47,660 --> 00:02:53,170
And then, for every connected pair of
units, you average the expected value of

37
00:02:53,170 --> 00:02:56,240
vihj over all the data vectors in the mini
batch.

38
00:02:58,280 --> 00:03:04,238
For the negative phase, you keep a set of
fantasy particles that is global

39
00:03:04,238 --> 00:03:09,090
configurations.
And then, you update each fantasy particle

40
00:03:09,090 --> 00:03:12,365
a few times by using alternating parallel
updates.

41
00:03:12,365 --> 00:03:17,474
So, after each weight update, you update
the fantasy particles a little bit and

42
00:03:17,474 --> 00:03:20,880
that should bring them back to close to
equilibrium.

43
00:03:21,820 --> 00:03:26,736
And then, for every connected pair of
units, you average vihj over all the

44
00:03:26,736 --> 00:03:30,777
fantasy particles, and that gives you your
negative statistics.

45
00:03:30,777 --> 00:03:36,097
This algorithm actually works very well,
and allows RBMs to build good density

46
00:03:36,097 --> 00:03:43,618
models or sets of binary vectors.
Now, I am going to go on to our learning

47
00:03:43,618 --> 00:03:48,520
algorithm that is not as good at building
density model but is much faster.

48
00:03:48,860 --> 00:03:53,901
So, I'm going to start with a picture of
an inefficient learning algorithm for

49
00:03:53,901 --> 00:04:00,163
restrictive Boltzmann machines.
We're going to start by clamping a

50
00:04:00,163 --> 00:04:04,954
datavector on the visible units, and we're
going to call that time t0.

51
00:04:04,954 --> 00:04:10,532
So, we're going to use times now, not to
denote weight updates, but to denote steps

52
00:04:10,532 --> 00:04:15,304
in a Markov chain.
Given that visible vector, we now update

53
00:04:15,304 --> 00:04:19,353
the hidden units.
So, we choose binary states for the hidden

54
00:04:19,353 --> 00:04:24,911
units and we measure the expected value,
vihj, for all pairs of visible and binary

55
00:04:24,911 --> 00:04:29,440
units that are connected.
And I'll call that vihj zero to indicate

56
00:04:29,440 --> 00:04:34,518
that it's measured at time zero,
With the hidden units being determined by

57
00:04:34,518 --> 00:04:38,224
the visible units.
And, of course, we can update all the

58
00:04:38,224 --> 00:04:43,385
hidden units in parallel.
We then use the hidden vector to update

59
00:04:43,385 --> 00:04:48,544
all the visible units in parallel, and
again we update all the hidden units in

60
00:04:48,544 --> 00:04:52,009
parallel.
So, the visible vector t1 = one, we'll

61
00:04:52,009 --> 00:04:55,935
call a reconstruction, or a one-step
reconstruction,

62
00:04:55,935 --> 00:05:00,401
And we can keep going with the alternating
chain that way,

63
00:05:00,401 --> 00:05:03,865
Updating visible units, and then hidden
units,

64
00:05:03,865 --> 00:05:09,820
Each set being updated in parallel.
And after we've gone for a long time,

65
00:05:10,320 --> 00:05:15,832
We'll get to some state of the visible
units, or I'll call t infinity to indicate

66
00:05:15,832 --> 00:05:21,345
it needs to be a long time and the system
will be at thermal equilibrium, and now,

67
00:05:21,345 --> 00:05:26,654
we can measure the correlation of vi and
hj after the chains run for a long time

68
00:05:26,654 --> 00:05:32,497
and I'll call that vihj infinity.
And the visible state we have after a long

69
00:05:32,497 --> 00:05:37,221
time, I'll call it fantasy.
So now, the learning rule is simply, we

70
00:05:37,221 --> 00:05:42,725
change Wij by the learning rate times the
difference between vihj at time zero and

71
00:05:42,725 --> 00:05:46,818
vihj at time infinity.
And, of course, the problem with this

72
00:05:46,818 --> 00:05:52,322
algorithm is that we have to run this
chain for a long time before it reaches

73
00:05:52,322 --> 00:05:56,485
thermal equilibrium.
And if we don't run it for long enough,

74
00:05:56,485 --> 00:06:02,321
the learning may go wrong.
In fact, that last statement is very

75
00:06:02,321 --> 00:06:07,064
misleading.
It turns out that even if we only run the

76
00:06:07,064 --> 00:06:11,360
chain for a short time, the learning still
works.

77
00:06:13,280 --> 00:06:20,498
So, here's the very surprising shortcut.
You just run the chain up, down, and up

78
00:06:20,498 --> 00:06:23,550
again.
So, from the data, you generate a hidden

79
00:06:23,550 --> 00:06:26,329
state, from that.
You generate a reconstruction, and from

80
00:06:26,329 --> 00:06:31,567
that, you generate another hidden state.
And you may have a statistics once you've

81
00:06:31,567 --> 00:06:35,907
done that.
So, instead of using the statistics

82
00:06:35,907 --> 00:06:42,636
measured at equilibrium, we're using the
statistics measured after doing one full

83
00:06:42,636 --> 00:06:49,321
update of the Markov chain.
The learning rule is, and the same as

84
00:06:49,321 --> 00:06:54,190
before, except this much quicker to
compute, and this is clearly is not doing

85
00:06:54,190 --> 00:06:59,659
maximum likelihood learning because the
term we are using for negative statistics

86
00:07:00,259 --> 00:07:04,301
is wrong.
But the learning, nevertheless, works

87
00:07:04,301 --> 00:07:08,025
quite well.
Next week, we'll understand a bit more

88
00:07:08,025 --> 00:07:11,840
about why it works well.
But for now, we'll just see that it does.

89
00:07:14,820 --> 00:07:18,360
So, the obvious question is why does
actual cut work at all?

90
00:07:18,700 --> 00:07:24,439
And here's the reasoning.
If we start the chain at the data, the

91
00:07:24,439 --> 00:07:29,175
Markov chain will wander away from the
data and towards its equilibrium

92
00:07:29,175 --> 00:07:32,595
distribution.
That is towards things that is initial

93
00:07:32,595 --> 00:07:39,205
weights like more than the data.
We can see what direction it's wandering

94
00:07:39,205 --> 00:07:43,657
in after only a few steps.
And if we know the initial weights aren't

95
00:07:43,657 --> 00:07:47,066
very good, it's a waste of time to go all
the way to equilibrium.

96
00:07:47,066 --> 00:07:51,381
We know how to change them to stop it
wandering away from the data without going

97
00:07:51,381 --> 00:07:57,751
all the way to equilibrium.
All we need to do is lower the probability

98
00:07:57,751 --> 00:08:02,779
of the reconstructions of confabulations
as a psychologist would call them, it

99
00:08:02,779 --> 00:08:07,420
produces after one full step, and then,
raise the probability of the data.

100
00:08:08,300 --> 00:08:11,180
That will stop it wandering away from the
data.

101
00:08:12,340 --> 00:08:17,178
Once the data and the places it goes to
after one full step have the same

102
00:08:17,178 --> 00:08:24,120
distribution, then the learning will stop.
So, here's a picture of what's going on.

103
00:08:25,460 --> 00:08:29,880
Here's the energy surface in the space of
global configurations.

104
00:08:31,140 --> 00:08:36,589
Here's a data point on the energy surface,
and by data point, I mean, both the

105
00:08:36,589 --> 00:08:42,541
visible vector and the particular hidden
vector that we got by stochastic updating

106
00:08:42,541 --> 00:08:46,289
the hidden units.
So, that hidden vector is a function of

107
00:08:46,289 --> 00:08:50,967
what the data point is.
So, starting at that data point, we run

108
00:08:50,967 --> 00:08:56,397
the Markov chain for one full step to get
a new visible vector and the hidden vector

109
00:08:56,397 --> 00:08:59,974
that goes with it.
So, a reconstruction of the data point

110
00:08:59,974 --> 00:09:03,680
plus the hidden vector that goes with that
reconstruction.

111
00:09:05,780 --> 00:09:11,795
We then change the weights to pull the
energy down at the data point, and pull to

112
00:09:11,795 --> 00:09:17,032
the energy up the reconstruction.
And the effect of that would be to make

113
00:09:17,032 --> 00:09:20,667
the surface look like this.
And you'll notice we're beginning to

114
00:09:20,667 --> 00:09:26,025
construct an energy minimum at the data.
You'll also notice that far away from the

115
00:09:26,025 --> 00:09:29,260
data, things have stayed pretty much as
they were before.

116
00:09:32,000 --> 00:09:38,466
So, this shortcut of only doing one full
step to get the reconstruction fails for

117
00:09:38,466 --> 00:09:45,375
places that are far away from the data.
We need to worry about regions of the

118
00:09:45,375 --> 00:09:50,380
data-space that the model likes but which
are very far from any data point.

119
00:09:51,000 --> 00:09:56,614
These low energy holes cause the
normalization term to be big, and we can't

120
00:09:56,614 --> 00:10:02,086
sense them if we use the shortcut.
If we use persistent particles, where we

121
00:10:02,086 --> 00:10:07,234
remembered their states, and after each
update, we updated them a few more times,

122
00:10:07,234 --> 00:10:10,102
then they would eventually find these
holes.

123
00:10:10,102 --> 00:10:15,120
They'd move into the holes, and the
learning would cause the holes to fill up.

124
00:10:17,220 --> 00:10:22,464
A good compromise between speed and
correctness is to start with small weights

125
00:10:22,464 --> 00:10:27,310
and to use CD1, that is contrust
divergence with one full step to get the

126
00:10:27,310 --> 00:10:32,157
negative data.
Once the weights have grown a bit, the

127
00:10:32,157 --> 00:10:36,780
Markov chain is mixing more slowly, and
now we can use CD3.

128
00:10:37,480 --> 00:10:42,300
Once the weights have grown more, we can
use CD5, or nine, or ten.

129
00:10:42,900 --> 00:10:48,317
So, by increasing the number of steps as
the weights grow, we can keep the learning

130
00:10:48,317 --> 00:10:53,669
working reasonably well, even though the
mixing rate of the Markov chain is going