1
00:00:00,000 --> 00:00:08,460
In this video I'm going to describe how 
to use an RBM to model real value data. 

2
00:00:08,460 --> 00:00:11,949
The idea is that we make the visible 
units. 

3
00:00:11,949 --> 00:00:18,640
Instead of being binary stochastic units, 
the linear units with Gaussian noise. 

4
00:00:18,640 --> 00:00:23,100
When we do this, we get problems with 
learning. 

5
00:00:23,100 --> 00:00:29,361
And it turns out a good solution to those 
problems is to then make the hidden units 

6
00:00:29,361 --> 00:00:34,166
be rectified linear units. 
With linear Gaussian units for the 

7
00:00:34,166 --> 00:00:39,260
visible, and rectified linear units for 
the hiddens, it's quite easy to learn a 

8
00:00:39,260 --> 00:00:43,963
restricted Boltzmann machine that makes a 
good model of real value data. 

9
00:00:43,963 --> 00:00:48,862
We first used restricted Boltzmann 
machines with the images of handwritten 

10
00:00:48,862 --> 00:00:50,430
digits. 
For those images. 

11
00:00:50,430 --> 00:00:55,715
Intermediate intensities caused by a 
pixel being only partially inked can be 

12
00:00:55,715 --> 00:01:01,137
modelled quite well by probabilities, 
that is numbers between one and zero that 

13
00:01:01,137 --> 00:01:05,187
are actually the probability of a 
logistical unit being on. 

14
00:01:05,187 --> 00:01:10,419
So we treat partially inked pixels. 
As having a probability of being inked. 

15
00:01:10,419 --> 00:01:13,467
This is incorrect but it works quite 
well. 

16
00:01:13,467 --> 00:01:19,199
However it won't work for real images. 
In a real image the intensity of a pixel 

17
00:01:19,199 --> 00:01:23,625
is almost always, almost exactly the 
average of its neighbors. 

18
00:01:23,625 --> 00:01:29,211
So its got a very high probability of 
being very close to that average and a 

19
00:01:29,211 --> 00:01:33,130
very small probability of being a little 
further away. 

20
00:01:33,130 --> 00:01:37,219
And you can't achieve that with a 
logistic unit. 

21
00:01:37,219 --> 00:01:43,695
Mean field logistic units are unable to 
represent things like the intensity is 

22
00:01:43,695 --> 00:01:50,580
69. but very unlikely to be 71. or 67. So 
we need some other kind of unit. 

23
00:01:50,580 --> 00:01:55,289
The obvious thing to use is a linear unit 
with Gaussian norms. 

24
00:01:55,289 --> 00:02:00,910
So we model pixels as Gaussian variables. 
We can still use alternating, get 

25
00:02:00,910 --> 00:02:06,076
sampling, to run the Markoff chain 
required for the cross-divergence 

26
00:02:06,076 --> 00:02:09,013
learning. 
But we need to use a much smaller 

27
00:02:09,013 --> 00:02:12,160
learning range, otherwise it will tend to 
blow up. 

28
00:02:12,160 --> 00:02:18,733
The equation looks like this. 
The first term on the right hand side, is 

29
00:02:18,733 --> 00:02:23,860
a kind of parabolic containing function. 
It stops things blowing out. 

30
00:02:23,860 --> 00:02:30,722
So determining that sum contributed by 
the Ith visible unit is parabolic in 

31
00:02:30,722 --> 00:02:32,980
shape. 
It looks like this. 

32
00:02:32,980 --> 00:02:38,642
It's parabola with its minimum at the 
bias of the Ith unit. 

33
00:02:38,642 --> 00:02:45,551
And as the Ith unit departs from that 
value, we add energy quadratically. 

34
00:02:45,551 --> 00:02:50,830
So that tries to keep the Ith visible 
unit close to VI. 

35
00:02:50,830 --> 00:02:57,382
The interactive term between the visible 
and the hidden units looks like this. 

36
00:02:57,382 --> 00:03:03,515
And if you differentiate that with 
respect to the I, you can see that you 

37
00:03:03,515 --> 00:03:07,883
get a constant. 
It's the sum over all J, of H J W I J 

38
00:03:07,883 --> 00:03:13,678
divided by sigma I. 
So that term with its constant gradient 

39
00:03:13,678 --> 00:03:18,508
looks like this. 
And when you add together, that top down 

40
00:03:18,508 --> 00:03:24,737
contribution to the energy is linear, and 
the parabolic containment function. 

41
00:03:24,737 --> 00:03:30,557
You'll get a parabolic function, but with 
the mean shifted away from BI. 

42
00:03:30,557 --> 00:03:35,720
And how much it shifted depends on the 
slope of that blue line. 

43
00:03:35,720 --> 00:03:41,540
So the effect of the hidden units is just 
to push the mean to one side. 

44
00:03:41,540 --> 00:03:44,994
It's easy to write down an energy 
function like this. 

45
00:03:44,994 --> 00:03:50,012
And it's easy to take derivatives off it. 
But when we try learning with it, we 

46
00:03:50,012 --> 00:03:53,532
often get problems. 
There were a lot of reports in the 

47
00:03:53,532 --> 00:03:58,290
literature that people could not get 
these Gaussian binary RBM's to work. 

48
00:03:58,290 --> 00:04:03,926
And it is indeed extremely hard to learn 
tight variances for the visible units. 

49
00:04:03,926 --> 00:04:09,419
It took us a long time to figure out why 
it's so hard to learn those visible 

50
00:04:09,419 --> 00:04:11,488
variances. 
This picture helps. 

51
00:04:11,488 --> 00:04:16,340
If you consider the effect that visible 
unit I has on hidden unit J. 

52
00:04:16,340 --> 00:04:22,369
When visible unit I has a strong standard 
deviation sigma I, that has the effect of 

53
00:04:22,369 --> 00:04:27,672
exaggerating the bottom up weights. 
That's because we need to measure the 

54
00:04:27,672 --> 00:04:31,232
activity of I in units of its standard 
deviation. 

55
00:04:31,232 --> 00:04:37,044
So when the standard deviation is small, 
we need to multiply the weight by a lot. 

56
00:04:37,044 --> 00:04:42,420
If you look at the top down effect of J 
on I, that's multiplied by sigma I. 

57
00:04:42,420 --> 00:04:48,236
So when the standard deviation of a 
visible unit I is very small, the bottom 

58
00:04:48,236 --> 00:04:53,286
up effects get exaggerated, on the top 
down effects get attenuated. 

59
00:04:53,286 --> 00:04:59,179
The result is that we have a conflict 
where either we have bottom up effects 

60
00:04:59,179 --> 00:05:04,230
that are much too big or top down effects 
that are much too small. 

61
00:05:04,230 --> 00:05:10,201
And the result is that the hidden units 
tend to saturate and be firmly on or off 

62
00:05:10,201 --> 00:05:13,445
all the time, and this will mess up 
learning. 

63
00:05:13,445 --> 00:05:18,533
So the solution is to have many more 
hidden units than visible units. 

64
00:05:18,533 --> 00:05:24,357
That allows small weights between the 
visible and hidden units to have big top 

65
00:05:24,357 --> 00:05:27,675
down effects, because of so many hidden 
units. 

66
00:05:27,675 --> 00:05:33,130
But of course, we really need the number 
of hidden units to change as that 

67
00:05:33,130 --> 00:05:39,044
standard deviation sigma I gets smaller. 
And on the next slide, we'll see how we 

68
00:05:39,044 --> 00:05:43,300
can achieve that. 
I'm going to introduce stepped sigmoid 

69
00:05:43,300 --> 00:05:46,796
units. 
The idea is we make many copies of each 

70
00:05:46,796 --> 00:05:52,117
stacastic binary hidden unit. 
All the copies have the same weights, and 

71
00:05:52,117 --> 00:05:58,197
the same bias that's learned B But in 
addition to that adapted bias B they have 

72
00:05:58,197 --> 00:06:04,541
a fixed offset to the bias. 
The first unit has an offset of -1.5. The 

73
00:06:04,541 --> 00:06:10,460
second unit has an offset of -1.5. The 
third one has an offset of minus -2.5, 

74
00:06:10,460 --> 00:06:14,180
and so on. 
If you have a whole family of sigmoid 

75
00:06:14,180 --> 00:06:19,892
units like that, with the bias changed by 
one between neighbouring members of the 

76
00:06:19,892 --> 00:06:22,784
family, the response code looks like 
this. 

77
00:06:22,784 --> 00:06:27,227
If the total in product is very low, none 
of them are turned on. 

78
00:06:27,227 --> 00:06:31,811
As it increases, the number that get 
turned on increases linearly. 

79
00:06:31,811 --> 00:06:37,524
This means that as the standard deviation 
on the previous slide gets smaller, the 

80
00:06:37,524 --> 00:06:43,356
number of copies of each hidden unit that 
get turned on gets bigger and we achieved 

81
00:06:43,356 --> 00:06:48,212
just the effect we wanted, which we get 
more top-down effect to drive these 

82
00:06:48,212 --> 00:06:51,450
visible units that have small standard 
deviations. 

83
00:06:51,450 --> 00:06:56,372
Now it's quite expensive to use a big 
population of binary stochastic units 

84
00:06:56,372 --> 00:07:01,488
with offset biases, because for each one 
of them, we need to put the total input 

85
00:07:01,488 --> 00:07:06,798
through the logistic function, but we can 
make some fast approximations which work 

86
00:07:06,798 --> 00:07:11,048
just as well. 
So the sum of the activities of a whole 

87
00:07:11,048 --> 00:07:16,757
bunch of sigmoid units with offset 
ballasts, which is shown in that 

88
00:07:16,757 --> 00:07:21,038
summation. 
Is approximately equal to log of one plus 

89
00:07:21,038 --> 00:07:26,671
E to the X and that in turn is 
approximately equal to the maximum of 

90
00:07:26,671 --> 00:07:31,080
nought and X. 
And we can add some noise to the X if we 

91
00:07:31,080 --> 00:07:34,753
want. 
So the first term in the equation looks 

92
00:07:34,753 --> 00:07:37,998
like this. 
The second term looks like that. 

93
00:07:37,998 --> 00:07:43,833
And you can see that the sum of all those 
sigmoids in the first term will be a 

94
00:07:43,833 --> 00:07:47,605
curve like that. 
And we can approximate that by a linear 

95
00:07:47,605 --> 00:07:52,185
threshold unit that has a value of zero 
unless it's above threshold. 

96
00:07:52,185 --> 00:07:56,091
In which case its value increases 
linearly with its input. 

97
00:07:56,091 --> 00:08:01,480
Contrastive Divergence Learning works 
well for the sum of a bunch of stochastic 

98
00:08:01,480 --> 00:08:05,610
logistic units with offset biases. 
And in that case. 

99
00:08:05,610 --> 00:08:11,530
You get a noise variance that's equal to 
the logistic function. 

100
00:08:11,530 --> 00:08:15,901
But the output of that sum. 
Alternatively we can use that green curve 

101
00:08:15,901 --> 00:08:20,082
and use rectified linear units. 
They're much faster to compute because 

102
00:08:20,082 --> 00:08:23,440
you don't need to go through the logistic 
many times. 

103
00:08:23,440 --> 00:08:27,291
And can trust divergence works just fine 
with those. 

104
00:08:27,291 --> 00:08:32,845
One nice property of rectified linear 
units is that if they have a bias of 

105
00:08:32,845 --> 00:08:38,622
zero, they exhibit scale equivariance. 
This is a very nice property to have for 

106
00:08:38,622 --> 00:08:42,177
images. 
What scale equivariance means is that if 

107
00:08:42,177 --> 00:08:48,280
you take an image x and you multiply all 
the pixel intensities by a scalar a., 

108
00:08:48,280 --> 00:08:54,775
then the representation of ax in the 
rectified linear units would be just a 

109
00:08:54,775 --> 00:08:59,978
times the representation of x. 
In other words, when we scale up all the 

110
00:08:59,978 --> 00:09:05,256
intensities in the image, we scale up the 
activities of all the hidden units but 

111
00:09:05,256 --> 00:09:09,544
all the ratios stay the same. 
Rectified linear units aren't fully 

112
00:09:09,544 --> 00:09:14,822
linear because if you add together two 
images, the representation you get is not 

113
00:09:14,822 --> 00:09:18,451
the sum of the representations of each 
unit separately. 

114
00:09:18,451 --> 00:09:23,201
This property of scale equivariance is 
quite similar to the property of 

115
00:09:23,201 --> 00:09:26,760
translational equivariance, convolutional 
nets off. 

116
00:09:26,760 --> 00:09:32,884
So if we ignore the pooling for now, in a 
convolution on that, if we shift an image 

117
00:09:32,884 --> 00:09:38,608
and look at the representation, 
the representation of a shifted image is 

118
00:09:38,608 --> 00:09:43,134
just a shifted version of the 
representation of the unshifted image. 

119
00:09:43,134 --> 00:09:47,993
So in a convolutional net without 
pooling, translations of the input just 

120
00:09:47,993 --> 00:09:52,586
flow through the layers of the net 
without really affecting anything. 

121
00:09:52,586 --> 00:09:56,114
The representation of every layer is just 
translated.