In this video I'm going to describe how 
to use an RBM to model real value data. 
The idea is that we make the visible 
units. 
Instead of being binary stochastic units, 
the linear units with Gaussian noise. 
When we do this, we get problems with 
learning. 
And it turns out a good solution to those 
problems is to then make the hidden units 
be rectified linear units. 
With linear Gaussian units for the 
visible, and rectified linear units for 
the hiddens, it's quite easy to learn a 
restricted Boltzmann machine that makes a 
good model of real value data. 
We first used restricted Boltzmann 
machines with the images of handwritten 
digits. 
For those images. 
Intermediate intensities caused by a 
pixel being only partially inked can be 
modelled quite well by probabilities, 
that is numbers between one and zero that 
are actually the probability of a 
logistical unit being on. 
So we treat partially inked pixels. 
As having a probability of being inked. 
This is incorrect but it works quite 
well. 
However it won't work for real images. 
In a real image the intensity of a pixel 
is almost always, almost exactly the 
average of its neighbors. 
So its got a very high probability of 
being very close to that average and a 
very small probability of being a little 
further away. 
And you can't achieve that with a 
logistic unit. 
Mean field logistic units are unable to 
represent things like the intensity is 
69. but very unlikely to be 71. or 67. So 
we need some other kind of unit. 
The obvious thing to use is a linear unit 
with Gaussian norms. 
So we model pixels as Gaussian variables. 
We can still use alternating, get 
sampling, to run the Markoff chain 
required for the cross-divergence 
learning. 
But we need to use a much smaller 
learning range, otherwise it will tend to 
blow up. 
The equation looks like this. 
The first term on the right hand side, is 
a kind of parabolic containing function. 
It stops things blowing out. 
So determining that sum contributed by 
the Ith visible unit is parabolic in 
shape. 
It looks like this. 
It's parabola with its minimum at the 
bias of the Ith unit. 
And as the Ith unit departs from that 
value, we add energy quadratically. 
So that tries to keep the Ith visible 
unit close to VI. 
The interactive term between the visible 
and the hidden units looks like this. 
And if you differentiate that with 
respect to the I, you can see that you 
get a constant. 
It's the sum over all J, of H J W I J 
divided by sigma I. 
So that term with its constant gradient 
looks like this. 
And when you add together, that top down 
contribution to the energy is linear, and 
the parabolic containment function. 
You'll get a parabolic function, but with 
the mean shifted away from BI. 
And how much it shifted depends on the 
slope of that blue line. 
So the effect of the hidden units is just 
to push the mean to one side. 
It's easy to write down an energy 
function like this. 
And it's easy to take derivatives off it. 
But when we try learning with it, we 
often get problems. 
There were a lot of reports in the 
literature that people could not get 
these Gaussian binary RBM's to work. 
And it is indeed extremely hard to learn 
tight variances for the visible units. 
It took us a long time to figure out why 
it's so hard to learn those visible 
variances. 
This picture helps. 
If you consider the effect that visible 
unit I has on hidden unit J. 
When visible unit I has a strong standard 
deviation sigma I, that has the effect of 
exaggerating the bottom up weights. 
That's because we need to measure the 
activity of I in units of its standard 
deviation. 
So when the standard deviation is small, 
we need to multiply the weight by a lot. 
If you look at the top down effect of J 
on I, that's multiplied by sigma I. 
So when the standard deviation of a 
visible unit I is very small, the bottom 
up effects get exaggerated, on the top 
down effects get attenuated. 
The result is that we have a conflict 
where either we have bottom up effects 
that are much too big or top down effects 
that are much too small. 
And the result is that the hidden units 
tend to saturate and be firmly on or off 
all the time, and this will mess up 
learning. 
So the solution is to have many more 
hidden units than visible units. 
That allows small weights between the 
visible and hidden units to have big top 
down effects, because of so many hidden 
units. 
But of course, we really need the number 
of hidden units to change as that 
standard deviation sigma I gets smaller. 
And on the next slide, we'll see how we 
can achieve that. 
I'm going to introduce stepped sigmoid 
units. 
The idea is we make many copies of each 
stacastic binary hidden unit. 
All the copies have the same weights, and 
the same bias that's learned B But in 
addition to that adapted bias B they have 
a fixed offset to the bias. 
The first unit has an offset of -1.5. The 
second unit has an offset of -1.5. The 
third one has an offset of minus -2.5, 
and so on. 
If you have a whole family of sigmoid 
units like that, with the bias changed by 
one between neighbouring members of the 
family, the response code looks like 
this. 
If the total in product is very low, none 
of them are turned on. 
As it increases, the number that get 
turned on increases linearly. 
This means that as the standard deviation 
on the previous slide gets smaller, the 
number of copies of each hidden unit that 
get turned on gets bigger and we achieved 
just the effect we wanted, which we get 
more top-down effect to drive these 
visible units that have small standard 
deviations. 
Now it's quite expensive to use a big 
population of binary stochastic units 
with offset biases, because for each one 
of them, we need to put the total input 
through the logistic function, but we can 
make some fast approximations which work 
just as well. 
So the sum of the activities of a whole 
bunch of sigmoid units with offset 
ballasts, which is shown in that 
summation. 
Is approximately equal to log of one plus 
E to the X and that in turn is 
approximately equal to the maximum of 
nought and X. 
And we can add some noise to the X if we 
want. 
So the first term in the equation looks 
like this. 
The second term looks like that. 
And you can see that the sum of all those 
sigmoids in the first term will be a 
curve like that. 
And we can approximate that by a linear 
threshold unit that has a value of zero 
unless it's above threshold. 
In which case its value increases 
linearly with its input. 
Contrastive Divergence Learning works 
well for the sum of a bunch of stochastic 
logistic units with offset biases. 
And in that case. 
You get a noise variance that's equal to 
the logistic function. 
But the output of that sum. 
Alternatively we can use that green curve 
and use rectified linear units. 
They're much faster to compute because 
you don't need to go through the logistic 
many times. 
And can trust divergence works just fine 
with those. 
One nice property of rectified linear 
units is that if they have a bias of 
zero, they exhibit scale equivariance. 
This is a very nice property to have for 
images. 
What scale equivariance means is that if 
you take an image x and you multiply all 
the pixel intensities by a scalar a., 
then the representation of ax in the 
rectified linear units would be just a 
times the representation of x. 
In other words, when we scale up all the 
intensities in the image, we scale up the 
activities of all the hidden units but 
all the ratios stay the same. 
Rectified linear units aren't fully 
linear because if you add together two 
images, the representation you get is not 
the sum of the representations of each 
unit separately. 
This property of scale equivariance is 
quite similar to the property of 
translational equivariance, convolutional 
nets off. 
So if we ignore the pooling for now, in a 
convolution on that, if we shift an image 
and look at the representation, 
the representation of a shifted image is 
just a shifted version of the 
representation of the unshifted image. 
So in a convolutional net without 
pooling, translations of the input just 
flow through the layers of the net 
without really affecting anything. 
The representation of every layer is just 
translated.