In this video, I'm going to show a simple
example of a restricted Boltzmann machine
learning a model of images of handwritten
twos.
After it's learned the model, we'll look
at how good it is at reconstructing twos.
And we'll look at what happens if we give
it a different kind of digit and ask it to
reconstruct that.
We'll also look at the weights we get if
we train a restricted Boltzmann machine
that's considerably larger on all of the
digit classes.
It lends a wide variety of features, which
between them are very good at
reconstructing all the different classes
of digits, and also are quite a good model
of those digit classes.
That is, if you take a binary vector, this
image of a hundred digit, the model will
be able to find low energy states,
compatible with that image and if you give
it an image that's a long way away from
being an image of a hundred digit, the
model will not be able to find low energy
states compatible with that image.
I'm now gonna show how a relatively simple
RBM can learn to build a model of images
of the digit two.
The images of sixteen pixels by sixteen
pixels.
And it has 50 binary hidden units that
they're gonna learn to become interesting
feature detectors.
So when it's presented with the data case,
the first thing that it does is use the
weight and the connection from pixel to
feature like this, to activate the
features like this.
That is for each of the binary neurons, it
makes the sarcastic decision about whether
it should deduct the state of one or zero.
It then uses the binary pans for
activation to reconstruct the data.
That is, for each pixel, it makes a binary
decision about whether it should be a one
or a zero.
It then reactivates the binary feature
detectors using the reconstruction to
activate them rather than the data.
The weights are changed by incrementing
the weights between an active pixel and an
active feature detector when the network
is looking at data.
And that will lower the energy of the
global configuration of the data, and
whatever hidden pattern went with it.
And it decrements the weights, between an
active pixel and an active feature
detector, when it's looking at a
reconstruction, and that would raise the
energy of the reconstruction.
Near the beginning of learning when the
weights are random, the reconstruction
would almost certainly have lower energy
than the data.
Because the reconstruction is what the
network likes to reproduce on the visible
units, given the hidden pattern of
activity.
And obviously it likes to reproduce
patterns that have low energy according to
its energy function.
And you can think of what learning does as
changing the weights so the data is low
energy.
And the reconstructions of the data are
generally higher energy.
So let's start with some random weights
for the 50 feature detectors.
We'll use small random weights and each of
these squares shows you the weights to the
pixels coming from a particular feature
detector.
The small random weights are used to break
symmetry.
That because the update were stochastic,
we don't really need that.
After seeing a few hundred examples of
digits, and digesting the weights a few
times, the weights are beginning to form
patterns.
If we do it again, you can see that many
of the feature detectors are detecting the
pattern of a hole two.
They're fairly global feature detectors.
And those feature detectors are getting
stronger,
And stronger, and now some of them begin
to localize, and they're getting more
local, and even more local, and even more
local, and these are the final weights and
you can see that each neuron has become a
different feature detector and most of the
feature detectors are fairly local.
If you look at the feature detector in the
red box, for example, it's detecting the
top of a two,
And it's happy when the top of a two is
where its white pixels are and where
there's nothing where the black pixels
are.
So it's representing where the top of the
two is.
Once we've learned the model, we can look
at how well it reconstructs digits.
And we'll give it some test digits that it
hasn't seen before.
So we'll start by giving it a test example
of a two.
And its reconstruction is pretty faithful
to the test example.
It's slightly blurry.
The test example has a hook at the top and
that's been blurred after the
reconstruction, but it's a pretty good
reconstruction.
The more interesting thing we can do, is
give it a test example from a different
digit class.
So if we give it an example of the three
to reconstruct.
What it reconstructs actually looks more
like a two then like a three.
All of the feature detectors is learned,
are good for representing twos, but it
doesn't have feature detectors for things
like representing that cusp in the middle
of the three.
So it ends up reconstructing something,
but obeys the regularities of a two,
better than it represents the regularities
of a three.
In fact, the network tries to see
everything as a two.
So here's some feature detectors that were
learned in the first hidden layer of a
model that uses 500 hidden units to model
all ten digit classes.
And this model has been trained for a long
time with contrastive divergence.
It has a big variety of feature detectors.
If you look at the one in the blue box,
that's obviously going to be useful for
detecting things like eights If you look
at the one in the red box, it's not what
you expect to see.
It likes to see pixels on very near the
bottom there, and it really doesn't like
to see pixels on in a road that's 21
pixels above the bottom.
What's going on here is that the data is
normalized and so the digits can't have a
height of greater than twenty pixels.
And that means if you know that there's a
pixel on where those big positive weights
are, there can't possibly be a pixel on,
where those negative weights are.
So this is picking up on the long range
regularity that was introduced by the way
we normalized the data.
Here's another one that's doing the same
thing for the fact that the data can't be
wider than twenty pixels.
The feature detected in the green box is
very interesting.
It's for detecting where the bottom of a
vertical stroke comes and it will detect
it in a number of different positions and
then refuse to detect it in the
intermediate positions.
So it's very like one of the least
significant digits in a binary number, as
you increase the magnitude of a number it
goes on again, and off again, and on
again, and off again and it shows that
this is developing quite complex ways of
representing where things are.