In this video, I'm going to describe how
to make full Bayesian learning practical
for neural networks that have thousands,
and perhaps even millions of weights.
The technique that's used is a Monte Carlo
method, which seems very odd the first
time you hear about it.
We use a random number generator to move
around the space of weight vectors in a
random way,
But with a bias towards going downhill in
our cost function.
If we do this right, we get a beautiful
property, which is that we sample weight
vectors in proportion to their probability
in the posterior distribution.
And that means by sampling a lot of weight
factors, we can get a good approximation
to the full Bayesian method.
The number of grid points is exponential
in the number of parameters.
So we can't make a grid for more than a
few parameters.
This is enough data so that most of the
parameter vectors are very unlikely.
Only a tiny fraction of the group points,
will make a significant contribution to
the predictions.
So may be you can just focus on evaluating
this tiny fraction if we can find it.
An idea that makes Bayesian learning
feasible is that it might be good enough
just to sample weight vectors according to
their posterior probabilities.
So if you look at this equation, the
probability that we assigned to a test
output, given the input for the test case
and the training data, is the sum over all
points in weight space of the posterity
probability of that point in weight space
given the training data, times the
probability distribution for the test
values that we predict given that point in
weight space W I, and given the test
input.
Now instead of adding up all the terms in
that sum, we could just sample terms from
that sum.
What we do is we sample the weight vectors
in proportion to that probability.
So either we sample them or we don't.
So they'll get a weight of one or zero.
But the probability of getting a one.
That is, the probability being sampled,
will be their posterior probability.
So that will give us the correct expected
value for the right hand side.
It'll have noise due to the sampling but
it'll have the correct expected value.
So here's a picture of what happens in
standard back propagation.
On the right I've drawn the weight space.
Which of course is very high dimensional
and unbounded.
And this is a very bad picture of, but
it's the best I can do.
In this white space, I've drawn some
contours which are meant to be contours of
equal values of our cost function.
And the way back propagation is normally
used, is we start with some small value of
the weights, and then we follow the
gradient.
We move downhill in our cost function, in
the direction that increases the
log-likelihood, plus the log-prior, summed
over all training guesses.
Eventually, we'll either end up at a local
minimum or we'll get stuck on a plateau,
Or we'll just move so slowly that we run
out of patience.
But the main point of this picture, is
that we follow a path from an initial
point to some final, single point.
Now if we're using a sampling method, what
we could do, we start at the same place as
we did before, but each time we update the
weights.
We add a bit of Gaussian noise so we're
just turning around.
The weight vector will never settle down
then.
It'll keep on moving around.
It'll wander over the space, but always
preferring low cost regions.
That is, it'll tend to go downhill if it
can.
An important question is whether we can
say anything about how often the weights
will visit each point in that space.
So the red dots are meant to be samples we
took of the weights as we wandered around
the space.
And the idea is, we might save the weights
after every 10,000 steps.
And if you look at those red dots, a few
of them are in high cost regions, because
those regions are quite big.
The deepest minimum has the most red dots,
and other minima also have red dots.
The dots aren't right at the bottom of the
minima, because they're noisy samples.
If we add that Gaussian noise in just the
right way, there's a wonderful property of
Markov chain Monte Carlo.
It's an amazing fact.
The weight vectors, if we wandered around
for long enough, will be unbiased samples
from the true posterior distribution
overweight factors.
That is, those red dots we saw in the
previous slide will be sampled from the
posterior, where weight vectors are a
highly probable under the posterior, a
much more likely to be represented by a
red dot than weight factor that is highly
improbable.
This is called Markov Chain Monte Carlo,
and makes it feasible to use Bayesian
learning with thousands of parameters.
The method I suggested of adding some
Gaussian noise is called the [UNKNOW
method.
And it's not the most efficient method.
There's more sophisticated methods that
are more efficient,
And what I mean by more efficient is, they
don't need to wander around the weight
space for so long before you can start
taking those red samples.
Full Bayesian learning can actually be
done with mini batches.
When we compute the gradient of the cost
function on a random mini batch, we're
gonna get an unbiased estimate but with
sampling noise.
And the idea is to use that sampling noise
to provide the noise that the marked up
chained Monte Carlo method needs.
It's a very clever idea.
Recently, Welling and his collaborators
made it work nicely, so they could fairly
efficiently get samples from the post area
distribution over weights using mini-batch
methods.
This should make it possible to use full
Bayesian learning for much larger networks
where you have to train them with
mini-batch to have any hope of ever
finishing training them.