In this video, I'm going to describe how to make full Bayesian learning practical for neural networks that have thousands, and perhaps even millions of weights. The technique that's used is a Monte Carlo method, which seems very odd the first time you hear about it. We use a random number generator to move around the space of weight vectors in a random way, But with a bias towards going downhill in our cost function. If we do this right, we get a beautiful property, which is that we sample weight vectors in proportion to their probability in the posterior distribution. And that means by sampling a lot of weight factors, we can get a good approximation to the full Bayesian method. The number of grid points is exponential in the number of parameters. So we can't make a grid for more than a few parameters. This is enough data so that most of the parameter vectors are very unlikely. Only a tiny fraction of the group points, will make a significant contribution to the predictions. So may be you can just focus on evaluating this tiny fraction if we can find it. An idea that makes Bayesian learning feasible is that it might be good enough just to sample weight vectors according to their posterior probabilities. So if you look at this equation, the probability that we assigned to a test output, given the input for the test case and the training data, is the sum over all points in weight space of the posterity probability of that point in weight space given the training data, times the probability distribution for the test values that we predict given that point in weight space W I, and given the test input. Now instead of adding up all the terms in that sum, we could just sample terms from that sum. What we do is we sample the weight vectors in proportion to that probability. So either we sample them or we don't. So they'll get a weight of one or zero. But the probability of getting a one. That is, the probability being sampled, will be their posterior probability. So that will give us the correct expected value for the right hand side. It'll have noise due to the sampling but it'll have the correct expected value. So here's a picture of what happens in standard back propagation. On the right I've drawn the weight space. Which of course is very high dimensional and unbounded. And this is a very bad picture of, but it's the best I can do. In this white space, I've drawn some contours which are meant to be contours of equal values of our cost function. And the way back propagation is normally used, is we start with some small value of the weights, and then we follow the gradient. We move downhill in our cost function, in the direction that increases the log-likelihood, plus the log-prior, summed over all training guesses. Eventually, we'll either end up at a local minimum or we'll get stuck on a plateau, Or we'll just move so slowly that we run out of patience. But the main point of this picture, is that we follow a path from an initial point to some final, single point. Now if we're using a sampling method, what we could do, we start at the same place as we did before, but each time we update the weights. We add a bit of Gaussian noise so we're just turning around. The weight vector will never settle down then. It'll keep on moving around. It'll wander over the space, but always preferring low cost regions. That is, it'll tend to go downhill if it can. An important question is whether we can say anything about how often the weights will visit each point in that space. So the red dots are meant to be samples we took of the weights as we wandered around the space. And the idea is, we might save the weights after every 10,000 steps. And if you look at those red dots, a few of them are in high cost regions, because those regions are quite big. The deepest minimum has the most red dots, and other minima also have red dots. The dots aren't right at the bottom of the minima, because they're noisy samples. If we add that Gaussian noise in just the right way, there's a wonderful property of Markov chain Monte Carlo. It's an amazing fact. The weight vectors, if we wandered around for long enough, will be unbiased samples from the true posterior distribution overweight factors. That is, those red dots we saw in the previous slide will be sampled from the posterior, where weight vectors are a highly probable under the posterior, a much more likely to be represented by a red dot than weight factor that is highly improbable. This is called Markov Chain Monte Carlo, and makes it feasible to use Bayesian learning with thousands of parameters. The method I suggested of adding some Gaussian noise is called the [UNKNOW method. And it's not the most efficient method. There's more sophisticated methods that are more efficient, And what I mean by more efficient is, they don't need to wander around the weight space for so long before you can start taking those red samples. Full Bayesian learning can actually be done with mini batches. When we compute the gradient of the cost function on a random mini batch, we're gonna get an unbiased estimate but with sampling noise. And the idea is to use that sampling noise to provide the noise that the marked up chained Monte Carlo method needs. It's a very clever idea. Recently, Welling and his collaborators made it work nicely, so they could fairly efficiently get samples from the post area distribution over weights using mini-batch methods. This should make it possible to use full Bayesian learning for much larger networks where you have to train them with mini-batch to have any hope of ever finishing training them.