In this video, I'm going to return to the idea of full Baysian learning, and explain a little bit more about how it works. And then in the following video, I'm going to show how it can be made practical. In full Bayesian learning, we don't try and find a single best setting of the parameters. Instead, we try and find the full posterior distribution over all possible settings. That is, for every possible setting, we want a posterior probability density. And all those densities, we want to add up to one. It's extremely computationally intensive to compute this for all, but the simplest models. So, in the example earlier, we did it for a biased coin which just has one parameter, which is how biased it is. But in general, for a neural net, it's impossible. After we've computed the posterior distribution across all possible settings of the parameters, we can then make predictions by letting each different setting of the parameters make its own prediction. And then, averaging all those predictions together, weighting by their posterior probability. This is also very computationally intensive. The advantage of doing this is that if we use the full Bayesian approach, we can use complicated models even when we don't have much data. So, there's a very interesting philosophical point here. We're now used to the idea of overfitting, When you fit a complicated model to a small amount of data. But that's basically just a result of not bothering to get the full posterior distribution over the parameters. So, frequentists would say, if you don't have much data, you should use a simple model. And that's true. But it's only true if you assume that fitting a model means finding the single best setting of the parameters. If you find the full posterior distribution, that gets rid of overfitting. If there's very little data, the full posterior distribution will typically give you very vague predictions, because many different settings of the parameters that make very different predictions will have significant posterior probability. As you get more data, the posterior probability will get more and more focused on a few settings of the parameters, and the posterior predictions will get much sharper. So, here's a classic example of overfitting. We've got six data points and we fitted a fifth order polynomial and so it should go exactly through the data, which it more or less does. We also featured a straight line which only has two degrees of freedom. And so, which model do you believe? The model that has six coefficients and fits the data almost perfectly, or the model that only has two coefficients and doesn't fit the data all that well. It's obvious that the complicated model fits better, but you don't believe it. It's not economical, and it also makes silly predictions. So, if you look at the blue arrow, If that's the input value and you're trying to predict the output value, the red curve will predict a value that's lower than any of the observed data points, which seems crazy, whereas the green line will predict a sense of the value. But everything changes, if instead of fitting one fifth order polynomial, we start with a reasonable prior of the fifth order polynomials, for example, the coefficient shouldn't be to big. And then, we compute the full posterior distribution over fifth order polynomials. And I've shown you a sample from this distribution in the picture, where a thickened line means higher probability in the posterior. So, you will see some of those thin curves, miss a few of the data points by quite a lot, but nevertheless, they're quite close to most of the data points. Now, we get much vaguer, but much more sensible predictions. So, where the blue arrow is, you'll see the different models predict very different things. While, on average, they make a prediction quite close to the prediction made by the green line. From a Bayesian prospective, there's no reason why the amount of data you collect should influence your prior beliefs and the complexity of the model. A true Baysian would say, you have prior beliefs about how complicated things might be and just because you haven't collected any data yet, it doesn't mean you think things are much simpler. So, we can approximate full Baysian learning in a neural net, if the neural net has very few parameters. The idea is we put a grid over the parameter space, So each parameter is only allowed a few return to values and then we take the cross product of all those values for all the parameters. And now, we get a number of grid points in the parameter space. And in each of those points, we can see how well our model predicts the data, that is, if we're doing supervised learning, how well a model predicts the target outputs. And we can say that the posterior probability in that grid-point is the product of how well it predicts the data, how likely it is under the prior. And with the whole thing normalized, so that the posterior probability is [UNKNOWN]. This is still very expensive, but notice it has some attractive features. There's no gradient descent involved, and there's no local optimum issues. We're not following a path in this space, We're just evaluating a set of points in this space. Once we've decided on the posterior probability to assign to each grid-point, We then use them all to make predictions on the test data. That's also expensive. But when there isn't much data, it'll work much better than maximum likelihood or maximum a posteriori. So, the way we predict the test output, given the test input, is we say, the probability of the test output, given the test input, Is the sum overall the grid points of the probability that, that grid-point is a good model, is the sum over all grid-points of the probability of that grid-point, given the data and given our prior, times the probability that we will get that test output, Given the input and given the grid-point. In other words, we have to take into account, the fact that we might add noise to the output of the net before producing the test answer. So, here's a picture of full Bayesian learning. We have a little net here, that has four weights and two biases. If we allowed, nine possible values for each of those weights and biases, There would be nine to the six grid+points in the parameter space. It's a big number but we can cope with it. For each of those grid-points, we compute the probability of the observed outputs on all the training cases. We multiply by the prior for the grid-point, which might depend on the values of the weights, for example. And then, we re-normalize to get the posterior probability over all the grid-points. Then we make predictions using those grid-points, but weight to each of their predictions by its posterior probability.