In this video, I'm going to talk about the Bayesian interpretation of weight penalties. In the full Bayesian approach, we try to compute the posterior probability of every possible setting of the parameters of a model. But there's a much reduced form of the Bayesian approach, where we simply say, I'm going to look for the single set of parameters that is the best compromise between fitting my prior beliefs about what the parameters should be like, and fitting the data I've observed. This is called Maximum alpha Posteriori learning and it gives us a nice explanation of what's really going on, when we use weight decay to control the capacity of a model. I'm now going to talk a bit about what's really going on, when we minimize the squared error during supervised maximum likelihood learning. Finding the weight vector that minimizes the squared residuals, that is the differences between the target value and the value predicted by the net, Is equivalent to finding a weight vector that maximizes the log probability density of the correct answer. In order to see this equivalence, we have to assume that the correct answer is produced by adding Gaussian noise to the output of the neural net. So the idea is, we make a prediction by first running the neural net on the input to get the output. And then adding some Gaussain noise. And then we ask, what's the probability that when we do that, we get the correct answer? So the model's output is the center of a Gaussian and what we're interested in is having the target value of high probability under that Gaussian because the probability producing the value t, given that the network gives an output of y is just the probability density of t under a Gaussian centered at y. So the math looks like this: let's suppose that the output of the neural net on training case c is yc and this output is produced by applying the weights W to the input c. The probability that we'll get the correct target value when we add Gaussian noise to that output yc is given by a Gaussian centered on yc. So we're interested in the probability density of the target value, under a Gaussian centered at the output of the neural net. And on the right here, we have that Gaussian distribution, with mean yc. We also have to assume some variance, and that variance will be important later. If we now take logs and put in a minus sign, We see that the negative log probability density of the target value tc given that the network outputs yc, is a constant that comes from the normalizing term of the Gaussian plus the log of that exponential with the minus sign in which is just (tc2 - yc)^2 divided by twice the variance of the Gaussian. So what you see is that, if our cost function is the negative log probability of getting the right answer, that turns into minimizing a squared distance. It's helpful to know that whenever you see a squared error being minimized, you can make a probabilistic interpretation of what's going on, and in that probabilistic interpretation, you'll be maximizing the log probability under a Gausian. So the proper Bayesian approach, is to find the full posterior distribution over all possible weight vectors. If there's more than a handful of weights, that's hopelessly difficult when you have a non-linear net. Bayesians have a lot of ways of approximating this distribution, often using Monte Carlo methods. But for the time being, let's try and do something simpler. Let's just try to find the most probable weight vector. So the single setting of the weights that's most probable given the prior knowledge we have and given the data. So what we're going to try and do is find an optimal value of W by starting with some random weight vector, and then adjusting it in the direction that improves the probability of that weight factor given the data. It will only be a local optimum. Now, it's going to be easier to work in the log domain than in the probability domain. So if we want to minimize a cost, we better use negative log props. Just an aside about why we maximize sums of log probabilities, or minimize sums of negative log probs, What we really want to do is maximize the probability of the data, which is maximizing the product of the probabilities of producing all the target values that we observed on all the different training cases. If we assume that the output errors on different cases are independent, we can write that down as the product over all the training cases, of the probability of producing the target value, tc, given the weights. That is the product of the probability of producing tc, given the output that we're going to get from out network, if we give it input c and it has weights W. The log functions monotonic, and so it can't change where the maxima are. So instead of maximizing a product of probabilities, we can maximize the sums of log probabilities, and that typically works much better on a computer. It's much more stable. So we maximize the log probability of the data given the weights, which is simply maximizing the sum over all training cases of the log probability of the output for that training case, given the input and given the weights. In Maximum a Posteriori learning, we're trying to find the set of weights that optimizes the trade off between fitting our prior and fitting the data. So that's base theorem. If we take negative logs to get a cost, we get that the negative log of the probability of the weights given the data, is the negative log of the prior term, and the negative log of the data term, and an extra term. So, that last extra term, is an integral overall possible weight vectors. And so that doesn't affect W. So we can ignore it when we're optimizing W. The term that depends on the data is the negative log probability is given W, and that's our normal error term. And then term that only depends on W is the negative log probability of W under its prior. Maximizing the log probability of a weight is related to minimizing a squared distance, in just the same way as maximizing the log probability of producing correct target value is related to minimizing the square distance. So minimizing the squared weights is equivalent to maximizing the log probability of the weights under a zero-mean Gaussian prior. So here's a Gaussian. It's got a mean zero, and we want to maximize the probability of the weights, or the log probability of the weights. And to do that, we obviously want w to be close to the mean zero. The equation for the Gaussian is just like this, where the mean is zero so we don't have to put it in. And the log probability of w is then the squared weights scaled by twice the variance, plus a constant that comes from the normalizing term of the Gaussian. And isn't affected when we change w. So finally we can get to the basing interpretation of weight decay or weight penalties. We're trying to minimize the negative log probability of the weights given in the data and that involves minimizing a term that depends on by the turn the weights namely how will we shift the targets and determine that depends only on the weights. Is derived from the log probability of the data given the weights, which if we assume Gaussian noise is added to the output of the model to make the prediction, then that log probability is the squared distance between the output of the net on the target value scaled by twice the variance of that Gaussian noise. Similarly, if we assume we have a Gaussian prior for the weights, the log probability of a weight under the prior is the squared value of that weight scaled by twice the variance of the Gaussian prior. So now let's take that equation and multiply through by two sigma squared D. So, we got a new cost function. And the first term, when we multiply through turns into simply the sum of all training cases of the squared difference between the output of the net and the target. That's the squared error that we typically minimize in the neural net. The second term now becomes, the ratio of two variances times the sum of the squares of the weights. And so what you see is, the ratio of those two variances is exactly the weight penalty. So we initially thought of weight penalties as just a number you make up to try and make things work better. Where you fix the value of the weight penalty by using a validation set. But now we see that if we make this Gaussian interpretation where we have a Gaussian prior and we have a Gaussian model of the relation of the output of the net to the target, Then the weight penalty is determined by the variances of those Gaussians. It's just the ratio of those variances. It's not an arbitrary thing at all within this framework.