1 00:00:00,000 --> 00:00:05,938 In this video, I'm going to talk about the Bayesian interpretation of weight 2 00:00:05,938 --> 00:00:10,085 penalties. In the full Bayesian approach, we try to 3 00:00:10,085 --> 00:00:15,657 compute the posterior probability of every possible setting of the parameters of a 4 00:00:15,657 --> 00:00:18,992 model. But there's a much reduced form of the 5 00:00:18,992 --> 00:00:24,032 Bayesian approach, where we simply say, I'm going to look for the single set of 6 00:00:24,032 --> 00:00:29,138 parameters that is the best compromise between fitting my prior beliefs about 7 00:00:29,138 --> 00:00:33,720 what the parameters should be like, and fitting the data I've observed. 8 00:00:34,040 --> 00:00:38,868 This is called Maximum alpha Posteriori learning and it gives us a nice 9 00:00:38,868 --> 00:00:45,195 explanation of what's really going on, when we use weight decay to control the 10 00:00:45,195 --> 00:00:49,999 capacity of a model. I'm now going to talk a bit about what's 11 00:00:49,999 --> 00:00:55,064 really going on, when we minimize the squared error during supervised maximum 12 00:00:55,064 --> 00:01:00,398 likelihood learning. Finding the weight vector that minimizes 13 00:01:00,398 --> 00:01:05,005 the squared residuals, that is the differences between the target value and 14 00:01:05,005 --> 00:01:09,427 the value predicted by the net, Is equivalent to finding a weight vector 15 00:01:09,427 --> 00:01:13,420 that maximizes the log probability density of the correct answer. 16 00:01:14,020 --> 00:01:19,141 In order to see this equivalence, we have to assume that the correct answer is 17 00:01:19,141 --> 00:01:23,672 produced by adding Gaussian noise to the output of the neural net. 18 00:01:23,672 --> 00:01:28,990 So the idea is, we make a prediction by first running the neural net on the input 19 00:01:28,990 --> 00:01:32,733 to get the output. And then adding some Gaussain noise. 20 00:01:32,733 --> 00:01:37,920 And then we ask, what's the probability that when we do that, we get the correct 21 00:01:37,920 --> 00:01:43,344 answer? So the model's output is the center of a 22 00:01:43,344 --> 00:01:49,050 Gaussian and what we're interested in is having the target value of high 23 00:01:49,050 --> 00:01:55,146 probability under that Gaussian because the probability producing the value t, 24 00:01:55,146 --> 00:02:01,477 given that the network gives an output of y is just the probability density of t 25 00:02:01,477 --> 00:02:11,688 under a Gaussian centered at y. So the math looks like this: let's suppose 26 00:02:11,688 --> 00:02:17,350 that the output of the neural net on training case c is yc and this output is 27 00:02:17,350 --> 00:02:25,780 produced by applying the weights W to the input c. The probability that we'll get 28 00:02:25,780 --> 00:02:32,509 the correct target value when we add Gaussian noise to that output yc is given 29 00:02:32,509 --> 00:02:38,892 by a Gaussian centered on yc. So we're interested in the probability 30 00:02:38,892 --> 00:02:44,261 density of the target value, under a Gaussian centered at the output of the 31 00:02:44,261 --> 00:02:47,554 neural net. And on the right here, we have that 32 00:02:47,554 --> 00:02:53,786 Gaussian distribution, with mean yc. We also have to assume some variance, and 33 00:02:53,786 --> 00:03:01,927 that variance will be important later. If we now take logs and put in a minus 34 00:03:01,927 --> 00:03:05,805 sign, We see that the negative log probability 35 00:03:05,805 --> 00:03:12,887 density of the target value tc given that the network outputs yc, is a constant that 36 00:03:12,887 --> 00:03:19,715 comes from the normalizing term of the Gaussian plus the log of that exponential 37 00:03:19,715 --> 00:03:25,364 with the minus sign in which is just (tc2 - yc)^2 divided by twice the variance of 38 00:03:25,364 --> 00:03:30,145 the Gaussian. So what you see is that, if our cost 39 00:03:30,145 --> 00:03:36,291 function is the negative log probability of getting the right answer, that turns 40 00:03:36,291 --> 00:03:44,069 into minimizing a squared distance. It's helpful to know that whenever you see 41 00:03:44,069 --> 00:03:48,788 a squared error being minimized, you can make a probabilistic interpretation of 42 00:03:48,788 --> 00:03:53,688 what's going on, and in that probabilistic interpretation, you'll be maximizing the 43 00:03:53,688 --> 00:04:01,197 log probability under a Gausian. So the proper Bayesian approach, is to 44 00:04:01,197 --> 00:04:05,400 find the full posterior distribution over all possible weight vectors. 45 00:04:06,060 --> 00:04:11,565 If there's more than a handful of weights, that's hopelessly difficult when you have 46 00:04:11,565 --> 00:04:15,705 a non-linear net. Bayesians have a lot of ways of 47 00:04:15,705 --> 00:04:19,899 approximating this distribution, often using Monte Carlo methods. 48 00:04:19,899 --> 00:04:23,700 But for the time being, let's try and do something simpler. 49 00:04:23,980 --> 00:04:27,889 Let's just try to find the most probable weight vector. 50 00:04:27,889 --> 00:04:33,149 So the single setting of the weights that's most probable given the prior 51 00:04:33,149 --> 00:04:40,268 knowledge we have and given the data. So what we're going to try and do is find 52 00:04:40,268 --> 00:04:44,618 an optimal value of W by starting with some random weight vector, and then 53 00:04:44,618 --> 00:04:49,026 adjusting it in the direction that improves the probability of that weight 54 00:04:49,026 --> 00:04:53,280 factor given the data. It will only be a local optimum. 55 00:04:55,620 --> 00:05:01,143 Now, it's going to be easier to work in the log domain than in the probability 56 00:05:01,143 --> 00:05:04,310 domain. So if we want to minimize a cost, we 57 00:05:04,310 --> 00:05:10,745 better use negative log props. Just an aside about why we maximize sums 58 00:05:10,745 --> 00:05:14,940 of log probabilities, or minimize sums of negative log probs, 59 00:05:15,400 --> 00:05:20,327 What we really want to do is maximize the probability of the data, which is 60 00:05:20,327 --> 00:05:25,057 maximizing the product of the probabilities of producing all the target 61 00:05:25,057 --> 00:05:29,000 values that we observed on all the different training cases. 62 00:05:30,140 --> 00:05:35,475 If we assume that the output errors on different cases are independent, we can 63 00:05:35,475 --> 00:05:41,016 write that down as the product over all the training cases, of the probability of 64 00:05:41,016 --> 00:05:44,300 producing the target value, tc, given the weights. 65 00:05:44,640 --> 00:05:50,477 That is the product of the probability of producing tc, given the output that we're 66 00:05:50,477 --> 00:05:55,400 going to get from out network, if we give it input c and it has weights W. 67 00:05:58,300 --> 00:06:03,143 The log functions monotonic, and so it can't change where the maxima are. 68 00:06:03,143 --> 00:06:08,458 So instead of maximizing a product of probabilities, we can maximize the sums of 69 00:06:08,458 --> 00:06:13,100 log probabilities, and that typically works much better on a computer. 70 00:06:13,100 --> 00:06:18,950 It's much more stable. So we maximize the log probability of the 71 00:06:18,950 --> 00:06:24,048 data given the weights, which is simply maximizing the sum over all training cases 72 00:06:24,048 --> 00:06:29,021 of the log probability of the output for that training case, given the input and 73 00:06:29,021 --> 00:06:33,625 given the weights. In Maximum a Posteriori learning, we're 74 00:06:33,625 --> 00:06:39,835 trying to find the set of weights that optimizes the trade off between fitting 75 00:06:39,835 --> 00:06:44,080 our prior and fitting the data. So that's base theorem. 76 00:06:44,900 --> 00:06:50,270 If we take negative logs to get a cost, we get that the negative log of the 77 00:06:50,270 --> 00:06:56,284 probability of the weights given the data, is the negative log of the prior term, and 78 00:06:56,284 --> 00:07:00,007 the negative log of the data term, and an extra term. 79 00:07:00,007 --> 00:07:04,733 So, that last extra term, is an integral overall possible weight vectors. 80 00:07:04,733 --> 00:07:09,816 And so that doesn't affect W. So we can ignore it when we're optimizing 81 00:07:09,816 --> 00:07:15,116 W. The term that depends on the data is the 82 00:07:15,116 --> 00:07:20,160 negative log probability is given W, and that's our normal error term. 83 00:07:21,460 --> 00:07:27,451 And then term that only depends on W is the negative log probability of W under 84 00:07:27,451 --> 00:07:32,710 its prior. Maximizing the log probability of a weight 85 00:07:32,710 --> 00:07:37,120 is related to minimizing a squared distance, in just the same way as 86 00:07:37,120 --> 00:07:42,122 maximizing the log probability of producing correct target value is related 87 00:07:42,122 --> 00:07:48,124 to minimizing the square distance. So minimizing the squared weights is 88 00:07:48,124 --> 00:07:52,800 equivalent to maximizing the log probability of the weights under a 89 00:07:52,800 --> 00:07:56,682 zero-mean Gaussian prior. So here's a Gaussian. 90 00:07:56,682 --> 00:08:02,411 It's got a mean zero, and we want to maximize the probability of the weights, 91 00:08:02,411 --> 00:08:08,366 or the log probability of the weights. And to do that, we obviously want w to be 92 00:08:08,366 --> 00:08:14,784 close to the mean zero. The equation for the Gaussian is just like 93 00:08:14,784 --> 00:08:19,000 this, where the mean is zero so we don't have to put it in. 94 00:08:19,320 --> 00:08:25,375 And the log probability of w is then the squared weights scaled by twice the 95 00:08:25,375 --> 00:08:31,509 variance, plus a constant that comes from the normalizing term of the Gaussian. 96 00:08:31,509 --> 00:08:38,853 And isn't affected when we change w. So finally we can get to the basing 97 00:08:38,853 --> 00:08:42,160 interpretation of weight decay or weight penalties. 98 00:08:42,440 --> 00:08:47,321 We're trying to minimize the negative log probability of the weights given in the 99 00:08:47,321 --> 00:08:52,143 data and that involves minimizing a term that depends on by the turn the weights 100 00:08:52,143 --> 00:08:57,143 namely how will we shift the targets and determine that depends only on the 101 00:08:57,143 --> 00:09:03,919 weights. Is derived from the log probability of the 102 00:09:03,919 --> 00:09:09,401 data given the weights, which if we assume Gaussian noise is added to the output of 103 00:09:09,401 --> 00:09:14,289 the model to make the prediction, then that log probability is the squared 104 00:09:14,289 --> 00:09:19,506 distance between the output of the net on the target value scaled by twice the 105 00:09:19,506 --> 00:09:25,414 variance of that Gaussian noise. Similarly, if we assume we have a Gaussian 106 00:09:25,414 --> 00:09:31,929 prior for the weights, the log probability of a weight under the prior is the squared 107 00:09:31,929 --> 00:09:37,640 value of that weight scaled by twice the variance of the Gaussian prior. 108 00:09:40,220 --> 00:09:45,688 So now let's take that equation and multiply through by two sigma squared D. 109 00:09:45,688 --> 00:09:51,373 So, we got a new cost function. And the first term, when we multiply through turns 110 00:09:51,373 --> 00:09:57,058 into simply the sum of all training cases of the squared difference between the 111 00:09:57,058 --> 00:10:02,527 output of the net and the target. That's the squared error that we typically 112 00:10:02,527 --> 00:10:07,952 minimize in the neural net. The second term now becomes, the ratio of 113 00:10:07,952 --> 00:10:12,454 two variances times the sum of the squares of the weights. 114 00:10:12,454 --> 00:10:19,080 And so what you see is, the ratio of those two variances is exactly the weight 115 00:10:19,080 --> 00:10:22,658 penalty. So we initially thought of weight 116 00:10:22,658 --> 00:10:27,137 penalties as just a number you make up to try and make things work better. 117 00:10:27,137 --> 00:10:31,496 Where you fix the value of the weight penalty by using a validation set. 118 00:10:31,496 --> 00:10:36,096 But now we see that if we make this Gaussian interpretation where we have a 119 00:10:36,096 --> 00:10:40,999 Gaussian prior and we have a Gaussian model of the relation of the output of the 120 00:10:40,999 --> 00:10:44,571 net to the target, Then the weight penalty is determined by 121 00:10:44,571 --> 00:10:48,929 the variances of those Gaussians. It's just the ratio of those variances. 122 00:10:48,929 --> 00:10:52,380 It's not an arbitrary thing at all within this framework.