1 00:00:00,000 --> 00:00:06,074 In this video, I'm going to describe a method developed by David MacKay in the 2 00:00:06,074 --> 00:00:12,771 1990s for determining the weight penalties to use in a neural network without using a 3 00:00:12,771 --> 00:00:17,152 validation set. It's based on the idea that we can 4 00:00:17,152 --> 00:00:23,671 interpret weight penalties as doing map estimation so that the magnitude of the 5 00:00:23,671 --> 00:00:30,028 weight penalty is related to the tightness of the prior distribution over the 6 00:00:30,028 --> 00:00:34,265 weights. Mackay showed we can empirically fit both 7 00:00:34,265 --> 00:00:39,891 the weight penalties and the assumed noise in the output of the neural net to get a 8 00:00:39,891 --> 00:00:45,181 method for fitting weight penalties that does not require a validation set and 9 00:00:45,181 --> 00:00:50,539 therefore, allows us to have different weight penalties for different subsets as 10 00:00:50,539 --> 00:00:55,227 the connections in a neural network, something that will be very expensive to 11 00:00:55,227 --> 00:00:59,580 do using validation sets. Mackay went on to win competitions using 12 00:00:59,580 --> 00:01:04,062 this kind of method. I'm now going to describe a simple and 13 00:01:04,062 --> 00:01:09,624 practical method developed by David MacKay for making use of the fact that we can 14 00:01:09,624 --> 00:01:13,490 interpret weight penalties as the ratio of two variances. 15 00:01:13,490 --> 00:01:18,985 After we've learned a model to minimize squared error we can find the best value 16 00:01:18,985 --> 00:01:24,683 for the output variance and the best value is found by simply using the variance of 17 00:01:24,683 --> 00:01:31,264 the residual errors. We can also estimate the variance in the 18 00:01:31,264 --> 00:01:36,796 Gaussian prior of the weight. We have to start with some guess about 19 00:01:36,796 --> 00:01:41,923 what this variance should be. Then, we do some learning, and then we use 20 00:01:41,923 --> 00:01:48,687 a very dirty trick called empirical Bayes. We set the variance of our prior to be the 21 00:01:48,687 --> 00:01:53,636 variance of the weights of the model learned because that's the variance that 22 00:01:53,636 --> 00:01:58,742 will make those weights most likely. This really violates a lot of the 23 00:01:58,742 --> 00:02:03,514 presuppositions of the Bayesian approach. We're using the data to decide what our 24 00:02:03,514 --> 00:02:06,989 prior beliefs are. So, once we've learned the weights, we fit 25 00:02:06,989 --> 00:02:10,641 a zero mean Gaussian to the one-dimensional distribution of the 26 00:02:10,641 --> 00:02:13,822 learned weights. And then, we take the variance of that 27 00:02:13,822 --> 00:02:20,733 Gaussian, and we use that for our prior. Now, one nice thing about that is, is the 28 00:02:20,733 --> 00:02:25,172 different subsets of weights. Like in different layers, for example, we 29 00:02:25,172 --> 00:02:28,840 could learn different variances for the different layers. 30 00:02:30,760 --> 00:02:36,900 We don't need a validation set so we can use all of the non-test data for training. 31 00:02:36,900 --> 00:02:42,744 And because we don't need validation sets to determine the weight penalties in 32 00:02:42,744 --> 00:02:47,923 different layers, we can actually have many different weight penalties. 33 00:02:47,923 --> 00:02:51,622 This will be very hard to do with validation sets. 34 00:02:51,622 --> 00:02:57,687 So, here's MacKay's method. You start by guessing the noise variance 35 00:02:57,687 --> 00:03:02,647 and the weight prior variance. Actually, all you have to really do is 36 00:03:02,647 --> 00:03:06,132 guess their ratio. Then, you do some gradient descent 37 00:03:06,132 --> 00:03:11,427 learning, trying to improve the weights. Then, you reset the noise variance to be 38 00:03:11,427 --> 00:03:18,230 the variance of the residual errors and you reset the weight prior variance to be 39 00:03:18,230 --> 00:03:22,060 the distribution of the actually learned weight. 40 00:03:22,880 --> 00:03:25,660 And then, you go back around this loop again. 41 00:03:26,220 --> 00:03:28,843 So, this actually works quite well in practice. 42 00:03:28,843 --> 00:03:31,696 And MacKay won several competitions this way.