1 00:00:00,780 --> 00:00:06,489 In this video, I'm going to describe how to make full Bayesian learning practical 2 00:00:06,489 --> 00:00:11,916 for neural networks that have thousands, and perhaps even millions of weights. 3 00:00:11,916 --> 00:00:17,625 The technique that's used is a Monte Carlo method, which seems very odd the first 4 00:00:17,625 --> 00:00:22,136 time you hear about it. We use a random number generator to move 5 00:00:22,136 --> 00:00:25,731 around the space of weight vectors in a random way, 6 00:00:25,731 --> 00:00:29,960 But with a bias towards going downhill in our cost function. 7 00:00:30,860 --> 00:00:36,963 If we do this right, we get a beautiful property, which is that we sample weight 8 00:00:36,963 --> 00:00:42,603 vectors in proportion to their probability in the posterior distribution. 9 00:00:42,603 --> 00:00:49,016 And that means by sampling a lot of weight factors, we can get a good approximation 10 00:00:49,016 --> 00:00:54,785 to the full Bayesian method. The number of grid points is exponential 11 00:00:54,785 --> 00:00:59,041 in the number of parameters. So we can't make a grid for more than a 12 00:00:59,041 --> 00:01:03,492 few parameters. This is enough data so that most of the 13 00:01:03,492 --> 00:01:08,406 parameter vectors are very unlikely. Only a tiny fraction of the group points, 14 00:01:08,406 --> 00:01:11,980 will make a significant contribution to the predictions. 15 00:01:13,320 --> 00:01:18,880 So may be you can just focus on evaluating this tiny fraction if we can find it. 16 00:01:20,100 --> 00:01:25,141 An idea that makes Bayesian learning feasible is that it might be good enough 17 00:01:25,141 --> 00:01:29,860 just to sample weight vectors according to their posterior probabilities. 18 00:01:30,840 --> 00:01:36,156 So if you look at this equation, the probability that we assigned to a test 19 00:01:36,156 --> 00:01:42,039 output, given the input for the test case and the training data, is the sum over all 20 00:01:42,039 --> 00:01:47,851 points in weight space of the posterity probability of that point in weight space 21 00:01:47,851 --> 00:01:52,955 given the training data, times the probability distribution for the test 22 00:01:52,955 --> 00:01:58,554 values that we predict given that point in weight space W I, and given the test 23 00:01:58,554 --> 00:02:02,664 input. Now instead of adding up all the terms in 24 00:02:02,664 --> 00:02:06,360 that sum, we could just sample terms from that sum. 25 00:02:06,860 --> 00:02:12,273 What we do is we sample the weight vectors in proportion to that probability. 26 00:02:12,273 --> 00:02:17,616 So either we sample them or we don't. So they'll get a weight of one or zero. 27 00:02:17,616 --> 00:02:22,889 But the probability of getting a one. That is, the probability being sampled, 28 00:02:22,889 --> 00:02:28,555 will be their posterior probability. So that will give us the correct expected 29 00:02:28,555 --> 00:02:33,079 value for the right hand side. It'll have noise due to the sampling but 30 00:02:33,079 --> 00:02:40,324 it'll have the correct expected value. So here's a picture of what happens in 31 00:02:40,324 --> 00:02:44,550 standard back propagation. On the right I've drawn the weight space. 32 00:02:44,550 --> 00:02:47,783 Which of course is very high dimensional and unbounded. 33 00:02:47,783 --> 00:02:51,370 And this is a very bad picture of, but it's the best I can do. 34 00:02:51,370 --> 00:02:56,597 In this white space, I've drawn some contours which are meant to be contours of 35 00:02:56,597 --> 00:03:01,559 equal values of our cost function. And the way back propagation is normally 36 00:03:01,559 --> 00:03:06,654 used, is we start with some small value of the weights, and then we follow the 37 00:03:06,654 --> 00:03:09,963 gradient. We move downhill in our cost function, in 38 00:03:09,963 --> 00:03:14,859 the direction that increases the log-likelihood, plus the log-prior, summed 39 00:03:14,859 --> 00:03:20,781 over all training guesses. Eventually, we'll either end up at a local 40 00:03:20,781 --> 00:03:26,262 minimum or we'll get stuck on a plateau, Or we'll just move so slowly that we run 41 00:03:26,262 --> 00:03:29,916 out of patience. But the main point of this picture, is 42 00:03:29,916 --> 00:03:34,720 that we follow a path from an initial point to some final, single point. 43 00:03:37,280 --> 00:03:43,263 Now if we're using a sampling method, what we could do, we start at the same place as 44 00:03:43,263 --> 00:03:46,825 we did before, but each time we update the weights. 45 00:03:46,825 --> 00:03:51,100 We add a bit of Gaussian noise so we're just turning around. 46 00:03:52,340 --> 00:03:55,398 The weight vector will never settle down then. 47 00:03:55,398 --> 00:04:00,853 It'll keep on moving around. It'll wander over the space, but always 48 00:04:00,853 --> 00:04:06,072 preferring low cost regions. That is, it'll tend to go downhill if it 49 00:04:06,072 --> 00:04:10,789 can. An important question is whether we can 50 00:04:10,789 --> 00:04:15,850 say anything about how often the weights will visit each point in that space. 51 00:04:15,850 --> 00:04:21,438 So the red dots are meant to be samples we took of the weights as we wandered around 52 00:04:21,438 --> 00:04:24,856 the space. And the idea is, we might save the weights 53 00:04:24,856 --> 00:04:29,696 after every 10,000 steps. And if you look at those red dots, a few 54 00:04:29,696 --> 00:04:34,761 of them are in high cost regions, because those regions are quite big. 55 00:04:34,761 --> 00:04:40,487 The deepest minimum has the most red dots, and other minima also have red dots. 56 00:04:40,487 --> 00:04:46,360 The dots aren't right at the bottom of the minima, because they're noisy samples. 57 00:04:49,000 --> 00:04:54,842 If we add that Gaussian noise in just the right way, there's a wonderful property of 58 00:04:54,842 --> 00:04:58,041 Markov chain Monte Carlo. It's an amazing fact. 59 00:04:58,041 --> 00:05:03,745 The weight vectors, if we wandered around for long enough, will be unbiased samples 60 00:05:03,745 --> 00:05:07,640 from the true posterior distribution overweight factors. 61 00:05:08,060 --> 00:05:13,008 That is, those red dots we saw in the previous slide will be sampled from the 62 00:05:13,008 --> 00:05:17,763 posterior, where weight vectors are a highly probable under the posterior, a 63 00:05:17,763 --> 00:05:23,033 much more likely to be represented by a red dot than weight factor that is highly 64 00:05:23,033 --> 00:05:28,240 improbable. This is called Markov Chain Monte Carlo, 65 00:05:29,180 --> 00:05:34,520 and makes it feasible to use Bayesian learning with thousands of parameters. 66 00:05:36,240 --> 00:05:40,855 The method I suggested of adding some Gaussian noise is called the [UNKNOW 67 00:05:40,855 --> 00:05:43,542 method. And it's not the most efficient method. 68 00:05:43,542 --> 00:05:46,989 There's more sophisticated methods that are more efficient, 69 00:05:46,989 --> 00:05:51,721 And what I mean by more efficient is, they don't need to wander around the weight 70 00:05:51,721 --> 00:05:55,460 space for so long before you can start taking those red samples. 71 00:05:57,640 --> 00:06:01,540 Full Bayesian learning can actually be done with mini batches. 72 00:06:02,520 --> 00:06:07,415 When we compute the gradient of the cost function on a random mini batch, we're 73 00:06:07,415 --> 00:06:10,824 gonna get an unbiased estimate but with sampling noise. 74 00:06:10,824 --> 00:06:15,967 And the idea is to use that sampling noise to provide the noise that the marked up 75 00:06:15,967 --> 00:06:19,500 chained Monte Carlo method needs. It's a very clever idea. 76 00:06:21,020 --> 00:06:26,610 Recently, Welling and his collaborators made it work nicely, so they could fairly 77 00:06:26,610 --> 00:06:32,620 efficiently get samples from the post area distribution over weights using mini-batch 78 00:06:32,620 --> 00:06:35,921 methods. This should make it possible to use full 79 00:06:35,921 --> 00:06:40,335 Bayesian learning for much larger networks where you have to train them with 80 00:06:40,335 --> 00:06:43,820 mini-batch to have any hope of ever finishing training them.