1 00:00:00,000 --> 00:00:05,463 In this video, I'm going to return to the idea of full Baysian learning, and explain 2 00:00:05,463 --> 00:00:10,664 a little bit more about how it works. And then in the following video, I'm going 3 00:00:10,664 --> 00:00:16,017 to show how it can be made practical. In full Bayesian learning, we don't try 4 00:00:16,017 --> 00:00:19,280 and find a single best setting of the parameters. 5 00:00:19,280 --> 00:00:24,207 Instead, we try and find the full posterior distribution over all possible 6 00:00:24,207 --> 00:00:27,337 settings. That is, for every possible setting, we 7 00:00:27,337 --> 00:00:32,598 want a posterior probability density. And all those densities, we want to add up 8 00:00:32,598 --> 00:00:35,795 to one. It's extremely computationally intensive 9 00:00:35,795 --> 00:00:38,991 to compute this for all, but the simplest models. 10 00:00:38,991 --> 00:00:43,852 So, in the example earlier, we did it for a biased coin which just has one 11 00:00:43,852 --> 00:00:48,971 parameter, which is how biased it is. But in general, for a neural net, it's 12 00:00:48,971 --> 00:00:52,522 impossible. After we've computed the posterior 13 00:00:52,522 --> 00:00:57,323 distribution across all possible settings of the parameters, we can then make 14 00:00:57,323 --> 00:01:02,124 predictions by letting each different setting of the parameters make its own 15 00:01:02,124 --> 00:01:05,367 prediction. And then, averaging all those predictions 16 00:01:05,367 --> 00:01:08,484 together, weighting by their posterior probability. 17 00:01:08,484 --> 00:01:11,228 This is also very computationally intensive. 18 00:01:11,228 --> 00:01:16,465 The advantage of doing this is that if we use the full Bayesian approach, we can use 19 00:01:16,465 --> 00:01:19,770 complicated models even when we don't have much data. 20 00:01:19,770 --> 00:01:23,200 So, there's a very interesting philosophical point here. 21 00:01:24,200 --> 00:01:29,735 We're now used to the idea of overfitting, When you fit a complicated model to a 22 00:01:29,735 --> 00:01:34,094 small amount of data. But that's basically just a result of not 23 00:01:34,094 --> 00:01:38,869 bothering to get the full posterior distribution over the parameters. 24 00:01:38,869 --> 00:01:44,266 So, frequentists would say, if you don't have much data, you should use a simple 25 00:01:44,266 --> 00:01:45,788 model. And that's true. 26 00:01:45,788 --> 00:01:51,254 But it's only true if you assume that fitting a model means finding the single 27 00:01:51,254 --> 00:01:56,677 best setting of the parameters. If you find the full posterior 28 00:01:56,677 --> 00:02:00,030 distribution, that gets rid of overfitting. 29 00:02:00,030 --> 00:02:05,243 If there's very little data, the full posterior distribution will typically give 30 00:02:05,243 --> 00:02:10,588 you very vague predictions, because many different settings of the parameters that 31 00:02:10,588 --> 00:02:15,541 make very different predictions will have significant posterior probability. 32 00:02:15,541 --> 00:02:20,624 As you get more data, the posterior probability will get more and more focused 33 00:02:20,624 --> 00:02:25,838 on a few settings of the parameters, and the posterior predictions will get much 34 00:02:25,838 --> 00:02:28,406 sharper. So, here's a classic example of 35 00:02:28,406 --> 00:02:34,018 overfitting. We've got six data points and we fitted a fifth order polynomial and so 36 00:02:34,018 --> 00:02:38,310 it should go exactly through the data, which it more or less does. 37 00:02:38,310 --> 00:02:42,000 We also featured a straight line which only has two degrees of freedom. 38 00:02:42,600 --> 00:02:47,658 And so, which model do you believe? The model that has six coefficients and 39 00:02:47,658 --> 00:02:53,059 fits the data almost perfectly, or the model that only has two coefficients and 40 00:02:53,059 --> 00:02:58,186 doesn't fit the data all that well. It's obvious that the complicated model 41 00:02:58,186 --> 00:03:03,313 fits better, but you don't believe it. It's not economical, and it also makes 42 00:03:03,313 --> 00:03:06,800 silly predictions. So, if you look at the blue arrow, 43 00:03:07,120 --> 00:03:12,033 If that's the input value and you're trying to predict the output value, the 44 00:03:12,033 --> 00:03:16,753 red curve will predict a value that's lower than any of the observed data 45 00:03:16,753 --> 00:03:21,731 points, which seems crazy, whereas the green line will predict a sense of the 46 00:03:21,731 --> 00:03:24,511 value. But everything changes, if instead of 47 00:03:24,511 --> 00:03:29,683 fitting one fifth order polynomial, we start with a reasonable prior of the fifth 48 00:03:29,683 --> 00:03:34,080 order polynomials, for example, the coefficient shouldn't be to big. 49 00:03:34,080 --> 00:03:39,583 And then, we compute the full posterior distribution over fifth order polynomials. 50 00:03:39,610 --> 00:03:44,067 And I've shown you a sample from this distribution in the picture, where a 51 00:03:44,067 --> 00:03:47,940 thickened line means higher probability in the posterior. 52 00:03:49,160 --> 00:03:53,510 So, you will see some of those thin curves, miss a few of the data points by 53 00:03:53,510 --> 00:03:57,920 quite a lot, but nevertheless, they're quite close to most of the data points. 54 00:03:58,320 --> 00:04:03,160 Now, we get much vaguer, but much more sensible predictions. 55 00:04:03,420 --> 00:04:07,860 So, where the blue arrow is, you'll see the different models predict very 56 00:04:07,860 --> 00:04:11,280 different things. While, on average, they make a prediction 57 00:04:11,280 --> 00:04:14,460 quite close to the prediction made by the green line. 58 00:04:14,460 --> 00:04:19,320 From a Bayesian prospective, there's no reason why the amount of data you collect 59 00:04:19,320 --> 00:04:23,520 should influence your prior beliefs and the complexity of the model. 60 00:04:24,280 --> 00:04:29,700 A true Baysian would say, you have prior beliefs about how complicated things might 61 00:04:29,700 --> 00:04:34,990 be and just because you haven't collected any data yet, it doesn't mean you think 62 00:04:34,990 --> 00:04:38,844 things are much simpler. So, we can approximate full Baysian 63 00:04:38,844 --> 00:04:43,220 learning in a neural net, if the neural net has very few parameters. 64 00:04:43,960 --> 00:04:48,020 The idea is we put a grid over the parameter space, 65 00:04:48,520 --> 00:04:53,658 So each parameter is only allowed a few return to values and then we take the 66 00:04:53,658 --> 00:04:57,412 cross product of all those values for all the parameters. 67 00:04:57,412 --> 00:05:01,496 And now, we get a number of grid points in the parameter space. 68 00:05:01,496 --> 00:05:06,832 And in each of those points, we can see how well our model predicts the data, that 69 00:05:06,832 --> 00:05:11,773 is, if we're doing supervised learning, how well a model predicts the target 70 00:05:11,773 --> 00:05:15,796 outputs. And we can say that the posterior 71 00:05:15,796 --> 00:05:23,034 probability in that grid-point is the product of how well it predicts the data, 72 00:05:23,034 --> 00:05:28,636 how likely it is under the prior. And with the whole thing normalized, so 73 00:05:28,636 --> 00:05:31,344 that the posterior probability is [UNKNOWN]. 74 00:05:32,180 --> 00:05:36,669 This is still very expensive, but notice it has some attractive features. 75 00:05:36,669 --> 00:05:41,221 There's no gradient descent involved, and there's no local optimum issues. 76 00:05:41,221 --> 00:05:46,334 We're not following a path in this space, We're just evaluating a set of points in 77 00:05:46,334 --> 00:05:49,895 this space. Once we've decided on the posterior 78 00:05:49,895 --> 00:05:55,188 probability to assign to each grid-point, We then use them all to make predictions 79 00:05:55,188 --> 00:05:57,675 on the test data. That's also expensive. 80 00:05:57,675 --> 00:06:02,840 But when there isn't much data, it'll work much better than maximum likelihood or 81 00:06:02,840 --> 00:06:08,561 maximum a posteriori. So, the way we predict the test output, 82 00:06:08,561 --> 00:06:15,164 given the test input, is we say, the probability of the test output, given the 83 00:06:15,164 --> 00:06:19,783 test input, Is the sum overall the grid points of the 84 00:06:19,783 --> 00:06:24,260 probability that, that grid-point is a good model, is the sum over all 85 00:06:24,260 --> 00:06:29,385 grid-points of the probability of that grid-point, given the data and given our 86 00:06:29,385 --> 00:06:33,407 prior, times the probability that we will get that test output, 87 00:06:33,407 --> 00:06:38,403 Given the input and given the grid-point. In other words, we have to take into 88 00:06:38,403 --> 00:06:43,787 account, the fact that we might add noise to the output of the net before producing 89 00:06:43,787 --> 00:06:47,226 the test answer. So, here's a picture of full Bayesian 90 00:06:47,226 --> 00:06:50,641 learning. We have a little net here, that has four 91 00:06:50,641 --> 00:06:54,815 weights and two biases. If we allowed, nine possible values for 92 00:06:54,815 --> 00:06:59,325 each of those weights and biases, There would be nine to the six grid+points 93 00:06:59,326 --> 00:07:03,155 in the parameter space. It's a big number but we can cope with it. 94 00:07:03,155 --> 00:07:08,162 For each of those grid-points, we compute the probability of the observed outputs on 95 00:07:08,162 --> 00:07:11,460 all the training cases. We multiply by the prior for the 96 00:07:11,460 --> 00:07:15,702 grid-point, which might depend on the values of the weights, for example. 97 00:07:15,702 --> 00:07:19,884 And then, we re-normalize to get the posterior probability over all the 98 00:07:19,884 --> 00:07:22,829 grid-points. Then we make predictions using those 99 00:07:22,829 --> 00:07:27,660 grid-points, but weight to each of their predictions by its posterior probability.