1
00:00:00,000 --> 00:00:05,463
In this video, I'm going to return to the
idea of full Baysian learning, and explain

2
00:00:05,463 --> 00:00:10,664
a little bit more about how it works.
And then in the following video, I'm going

3
00:00:10,664 --> 00:00:16,017
to show how it can be made practical.
In full Bayesian learning, we don't try

4
00:00:16,017 --> 00:00:19,280
and find a single best setting of the
parameters.

5
00:00:19,280 --> 00:00:24,207
Instead, we try and find the full
posterior distribution over all possible

6
00:00:24,207 --> 00:00:27,337
settings.
That is, for every possible setting, we

7
00:00:27,337 --> 00:00:32,598
want a posterior probability density.
And all those densities, we want to add up

8
00:00:32,598 --> 00:00:35,795
to one.
It's extremely computationally intensive

9
00:00:35,795 --> 00:00:38,991
to compute this for all, but the simplest
models.

10
00:00:38,991 --> 00:00:43,852
So, in the example earlier, we did it for
a biased coin which just has one

11
00:00:43,852 --> 00:00:48,971
parameter, which is how biased it is.
But in general, for a neural net, it's

12
00:00:48,971 --> 00:00:52,522
impossible.
After we've computed the posterior

13
00:00:52,522 --> 00:00:57,323
distribution across all possible settings
of the parameters, we can then make

14
00:00:57,323 --> 00:01:02,124
predictions by letting each different
setting of the parameters make its own

15
00:01:02,124 --> 00:01:05,367
prediction.
And then, averaging all those predictions

16
00:01:05,367 --> 00:01:08,484
together, weighting by their posterior
probability.

17
00:01:08,484 --> 00:01:11,228
This is also very computationally
intensive.

18
00:01:11,228 --> 00:01:16,465
The advantage of doing this is that if we
use the full Bayesian approach, we can use

19
00:01:16,465 --> 00:01:19,770
complicated models even when we don't have
much data.

20
00:01:19,770 --> 00:01:23,200
So, there's a very interesting
philosophical point here.

21
00:01:24,200 --> 00:01:29,735
We're now used to the idea of overfitting,
When you fit a complicated model to a

22
00:01:29,735 --> 00:01:34,094
small amount of data.
But that's basically just a result of not

23
00:01:34,094 --> 00:01:38,869
bothering to get the full posterior
distribution over the parameters.

24
00:01:38,869 --> 00:01:44,266
So, frequentists would say, if you don't
have much data, you should use a simple

25
00:01:44,266 --> 00:01:45,788
model.
And that's true.

26
00:01:45,788 --> 00:01:51,254
But it's only true if you assume that
fitting a model means finding the single

27
00:01:51,254 --> 00:01:56,677
best setting of the parameters.
If you find the full posterior

28
00:01:56,677 --> 00:02:00,030
distribution, that gets rid of
overfitting.

29
00:02:00,030 --> 00:02:05,243
If there's very little data, the full
posterior distribution will typically give

30
00:02:05,243 --> 00:02:10,588
you very vague predictions, because many
different settings of the parameters that

31
00:02:10,588 --> 00:02:15,541
make very different predictions will have
significant posterior probability.

32
00:02:15,541 --> 00:02:20,624
As you get more data, the posterior
probability will get more and more focused

33
00:02:20,624 --> 00:02:25,838
on a few settings of the parameters, and
the posterior predictions will get much

34
00:02:25,838 --> 00:02:28,406
sharper.
So, here's a classic example of

35
00:02:28,406 --> 00:02:34,018
overfitting. We've got six data points and
we fitted a fifth order polynomial and so

36
00:02:34,018 --> 00:02:38,310
it should go exactly through the data,
which it more or less does.

37
00:02:38,310 --> 00:02:42,000
We also featured a straight line which
only has two degrees of freedom.

38
00:02:42,600 --> 00:02:47,658
And so, which model do you believe?
The model that has six coefficients and

39
00:02:47,658 --> 00:02:53,059
fits the data almost perfectly, or the
model that only has two coefficients and

40
00:02:53,059 --> 00:02:58,186
doesn't fit the data all that well.
It's obvious that the complicated model

41
00:02:58,186 --> 00:03:03,313
fits better, but you don't believe it.
It's not economical, and it also makes

42
00:03:03,313 --> 00:03:06,800
silly predictions.
So, if you look at the blue arrow,

43
00:03:07,120 --> 00:03:12,033
If that's the input value and you're
trying to predict the output value, the

44
00:03:12,033 --> 00:03:16,753
red curve will predict a value that's
lower than any of the observed data

45
00:03:16,753 --> 00:03:21,731
points, which seems crazy, whereas the
green line will predict a sense of the

46
00:03:21,731 --> 00:03:24,511
value.
But everything changes, if instead of

47
00:03:24,511 --> 00:03:29,683
fitting one fifth order polynomial, we
start with a reasonable prior of the fifth

48
00:03:29,683 --> 00:03:34,080
order polynomials, for example, the
coefficient shouldn't be to big.

49
00:03:34,080 --> 00:03:39,583
And then, we compute the full posterior
distribution over fifth order polynomials.

50
00:03:39,610 --> 00:03:44,067
And I've shown you a sample from this
distribution in the picture, where a

51
00:03:44,067 --> 00:03:47,940
thickened line means higher probability in
the posterior.

52
00:03:49,160 --> 00:03:53,510
So, you will see some of those thin
curves, miss a few of the data points by

53
00:03:53,510 --> 00:03:57,920
quite a lot, but nevertheless, they're
quite close to most of the data points.

54
00:03:58,320 --> 00:04:03,160
Now, we get much vaguer, but much more
sensible predictions.

55
00:04:03,420 --> 00:04:07,860
So, where the blue arrow is, you'll see
the different models predict very

56
00:04:07,860 --> 00:04:11,280
different things.
While, on average, they make a prediction

57
00:04:11,280 --> 00:04:14,460
quite close to the prediction made by the
green line.

58
00:04:14,460 --> 00:04:19,320
From a Bayesian prospective, there's no
reason why the amount of data you collect

59
00:04:19,320 --> 00:04:23,520
should influence your prior beliefs and
the complexity of the model.

60
00:04:24,280 --> 00:04:29,700
A true Baysian would say, you have prior
beliefs about how complicated things might

61
00:04:29,700 --> 00:04:34,990
be and just because you haven't collected
any data yet, it doesn't mean you think

62
00:04:34,990 --> 00:04:38,844
things are much simpler.
So, we can approximate full Baysian

63
00:04:38,844 --> 00:04:43,220
learning in a neural net, if the neural
net has very few parameters.

64
00:04:43,960 --> 00:04:48,020
The idea is we put a grid over the
parameter space,

65
00:04:48,520 --> 00:04:53,658
So each parameter is only allowed a few
return to values and then we take the

66
00:04:53,658 --> 00:04:57,412
cross product of all those values for all
the parameters.

67
00:04:57,412 --> 00:05:01,496
And now, we get a number of grid points in
the parameter space.

68
00:05:01,496 --> 00:05:06,832
And in each of those points, we can see
how well our model predicts the data, that

69
00:05:06,832 --> 00:05:11,773
is, if we're doing supervised learning,
how well a model predicts the target

70
00:05:11,773 --> 00:05:15,796
outputs.
And we can say that the posterior

71
00:05:15,796 --> 00:05:23,034
probability in that grid-point is the
product of how well it predicts the data,

72
00:05:23,034 --> 00:05:28,636
how likely it is under the prior.
And with the whole thing normalized, so

73
00:05:28,636 --> 00:05:31,344
that the posterior probability is
[UNKNOWN].

74
00:05:32,180 --> 00:05:36,669
This is still very expensive, but notice
it has some attractive features.

75
00:05:36,669 --> 00:05:41,221
There's no gradient descent involved, and
there's no local optimum issues.

76
00:05:41,221 --> 00:05:46,334
We're not following a path in this space,
We're just evaluating a set of points in

77
00:05:46,334 --> 00:05:49,895
this space.
Once we've decided on the posterior

78
00:05:49,895 --> 00:05:55,188
probability to assign to each grid-point,
We then use them all to make predictions

79
00:05:55,188 --> 00:05:57,675
on the test data.
That's also expensive.

80
00:05:57,675 --> 00:06:02,840
But when there isn't much data, it'll work
much better than maximum likelihood or

81
00:06:02,840 --> 00:06:08,561
maximum a posteriori.
So, the way we predict the test output,

82
00:06:08,561 --> 00:06:15,164
given the test input, is we say, the
probability of the test output, given the

83
00:06:15,164 --> 00:06:19,783
test input,
Is the sum overall the grid points of the

84
00:06:19,783 --> 00:06:24,260
probability that, that grid-point is a
good model, is the sum over all

85
00:06:24,260 --> 00:06:29,385
grid-points of the probability of that
grid-point, given the data and given our

86
00:06:29,385 --> 00:06:33,407
prior, times the probability that we will
get that test output,

87
00:06:33,407 --> 00:06:38,403
Given the input and given the grid-point.
In other words, we have to take into

88
00:06:38,403 --> 00:06:43,787
account, the fact that we might add noise
to the output of the net before producing

89
00:06:43,787 --> 00:06:47,226
the test answer.
So, here's a picture of full Bayesian

90
00:06:47,226 --> 00:06:50,641
learning.
We have a little net here, that has four

91
00:06:50,641 --> 00:06:54,815
weights and two biases.
If we allowed, nine possible values for

92
00:06:54,815 --> 00:06:59,325
each of those weights and biases,
There would be nine to the six grid+points

93
00:06:59,326 --> 00:07:03,155
in the parameter space.
It's a big number but we can cope with it.

94
00:07:03,155 --> 00:07:08,162
For each of those grid-points, we compute
the probability of the observed outputs on

95
00:07:08,162 --> 00:07:11,460
all the training cases.
We multiply by the prior for the

96
00:07:11,460 --> 00:07:15,702
grid-point, which might depend on the
values of the weights, for example.

97
00:07:15,702 --> 00:07:19,884
And then, we re-normalize to get the
posterior probability over all the

98
00:07:19,884 --> 00:07:22,829
grid-points.
Then we make predictions using those

99
00:07:22,829 --> 00:07:27,660
grid-points, but weight to each of their
predictions by its posterior probability.