1
00:00:00,780 --> 00:00:06,489
In this video, I'm going to describe how
to make full Bayesian learning practical

2
00:00:06,489 --> 00:00:11,916
for neural networks that have thousands,
and perhaps even millions of weights.

3
00:00:11,916 --> 00:00:17,625
The technique that's used is a Monte Carlo
method, which seems very odd the first

4
00:00:17,625 --> 00:00:22,136
time you hear about it.
We use a random number generator to move

5
00:00:22,136 --> 00:00:25,731
around the space of weight vectors in a
random way,

6
00:00:25,731 --> 00:00:29,960
But with a bias towards going downhill in
our cost function.

7
00:00:30,860 --> 00:00:36,963
If we do this right, we get a beautiful
property, which is that we sample weight

8
00:00:36,963 --> 00:00:42,603
vectors in proportion to their probability
in the posterior distribution.

9
00:00:42,603 --> 00:00:49,016
And that means by sampling a lot of weight
factors, we can get a good approximation

10
00:00:49,016 --> 00:00:54,785
to the full Bayesian method.
The number of grid points is exponential

11
00:00:54,785 --> 00:00:59,041
in the number of parameters.
So we can't make a grid for more than a

12
00:00:59,041 --> 00:01:03,492
few parameters.
This is enough data so that most of the

13
00:01:03,492 --> 00:01:08,406
parameter vectors are very unlikely.
Only a tiny fraction of the group points,

14
00:01:08,406 --> 00:01:11,980
will make a significant contribution to
the predictions.

15
00:01:13,320 --> 00:01:18,880
So may be you can just focus on evaluating
this tiny fraction if we can find it.

16
00:01:20,100 --> 00:01:25,141
An idea that makes Bayesian learning
feasible is that it might be good enough

17
00:01:25,141 --> 00:01:29,860
just to sample weight vectors according to
their posterior probabilities.

18
00:01:30,840 --> 00:01:36,156
So if you look at this equation, the
probability that we assigned to a test

19
00:01:36,156 --> 00:01:42,039
output, given the input for the test case
and the training data, is the sum over all

20
00:01:42,039 --> 00:01:47,851
points in weight space of the posterity
probability of that point in weight space

21
00:01:47,851 --> 00:01:52,955
given the training data, times the
probability distribution for the test

22
00:01:52,955 --> 00:01:58,554
values that we predict given that point in
weight space W I, and given the test

23
00:01:58,554 --> 00:02:02,664
input.
Now instead of adding up all the terms in

24
00:02:02,664 --> 00:02:06,360
that sum, we could just sample terms from
that sum.

25
00:02:06,860 --> 00:02:12,273
What we do is we sample the weight vectors
in proportion to that probability.

26
00:02:12,273 --> 00:02:17,616
So either we sample them or we don't.
So they'll get a weight of one or zero.

27
00:02:17,616 --> 00:02:22,889
But the probability of getting a one.
That is, the probability being sampled,

28
00:02:22,889 --> 00:02:28,555
will be their posterior probability.
So that will give us the correct expected

29
00:02:28,555 --> 00:02:33,079
value for the right hand side.
It'll have noise due to the sampling but

30
00:02:33,079 --> 00:02:40,324
it'll have the correct expected value.
So here's a picture of what happens in

31
00:02:40,324 --> 00:02:44,550
standard back propagation.
On the right I've drawn the weight space.

32
00:02:44,550 --> 00:02:47,783
Which of course is very high dimensional
and unbounded.

33
00:02:47,783 --> 00:02:51,370
And this is a very bad picture of, but
it's the best I can do.

34
00:02:51,370 --> 00:02:56,597
In this white space, I've drawn some
contours which are meant to be contours of

35
00:02:56,597 --> 00:03:01,559
equal values of our cost function.
And the way back propagation is normally

36
00:03:01,559 --> 00:03:06,654
used, is we start with some small value of
the weights, and then we follow the

37
00:03:06,654 --> 00:03:09,963
gradient.
We move downhill in our cost function, in

38
00:03:09,963 --> 00:03:14,859
the direction that increases the
log-likelihood, plus the log-prior, summed

39
00:03:14,859 --> 00:03:20,781
over all training guesses.
Eventually, we'll either end up at a local

40
00:03:20,781 --> 00:03:26,262
minimum or we'll get stuck on a plateau,
Or we'll just move so slowly that we run

41
00:03:26,262 --> 00:03:29,916
out of patience.
But the main point of this picture, is

42
00:03:29,916 --> 00:03:34,720
that we follow a path from an initial
point to some final, single point.

43
00:03:37,280 --> 00:03:43,263
Now if we're using a sampling method, what
we could do, we start at the same place as

44
00:03:43,263 --> 00:03:46,825
we did before, but each time we update the
weights.

45
00:03:46,825 --> 00:03:51,100
We add a bit of Gaussian noise so we're
just turning around.

46
00:03:52,340 --> 00:03:55,398
The weight vector will never settle down
then.

47
00:03:55,398 --> 00:04:00,853
It'll keep on moving around.
It'll wander over the space, but always

48
00:04:00,853 --> 00:04:06,072
preferring low cost regions.
That is, it'll tend to go downhill if it

49
00:04:06,072 --> 00:04:10,789
can.
An important question is whether we can

50
00:04:10,789 --> 00:04:15,850
say anything about how often the weights
will visit each point in that space.

51
00:04:15,850 --> 00:04:21,438
So the red dots are meant to be samples we
took of the weights as we wandered around

52
00:04:21,438 --> 00:04:24,856
the space.
And the idea is, we might save the weights

53
00:04:24,856 --> 00:04:29,696
after every 10,000 steps.
And if you look at those red dots, a few

54
00:04:29,696 --> 00:04:34,761
of them are in high cost regions, because
those regions are quite big.

55
00:04:34,761 --> 00:04:40,487
The deepest minimum has the most red dots,
and other minima also have red dots.

56
00:04:40,487 --> 00:04:46,360
The dots aren't right at the bottom of the
minima, because they're noisy samples.

57
00:04:49,000 --> 00:04:54,842
If we add that Gaussian noise in just the
right way, there's a wonderful property of

58
00:04:54,842 --> 00:04:58,041
Markov chain Monte Carlo.
It's an amazing fact.

59
00:04:58,041 --> 00:05:03,745
The weight vectors, if we wandered around
for long enough, will be unbiased samples

60
00:05:03,745 --> 00:05:07,640
from the true posterior distribution
overweight factors.

61
00:05:08,060 --> 00:05:13,008
That is, those red dots we saw in the
previous slide will be sampled from the

62
00:05:13,008 --> 00:05:17,763
posterior, where weight vectors are a
highly probable under the posterior, a

63
00:05:17,763 --> 00:05:23,033
much more likely to be represented by a
red dot than weight factor that is highly

64
00:05:23,033 --> 00:05:28,240
improbable.
This is called Markov Chain Monte Carlo,

65
00:05:29,180 --> 00:05:34,520
and makes it feasible to use Bayesian
learning with thousands of parameters.

66
00:05:36,240 --> 00:05:40,855
The method I suggested of adding some
Gaussian noise is called the [UNKNOW

67
00:05:40,855 --> 00:05:43,542
method.
And it's not the most efficient method.

68
00:05:43,542 --> 00:05:46,989
There's more sophisticated methods that
are more efficient,

69
00:05:46,989 --> 00:05:51,721
And what I mean by more efficient is, they
don't need to wander around the weight

70
00:05:51,721 --> 00:05:55,460
space for so long before you can start
taking those red samples.

71
00:05:57,640 --> 00:06:01,540
Full Bayesian learning can actually be
done with mini batches.

72
00:06:02,520 --> 00:06:07,415
When we compute the gradient of the cost
function on a random mini batch, we're

73
00:06:07,415 --> 00:06:10,824
gonna get an unbiased estimate but with
sampling noise.

74
00:06:10,824 --> 00:06:15,967
And the idea is to use that sampling noise
to provide the noise that the marked up

75
00:06:15,967 --> 00:06:19,500
chained Monte Carlo method needs.
It's a very clever idea.

76
00:06:21,020 --> 00:06:26,610
Recently, Welling and his collaborators
made it work nicely, so they could fairly

77
00:06:26,610 --> 00:06:32,620
efficiently get samples from the post area
distribution over weights using mini-batch

78
00:06:32,620 --> 00:06:35,921
methods.
This should make it possible to use full

79
00:06:35,921 --> 00:06:40,335
Bayesian learning for much larger networks
where you have to train them with

80
00:06:40,335 --> 00:06:43,820
mini-batch to have any hope of ever
finishing training them.