1
00:00:03,820 --> 00:00:08,750
Hi, and welcome to week four of practical Bayesian methods.

2
00:00:08,750 --> 00:00:09,830
This week, we're going to cover

3
00:00:09,830 --> 00:00:11,840
the Markov chain Monte Carlo which

4
00:00:11,840 --> 00:00:14,495
is kind of a silver bullet of probabilistic programming,

5
00:00:14,495 --> 00:00:17,590
because it allows you to train or to

6
00:00:17,590 --> 00:00:21,405
do imprints on almost any model without too much trouble.

7
00:00:21,405 --> 00:00:23,775
We'll discuss some extensions of this technique.

8
00:00:23,775 --> 00:00:26,975
So how to make it faster in some particular problems,

9
00:00:26,975 --> 00:00:29,243
by exposing the structure of the problems,

10
00:00:29,243 --> 00:00:30,905
and we'll also discuss some limitations,

11
00:00:30,905 --> 00:00:35,210
like the speed of the method.

12
00:00:35,210 --> 00:00:40,404
So, let's start the discussion of Markov chain Monte Carlo with the Monte Carlo part.

13
00:00:40,404 --> 00:00:44,420
Monte Carlo was originated in the Manhattan Project which gave us

14
00:00:44,420 --> 00:00:51,740
the atomic bomb and it was a quick and dirty answer to the question,

15
00:00:51,740 --> 00:00:56,355
what to do until the mathematicians arrives.

16
00:00:56,355 --> 00:00:59,140
The idea of Monte Carlo simulation is to,

17
00:00:59,140 --> 00:01:02,015
so if you have a complicated model

18
00:01:02,015 --> 00:01:05,765
of some problem and you want to predict something about this model,

19
00:01:05,765 --> 00:01:07,790
instead of thinking carefully and,

20
00:01:07,790 --> 00:01:10,385
you know, writing down equations and solving them,

21
00:01:10,385 --> 00:01:16,790
you may just simulate your model lots and lots of times and then see,

22
00:01:16,790 --> 00:01:18,883
for example, the average of the simulation

23
00:01:18,883 --> 00:01:22,515
as some kind of smooth prediction of what will happen.

24
00:01:22,515 --> 00:01:25,075
So, probably the reason why this thing was

25
00:01:25,075 --> 00:01:27,675
invented in Manhattan Project is because it's one of

26
00:01:27,675 --> 00:01:32,270
the first really huge scientific projects where we already

27
00:01:32,270 --> 00:01:38,115
had lots of means to approximate integrals and to do these kind of simulations.

28
00:01:38,115 --> 00:01:43,850
And second of all, we already had computers in some reasonably useful state,

29
00:01:43,850 --> 00:01:45,765
where we can actually assimilate stuff,

30
00:01:45,765 --> 00:01:48,035
not mentally but automatically.

31
00:01:48,035 --> 00:01:52,575
And the Monte Carlo algorithm turned out to be pretty efficient,

32
00:01:52,575 --> 00:02:01,030
and it made it to the list of top 10 algorithms of 20th century.

33
00:02:01,030 --> 00:02:04,490
So the name Monte Carlo was given by the name of the city

34
00:02:04,490 --> 00:02:08,320
Monte Carlo which is famous for its casinos,

35
00:02:08,320 --> 00:02:11,615
and probably because, you know,

36
00:02:11,615 --> 00:02:16,235
everything in Manhattan Project had to have its code name.

37
00:02:16,235 --> 00:02:22,850
So let's see a small example of what I'm talking about.

38
00:02:22,850 --> 00:02:26,794
So, what is Monte Carlo applied to some really simple problem.

39
00:02:26,794 --> 00:02:30,810
Say you want to approximate the value of pi,

40
00:02:30,810 --> 00:02:35,180
and you know that pi is the area of the unit circle,

41
00:02:35,180 --> 00:02:37,440
a circle with radius one.

42
00:02:37,440 --> 00:02:38,900
So we can, kind of,

43
00:02:38,900 --> 00:02:41,635
use this information to approximate it.

44
00:02:41,635 --> 00:02:50,726
And the idea here is that if you consider a rectangle from zero to one on both axes,

45
00:02:50,726 --> 00:02:55,085
then this quarter of a circle which appears in this rectangle,

46
00:02:55,085 --> 00:02:58,930
has exactly the area pi divided by four.

47
00:02:58,930 --> 00:03:00,810
It's just one-fourth of the circle.

48
00:03:00,810 --> 00:03:04,220
And its area equals to some integral which basically says

49
00:03:04,220 --> 00:03:08,300
at each point whether this point is in the circle or not.

50
00:03:08,300 --> 00:03:14,310
So if we integrate this thing along this rectangle, we'll get the area.

51
00:03:14,310 --> 00:03:18,230
So how can we compute this integral? It's hard.

52
00:03:18,230 --> 00:03:20,980
But we can throw lots and lots of points on

53
00:03:20,980 --> 00:03:26,825
this unit rectangle and just see the fraction of those points that are put in the circle,

54
00:03:26,825 --> 00:03:28,987
in this quarter of the circle,

55
00:03:28,987 --> 00:03:33,450
and use this as an approximation to what the integral really is.

56
00:03:33,450 --> 00:03:35,420
So, you can see that

57
00:03:35,420 --> 00:03:39,500
this maybe not the best method to approximate the value of pi because here,

58
00:03:39,500 --> 00:03:42,560
for example, you spend like 3000 samples.

59
00:03:42,560 --> 00:03:48,050
So you assemble 3000 points and the pi was approximate as 3.1,

60
00:03:48,050 --> 00:03:52,480
which is not the best accuracy you can get with this much computations.

61
00:03:52,480 --> 00:03:56,305
But anyway, the method is really simple.

62
00:03:56,305 --> 00:03:59,135
It's like two lines of code in almost any programming language.

63
00:03:59,135 --> 00:04:03,440
And this is the general theme of Monte Carlo simulations.

64
00:04:03,440 --> 00:04:08,225
So, Monte Carlo usually gives you something which is really easy to program,

65
00:04:08,225 --> 00:04:11,645
which is really universally applicable to lots of and lots of problems,

66
00:04:11,645 --> 00:04:15,590
which is very scalable to parallelization.

67
00:04:15,590 --> 00:04:17,600
So, if you have like 100 computers,

68
00:04:17,600 --> 00:04:22,850
then you can really easily parallelize this thing to be

69
00:04:22,850 --> 00:04:25,250
100 times more efficient which is not

70
00:04:25,250 --> 00:04:28,109
the case for many other algorithms which are much harder to parallelize.

71
00:04:28,109 --> 00:04:35,405
And sometimes this Monte Carlo method can be slow compared to other alternatives.

72
00:04:35,405 --> 00:04:39,540
So, usually, if you sit down carefully for a weekend,

73
00:04:39,540 --> 00:04:41,885
like write down lots of equations and think,

74
00:04:41,885 --> 00:04:44,686
then you may come up with a better way to do,

75
00:04:44,686 --> 00:04:48,185
to solve your problem than to do just Monte Carlo simulation.

76
00:04:48,185 --> 00:04:51,010
Sometimes not, but sometimes it's

77
00:04:51,010 --> 00:04:54,495
just a quick and dirty approach which gives you the first approximation.

78
00:04:54,495 --> 00:04:56,450
But anyway, it's very useful.

79
00:04:56,450 --> 00:05:00,485
So, to be a little bit more general about this family of techniques,

80
00:05:00,485 --> 00:05:05,868
let's say that we want to approximate some expected value of some function f,

81
00:05:05,868 --> 00:05:08,975
with respect to some distribution p of x.

82
00:05:08,975 --> 00:05:11,155
And we will use,

83
00:05:11,155 --> 00:05:17,145
we will sample a few points for some end points from this distribution p of x,

84
00:05:17,145 --> 00:05:22,000
and use the average of f of x size as the approximation.

85
00:05:22,000 --> 00:05:30,220
So, here, we are kind of using sampled average instead of the expected value.

86
00:05:30,220 --> 00:05:35,510
And this thing has lots of nice various co-properties like it's unbiased,

87
00:05:35,510 --> 00:05:38,535
meaning that if you sample enough points,

88
00:05:38,535 --> 00:05:42,750
you can get closer and closer to the actual true answer,

89
00:05:42,750 --> 00:05:44,715
at least with high probability.

90
00:05:44,715 --> 00:05:51,285
And anyway, we're [inaudible] how fast you will converge to that.

91
00:05:51,285 --> 00:05:53,765
So, this is Monte Carlo.

92
00:05:53,765 --> 00:05:58,605
And you may ask a question like, why do we care?

93
00:05:58,605 --> 00:06:01,645
Why do we need to approximate expected values?

94
00:06:01,645 --> 00:06:03,910
Well, in permalink programming,

95
00:06:03,910 --> 00:06:06,525
there are probably a few reasons to do that.

96
00:06:06,525 --> 00:06:08,305
Let's consider two of them.

97
00:06:08,305 --> 00:06:11,565
So first of all, if you wanted to do full Bayesian inference,

98
00:06:11,565 --> 00:06:16,050
like we covered a little bit in week one, then,

99
00:06:16,050 --> 00:06:21,355
instead of having some pragmatic model and finding the best failure of parameters,

100
00:06:21,355 --> 00:06:26,914
you may just want to tweak your parameters as latent variables.

101
00:06:26,914 --> 00:06:29,280
And then for a new object X,

102
00:06:29,280 --> 00:06:30,521
if you want to predict it's label,

103
00:06:30,521 --> 00:06:35,390
for example, and you have some training data set x train and y train.

104
00:06:35,390 --> 00:06:39,230
You may want to just integrate out this latent variable like this.

105
00:06:39,230 --> 00:06:41,360
So, imagine that, for example,

106
00:06:41,360 --> 00:06:46,450
x is image and you want to predict like whether it's an image of a cat or dog.

107
00:06:46,450 --> 00:06:53,105
And then P of Y given X and W maybe for example a neural network.

108
00:06:53,105 --> 00:06:56,760
With weights W and which takes

109
00:06:56,760 --> 00:07:00,770
as inputs this image x and outputs your distribution over labels y.

110
00:07:00,770 --> 00:07:05,220
And then, instead of using this just one neural network with some kind of

111
00:07:05,220 --> 00:07:07,770
optimal or best set of weights

112
00:07:07,770 --> 00:07:12,135
w you're considering all possible neural networks with this architecture.

113
00:07:12,135 --> 00:07:17,685
So, for each possible value for weight's w,

114
00:07:17,685 --> 00:07:20,160
you consider this neural network.

115
00:07:20,160 --> 00:07:23,255
You pass your image through this neural network.

116
00:07:23,255 --> 00:07:29,310
You write down the answers the P of Y and then you average all these predictions with

117
00:07:29,310 --> 00:07:31,965
some weights which are given by

118
00:07:31,965 --> 00:07:36,745
the Pasteur's division the weights P of W given the training data set.

119
00:07:36,745 --> 00:07:40,025
So it's like an infinitely large ensemble of neural networks.

120
00:07:40,025 --> 00:07:42,750
Where the weights of the ensemble are given by

121
00:07:42,750 --> 00:07:46,125
the Pasteur's distribution P of W given the training data set.

122
00:07:46,125 --> 00:07:50,910
And this P of W given the training data set gives you some idea about how well

123
00:07:50,910 --> 00:07:57,100
these weights will behave as a weight for your neural network.

124
00:07:57,100 --> 00:08:00,330
So this value will be higher

125
00:08:00,330 --> 00:08:05,295
for reasonable weights and lower for the weights which doesn't make sense.

126
00:08:05,295 --> 00:08:08,440
So this way, you're doing full base in inference.

127
00:08:08,440 --> 00:08:11,790
You're having a few benefits of probabilistic modeling like you can,

128
00:08:11,790 --> 00:08:15,955
for example, predict uncertainty in your predictions.

129
00:08:15,955 --> 00:08:20,080
And well basically, instead of just fixed value of weights,

130
00:08:20,080 --> 00:08:22,020
you have a distribution over weights.

131
00:08:22,020 --> 00:08:24,960
And this thing is, of course, intractable.

132
00:08:24,960 --> 00:08:27,855
So if you have a complicated function like in neural network here,

133
00:08:27,855 --> 00:08:30,110
you can compute this integral analytically.

134
00:08:30,110 --> 00:08:32,420
So you have to do some approximation.

135
00:08:32,420 --> 00:08:34,980
And one of the way to do it is to just say that this thing

136
00:08:34,980 --> 00:08:38,520
is just an expected value of this neural network.

137
00:08:38,520 --> 00:08:41,920
So we have a neural network output and the computing

138
00:08:41,920 --> 00:08:44,055
the expected failure of this neural network output

139
00:08:44,055 --> 00:08:47,080
with respect to some distributional weights.

140
00:08:47,080 --> 00:08:49,500
And so, we can approximate this thing with Monte Carlo.

141
00:08:49,500 --> 00:08:53,725
If we can sample from P of W given the training data set.

142
00:08:53,725 --> 00:08:56,710
So this is one example,

143
00:08:56,710 --> 00:08:59,590
where the Monte Carlo can be

144
00:08:59,590 --> 00:09:02,875
useful to approximate expected value in the probabilistic modeling.

145
00:09:02,875 --> 00:09:06,480
And you may ask how we can

146
00:09:06,480 --> 00:09:11,100
find this posterior distribution of the weights P of W in the training data set?

147
00:09:11,100 --> 00:09:15,940
Well the posterior distribution is proportional to the joint distribution.

148
00:09:15,940 --> 00:09:18,850
So likelihood P of Y train given X train and

149
00:09:18,850 --> 00:09:23,990
W times the prior and W and divided by some normalization constant.

150
00:09:23,990 --> 00:09:29,440
And so the numerator here is easy because you defined your model yourself.

151
00:09:29,440 --> 00:09:30,550
You could have, for example,

152
00:09:30,550 --> 00:09:33,250
defined P of W the prior to be

153
00:09:33,250 --> 00:09:37,590
normal and the likelihood is just your output of your neural network.

154
00:09:37,590 --> 00:09:42,130
So you can compute this thing for any values of Y, X,

155
00:09:42,130 --> 00:09:44,080
and W. But then,

156
00:09:44,080 --> 00:09:49,650
the denominator, the normalization constant Z is much harder.

157
00:09:49,650 --> 00:09:53,415
It contains something to go on site so you can't ever compute it.

158
00:09:53,415 --> 00:09:57,310
So it's kind of a general situation where you know

159
00:09:57,310 --> 00:10:03,245
your distribution from which you want to sample but only up to normalization constant.

160
00:10:03,245 --> 00:10:07,450
Right? And other example of these kind of

161
00:10:07,450 --> 00:10:11,080
expected values approximation can be useful in probabilistic modeling

162
00:10:11,080 --> 00:10:19,060
is the M-step of the expectation maximization algorithm which we covered in week two.

163
00:10:19,060 --> 00:10:24,190
So, on the E step of expectation maximization,

164
00:10:24,190 --> 00:10:34,142
we found the posterior distribution latent variable T given the data set and parameters.

165
00:10:34,142 --> 00:10:35,168
And on the M-step,

166
00:10:35,168 --> 00:10:40,255
we want to maximize the expected value of algorithm of the joint probability.

167
00:10:40,255 --> 00:10:43,330
With respect to the parameters theta and

168
00:10:43,330 --> 00:10:48,245
the expected value with respect to this posterior distribution Q.

169
00:10:48,245 --> 00:10:53,370
So, here again, if the posterior distribution Q is too hard to work with,

170
00:10:53,370 --> 00:10:57,475
and we can't integrate this thing analytically, we can do a few things.

171
00:10:57,475 --> 00:10:59,080
We can, first of all,

172
00:10:59,080 --> 00:11:04,985
we can approximate Q to be some to lay in some simple class like factorize distribution.

173
00:11:04,985 --> 00:11:06,910
And that's what we did in the previous week,

174
00:11:06,910 --> 00:11:09,305
week three about variational inference.

175
00:11:09,305 --> 00:11:11,890
So we're approximating the posterior and then

176
00:11:11,890 --> 00:11:14,930
computing these integral analytically for the simplified posterior.

177
00:11:14,930 --> 00:11:19,330
Or we can use the exact posterior but we

178
00:11:19,330 --> 00:11:25,252
can sample some data set of samples from this set

179
00:11:25,252 --> 00:11:27,790
of latent variable values and

180
00:11:27,790 --> 00:11:30,570
then approximate this expected value with just average with respect

181
00:11:30,570 --> 00:11:33,760
to these samples and

182
00:11:33,760 --> 00:11:38,595
then maximize this approximation instead of the true expected values.

183
00:11:38,595 --> 00:11:42,820
So these are two examples of where you can need expect [inaudible] and

184
00:11:42,820 --> 00:11:47,325
probabilistic modeling and how Monte Carlo simulation can help you here.

185
00:11:47,325 --> 00:11:52,083
And in the following videos,

186
00:11:52,083 --> 00:11:55,040
we are going to use just,

187
00:11:55,040 --> 00:12:01,120
we're going to use the formulation of this problem is as like we want to learn how

188
00:12:01,120 --> 00:12:03,974
to sample from a complicated probabilistic distribution

189
00:12:03,974 --> 00:12:06,905
which we know up to normalization constant.

190
00:12:06,905 --> 00:12:12,370
And we will always write like p of x meaning that x is the parameters you want to sample.

191
00:12:12,370 --> 00:12:13,895
So here we had,

192
00:12:13,895 --> 00:12:18,910
in the first example of full Bayesin interference we had p of w giving some data.

193
00:12:18,910 --> 00:12:26,080
And in this second example we have p of latent variable giving some again data.

194
00:12:26,080 --> 00:12:28,150
But in all cases, the full ingredients,

195
00:12:28,150 --> 00:12:29,390
we will just write down,

196
00:12:29,390 --> 00:12:33,950
you know, p of x meaning that it [inaudible] distribution we want to sample from.