1
00:00:00,041 --> 00:00:04,608
[MUSIC]

2
00:00:04,608 --> 00:00:09,048
In the previous videos we familiarised
ourself with the Monte Carlo part.

3
00:00:09,048 --> 00:00:11,952
So now,
let's talk about the Markov Chain part.

4
00:00:11,952 --> 00:00:16,492
Which is actually really cool idea
on how to generate samples from

5
00:00:16,492 --> 00:00:21,045
complicated multi-dimensional
distributions efficiently.

6
00:00:21,045 --> 00:00:23,351
So what is a Markov chain?

7
00:00:23,351 --> 00:00:27,264
Imagine a frog which sits on a water lily.

8
00:00:27,264 --> 00:00:31,642
And this frog has kind of a discrete time.

9
00:00:31,642 --> 00:00:37,977
So at each time stamp, at each second,
for example, it makes a decision,

10
00:00:37,977 --> 00:00:43,220
whether to stay on its current lily or
to jump to the other one.

11
00:00:43,220 --> 00:00:47,116
And its decisions are kind of
prefined in probabilistic way.

12
00:00:47,116 --> 00:00:50,002
So if it's right now on
the left water lily,

13
00:00:50,002 --> 00:00:55,780
it will stay there with probability 0.3,
and it will move with probability 0.7.

14
00:00:55,780 --> 00:00:59,844
And if it's on the right one,
it will stay or move of probability 0.5.

15
00:00:59,844 --> 00:01:00,997
For example,

16
00:01:00,997 --> 00:01:06,487
now we can say that we have kind of
a dynamic system which has a state.

17
00:01:06,487 --> 00:01:09,442
So the state will show us
where the frog is now.

18
00:01:09,442 --> 00:01:16,667
This dynamic system, it kind of
evolves for time, its states changes.

19
00:01:16,667 --> 00:01:23,194
And the rules by which these
states changes are as follows.

20
00:01:23,194 --> 00:01:29,359
So the probability to transition from
left state to left state is 0.3 and etc.

21
00:01:29,359 --> 00:01:34,326
So you can define the behavior of your
dynamic system by using these transition

22
00:01:34,326 --> 00:01:35,464
probabilities.

23
00:01:35,464 --> 00:01:37,980
So we can simulate our dynamic system.

24
00:01:37,980 --> 00:01:42,055
We can see that, for example, on
the timestamp 1, the frog was on the left,

25
00:01:42,055 --> 00:01:44,968
and then maybe jumped to
the right on the timestamp 2.

26
00:01:44,968 --> 00:01:46,667
And then maybe it stayed to the right.

27
00:01:46,667 --> 00:01:50,871
Then maybe it went to the left, and
then, for example, to the right.

28
00:01:50,871 --> 00:01:57,147
And we can write down the path through the
state space which our dynamic system made.

29
00:01:57,147 --> 00:02:00,513
So it was left, right, right, left, right.

30
00:02:00,513 --> 00:02:05,655
And this thing is called
Markov chain because the next

31
00:02:05,655 --> 00:02:10,691
state is kind of depends
only on the previous state.

32
00:02:10,691 --> 00:02:15,619
So the frog doesn't care where
it was ten time steps ago,

33
00:02:15,619 --> 00:02:19,733
the only thing it cares
about is where it is now.

34
00:02:19,733 --> 00:02:26,257
To make the decision, it just needs
to know what is the state now.

35
00:02:26,257 --> 00:02:30,573
Okay, so we can ask ourselves a question.

36
00:02:30,573 --> 00:02:35,822
Like where the frog will be
after 10 timestamps for example.

37
00:02:35,822 --> 00:02:41,572
So let's say that we started from the
timestamp 1, and the frog was on the left.

38
00:02:41,572 --> 00:02:44,290
So with probability 1, it was on the left.

39
00:02:44,290 --> 00:02:46,870
With probability 0, it was on the right.

40
00:02:46,870 --> 00:02:49,986
What is the probability
that will be on the left or

41
00:02:49,986 --> 00:02:52,116
on the right in the timestamp 2?

42
00:02:52,116 --> 00:02:54,358
Well, it's 0.3 and 0.7, right?

43
00:02:55,554 --> 00:02:59,900
Okay, and what about the timestamp 3?

44
00:02:59,900 --> 00:03:03,396
It's going to trigger because we
don't know where it was before.

45
00:03:03,396 --> 00:03:08,143
But using the rule of some of
probabilities we can kind of check all

46
00:03:08,143 --> 00:03:11,426
the possibilities for
the previous state and

47
00:03:11,426 --> 00:03:14,725
use that to compute
the next probabilities.

48
00:03:14,725 --> 00:03:20,425
So the law of some tells us that
the value of the state of the timestamp

49
00:03:20,425 --> 00:03:25,025
3 equals to the value of
the state of the timestamp 3,

50
00:03:25,025 --> 00:03:29,525
given that at the second
timestamp it was on the left,

51
00:03:29,525 --> 00:03:35,026
times the probability it was on
the left on the second timestamp.

52
00:03:35,026 --> 00:03:38,358
Plus the same thing,
considering that the station,

53
00:03:38,358 --> 00:03:41,106
it was on the right on
the second timestamp.

54
00:03:41,106 --> 00:03:45,790
And we can just, we know all these
values because the conditional

55
00:03:45,790 --> 00:03:49,467
probability is just
the transition probability.

56
00:03:49,467 --> 00:03:55,252
So the probability that the frog
moves from left to right for example,

57
00:03:55,252 --> 00:04:00,352
and the probability of x2
are computed in the previous line,

58
00:04:00,352 --> 00:04:03,312
so it is what we have just computed.

59
00:04:03,312 --> 00:04:06,953
And we can write it down, so
it will be for probability of the left,

60
00:04:06,953 --> 00:04:08,329
in the third timestamp.

61
00:04:08,329 --> 00:04:13,516
It will be something like 0.3 because it's

62
00:04:13,516 --> 00:04:19,501
the probability of transitioning
from left to left,

63
00:04:19,501 --> 00:04:23,363
hence the probability of 0.3.

64
00:04:23,363 --> 00:04:27,113
So it's from the previous
line of the table.

65
00:04:27,113 --> 00:04:31,680
Plus the probability of
the prior probability of being

66
00:04:31,680 --> 00:04:35,848
on the right hand side of
the previous timestamp,

67
00:04:35,848 --> 00:04:40,018
which is 0.7 times
the transition probability

68
00:04:40,018 --> 00:04:44,999
of moving from right of,
right to right, which is 0.5.

69
00:04:44,999 --> 00:04:48,791
I'm sorry,
from right to left which is 0.5.

70
00:04:48,791 --> 00:04:52,780
And we can write the same thing for
the for

71
00:04:52,780 --> 00:04:57,824
being at the right state,
in the timestamp 3, and

72
00:04:57,824 --> 00:05:04,179
we'll have some numbers like
is which are around 44 and 56.

73
00:05:04,179 --> 00:05:07,600
And we can continue this table,
for as long as we want.

74
00:05:07,600 --> 00:05:11,826
And we'll found out that the table kind of

75
00:05:11,826 --> 00:05:16,426
converged the numbers around 42 and 58.

76
00:05:16,426 --> 00:05:18,713
So after 1 million steps,

77
00:05:18,713 --> 00:05:24,589
the probability is that the frog is
on the left is almost exactly 42.

78
00:05:24,589 --> 00:05:28,920
And the probability that the frog is
on the right is almost exactly 58.

79
00:05:28,920 --> 00:05:35,203
Which basically means that we, if we
stimulate our frog for long enough and

80
00:05:35,203 --> 00:05:40,401
then write down where it ended
up with the one millionth step.

81
00:05:40,401 --> 00:05:45,471
Then we will kind of get a sample from
this discreet distribution of the values.

82
00:05:45,471 --> 00:05:51,260
So with probability 42 we'll get left,
and with probability 58 we'll get right.

83
00:05:51,260 --> 00:05:53,198
So let's look how it works.

84
00:05:53,198 --> 00:05:56,957
Say we select our frog for
like 1 million steps,

85
00:05:56,957 --> 00:06:00,542
and the last step was for
example left, okay?

86
00:06:00,542 --> 00:06:07,870
We write down this left and then simulate
another [INAUDIBLE] for the state space.

87
00:06:07,870 --> 00:06:12,409
So begin simulation and
the last state was for example right.

88
00:06:12,409 --> 00:06:16,837
We are the down and
by then phase several types.

89
00:06:16,837 --> 00:06:22,374
We have a sample from distribution
of the millionth step of

90
00:06:22,374 --> 00:06:28,031
our dynamic system given that
the frog started on the left.

91
00:06:28,031 --> 00:06:31,894
And this thing will be close
the probability 42 and 58.

92
00:06:31,894 --> 00:06:37,111
So even here we can see that
we have two out of five lefts,

93
00:06:37,111 --> 00:06:43,009
and three out of five rights,
which are close to 42 and 58.

94
00:06:43,009 --> 00:06:47,107
So this way we can generate,
it's a really lousy way to

95
00:06:47,107 --> 00:06:51,576
generate samples from this
distribution with the radius.

96
00:06:51,576 --> 00:06:56,781
So there's a much simpler way which we
have just discussed in the previous video.

97
00:06:56,781 --> 00:07:02,721
But note that this way, so,
build a dynamic system and simulate it and

98
00:07:02,721 --> 00:07:07,870
see where it end up, works even for
more complicated cases.

99
00:07:07,870 --> 00:07:10,719
So what if you have for
example ten lilies?

100
00:07:10,719 --> 00:07:12,486
Or 1 billion lilies?

101
00:07:12,486 --> 00:07:20,352
If you want to simulate your
distribution by in naive way.

102
00:07:20,352 --> 00:07:25,516
If you have billion values for
in your variable, you'll have to spend

103
00:07:25,516 --> 00:07:30,381
amount of time proportional to 1
billion to generate one sample from it.

104
00:07:30,381 --> 00:07:35,144
If you use this kind of idea of
simulate some dynamic system and

105
00:07:35,144 --> 00:07:37,902
then write down the last position.

106
00:07:37,902 --> 00:07:42,475
And if you do not that many steps like
1000 and hope that everything has

107
00:07:42,475 --> 00:07:47,520
converged to the true distribution, then
you can get a random number from this.

108
00:07:47,520 --> 00:07:52,434
Complicated distribution without
much trouble too much trouble

109
00:07:52,434 --> 00:07:55,372
of the naive approach the example.

110
00:07:55,372 --> 00:07:58,384
Or maybe a frog can be continuous, and

111
00:07:58,384 --> 00:08:03,102
can jump from any point on your
2D plane to any other point.

112
00:08:03,102 --> 00:08:06,687
In this case, you will be able
to generate your samples from

113
00:08:06,687 --> 00:08:10,565
a continuous distribution by
simulating your dynamic system.

114
00:08:10,565 --> 00:08:15,339
So the reason that you can
still sample efficiently,

115
00:08:15,339 --> 00:08:19,794
even if you have this
continuous dynamic system,

116
00:08:19,794 --> 00:08:27,344
by just writing down the path of this 2D
plane, and writing down the last position.

117
00:08:27,344 --> 00:08:30,778
So the idea Markov Chain
simulation applied

118
00:08:30,778 --> 00:08:35,062
to generating samples of
random values is as follows.

119
00:08:35,062 --> 00:08:37,628
We have a distribution
we want to sample from.

120
00:08:37,628 --> 00:08:41,452
It may be continuous, it may be discrete.

121
00:08:41,452 --> 00:08:45,721
We build a Markov Chain
that has this distribution,

122
00:08:45,721 --> 00:08:49,512
that converges to this distribution, okay?

123
00:08:49,512 --> 00:08:54,602
Then we start with from a initialization,
x0, and then we simulate

124
00:08:54,602 --> 00:09:00,243
our Markov Chain for long enough,
for example for 1,000 iterations.

125
00:09:00,243 --> 00:09:04,503
[COUGH] So
we start with x0 then we generate x1,

126
00:09:04,503 --> 00:09:08,666
by using the transition probability and
etc.

127
00:09:08,666 --> 00:09:11,474
And then we hope that after 1,000 steps,

128
00:09:11,474 --> 00:09:14,646
you will converge to
the true distribution p(x).

129
00:09:14,646 --> 00:09:18,505
Because we built our
Markov Chain this way.

130
00:09:18,505 --> 00:09:21,659
Then, after these 1,000 steps,

131
00:09:21,659 --> 00:09:27,404
you can throw away the first
1,000 samples, 1,000 states.

132
00:09:27,404 --> 00:09:32,228
And then you can start writing down
the states after these 1,000 steps.

133
00:09:32,228 --> 00:09:36,036
Because they all will be
from the distribution p(x),

134
00:09:36,036 --> 00:09:40,826
or at least close to it, if your
Markov Chain converged close enough.

135
00:09:42,862 --> 00:09:46,055
Okay, so this is the general idea.

136
00:09:46,055 --> 00:09:50,308
And in the following we'll discuss how to
build Markov Chain with this property.

137
00:09:50,308 --> 00:09:54,517
So how to build Markov Chain that
converge to the distribution you want to

138
00:09:54,517 --> 00:09:55,345
sample from.

139
00:09:55,345 --> 00:10:01,552
Now let's first discuss a little bit about
whether a Markov Chain converge anywhere.

140
00:10:01,552 --> 00:10:05,441
So in which case it does converge,
and which it doesn't.

141
00:10:05,441 --> 00:10:08,918
Well, the first observation here
is that the Markov chain doesn't

142
00:10:08,918 --> 00:10:10,510
have to converge to anywhere.

143
00:10:10,510 --> 00:10:15,570
So for example,
if you take a dynamic system like this,

144
00:10:15,570 --> 00:10:20,427
the frog will always swap
the lily at each time stamp.

145
00:10:20,427 --> 00:10:23,866
And so if you look at the table,
it will look like this.

146
00:10:23,866 --> 00:10:28,341
The permute of being on the left
on the first time step was 1 and

147
00:10:28,341 --> 00:10:30,075
the p on the right is 0.

148
00:10:30,075 --> 00:10:34,985
And then you swap things, so you're will
be on the right on the next time stamp and

149
00:10:34,985 --> 00:10:38,601
you will certainly be on the left
on the next time stamp, and etc.

150
00:10:38,601 --> 00:10:41,340
This thing will never converge this table.

151
00:10:41,340 --> 00:10:46,245
So it will always swap if the place.

152
00:10:46,245 --> 00:10:48,764
So this thing does work for us.

153
00:10:48,764 --> 00:10:52,165
We want something that
converge to some distribution.

154
00:10:52,165 --> 00:10:57,617
And the question of whether or
not given a Markov chain is converged,

155
00:10:57,617 --> 00:11:02,885
is converging to something,
is actually, pretty well studied.

156
00:11:02,885 --> 00:11:07,323
But first of all, we'll need
a definition of what does it mean for

157
00:11:07,323 --> 00:11:11,612
Markov chain to converge to
something in mathematical terms?

158
00:11:11,612 --> 00:11:15,805
So this is something called
a stationary distribution.

159
00:11:15,805 --> 00:11:20,650
And this is a distribution
of pi called stationary for

160
00:11:20,650 --> 00:11:25,074
Markov chain T If
the following equation holds.

161
00:11:25,074 --> 00:11:30,542
Which basically tells us that if
you start the distribution y and

162
00:11:30,542 --> 00:11:36,012
if you marginalize out the current
position x, then you will get

163
00:11:36,012 --> 00:11:41,703
basically the distribution over
position on the next time stamp.

164
00:11:41,703 --> 00:11:44,397
This has to be pi again.

165
00:11:44,397 --> 00:11:49,675
So basically, if you start from pi,
if you sample a state from pi,

166
00:11:49,675 --> 00:11:55,993
and then if you make just one transition,
so one timestamp of the Markov chain,

167
00:11:55,993 --> 00:12:01,026
then the distribution you will
get after that is, again, pi.

168
00:12:01,026 --> 00:12:05,994
This means that if we converge to,
if we encounter this

169
00:12:05,994 --> 00:12:11,070
pi at any point of our sampling,
then we will stay at it.

170
00:12:11,070 --> 00:12:13,529
We will not change this distribution.

171
00:12:13,529 --> 00:12:18,912
And there is a nice theorem that tells
us that If the transition probability

172
00:12:18,912 --> 00:12:24,721
is non-zero for any two states, so if we
can jump from any state to any other one,

173
00:12:24,721 --> 00:12:30,633
with non-zero probability, then there
exists a unique pi which is stationary.

174
00:12:30,633 --> 00:12:34,816
And we will converge to this
pi from any starting point.

175
00:12:34,816 --> 00:12:42,198
So if we build a Markov chain that has
non-zero transition probabilities,

176
00:12:42,198 --> 00:12:46,812
and that has a stationary
distribution which we

177
00:12:46,812 --> 00:12:51,653
want sample from,
then no matter where we start,

178
00:12:51,653 --> 00:12:56,034
if we simulate our Markov
chain long enough,

179
00:12:56,034 --> 00:13:02,989
we will get eventually samples from
the desired distribution y, [SOUND]