1
00:00:00,000 --> 00:00:04,222
[MUSIC]

2
00:00:04,222 --> 00:00:09,017
Let's see how the Metropolis-Hastings can
work in this simple, one-dimensional case.

3
00:00:09,017 --> 00:00:13,462
Again, it's not that useful
in one dimension, but

4
00:00:13,462 --> 00:00:17,716
it's chosen just for
illustrational purposes.

5
00:00:17,716 --> 00:00:23,932
So let's say we want to sample from this
two model distribution of the blue curve.

6
00:00:23,932 --> 00:00:28,735
And let's say we start with the orange
point, which is at the position 0 and

7
00:00:28,735 --> 00:00:32,429
it's at the iteration 0, so
it's our initialization.

8
00:00:32,429 --> 00:00:36,017
Let's see how
the interpretation works here.

9
00:00:36,017 --> 00:00:39,755
First of all,
let's choose the proposal distribution Q.

10
00:00:39,755 --> 00:00:42,195
And to make everything simpler,

11
00:00:42,195 --> 00:00:47,261
let's just use standard normal
centered around the current point.

12
00:00:47,261 --> 00:00:51,563
So at each duration,
the mark of chain Q will propose to move

13
00:00:51,563 --> 00:00:56,307
somewhere around the current
point with some small variance 1.

14
00:00:56,307 --> 00:00:58,837
And for example the first duration,

15
00:00:58,837 --> 00:01:03,331
the Markov chain may propose this point,
which is the right one.

16
00:01:03,331 --> 00:01:07,906
So now we have to compute
the acceptance probability, so

17
00:01:07,906 --> 00:01:11,446
whether we want to accept this point or
not.

18
00:01:11,446 --> 00:01:16,943
This the definition of the acceptance
probability, we have just proved

19
00:01:16,943 --> 00:01:22,093
in the previous video, but
note that in this case the ratio we have.

20
00:01:22,093 --> 00:01:27,667
So the Q(x' -> x), and
the Q(x' -> x), just the same.

21
00:01:27,667 --> 00:01:32,571
So it is the property of our
current proposal illustration Q,

22
00:01:32,571 --> 00:01:36,642
that it doesn't depend on
the order of arguments.

23
00:01:36,642 --> 00:01:40,713
Which means that we can just vanish
this Q, and what is left is just

24
00:01:40,713 --> 00:01:45,026
the ratio of the densities at the new
point, and at the previous one.

25
00:01:45,026 --> 00:01:50,330
Which kind of makes sense,
it just says that if the new point is more

26
00:01:50,330 --> 00:01:57,079
probable than the previous one, we'll
definitely accept it with probability 1.

27
00:01:57,079 --> 00:02:02,069
If it's not the case, then we'll think
like may be we'll accept it, maybe we'll

28
00:02:02,069 --> 00:02:06,861
may be not depending on how less probable
the new point is than the previous one.

29
00:02:06,861 --> 00:02:10,742
So in this particular case, for
this particular red dot proposal,

30
00:02:10,742 --> 00:02:14,093
the new point is much more
probable than the previous one.

31
00:02:14,093 --> 00:02:19,187
It's like almost four times more probable
which means that the probability

32
00:02:19,187 --> 00:02:23,496
of acceptance here is 1 and
we'll definitely keep this point.

33
00:02:23,496 --> 00:02:27,176
So the end of situation 1 is
moving to this new point.

34
00:02:27,176 --> 00:02:29,377
Okay what about iteration two?

35
00:02:29,377 --> 00:02:32,104
Let's sample again a point
from the proposal and

36
00:02:32,104 --> 00:02:34,708
we'll be somewhere around here for
example.

37
00:02:34,708 --> 00:02:39,068
Again computing
the acceptance probability,

38
00:02:39,068 --> 00:02:43,977
here we moved a little bit
to the hard dense division.

39
00:02:43,977 --> 00:02:48,945
It's not that much higher, but since it's
higher then we will definitely accept

40
00:02:48,945 --> 00:02:53,421
this point, so with probability 1
we're keeping at this point, okay.

41
00:02:53,421 --> 00:02:57,890
New proposal, this one is really,
it's trying to move

42
00:02:57,890 --> 00:03:02,742
to a really non probable region
according to David Lugov, so

43
00:03:02,742 --> 00:03:07,022
we'll keep this point with
probability point 13,

44
00:03:07,022 --> 00:03:11,682
and now we're flipping a biased
coin with probability for

45
00:03:11,682 --> 00:03:15,106
2 in which tells us to
accept this point and

46
00:03:15,106 --> 00:03:19,043
with the probability 87
to reject this point.

47
00:03:19,043 --> 00:03:21,421
So we flipped our biased coin, and

48
00:03:21,421 --> 00:03:25,858
it told us to in this particular example,
to reject this point.

49
00:03:25,858 --> 00:03:26,837
Okay, why not?

50
00:03:26,837 --> 00:03:31,797
Another proposal, it asks us to move here.

51
00:03:31,797 --> 00:03:34,765
And this thing is much more
probable than it appears, 1.

52
00:03:34,765 --> 00:03:38,192
And we will keep this
point with probability 73.

53
00:03:38,192 --> 00:03:41,539
So most definitely keep this 1.

54
00:03:41,539 --> 00:03:46,959
Again, we are flipping a biased coin, and
now it tells us again to reject the point.

55
00:03:46,959 --> 00:03:49,157
Well, why not, it happens sometimes.

56
00:03:49,157 --> 00:03:53,568
It was more probable to accept it but
we happen to reject it.

57
00:03:53,568 --> 00:03:59,250
So again we are staying at the same
place we were at the previous iteration.

58
00:03:59,250 --> 00:04:05,490
And finally if you repeat this process for
long enough, you will get a log like this.

59
00:04:05,490 --> 00:04:11,862
So you will move In your sample space and
here we have like 50 iterations and

60
00:04:11,862 --> 00:04:17,950
sometimes you will stop it the same
place and this plot will become flat.

61
00:04:17,950 --> 00:04:20,887
But generally you will move around and

62
00:04:20,887 --> 00:04:24,832
you can plot a histogram of
the journey at points and

63
00:04:24,832 --> 00:04:29,633
it looks kind of close to
the Gaussian you want to sample from.

64
00:04:29,633 --> 00:04:33,751
So in this case we didn't get
exactly the samples we wanted,

65
00:04:33,751 --> 00:04:37,075
the histogram is not
exactly like the Gaussian.

66
00:04:37,075 --> 00:04:41,323
I'm sorry not the Gaussian, but
the blue curve, but it's close, so

67
00:04:41,323 --> 00:04:46,004
it's a reasonable way to sample from
the blue curve in this particular case.

68
00:04:46,004 --> 00:04:49,884
Now the question is what happens
if we change our proposal.

69
00:04:49,884 --> 00:04:54,447
So let's say we use a Gaussian
with less variance.

70
00:04:54,447 --> 00:04:57,538
So we propose always to move just

71
00:04:57,538 --> 00:05:02,464
with tiny little steps
around the previous point.

72
00:05:02,464 --> 00:05:07,925
Well, it kind of works, but
in [INAUDIBLE] situations it doesn't

73
00:05:07,925 --> 00:05:13,403
proceed to move outside the low
density region where it started.

74
00:05:13,403 --> 00:05:17,598
So in 50 iterations,
it haven't converged yet, and

75
00:05:17,598 --> 00:05:22,700
it will definitely converge at some
point but we don't know where.

76
00:05:22,700 --> 00:05:27,540
And this means that it's much
less efficient choice here to use

77
00:05:27,540 --> 00:05:29,102
these small steps.

78
00:05:29,102 --> 00:05:30,697
What about large steps?

79
00:05:30,697 --> 00:05:36,331
Well, if we increase the variance of
our proposal distribution Q to be 100,

80
00:05:36,331 --> 00:05:41,126
then we'll get large steps which is
nice in terms of convergence and

81
00:05:41,126 --> 00:05:42,912
uncorrelated samples.

82
00:05:42,912 --> 00:05:46,920
But it will stay at the same place for
really long periods.

83
00:05:46,920 --> 00:05:54,655
Because it will be often that we state the
nice place at the high probable place and

84
00:05:54,655 --> 00:06:02,070
the mark of chain Q proposes us to move
far away from it and we we don't like it.

85
00:06:02,070 --> 00:06:03,925
So we stay where we were.

86
00:06:03,925 --> 00:06:07,579
And this means that our actual
symbols will be kind of correlated,

87
00:06:07,579 --> 00:06:11,364
because we always stay at the same
place and we waste the resources and

88
00:06:11,364 --> 00:06:12,946
capacity of our computers.

89
00:06:12,946 --> 00:06:18,328
I want to share with you one more thing
about the Metropolis Hastings approach,

90
00:06:18,328 --> 00:06:22,102
it's a really cool perspective
on it which tells us that

91
00:06:22,102 --> 00:06:26,457
Metropolis Hastings can be
considered as a correction scheme.

92
00:06:26,457 --> 00:06:30,361
So if you have a slightly wrong
version of your assembly color theme,

93
00:06:30,361 --> 00:06:32,998
you can correct it with
Metropolis Hastings.

94
00:06:32,998 --> 00:06:39,037
Let's look at one example, so recall
the Gibbs scheme, the Gibbs sampling.

95
00:06:39,037 --> 00:06:43,650
We used to assemble points at
one dimension at a time and

96
00:06:43,650 --> 00:06:48,966
the Gibbs scheme is inherently
not parallel, so we'll have to

97
00:06:48,966 --> 00:06:55,613
know all the information from the previous
sub steps to make the next sub step.

98
00:06:55,613 --> 00:07:00,067
Okay, let's try to make
it parallel [COUGH]

99
00:07:00,067 --> 00:07:04,275
let's briefly give sampling scheme and

100
00:07:04,275 --> 00:07:09,732
use the information from
the previous situations.

101
00:07:09,732 --> 00:07:14,880
So the sub-steps will not
depend on each other.

102
00:07:14,880 --> 00:07:20,361
Briefly, If we have million-dimensional
space, and if we have million computers,

103
00:07:20,361 --> 00:07:25,256
we can do every sub-steps in parallel,
and we'll be really, really fast.

104
00:07:25,256 --> 00:07:29,784
But the problem is that we broke our
scheme, we no longer have any conversion

105
00:07:29,784 --> 00:07:34,269
guarantees, that it will convert to
the true distribution which we want.

106
00:07:34,269 --> 00:07:35,971
And it will not, actually.

107
00:07:35,971 --> 00:07:38,746
So it will generate samples
from some wrong distribution.

108
00:07:38,746 --> 00:07:40,669
What can we do here?

109
00:07:40,669 --> 00:07:46,276
Well, we can use this thing as a proposal
distribution for the Metropolis Hastings

110
00:07:46,276 --> 00:07:51,500
and then correct it with the from
the Metropolis Hastings approach.

111
00:07:51,500 --> 00:07:56,198
And since this particular proposal
distribution is not arbitrary,

112
00:07:56,198 --> 00:07:58,225
it's already almost right.

113
00:07:58,225 --> 00:08:01,424
Almost generating your points
from your desired distribution.

114
00:08:01,424 --> 00:08:03,614
Then you will not reject too many points.

115
00:08:03,614 --> 00:08:07,440
Because this is already almost
the thing that you want to do.

116
00:08:07,440 --> 00:08:11,305
So, you will just occasionally
reject one point or another.

117
00:08:11,305 --> 00:08:16,247
But generally it will, it may be
much much more efficient overall,

118
00:08:16,247 --> 00:08:19,727
because now you can do
give subject a parallel.

119
00:08:19,727 --> 00:08:21,317
So to summarize.

120
00:08:21,317 --> 00:08:27,350
Metropolis Hastings approach is rejection
sampling idea applied to Markov Chains.

121
00:08:27,350 --> 00:08:32,381
And the nice thing about this
approach is that it gives you a whole

122
00:08:32,381 --> 00:08:37,422
family of Markov Chains and
you can choose the one that you like.

123
00:08:37,422 --> 00:08:42,160
Another bright property of this
algorithm is that it works for

124
00:08:42,160 --> 00:08:47,901
normalized probabilities and densities,
as well as the Gibbs sampling,

125
00:08:47,901 --> 00:08:51,471
and then it's also kind
of easy to implement.

126
00:08:51,471 --> 00:08:55,081
It is a bit harder than
the Gibbs sampling maybe, but

127
00:08:55,081 --> 00:08:57,389
anyways like five lines of code.

128
00:08:57,389 --> 00:09:02,171
And the negative features are that
samples are less correlated, but

129
00:09:02,171 --> 00:09:04,241
still somewhat correlated.

130
00:09:04,241 --> 00:09:08,849
And also, the property that it
gives you a whole family of

131
00:09:08,849 --> 00:09:12,505
Markov Chains is kind
of a two-sided thing.

132
00:09:12,505 --> 00:09:17,107
So first of all, it gives you flexibility
to choose the right thing for

133
00:09:17,107 --> 00:09:18,905
your particular problem.

134
00:09:18,905 --> 00:09:24,628
But second of all it forces you to think,
it forces you to choose something.

135
00:09:24,628 --> 00:09:28,534
And may be hard in some cases,
and harder to automate.

136
00:09:28,534 --> 00:09:32,620
So if something works always, you don't
always have to think at any point.

137
00:09:32,620 --> 00:09:36,036
You can just automate the whole process.

138
00:09:36,036 --> 00:09:39,717
And to apply in you usually
have to think a little bit and

139
00:09:39,717 --> 00:09:44,246
understand which proposal will make
sense for your particular obligation.

140
00:09:44,246 --> 00:09:54,246
[MUSIC]