[MUSIC] Let's see how the Metropolis-Hastings can
work in this simple, one-dimensional case. Again, it's not that useful
in one dimension, but it's chosen just for
illustrational purposes. So let's say we want to sample from this
two model distribution of the blue curve. And let's say we start with the orange
point, which is at the position 0 and it's at the iteration 0, so
it's our initialization. Let's see how
the interpretation works here. First of all,
let's choose the proposal distribution Q. And to make everything simpler, let's just use standard normal
centered around the current point. So at each duration,
the mark of chain Q will propose to move somewhere around the current
point with some small variance 1. And for example the first duration, the Markov chain may propose this point,
which is the right one. So now we have to compute
the acceptance probability, so whether we want to accept this point or
not. This the definition of the acceptance
probability, we have just proved in the previous video, but
note that in this case the ratio we have. So the Q(x' -> x), and
the Q(x' -> x), just the same. So it is the property of our
current proposal illustration Q, that it doesn't depend on
the order of arguments. Which means that we can just vanish
this Q, and what is left is just the ratio of the densities at the new
point, and at the previous one. Which kind of makes sense,
it just says that if the new point is more probable than the previous one, we'll
definitely accept it with probability 1. If it's not the case, then we'll think
like may be we'll accept it, maybe we'll may be not depending on how less probable
the new point is than the previous one. So in this particular case, for
this particular red dot proposal, the new point is much more
probable than the previous one. It's like almost four times more probable
which means that the probability of acceptance here is 1 and
we'll definitely keep this point. So the end of situation 1 is
moving to this new point. Okay what about iteration two? Let's sample again a point
from the proposal and we'll be somewhere around here for
example. Again computing
the acceptance probability, here we moved a little bit
to the hard dense division. It's not that much higher, but since it's
higher then we will definitely accept this point, so with probability 1
we're keeping at this point, okay. New proposal, this one is really,
it's trying to move to a really non probable region
according to David Lugov, so we'll keep this point with
probability point 13, and now we're flipping a biased
coin with probability for 2 in which tells us to
accept this point and with the probability 87
to reject this point. So we flipped our biased coin, and it told us to in this particular example,
to reject this point. Okay, why not? Another proposal, it asks us to move here. And this thing is much more
probable than it appears, 1. And we will keep this
point with probability 73. So most definitely keep this 1. Again, we are flipping a biased coin, and
now it tells us again to reject the point. Well, why not, it happens sometimes. It was more probable to accept it but
we happen to reject it. So again we are staying at the same
place we were at the previous iteration. And finally if you repeat this process for
long enough, you will get a log like this. So you will move In your sample space and
here we have like 50 iterations and sometimes you will stop it the same
place and this plot will become flat. But generally you will move around and you can plot a histogram of
the journey at points and it looks kind of close to
the Gaussian you want to sample from. So in this case we didn't get
exactly the samples we wanted, the histogram is not
exactly like the Gaussian. I'm sorry not the Gaussian, but
the blue curve, but it's close, so it's a reasonable way to sample from
the blue curve in this particular case. Now the question is what happens
if we change our proposal. So let's say we use a Gaussian
with less variance. So we propose always to move just with tiny little steps
around the previous point. Well, it kind of works, but
in [INAUDIBLE] situations it doesn't proceed to move outside the low
density region where it started. So in 50 iterations,
it haven't converged yet, and it will definitely converge at some
point but we don't know where. And this means that it's much
less efficient choice here to use these small steps. What about large steps? Well, if we increase the variance of
our proposal distribution Q to be 100, then we'll get large steps which is
nice in terms of convergence and uncorrelated samples. But it will stay at the same place for
really long periods. Because it will be often that we state the
nice place at the high probable place and the mark of chain Q proposes us to move
far away from it and we we don't like it. So we stay where we were. And this means that our actual
symbols will be kind of correlated, because we always stay at the same
place and we waste the resources and capacity of our computers. I want to share with you one more thing
about the Metropolis Hastings approach, it's a really cool perspective
on it which tells us that Metropolis Hastings can be
considered as a correction scheme. So if you have a slightly wrong
version of your assembly color theme, you can correct it with
Metropolis Hastings. Let's look at one example, so recall
the Gibbs scheme, the Gibbs sampling. We used to assemble points at
one dimension at a time and the Gibbs scheme is inherently
not parallel, so we'll have to know all the information from the previous
sub steps to make the next sub step. Okay, let's try to make
it parallel [COUGH] let's briefly give sampling scheme and use the information from
the previous situations. So the sub-steps will not
depend on each other. Briefly, If we have million-dimensional
space, and if we have million computers, we can do every sub-steps in parallel,
and we'll be really, really fast. But the problem is that we broke our
scheme, we no longer have any conversion guarantees, that it will convert to
the true distribution which we want. And it will not, actually. So it will generate samples
from some wrong distribution. What can we do here? Well, we can use this thing as a proposal
distribution for the Metropolis Hastings and then correct it with the from
the Metropolis Hastings approach. And since this particular proposal
distribution is not arbitrary, it's already almost right. Almost generating your points
from your desired distribution. Then you will not reject too many points. Because this is already almost
the thing that you want to do. So, you will just occasionally
reject one point or another. But generally it will, it may be
much much more efficient overall, because now you can do
give subject a parallel. So to summarize. Metropolis Hastings approach is rejection
sampling idea applied to Markov Chains. And the nice thing about this
approach is that it gives you a whole family of Markov Chains and
you can choose the one that you like. Another bright property of this
algorithm is that it works for normalized probabilities and densities,
as well as the Gibbs sampling, and then it's also kind
of easy to implement. It is a bit harder than
the Gibbs sampling maybe, but anyways like five lines of code. And the negative features are that
samples are less correlated, but still somewhat correlated. And also, the property that it
gives you a whole family of Markov Chains is kind
of a two-sided thing. So first of all, it gives you flexibility
to choose the right thing for your particular problem. But second of all it forces you to think,
it forces you to choose something. And may be hard in some cases,
and harder to automate. So if something works always, you don't
always have to think at any point. You can just automate the whole process. And to apply in you usually
have to think a little bit and understand which proposal will make
sense for your particular obligation. [MUSIC]