[MUSIC] In the previous videos we familiarised ourself with the Monte Carlo part. So now, let's talk about the Markov Chain part. Which is actually really cool idea on how to generate samples from complicated multi-dimensional distributions efficiently. So what is a Markov chain? Imagine a frog which sits on a water lily. And this frog has kind of a discrete time. So at each time stamp, at each second, for example, it makes a decision, whether to stay on its current lily or to jump to the other one. And its decisions are kind of prefined in probabilistic way. So if it's right now on the left water lily, it will stay there with probability 0.3, and it will move with probability 0.7. And if it's on the right one, it will stay or move of probability 0.5. For example, now we can say that we have kind of a dynamic system which has a state. So the state will show us where the frog is now. This dynamic system, it kind of evolves for time, its states changes. And the rules by which these states changes are as follows. So the probability to transition from left state to left state is 0.3 and etc. So you can define the behavior of your dynamic system by using these transition probabilities. So we can simulate our dynamic system. We can see that, for example, on the timestamp 1, the frog was on the left, and then maybe jumped to the right on the timestamp 2. And then maybe it stayed to the right. Then maybe it went to the left, and then, for example, to the right. And we can write down the path through the state space which our dynamic system made. So it was left, right, right, left, right. And this thing is called Markov chain because the next state is kind of depends only on the previous state. So the frog doesn't care where it was ten time steps ago, the only thing it cares about is where it is now. To make the decision, it just needs to know what is the state now. Okay, so we can ask ourselves a question. Like where the frog will be after 10 timestamps for example. So let's say that we started from the timestamp 1, and the frog was on the left. So with probability 1, it was on the left. With probability 0, it was on the right. What is the probability that will be on the left or on the right in the timestamp 2? Well, it's 0.3 and 0.7, right? Okay, and what about the timestamp 3? It's going to trigger because we don't know where it was before. But using the rule of some of probabilities we can kind of check all the possibilities for the previous state and use that to compute the next probabilities. So the law of some tells us that the value of the state of the timestamp 3 equals to the value of the state of the timestamp 3, given that at the second timestamp it was on the left, times the probability it was on the left on the second timestamp. Plus the same thing, considering that the station, it was on the right on the second timestamp. And we can just, we know all these values because the conditional probability is just the transition probability. So the probability that the frog moves from left to right for example, and the probability of x2 are computed in the previous line, so it is what we have just computed. And we can write it down, so it will be for probability of the left, in the third timestamp. It will be something like 0.3 because it's the probability of transitioning from left to left, hence the probability of 0.3. So it's from the previous line of the table. Plus the probability of the prior probability of being on the right hand side of the previous timestamp, which is 0.7 times the transition probability of moving from right of, right to right, which is 0.5. I'm sorry, from right to left which is 0.5. And we can write the same thing for the for being at the right state, in the timestamp 3, and we'll have some numbers like is which are around 44 and 56. And we can continue this table, for as long as we want. And we'll found out that the table kind of converged the numbers around 42 and 58. So after 1 million steps, the probability is that the frog is on the left is almost exactly 42. And the probability that the frog is on the right is almost exactly 58. Which basically means that we, if we stimulate our frog for long enough and then write down where it ended up with the one millionth step. Then we will kind of get a sample from this discreet distribution of the values. So with probability 42 we'll get left, and with probability 58 we'll get right. So let's look how it works. Say we select our frog for like 1 million steps, and the last step was for example left, okay? We write down this left and then simulate another [INAUDIBLE] for the state space. So begin simulation and the last state was for example right. We are the down and by then phase several types. We have a sample from distribution of the millionth step of our dynamic system given that the frog started on the left. And this thing will be close the probability 42 and 58. So even here we can see that we have two out of five lefts, and three out of five rights, which are close to 42 and 58. So this way we can generate, it's a really lousy way to generate samples from this distribution with the radius. So there's a much simpler way which we have just discussed in the previous video. But note that this way, so, build a dynamic system and simulate it and see where it end up, works even for more complicated cases. So what if you have for example ten lilies? Or 1 billion lilies? If you want to simulate your distribution by in naive way. If you have billion values for in your variable, you'll have to spend amount of time proportional to 1 billion to generate one sample from it. If you use this kind of idea of simulate some dynamic system and then write down the last position. And if you do not that many steps like 1000 and hope that everything has converged to the true distribution, then you can get a random number from this. Complicated distribution without much trouble too much trouble of the naive approach the example. Or maybe a frog can be continuous, and can jump from any point on your 2D plane to any other point. In this case, you will be able to generate your samples from a continuous distribution by simulating your dynamic system. So the reason that you can still sample efficiently, even if you have this continuous dynamic system, by just writing down the path of this 2D plane, and writing down the last position. So the idea Markov Chain simulation applied to generating samples of random values is as follows. We have a distribution we want to sample from. It may be continuous, it may be discrete. We build a Markov Chain that has this distribution, that converges to this distribution, okay? Then we start with from a initialization, x0, and then we simulate our Markov Chain for long enough, for example for 1,000 iterations. [COUGH] So we start with x0 then we generate x1, by using the transition probability and etc. And then we hope that after 1,000 steps, you will converge to the true distribution p(x). Because we built our Markov Chain this way. Then, after these 1,000 steps, you can throw away the first 1,000 samples, 1,000 states. And then you can start writing down the states after these 1,000 steps. Because they all will be from the distribution p(x), or at least close to it, if your Markov Chain converged close enough. Okay, so this is the general idea. And in the following we'll discuss how to build Markov Chain with this property. So how to build Markov Chain that converge to the distribution you want to sample from. Now let's first discuss a little bit about whether a Markov Chain converge anywhere. So in which case it does converge, and which it doesn't. Well, the first observation here is that the Markov chain doesn't have to converge to anywhere. So for example, if you take a dynamic system like this, the frog will always swap the lily at each time stamp. And so if you look at the table, it will look like this. The permute of being on the left on the first time step was 1 and the p on the right is 0. And then you swap things, so you're will be on the right on the next time stamp and you will certainly be on the left on the next time stamp, and etc. This thing will never converge this table. So it will always swap if the place. So this thing does work for us. We want something that converge to some distribution. And the question of whether or not given a Markov chain is converged, is converging to something, is actually, pretty well studied. But first of all, we'll need a definition of what does it mean for Markov chain to converge to something in mathematical terms? So this is something called a stationary distribution. And this is a distribution of pi called stationary for Markov chain T If the following equation holds. Which basically tells us that if you start the distribution y and if you marginalize out the current position x, then you will get basically the distribution over position on the next time stamp. This has to be pi again. So basically, if you start from pi, if you sample a state from pi, and then if you make just one transition, so one timestamp of the Markov chain, then the distribution you will get after that is, again, pi. This means that if we converge to, if we encounter this pi at any point of our sampling, then we will stay at it. We will not change this distribution. And there is a nice theorem that tells us that If the transition probability is non-zero for any two states, so if we can jump from any state to any other one, with non-zero probability, then there exists a unique pi which is stationary. And we will converge to this pi from any starting point. So if we build a Markov chain that has non-zero transition probabilities, and that has a stationary distribution which we want sample from, then no matter where we start, if we simulate our Markov chain long enough, we will get eventually samples from the desired distribution y, [SOUND]