1 00:00:00,041 --> 00:00:04,608 [MUSIC] 2 00:00:04,608 --> 00:00:09,048 In the previous videos we familiarised ourself with the Monte Carlo part. 3 00:00:09,048 --> 00:00:11,952 So now, let's talk about the Markov Chain part. 4 00:00:11,952 --> 00:00:16,492 Which is actually really cool idea on how to generate samples from 5 00:00:16,492 --> 00:00:21,045 complicated multi-dimensional distributions efficiently. 6 00:00:21,045 --> 00:00:23,351 So what is a Markov chain? 7 00:00:23,351 --> 00:00:27,264 Imagine a frog which sits on a water lily. 8 00:00:27,264 --> 00:00:31,642 And this frog has kind of a discrete time. 9 00:00:31,642 --> 00:00:37,977 So at each time stamp, at each second, for example, it makes a decision, 10 00:00:37,977 --> 00:00:43,220 whether to stay on its current lily or to jump to the other one. 11 00:00:43,220 --> 00:00:47,116 And its decisions are kind of prefined in probabilistic way. 12 00:00:47,116 --> 00:00:50,002 So if it's right now on the left water lily, 13 00:00:50,002 --> 00:00:55,780 it will stay there with probability 0.3, and it will move with probability 0.7. 14 00:00:55,780 --> 00:00:59,844 And if it's on the right one, it will stay or move of probability 0.5. 15 00:00:59,844 --> 00:01:00,997 For example, 16 00:01:00,997 --> 00:01:06,487 now we can say that we have kind of a dynamic system which has a state. 17 00:01:06,487 --> 00:01:09,442 So the state will show us where the frog is now. 18 00:01:09,442 --> 00:01:16,667 This dynamic system, it kind of evolves for time, its states changes. 19 00:01:16,667 --> 00:01:23,194 And the rules by which these states changes are as follows. 20 00:01:23,194 --> 00:01:29,359 So the probability to transition from left state to left state is 0.3 and etc. 21 00:01:29,359 --> 00:01:34,326 So you can define the behavior of your dynamic system by using these transition 22 00:01:34,326 --> 00:01:35,464 probabilities. 23 00:01:35,464 --> 00:01:37,980 So we can simulate our dynamic system. 24 00:01:37,980 --> 00:01:42,055 We can see that, for example, on the timestamp 1, the frog was on the left, 25 00:01:42,055 --> 00:01:44,968 and then maybe jumped to the right on the timestamp 2. 26 00:01:44,968 --> 00:01:46,667 And then maybe it stayed to the right. 27 00:01:46,667 --> 00:01:50,871 Then maybe it went to the left, and then, for example, to the right. 28 00:01:50,871 --> 00:01:57,147 And we can write down the path through the state space which our dynamic system made. 29 00:01:57,147 --> 00:02:00,513 So it was left, right, right, left, right. 30 00:02:00,513 --> 00:02:05,655 And this thing is called Markov chain because the next 31 00:02:05,655 --> 00:02:10,691 state is kind of depends only on the previous state. 32 00:02:10,691 --> 00:02:15,619 So the frog doesn't care where it was ten time steps ago, 33 00:02:15,619 --> 00:02:19,733 the only thing it cares about is where it is now. 34 00:02:19,733 --> 00:02:26,257 To make the decision, it just needs to know what is the state now. 35 00:02:26,257 --> 00:02:30,573 Okay, so we can ask ourselves a question. 36 00:02:30,573 --> 00:02:35,822 Like where the frog will be after 10 timestamps for example. 37 00:02:35,822 --> 00:02:41,572 So let's say that we started from the timestamp 1, and the frog was on the left. 38 00:02:41,572 --> 00:02:44,290 So with probability 1, it was on the left. 39 00:02:44,290 --> 00:02:46,870 With probability 0, it was on the right. 40 00:02:46,870 --> 00:02:49,986 What is the probability that will be on the left or 41 00:02:49,986 --> 00:02:52,116 on the right in the timestamp 2? 42 00:02:52,116 --> 00:02:54,358 Well, it's 0.3 and 0.7, right? 43 00:02:55,554 --> 00:02:59,900 Okay, and what about the timestamp 3? 44 00:02:59,900 --> 00:03:03,396 It's going to trigger because we don't know where it was before. 45 00:03:03,396 --> 00:03:08,143 But using the rule of some of probabilities we can kind of check all 46 00:03:08,143 --> 00:03:11,426 the possibilities for the previous state and 47 00:03:11,426 --> 00:03:14,725 use that to compute the next probabilities. 48 00:03:14,725 --> 00:03:20,425 So the law of some tells us that the value of the state of the timestamp 49 00:03:20,425 --> 00:03:25,025 3 equals to the value of the state of the timestamp 3, 50 00:03:25,025 --> 00:03:29,525 given that at the second timestamp it was on the left, 51 00:03:29,525 --> 00:03:35,026 times the probability it was on the left on the second timestamp. 52 00:03:35,026 --> 00:03:38,358 Plus the same thing, considering that the station, 53 00:03:38,358 --> 00:03:41,106 it was on the right on the second timestamp. 54 00:03:41,106 --> 00:03:45,790 And we can just, we know all these values because the conditional 55 00:03:45,790 --> 00:03:49,467 probability is just the transition probability. 56 00:03:49,467 --> 00:03:55,252 So the probability that the frog moves from left to right for example, 57 00:03:55,252 --> 00:04:00,352 and the probability of x2 are computed in the previous line, 58 00:04:00,352 --> 00:04:03,312 so it is what we have just computed. 59 00:04:03,312 --> 00:04:06,953 And we can write it down, so it will be for probability of the left, 60 00:04:06,953 --> 00:04:08,329 in the third timestamp. 61 00:04:08,329 --> 00:04:13,516 It will be something like 0.3 because it's 62 00:04:13,516 --> 00:04:19,501 the probability of transitioning from left to left, 63 00:04:19,501 --> 00:04:23,363 hence the probability of 0.3. 64 00:04:23,363 --> 00:04:27,113 So it's from the previous line of the table. 65 00:04:27,113 --> 00:04:31,680 Plus the probability of the prior probability of being 66 00:04:31,680 --> 00:04:35,848 on the right hand side of the previous timestamp, 67 00:04:35,848 --> 00:04:40,018 which is 0.7 times the transition probability 68 00:04:40,018 --> 00:04:44,999 of moving from right of, right to right, which is 0.5. 69 00:04:44,999 --> 00:04:48,791 I'm sorry, from right to left which is 0.5. 70 00:04:48,791 --> 00:04:52,780 And we can write the same thing for the for 71 00:04:52,780 --> 00:04:57,824 being at the right state, in the timestamp 3, and 72 00:04:57,824 --> 00:05:04,179 we'll have some numbers like is which are around 44 and 56. 73 00:05:04,179 --> 00:05:07,600 And we can continue this table, for as long as we want. 74 00:05:07,600 --> 00:05:11,826 And we'll found out that the table kind of 75 00:05:11,826 --> 00:05:16,426 converged the numbers around 42 and 58. 76 00:05:16,426 --> 00:05:18,713 So after 1 million steps, 77 00:05:18,713 --> 00:05:24,589 the probability is that the frog is on the left is almost exactly 42. 78 00:05:24,589 --> 00:05:28,920 And the probability that the frog is on the right is almost exactly 58. 79 00:05:28,920 --> 00:05:35,203 Which basically means that we, if we stimulate our frog for long enough and 80 00:05:35,203 --> 00:05:40,401 then write down where it ended up with the one millionth step. 81 00:05:40,401 --> 00:05:45,471 Then we will kind of get a sample from this discreet distribution of the values. 82 00:05:45,471 --> 00:05:51,260 So with probability 42 we'll get left, and with probability 58 we'll get right. 83 00:05:51,260 --> 00:05:53,198 So let's look how it works. 84 00:05:53,198 --> 00:05:56,957 Say we select our frog for like 1 million steps, 85 00:05:56,957 --> 00:06:00,542 and the last step was for example left, okay? 86 00:06:00,542 --> 00:06:07,870 We write down this left and then simulate another [INAUDIBLE] for the state space. 87 00:06:07,870 --> 00:06:12,409 So begin simulation and the last state was for example right. 88 00:06:12,409 --> 00:06:16,837 We are the down and by then phase several types. 89 00:06:16,837 --> 00:06:22,374 We have a sample from distribution of the millionth step of 90 00:06:22,374 --> 00:06:28,031 our dynamic system given that the frog started on the left. 91 00:06:28,031 --> 00:06:31,894 And this thing will be close the probability 42 and 58. 92 00:06:31,894 --> 00:06:37,111 So even here we can see that we have two out of five lefts, 93 00:06:37,111 --> 00:06:43,009 and three out of five rights, which are close to 42 and 58. 94 00:06:43,009 --> 00:06:47,107 So this way we can generate, it's a really lousy way to 95 00:06:47,107 --> 00:06:51,576 generate samples from this distribution with the radius. 96 00:06:51,576 --> 00:06:56,781 So there's a much simpler way which we have just discussed in the previous video. 97 00:06:56,781 --> 00:07:02,721 But note that this way, so, build a dynamic system and simulate it and 98 00:07:02,721 --> 00:07:07,870 see where it end up, works even for more complicated cases. 99 00:07:07,870 --> 00:07:10,719 So what if you have for example ten lilies? 100 00:07:10,719 --> 00:07:12,486 Or 1 billion lilies? 101 00:07:12,486 --> 00:07:20,352 If you want to simulate your distribution by in naive way. 102 00:07:20,352 --> 00:07:25,516 If you have billion values for in your variable, you'll have to spend 103 00:07:25,516 --> 00:07:30,381 amount of time proportional to 1 billion to generate one sample from it. 104 00:07:30,381 --> 00:07:35,144 If you use this kind of idea of simulate some dynamic system and 105 00:07:35,144 --> 00:07:37,902 then write down the last position. 106 00:07:37,902 --> 00:07:42,475 And if you do not that many steps like 1000 and hope that everything has 107 00:07:42,475 --> 00:07:47,520 converged to the true distribution, then you can get a random number from this. 108 00:07:47,520 --> 00:07:52,434 Complicated distribution without much trouble too much trouble 109 00:07:52,434 --> 00:07:55,372 of the naive approach the example. 110 00:07:55,372 --> 00:07:58,384 Or maybe a frog can be continuous, and 111 00:07:58,384 --> 00:08:03,102 can jump from any point on your 2D plane to any other point. 112 00:08:03,102 --> 00:08:06,687 In this case, you will be able to generate your samples from 113 00:08:06,687 --> 00:08:10,565 a continuous distribution by simulating your dynamic system. 114 00:08:10,565 --> 00:08:15,339 So the reason that you can still sample efficiently, 115 00:08:15,339 --> 00:08:19,794 even if you have this continuous dynamic system, 116 00:08:19,794 --> 00:08:27,344 by just writing down the path of this 2D plane, and writing down the last position. 117 00:08:27,344 --> 00:08:30,778 So the idea Markov Chain simulation applied 118 00:08:30,778 --> 00:08:35,062 to generating samples of random values is as follows. 119 00:08:35,062 --> 00:08:37,628 We have a distribution we want to sample from. 120 00:08:37,628 --> 00:08:41,452 It may be continuous, it may be discrete. 121 00:08:41,452 --> 00:08:45,721 We build a Markov Chain that has this distribution, 122 00:08:45,721 --> 00:08:49,512 that converges to this distribution, okay? 123 00:08:49,512 --> 00:08:54,602 Then we start with from a initialization, x0, and then we simulate 124 00:08:54,602 --> 00:09:00,243 our Markov Chain for long enough, for example for 1,000 iterations. 125 00:09:00,243 --> 00:09:04,503 [COUGH] So we start with x0 then we generate x1, 126 00:09:04,503 --> 00:09:08,666 by using the transition probability and etc. 127 00:09:08,666 --> 00:09:11,474 And then we hope that after 1,000 steps, 128 00:09:11,474 --> 00:09:14,646 you will converge to the true distribution p(x). 129 00:09:14,646 --> 00:09:18,505 Because we built our Markov Chain this way. 130 00:09:18,505 --> 00:09:21,659 Then, after these 1,000 steps, 131 00:09:21,659 --> 00:09:27,404 you can throw away the first 1,000 samples, 1,000 states. 132 00:09:27,404 --> 00:09:32,228 And then you can start writing down the states after these 1,000 steps. 133 00:09:32,228 --> 00:09:36,036 Because they all will be from the distribution p(x), 134 00:09:36,036 --> 00:09:40,826 or at least close to it, if your Markov Chain converged close enough. 135 00:09:42,862 --> 00:09:46,055 Okay, so this is the general idea. 136 00:09:46,055 --> 00:09:50,308 And in the following we'll discuss how to build Markov Chain with this property. 137 00:09:50,308 --> 00:09:54,517 So how to build Markov Chain that converge to the distribution you want to 138 00:09:54,517 --> 00:09:55,345 sample from. 139 00:09:55,345 --> 00:10:01,552 Now let's first discuss a little bit about whether a Markov Chain converge anywhere. 140 00:10:01,552 --> 00:10:05,441 So in which case it does converge, and which it doesn't. 141 00:10:05,441 --> 00:10:08,918 Well, the first observation here is that the Markov chain doesn't 142 00:10:08,918 --> 00:10:10,510 have to converge to anywhere. 143 00:10:10,510 --> 00:10:15,570 So for example, if you take a dynamic system like this, 144 00:10:15,570 --> 00:10:20,427 the frog will always swap the lily at each time stamp. 145 00:10:20,427 --> 00:10:23,866 And so if you look at the table, it will look like this. 146 00:10:23,866 --> 00:10:28,341 The permute of being on the left on the first time step was 1 and 147 00:10:28,341 --> 00:10:30,075 the p on the right is 0. 148 00:10:30,075 --> 00:10:34,985 And then you swap things, so you're will be on the right on the next time stamp and 149 00:10:34,985 --> 00:10:38,601 you will certainly be on the left on the next time stamp, and etc. 150 00:10:38,601 --> 00:10:41,340 This thing will never converge this table. 151 00:10:41,340 --> 00:10:46,245 So it will always swap if the place. 152 00:10:46,245 --> 00:10:48,764 So this thing does work for us. 153 00:10:48,764 --> 00:10:52,165 We want something that converge to some distribution. 154 00:10:52,165 --> 00:10:57,617 And the question of whether or not given a Markov chain is converged, 155 00:10:57,617 --> 00:11:02,885 is converging to something, is actually, pretty well studied. 156 00:11:02,885 --> 00:11:07,323 But first of all, we'll need a definition of what does it mean for 157 00:11:07,323 --> 00:11:11,612 Markov chain to converge to something in mathematical terms? 158 00:11:11,612 --> 00:11:15,805 So this is something called a stationary distribution. 159 00:11:15,805 --> 00:11:20,650 And this is a distribution of pi called stationary for 160 00:11:20,650 --> 00:11:25,074 Markov chain T If the following equation holds. 161 00:11:25,074 --> 00:11:30,542 Which basically tells us that if you start the distribution y and 162 00:11:30,542 --> 00:11:36,012 if you marginalize out the current position x, then you will get 163 00:11:36,012 --> 00:11:41,703 basically the distribution over position on the next time stamp. 164 00:11:41,703 --> 00:11:44,397 This has to be pi again. 165 00:11:44,397 --> 00:11:49,675 So basically, if you start from pi, if you sample a state from pi, 166 00:11:49,675 --> 00:11:55,993 and then if you make just one transition, so one timestamp of the Markov chain, 167 00:11:55,993 --> 00:12:01,026 then the distribution you will get after that is, again, pi. 168 00:12:01,026 --> 00:12:05,994 This means that if we converge to, if we encounter this 169 00:12:05,994 --> 00:12:11,070 pi at any point of our sampling, then we will stay at it. 170 00:12:11,070 --> 00:12:13,529 We will not change this distribution. 171 00:12:13,529 --> 00:12:18,912 And there is a nice theorem that tells us that If the transition probability 172 00:12:18,912 --> 00:12:24,721 is non-zero for any two states, so if we can jump from any state to any other one, 173 00:12:24,721 --> 00:12:30,633 with non-zero probability, then there exists a unique pi which is stationary. 174 00:12:30,633 --> 00:12:34,816 And we will converge to this pi from any starting point. 175 00:12:34,816 --> 00:12:42,198 So if we build a Markov chain that has non-zero transition probabilities, 176 00:12:42,198 --> 00:12:46,812 and that has a stationary distribution which we 177 00:12:46,812 --> 00:12:51,653 want sample from, then no matter where we start, 178 00:12:51,653 --> 00:12:56,034 if we simulate our Markov chain long enough, 179 00:12:56,034 --> 00:13:02,989 we will get eventually samples from the desired distribution y, [SOUND]