1 00:00:00,031 --> 00:00:04,629 [MUSIC] 2 00:00:04,629 --> 00:00:07,489 In the previous video, we discussed that it 3 00:00:07,489 --> 00:00:11,910 can be nice to build a Markov chain to sample from sum distribution. 4 00:00:11,910 --> 00:00:14,948 But how to build such Markov chain? 5 00:00:14,948 --> 00:00:18,003 A simplest method for that is called Gibbs sampling. 6 00:00:18,003 --> 00:00:19,693 Involves as follows. 7 00:00:19,693 --> 00:00:23,601 So say you have a distribution, a three-dimensional one, 8 00:00:23,601 --> 00:00:26,216 which we know up to normalization cost. 9 00:00:26,216 --> 00:00:28,844 So we can compute this probability or 10 00:00:28,844 --> 00:00:34,370 density in the continuous case at any given point, but up to normalization. 11 00:00:34,370 --> 00:00:38,034 So the Gibbs sampling works as follows. 12 00:00:38,034 --> 00:00:42,095 First of all, we start with some initial point, for example, all 0s. 13 00:00:42,095 --> 00:00:44,776 And it doesn't matter where we start. 14 00:00:44,776 --> 00:00:48,860 Well, it does matter for some convergence properties, and 15 00:00:48,860 --> 00:00:53,597 stuff like that, but it doesn't matter for the sake of correctness. 16 00:00:53,597 --> 00:01:00,006 And then, as we're starting to build our next iteration point, 17 00:01:00,006 --> 00:01:07,130 we're going to build this next iteration point one dimension at a time. 18 00:01:07,130 --> 00:01:11,580 So the first dimension of the next point will be a sample from 19 00:01:11,580 --> 00:01:16,386 conditional distribution, on x1, given the varies of x2 and 20 00:01:16,386 --> 00:01:19,772 x3, which we had from our initialization. 21 00:01:19,772 --> 00:01:25,540 And you may ask, how can we sample from this conditional distribution? 22 00:01:25,540 --> 00:01:27,689 It's also a hard problem, right? 23 00:01:27,689 --> 00:01:31,930 But note that the conditional distribution is proportional to the joint distribution. 24 00:01:31,930 --> 00:01:36,548 So we know it, again, up to normalization constant. 25 00:01:36,548 --> 00:01:39,032 And sampling from one-dimensional distributions, 26 00:01:39,032 --> 00:01:42,709 it's usually much easier of it, than something from my multidimensional one. 27 00:01:42,709 --> 00:01:44,751 So we assume that we can do it. 28 00:01:44,751 --> 00:01:50,678 In some particular practical cases we can invent a really efficient scheme for 29 00:01:50,678 --> 00:01:56,626 analytically sampling from these kind of one-dimensional distributions. 30 00:01:56,626 --> 00:02:01,511 In others, we have to use general purpose schemes, like rejection sampling. 31 00:02:01,511 --> 00:02:04,634 But anyway, one-dimensional sampling is easy. 32 00:02:04,634 --> 00:02:07,907 So let's go to the next sub-step. 33 00:02:07,907 --> 00:02:09,767 On the next sub-step, 34 00:02:09,767 --> 00:02:15,722 we're going to generate the second dimension of our first duration point. 35 00:02:15,722 --> 00:02:22,195 And it's going to be generated from the conditional distribution on x2, 36 00:02:22,195 --> 00:02:26,921 given x1 being the dimension we have just generated, 37 00:02:26,921 --> 00:02:31,460 and x3 being the initialization point dimension. 38 00:02:31,460 --> 00:02:36,802 And finally, we will generate the third dimension in the same fashion. 39 00:02:36,802 --> 00:02:40,933 So we'll generate it from the conditional distribution, 40 00:02:40,933 --> 00:02:43,751 given the stuff we have just generated. 41 00:02:43,751 --> 00:02:48,383 And in the general case, this was the iteration one. 42 00:02:48,383 --> 00:02:53,028 And these three sub-steps constitutes into just one 43 00:02:53,028 --> 00:02:56,234 kind of big step of our Markov chain. 44 00:02:56,234 --> 00:02:59,926 So it's a way we transition from x0 to x1. 45 00:02:59,926 --> 00:03:01,965 And if you write it down for 46 00:03:01,965 --> 00:03:06,152 general iteration number k, it will look like this. 47 00:03:06,152 --> 00:03:10,952 So the point from duration k + 1 is generated one-dimension at a time, 48 00:03:10,952 --> 00:03:16,150 by using some conditional distributions, which are conditioned on the stuff 49 00:03:16,150 --> 00:03:20,738 we have from the previous duration, and from the current duration. 50 00:03:20,738 --> 00:03:24,729 And note that this schema's really not parallel. 51 00:03:24,729 --> 00:03:29,525 So each step depends on the previous steps to be executed. 52 00:03:29,525 --> 00:03:33,737 And this means that no matter how many computers we have, 53 00:03:33,737 --> 00:03:37,073 if we have a 1 million-dimensional space, 54 00:03:37,073 --> 00:03:43,159 we'll have to sample these dimensions one at a time, which is kind of a downside. 55 00:03:43,159 --> 00:03:46,681 But we'll discuss later how can we avoid it. 56 00:03:46,681 --> 00:03:52,980 Anyway, this procedure gives you a way to move from x0 to x1, and to x2, and etc. 57 00:03:52,980 --> 00:03:58,208 And let's prove that this indeed defines a Markov chain, 58 00:03:58,208 --> 00:04:01,846 which converged to the distribution p. 59 00:04:01,846 --> 00:04:03,917 So if we sample like this long enough, 60 00:04:03,917 --> 00:04:06,775 we'll get samples from the desired distribution. 61 00:04:06,775 --> 00:04:09,538 And we are going to do it on the blackboard. 62 00:04:12,630 --> 00:04:17,817 So let's prove that the Gibbs sampling over the three sub-steps, 63 00:04:17,817 --> 00:04:22,640 considered as one big step, indeed provides you a Markov chain 64 00:04:22,640 --> 00:04:26,289 that converged to the desired distribution p. 65 00:04:26,289 --> 00:04:30,988 So what we want to prove is that p of the new point, 66 00:04:30,988 --> 00:04:35,226 x prime, y prime, and z prime, equals, so 67 00:04:35,226 --> 00:04:40,041 we want to prove that it equals, to the one step for 68 00:04:40,041 --> 00:04:45,546 the Gibbs sampling, starting from the distribution p, 69 00:04:45,546 --> 00:04:49,474 and then doing one step of the sampling. 70 00:04:49,474 --> 00:04:55,481 So here we'll have marginalizing out the previous step, x, y, and z. 71 00:04:55,481 --> 00:05:01,593 And the transition probability, q, probability to go from x, 72 00:05:01,593 --> 00:05:05,225 y, z, to x prime, y prime, z prime. 73 00:05:08,102 --> 00:05:10,371 And times the probability p. 74 00:05:10,371 --> 00:05:15,571 So we want to prove that these two things are equal to each other. 75 00:05:17,982 --> 00:05:21,333 Let's look at the right hand side of this expression. 76 00:05:21,333 --> 00:05:25,768 So we have sum with respect to the previous 77 00:05:25,768 --> 00:05:29,960 point of the transition probability. 78 00:05:29,960 --> 00:05:34,672 So how probable is it, under the Gibbs sampling, to move from x, 79 00:05:34,672 --> 00:05:37,431 y, z, to x prime, y prime, z prime? 80 00:05:37,431 --> 00:05:44,483 Well, we have to first of all move from x to x prime, 81 00:05:44,483 --> 00:05:49,076 so that will be least conditional 82 00:05:49,076 --> 00:05:53,684 probability, y = y, and z = z. 83 00:05:53,684 --> 00:05:56,720 Then we have to assume that we move to x prime, and 84 00:05:56,720 --> 00:06:00,362 then what is the probability that we move from y to y prime? 85 00:06:00,362 --> 00:06:05,122 So it'll be y prime, given that we have already 86 00:06:05,122 --> 00:06:08,932 moved to x prime, and started with z. 87 00:06:11,005 --> 00:06:16,086 And finally, the third conditional probability, 88 00:06:16,086 --> 00:06:22,677 that we move to z prime, given that we already moved to x and y prime. 89 00:06:27,350 --> 00:06:31,463 So this is the transition probability q, this overall thing, and 90 00:06:31,463 --> 00:06:34,630 times the starting probability, p (x, y, z). 91 00:06:34,630 --> 00:06:39,051 So times p (x, y, z). 92 00:06:39,051 --> 00:06:43,371 And we want to prove that this thing equals to the p (x prime, 93 00:06:43,371 --> 00:06:46,505 y prime, z prime), because in this case, 94 00:06:46,505 --> 00:06:50,865 we will prove that this p is stationary for our Markov chain. 95 00:06:50,865 --> 00:06:55,685 And then, it means that the Markov chain converged to this distribution, p. 96 00:06:55,685 --> 00:06:59,987 So first of all, it's not that this part, p of z prime, 97 00:06:59,987 --> 00:07:02,933 it doesn't depend on x, y or z at all. 98 00:07:02,933 --> 00:07:05,461 So we can move it outside the sum. 99 00:07:05,461 --> 00:07:12,805 It'll be p (z prime | x prime, 100 00:07:12,805 --> 00:07:17,703 y prime) times sum, 101 00:07:20,549 --> 00:07:22,172 Of these two terms, 102 00:07:32,531 --> 00:07:36,099 And then, times the probability of p (x, y, z). 103 00:07:36,099 --> 00:07:41,815 And note here that, actually, the only part that depends on x is this one. 104 00:07:41,815 --> 00:07:46,733 So it's p (x, y, z), which basically means that we can push the summation aside. 105 00:07:46,733 --> 00:07:50,448 We can write here the sum with respect to only y and z, and 106 00:07:50,448 --> 00:07:53,465 write the summations with respect to x here. 107 00:07:53,465 --> 00:07:59,790 So it's summation with respect to all values of x, 108 00:07:59,790 --> 00:08:03,770 from 1 to the cardinality of x. 109 00:08:03,770 --> 00:08:08,692 And also, by the way, note that if our variables x, y, and 110 00:08:08,692 --> 00:08:14,316 z would be continuous, we will have integrations instead of sums. 111 00:08:14,316 --> 00:08:19,043 But the overall logic of the proof will be exactly the same. 112 00:08:19,043 --> 00:08:20,903 Okay, so we have this expression. 113 00:08:24,305 --> 00:08:26,447 These two are equal to each other. 114 00:08:29,522 --> 00:08:36,027 And you can note that this part, we have just marginalized out x. 115 00:08:36,027 --> 00:08:42,450 So we end up with p (y, z), the marginal distribution p (y, z). 116 00:08:42,450 --> 00:08:50,781 Great, okay, now we can multiply this part by x prime | y, 117 00:08:50,781 --> 00:08:55,459 z, to get a joint distribution. 118 00:08:55,459 --> 00:08:59,230 So this thing will equal 119 00:08:59,230 --> 00:09:04,529 to, Again, this part. 120 00:09:12,854 --> 00:09:15,753 Times sum, we'll have, 121 00:09:21,936 --> 00:09:26,282 Let's start with this term, p (y prime | x prime, z), 122 00:09:26,282 --> 00:09:31,798 times the product of these two joints, which is the joint distribution. 123 00:09:31,798 --> 00:09:36,300 And note that the only part that depends on y is this joint distribution. 124 00:09:36,300 --> 00:09:42,651 So we can move this summation again, with respect to y, inside and right down here, 125 00:09:42,651 --> 00:09:48,385 summation only with respect to z, and here the summation with respect to y. 126 00:09:48,385 --> 00:09:53,941 So it will be p (x prime, y, z), 127 00:09:53,941 --> 00:10:01,241 which is a product of these two terms, okay? 128 00:10:01,241 --> 00:10:06,213 And again, note that when we marginalize out y here, 129 00:10:06,213 --> 00:10:09,205 we end of with p (x prime, z). 130 00:10:20,442 --> 00:10:24,994 And again, we can multiply these two terms together to get a joint distribution. 131 00:10:24,994 --> 00:10:27,443 So finally, we have something like this. 132 00:10:33,869 --> 00:10:37,669 It's, again, the first term, 133 00:10:37,669 --> 00:10:41,603 p (z prime | x prime, y prime), 134 00:10:41,603 --> 00:10:48,253 times summation with respect to z of p (y prime | x prime, 135 00:10:48,253 --> 00:10:52,622 z), times the joint distribution. 136 00:10:52,622 --> 00:11:00,605 So it's just the full joint distribution of these three terms, y prime, x prime, z. 137 00:11:02,655 --> 00:11:09,260 And these parts, since we marginalized out z, again, it equals to p (y prime, 138 00:11:09,260 --> 00:11:13,844 z prime), by the definition of the marginalization. 139 00:11:13,844 --> 00:11:17,437 So it's like a rule of sum of probabilities. 140 00:11:20,954 --> 00:11:24,183 And finally, by multiplying these two terms together, 141 00:11:24,183 --> 00:11:25,874 we end up with what we wanted. 142 00:11:25,874 --> 00:11:28,755 So this thing equals to, 143 00:11:33,011 --> 00:11:37,300 P (z prime, x prime, y prime). 144 00:11:39,742 --> 00:11:41,205 So we're done here. 145 00:11:41,205 --> 00:11:46,617 We have just proven that if you start with distribution p on the step x, 146 00:11:46,617 --> 00:11:51,570 y, z, and then make one step of this Gibbs sampling procedure, 147 00:11:51,570 --> 00:11:56,984 one coordinates at a time, we'll end up with the same distribution, 148 00:11:56,984 --> 00:11:59,961 p, p (x prime, y prime, z prime). 149 00:11:59,961 --> 00:12:03,150 Which basically means that this distribution beam's stationary for 150 00:12:03,150 --> 00:12:04,128 the Gibbs sampling. 151 00:12:04,128 --> 00:12:09,503 And that's what will converge to this distribution from any starting point. 152 00:12:09,503 --> 00:12:13,515 And basically, it means that if we use Gibbs sampling, it, indeed, 153 00:12:13,515 --> 00:12:17,537 will eventually start to produce us samples from the distribution p. 154 00:12:17,537 --> 00:12:27,537 [MUSIC]