1 00:00:03,790 --> 00:00:06,406 So, we have just proved that the Gibbs sampling scheme indeed gives you 2 00:00:06,406 --> 00:00:12,910 a correct way of sampling from the desired distribution. 3 00:00:12,910 --> 00:00:18,401 So the underlying Markov chain indeed converges to the distribution B. 4 00:00:18,401 --> 00:00:22,950 But let us look at a small demo of how it can work in practice. 5 00:00:22,950 --> 00:00:27,650 So, let's look at this simple two-dimensional distribution which looks like a Gaussian. 6 00:00:27,650 --> 00:00:31,025 And note that we actually don't need Gibbs sampling here. 7 00:00:31,025 --> 00:00:34,210 Because first of all, this distribution is two dimensional. 8 00:00:34,210 --> 00:00:38,510 So the rejection sampling scheme can also be applied here because the conditionality is 9 00:00:38,510 --> 00:00:45,112 not large and also this is Gaussian. 10 00:00:45,112 --> 00:00:50,885 So we can efficiently sample from it without too much trouble of any fancy techniques. 11 00:00:50,885 --> 00:00:53,445 But this is only for illustrational purposes. 12 00:00:53,445 --> 00:00:58,595 Gibbs sampling works even in million dimensional spaces with complicated distributions. 13 00:00:58,595 --> 00:01:03,960 So let's start with point 0,0, initialization point. 14 00:01:03,960 --> 00:01:06,889 So first of all, on the first substep of the first iteration, 15 00:01:06,889 --> 00:01:11,780 Gibbs sampling suggests you to find the conditional distribution on X2 given X1 and 0 16 00:01:11,780 --> 00:01:17,765 and it will look like this here because if we know that X2 is 0, 17 00:01:17,765 --> 00:01:21,110 then X1 should be somewhere around 18 00:01:21,110 --> 00:01:25,477 5 or 6 because of the shape of our Gaussian distribution. 19 00:01:25,477 --> 00:01:31,508 It has more points in that region than in other regions for this particular area of X2. 20 00:01:31,508 --> 00:01:33,944 Let's sample a point from that. 21 00:01:33,944 --> 00:01:36,330 It can look like this for example. 22 00:01:36,330 --> 00:01:39,525 Okay, so this is the first substep. 23 00:01:39,525 --> 00:01:41,260 On the next substep, 24 00:01:41,260 --> 00:01:45,000 we have to compute the conditional distribution on 25 00:01:45,000 --> 00:01:50,810 X2 given that X1 is the one we have just generated. 26 00:01:50,810 --> 00:01:53,895 So X1 is 6 or somewhere around there and 27 00:01:53,895 --> 00:01:59,525 the conditional distribution on X2 will be like this. 28 00:01:59,525 --> 00:02:02,588 Again, it is just the position of the points, 29 00:02:02,588 --> 00:02:07,503 positions where our Gaussian distribution 30 00:02:07,503 --> 00:02:13,635 lies if they are X1 according to 6. 31 00:02:13,635 --> 00:02:18,067 So let's sample from that and get a point like this. 32 00:02:18,067 --> 00:02:22,725 This is the result of our iteration one. 33 00:02:22,725 --> 00:02:28,980 So this is the next point in our Markov chain process. 34 00:02:28,980 --> 00:02:33,215 Note that here we have not converged yet. 35 00:02:33,215 --> 00:02:40,890 We even did not find a region of space where the density of our distribution is less. 36 00:02:40,890 --> 00:02:43,400 We are somewhere in the low density region. 37 00:02:43,400 --> 00:02:50,475 So we cannot use these samples yet because they are not from the true distribution. 38 00:02:50,475 --> 00:02:51,949 So let us proceed. 39 00:02:51,949 --> 00:02:55,717 Next iteration, we again sample from the conditional distribution on X1 given that 40 00:02:55,717 --> 00:03:03,340 X2 is around 1 and this is somewhere here. 41 00:03:03,340 --> 00:03:08,520 Okay, let's sample from it and let's move to this point and this is 42 00:03:08,520 --> 00:03:13,430 the end of our substep one of the second iteration and on substep two, 43 00:03:13,430 --> 00:03:18,435 we will generate a point like this again from the conditional distribution. 44 00:03:18,435 --> 00:03:22,925 So one more substep and the end of 45 00:03:22,925 --> 00:03:30,184 the first iteration and finally if we repeat this process for like 10 more iterations, 46 00:03:30,184 --> 00:03:36,240 we will get points like this which are already it looks like something from the Gaussian. 47 00:03:36,240 --> 00:03:46,565 So, this was a demo of how Gibbs sampling works and to summarize its properties. 48 00:03:46,565 --> 00:03:48,889 So first of all, 49 00:03:48,889 --> 00:03:51,583 the Gibbs sampling idea is to, 50 00:03:51,583 --> 00:03:56,000 instead of one complicated problem 51 00:03:56,000 --> 00:03:59,027 of generating samples from multidimensional distribution, 52 00:03:59,027 --> 00:04:02,810 which you may know after normalization constant, 53 00:04:02,810 --> 00:04:10,070 it reduces to a sequence of one-dimensional sampling problems and this is very nice. 54 00:04:10,070 --> 00:04:12,800 It makes everything more efficient. 55 00:04:12,800 --> 00:04:16,272 But note that if, in the first video this week, 56 00:04:16,272 --> 00:04:23,713 we discussed some schemes where we can generate a point and that is it, 57 00:04:23,713 --> 00:04:28,302 we have just one sample by spending some time on generating it, 58 00:04:28,302 --> 00:04:34,874 here we have a chain of samples. 59 00:04:34,874 --> 00:04:39,500 So you cannot generate the first sample unless you for example converge. 60 00:04:39,500 --> 00:04:43,400 So we have to wait for these first five hundred iterations and then throw 61 00:04:43,400 --> 00:04:48,465 away the first samples to get just one sample that you care about. 62 00:04:48,465 --> 00:04:52,875 Another positive feature of this Gibbs sampling is that it is really easy to implement. 63 00:04:52,875 --> 00:04:55,375 So just a few lines of coding in pattern. 64 00:04:55,375 --> 00:04:57,345 And it is really universally applicable. 65 00:04:57,345 --> 00:05:00,440 So almost always you can sample from 66 00:05:00,440 --> 00:05:02,840 one-dimensional conditional distributions and 67 00:05:02,840 --> 00:05:06,350 this way you can sample from the multidimensional one. 68 00:05:06,350 --> 00:05:11,360 And the negative feature is that it can give you really correlated samples. 69 00:05:11,360 --> 00:05:15,348 So if you recall the demo, 70 00:05:15,348 --> 00:05:19,240 all samples were kind of close to the previous one. 71 00:05:19,240 --> 00:05:22,365 So you cannot do large steps here in the Gibbs scheme. 72 00:05:22,365 --> 00:05:24,470 You are always doing some kind of local steps 73 00:05:24,470 --> 00:05:27,525 because you are going one dimension at a time. 74 00:05:27,525 --> 00:05:33,140 And this means that your samples are similar to each other which in turn means 75 00:05:33,140 --> 00:05:39,397 that you are not that efficient in estimating expected values as you would wish. 76 00:05:39,397 --> 00:05:42,170 So let us say that we waited for the first 77 00:05:42,170 --> 00:05:45,892 1,000 iterations of the Gibbs sampling and throw these points away 78 00:05:45,892 --> 00:05:53,000 saying that this was useless for us because the Markov chain did not converge yet. 79 00:05:53,000 --> 00:05:55,745 And then after the 1,001 iterations, 80 00:05:55,745 --> 00:05:58,850 we are starting to collect samples and write them 81 00:05:58,850 --> 00:06:02,475 down to use for our expected value estimation. 82 00:06:02,475 --> 00:06:07,370 And for example, we may say that we collected 1,000 samples. 83 00:06:07,370 --> 00:06:10,371 Well we may think that we have 1,000 samples, 84 00:06:10,371 --> 00:06:15,130 but they are so close to each other that in effect it is like you 85 00:06:15,130 --> 00:06:20,925 had one hundred for example one hundred independent samples. 86 00:06:20,925 --> 00:06:27,370 So the efficiency of estimating expected phases with correlated samples is much lower 87 00:06:27,370 --> 00:06:30,380 than the efficiency of estimating the same thing with 88 00:06:30,380 --> 00:06:34,020 the uncorrelated samples and this can be a problem. 89 00:06:34,020 --> 00:06:36,855 So it is an efficiency problem. 90 00:06:36,855 --> 00:06:42,663 One other negative feature is that because of the small steps, 91 00:06:42,663 --> 00:06:44,165 it can be hard to converge. 92 00:06:44,165 --> 00:06:49,725 The convergence here in Markov chains is sometimes called mixing. 93 00:06:49,725 --> 00:06:51,880 So you have to do lots of steps to find 94 00:06:51,880 --> 00:06:55,220 this high density region of your probability distribution 95 00:06:55,220 --> 00:06:58,379 and you have to throw all these points 96 00:06:58,379 --> 00:07:03,004 away because you have not converged yet and you cannot use these points. 97 00:07:03,004 --> 00:07:05,220 So because of these small local steps, 98 00:07:05,220 --> 00:07:09,902 you have to wait for convergence longer than you would have if the steps were large 99 00:07:09,902 --> 00:07:15,683 and finally as we have already discussed, 100 00:07:15,683 --> 00:07:17,800 this method is not parallel and if you want to 101 00:07:17,800 --> 00:07:21,030 sample from like one million dimensional space, 102 00:07:21,030 --> 00:07:25,720 you have to anyway sample each dimension one at a time, 103 00:07:25,720 --> 00:07:28,194 which can be really slow to do. 104 00:07:28,194 --> 00:07:30,017 So in the next video, 105 00:07:30,017 --> 00:07:34,985 we will discuss some other schemes for building Markov chains that 106 00:07:34,985 --> 00:07:41,660 can be used to mitigate some of these problems.