1
00:00:03,790 --> 00:00:06,406
So, we have just proved that the Gibbs sampling scheme indeed gives you

2
00:00:06,406 --> 00:00:12,910
a correct way of sampling from the desired distribution.

3
00:00:12,910 --> 00:00:18,401
So the underlying Markov chain indeed converges to the distribution B.

4
00:00:18,401 --> 00:00:22,950
But let us look at a small demo of how it can work in practice.

5
00:00:22,950 --> 00:00:27,650
So, let's look at this simple two-dimensional distribution which looks like a Gaussian.

6
00:00:27,650 --> 00:00:31,025
And note that we actually don't need Gibbs sampling here.

7
00:00:31,025 --> 00:00:34,210
Because first of all, this distribution is two dimensional.

8
00:00:34,210 --> 00:00:38,510
So the rejection sampling scheme can also be applied here because the conditionality is

9
00:00:38,510 --> 00:00:45,112
not large and also this is Gaussian.

10
00:00:45,112 --> 00:00:50,885
So we can efficiently sample from it without too much trouble of any fancy techniques.

11
00:00:50,885 --> 00:00:53,445
But this is only for illustrational purposes.

12
00:00:53,445 --> 00:00:58,595
Gibbs sampling works even in million dimensional spaces with complicated distributions.

13
00:00:58,595 --> 00:01:03,960
So let's start with point 0,0, initialization point.

14
00:01:03,960 --> 00:01:06,889
So first of all, on the first substep of the first iteration,

15
00:01:06,889 --> 00:01:11,780
Gibbs sampling suggests you to find the conditional distribution on X2 given X1 and 0

16
00:01:11,780 --> 00:01:17,765
and it will look like this here because if we know that X2 is 0,

17
00:01:17,765 --> 00:01:21,110
then X1 should be somewhere around

18
00:01:21,110 --> 00:01:25,477
5 or 6 because of the shape of our Gaussian distribution.

19
00:01:25,477 --> 00:01:31,508
It has more points in that region than in other regions for this particular area of X2.

20
00:01:31,508 --> 00:01:33,944
Let's sample a point from that.

21
00:01:33,944 --> 00:01:36,330
It can look like this for example.

22
00:01:36,330 --> 00:01:39,525
Okay, so this is the first substep.

23
00:01:39,525 --> 00:01:41,260
On the next substep,

24
00:01:41,260 --> 00:01:45,000
we have to compute the conditional distribution on

25
00:01:45,000 --> 00:01:50,810
X2 given that X1 is the one we have just generated.

26
00:01:50,810 --> 00:01:53,895
So X1 is 6 or somewhere around there and

27
00:01:53,895 --> 00:01:59,525
the conditional distribution on X2 will be like this.

28
00:01:59,525 --> 00:02:02,588
Again, it is just the position of the points,

29
00:02:02,588 --> 00:02:07,503
positions where our Gaussian distribution

30
00:02:07,503 --> 00:02:13,635
lies if they are X1 according to 6.

31
00:02:13,635 --> 00:02:18,067
So let's sample from that and get a point like this.

32
00:02:18,067 --> 00:02:22,725
This is the result of our iteration one.

33
00:02:22,725 --> 00:02:28,980
So this is the next point in our Markov chain process.

34
00:02:28,980 --> 00:02:33,215
Note that here we have not converged yet.

35
00:02:33,215 --> 00:02:40,890
We even did not find a region of space where the density of our distribution is less.

36
00:02:40,890 --> 00:02:43,400
We are somewhere in the low density region.

37
00:02:43,400 --> 00:02:50,475
So we cannot use these samples yet because they are not from the true distribution.

38
00:02:50,475 --> 00:02:51,949
So let us proceed.

39
00:02:51,949 --> 00:02:55,717
Next iteration, we again sample from the conditional distribution on X1 given that

40
00:02:55,717 --> 00:03:03,340
X2 is around 1 and this is somewhere here.

41
00:03:03,340 --> 00:03:08,520
Okay, let's sample from it and let's move to this point and this is

42
00:03:08,520 --> 00:03:13,430
the end of our substep one of the second iteration and on substep two,

43
00:03:13,430 --> 00:03:18,435
we will generate a point like this again from the conditional distribution.

44
00:03:18,435 --> 00:03:22,925
So one more substep and the end of

45
00:03:22,925 --> 00:03:30,184
the first iteration and finally if we repeat this process for like 10 more iterations,

46
00:03:30,184 --> 00:03:36,240
we will get points like this which are already it looks like something from the Gaussian.

47
00:03:36,240 --> 00:03:46,565
So, this was a demo of how Gibbs sampling works and to summarize its properties.

48
00:03:46,565 --> 00:03:48,889
So first of all,

49
00:03:48,889 --> 00:03:51,583
the Gibbs sampling idea is to,

50
00:03:51,583 --> 00:03:56,000
instead of one complicated problem

51
00:03:56,000 --> 00:03:59,027
of generating samples from multidimensional distribution,

52
00:03:59,027 --> 00:04:02,810
which you may know after normalization constant,

53
00:04:02,810 --> 00:04:10,070
it reduces to a sequence of one-dimensional sampling problems and this is very nice.

54
00:04:10,070 --> 00:04:12,800
It makes everything more efficient.

55
00:04:12,800 --> 00:04:16,272
But note that if, in the first video this week,

56
00:04:16,272 --> 00:04:23,713
we discussed some schemes where we can generate a point and that is it,

57
00:04:23,713 --> 00:04:28,302
we have just one sample by spending some time on generating it,

58
00:04:28,302 --> 00:04:34,874
here we have a chain of samples.

59
00:04:34,874 --> 00:04:39,500
So you cannot generate the first sample unless you for example converge.

60
00:04:39,500 --> 00:04:43,400
So we have to wait for these first five hundred iterations and then throw

61
00:04:43,400 --> 00:04:48,465
away the first samples to get just one sample that you care about.

62
00:04:48,465 --> 00:04:52,875
Another positive feature of this Gibbs sampling is that it is really easy to implement.

63
00:04:52,875 --> 00:04:55,375
So just a few lines of coding in pattern.

64
00:04:55,375 --> 00:04:57,345
And it is really universally applicable.

65
00:04:57,345 --> 00:05:00,440
So almost always you can sample from

66
00:05:00,440 --> 00:05:02,840
one-dimensional conditional distributions and

67
00:05:02,840 --> 00:05:06,350
this way you can sample from the multidimensional one.

68
00:05:06,350 --> 00:05:11,360
And the negative feature is that it can give you really correlated samples.

69
00:05:11,360 --> 00:05:15,348
So if you recall the demo,

70
00:05:15,348 --> 00:05:19,240
all samples were kind of close to the previous one.

71
00:05:19,240 --> 00:05:22,365
So you cannot do large steps here in the Gibbs scheme.

72
00:05:22,365 --> 00:05:24,470
You are always doing some kind of local steps

73
00:05:24,470 --> 00:05:27,525
because you are going one dimension at a time.

74
00:05:27,525 --> 00:05:33,140
And this means that your samples are similar to each other which in turn means

75
00:05:33,140 --> 00:05:39,397
that you are not that efficient in estimating expected values as you would wish.

76
00:05:39,397 --> 00:05:42,170
So let us say that we waited for the first

77
00:05:42,170 --> 00:05:45,892
1,000 iterations of the Gibbs sampling and throw these points away

78
00:05:45,892 --> 00:05:53,000
saying that this was useless for us because the Markov chain did not converge yet.

79
00:05:53,000 --> 00:05:55,745
And then after the 1,001 iterations,

80
00:05:55,745 --> 00:05:58,850
we are starting to collect samples and write them

81
00:05:58,850 --> 00:06:02,475
down to use for our expected value estimation.

82
00:06:02,475 --> 00:06:07,370
And for example, we may say that we collected 1,000 samples.

83
00:06:07,370 --> 00:06:10,371
Well we may think that we have 1,000 samples,

84
00:06:10,371 --> 00:06:15,130
but they are so close to each other that in effect it is like you

85
00:06:15,130 --> 00:06:20,925
had one hundred for example one hundred independent samples.

86
00:06:20,925 --> 00:06:27,370
So the efficiency of estimating expected phases with correlated samples is much lower

87
00:06:27,370 --> 00:06:30,380
than the efficiency of estimating the same thing with

88
00:06:30,380 --> 00:06:34,020
the uncorrelated samples and this can be a problem.

89
00:06:34,020 --> 00:06:36,855
So it is an efficiency problem.

90
00:06:36,855 --> 00:06:42,663
One other negative feature is that because of the small steps,

91
00:06:42,663 --> 00:06:44,165
it can be hard to converge.

92
00:06:44,165 --> 00:06:49,725
The convergence here in Markov chains is sometimes called mixing.

93
00:06:49,725 --> 00:06:51,880
So you have to do lots of steps to find

94
00:06:51,880 --> 00:06:55,220
this high density region of your probability distribution

95
00:06:55,220 --> 00:06:58,379
and you have to throw all these points

96
00:06:58,379 --> 00:07:03,004
away because you have not converged yet and you cannot use these points.

97
00:07:03,004 --> 00:07:05,220
So because of these small local steps,

98
00:07:05,220 --> 00:07:09,902
you have to wait for convergence longer than you would have if the steps were large

99
00:07:09,902 --> 00:07:15,683
and finally as we have already discussed,

100
00:07:15,683 --> 00:07:17,800
this method is not parallel and if you want to

101
00:07:17,800 --> 00:07:21,030
sample from like one million dimensional space,

102
00:07:21,030 --> 00:07:25,720
you have to anyway sample each dimension one at a time,

103
00:07:25,720 --> 00:07:28,194
which can be really slow to do.

104
00:07:28,194 --> 00:07:30,017
So in the next video,

105
00:07:30,017 --> 00:07:34,985
we will discuss some other schemes for building Markov chains that

106
00:07:34,985 --> 00:07:41,660
can be used to mitigate some of these problems.