1
00:00:00,334 --> 00:00:04,719
[MUSIC]

2
00:00:04,719 --> 00:00:05,460
In this video,

3
00:00:05,460 --> 00:00:08,824
we're going to cover sampling from
one-dimensional distributions.

4
00:00:08,824 --> 00:00:13,645
And let's start with the simplest case,
a one-dimensional discrete distribution.

5
00:00:13,645 --> 00:00:16,869
So say we have a random
variable like this,

6
00:00:16,869 --> 00:00:20,992
which takes the value of
a1 with probability 0.6,

7
00:00:20,992 --> 00:00:27,107
value a2 with probability 0.1, and
the value a3 with probability 0.3.

8
00:00:27,107 --> 00:00:28,831
How can we sample from it?

9
00:00:28,831 --> 00:00:31,668
Well, we have to start from something.

10
00:00:31,668 --> 00:00:35,962
It's hard to generate randomness from
nothing on a deterministic computer.

11
00:00:35,962 --> 00:00:40,689
So we're going to use this kind of
function which can generate your random

12
00:00:40,689 --> 00:00:43,378
number on the interval from zero to one.

13
00:00:43,378 --> 00:00:48,022
And this function is included in almost
any possible programming language's

14
00:00:48,022 --> 00:00:48,664
library.

15
00:00:48,664 --> 00:00:50,698
So it's usually there, and

16
00:00:50,698 --> 00:00:54,947
we can use it to base our further
sampling techniques on it.

17
00:00:54,947 --> 00:00:57,677
But anyway, we have this function.

18
00:00:57,677 --> 00:01:00,936
How can we generate a sample
from this discrete distribution?

19
00:01:00,936 --> 00:01:06,832
Well, let's start with a interval and
let's decompose it into three chunks,

20
00:01:06,832 --> 00:01:09,912
where the first chunk has length 0.6,

21
00:01:09,912 --> 00:01:15,546
the second chunk has length 0.1,
and the third chunk has length 0.3.

22
00:01:15,546 --> 00:01:22,520
And let's say that we're going to generate
a random number in this interval.

23
00:01:22,520 --> 00:01:27,679
And let's say that if it appears in
the first chunk, in the right one,

24
00:01:27,679 --> 00:01:32,154
then we will say that we have
just generated the value of a1.

25
00:01:32,154 --> 00:01:37,114
So let's generate a random number from
our function which just [INAUDIBLE]

26
00:01:37,114 --> 00:01:38,798
point on this interval.

27
00:01:38,798 --> 00:01:43,172
And let's say that we were to be
somewhere around 0.9 for example.

28
00:01:43,172 --> 00:01:47,003
It means we have just
generated the value of a3.

29
00:01:47,003 --> 00:01:50,283
So, why does it work?

30
00:01:50,283 --> 00:01:55,781
Well, because, obviously, if you
throw a random point on an interval,

31
00:01:55,781 --> 00:02:00,041
then the probability that it's
appeared in some chunk is

32
00:02:00,041 --> 00:02:03,162
proportional to the length of the chunk.

33
00:02:03,162 --> 00:02:06,233
And the length of the first chunk is 0.6,

34
00:02:06,233 --> 00:02:11,728
which is exactly the probability we
wanted to, a1 to be generated, and etc.

35
00:02:11,728 --> 00:02:16,083
So this indeed is the correct procedure
for generating random numbers

36
00:02:16,083 --> 00:02:19,857
from discrete distributions
with finite number of values.

37
00:02:19,857 --> 00:02:24,583
And the bottom line here is that
it's really easy to generate samples

38
00:02:24,583 --> 00:02:26,792
from discrete distributions.

39
00:02:26,792 --> 00:02:30,532
But note that here we
had just three values.

40
00:02:30,532 --> 00:02:35,756
If we for example had 1 million values, or
1 billion values, it would be much harder.

41
00:02:35,756 --> 00:02:40,664
Because we would have to spend
amount of time proportional to

42
00:02:40,664 --> 00:02:45,668
1 billion to separate this
interval into 1 billion chunks of

43
00:02:45,668 --> 00:02:51,057
appropriate sizes and
then to find in which particular interval,

44
00:02:51,057 --> 00:02:55,509
in which particular chunk
the random value appeared.

45
00:02:55,509 --> 00:03:00,911
But if you have finite number of discrete
events and this number is not that large,

46
00:03:00,911 --> 00:03:04,907
it's like 1,000,
then you can really efficiently and

47
00:03:04,907 --> 00:03:08,298
easily generate samples
from this distribution.

48
00:03:08,298 --> 00:03:11,427
Okay, let's move to
continuous distributions.

49
00:03:11,427 --> 00:03:16,276
So let's say for beginning that we want to

50
00:03:16,276 --> 00:03:21,560
sample a point from a Gaussian, like this.

51
00:03:21,560 --> 00:03:23,714
How can we do it?

52
00:03:23,714 --> 00:03:27,120
Well, one way is to use
the central limit theorem.

53
00:03:27,120 --> 00:03:31,904
That says that if we have a few
independent random values, and

54
00:03:31,904 --> 00:03:36,973
if we sum them up together,
we'll get something like Gaussian.

55
00:03:36,973 --> 00:03:41,884
So if we just throw, say,
12 random numbers from the uniform

56
00:03:41,884 --> 00:03:47,538
distribution on an interval from 0
to 1 and then add them all together,

57
00:03:47,538 --> 00:03:51,628
we'll get something with
expected value being 6.

58
00:03:51,628 --> 00:03:58,194
Because expected value of each x i is 0.5,
it's the center of the interval.

59
00:03:58,194 --> 00:04:06,528
And the sum of 12 x i's will
be will be from 0 to 12.

60
00:04:06,528 --> 00:04:09,975
And the center of this interval is 6.

61
00:04:09,975 --> 00:04:13,995
So I have to subtract 6 to make
the expected value of this

62
00:04:13,995 --> 00:04:15,714
distribution to be 0.

63
00:04:15,714 --> 00:04:21,633
And in this way, our distribution will
be actually not from minus infinity to

64
00:04:21,633 --> 00:04:27,204
plus infinity like in the Gaussian case,
but rather from minus 6 to 6.

65
00:04:27,204 --> 00:04:31,902
So it's indeed an approximation,
it can't generate you anything like

66
00:04:31,902 --> 00:04:35,220
an actual Gaussian, but
it's a pretty close one.

67
00:04:35,220 --> 00:04:40,776
So although it cannot ever generate for
example, a number of 100,

68
00:04:40,776 --> 00:04:47,102
but the Gaussian itself is also very,
very improbable to generate you 100.

69
00:04:47,102 --> 00:04:55,210
So we can be okay with losing these highly
improbable parts of your sample space.

70
00:04:55,210 --> 00:04:57,843
And you may ask, why 12 here?

71
00:04:57,843 --> 00:05:03,308
Well, just to make
the variance of the resulting

72
00:05:03,308 --> 00:05:09,991
random variable z to be equal
to 1 as of standard Gaussian.

73
00:05:09,991 --> 00:05:13,702
If you use for example not 12 but,
I don't know, 100 values,

74
00:05:13,702 --> 00:05:17,952
you would have to divide this thing by
something to make its variance to be 1.

75
00:05:17,952 --> 00:05:22,519
Or if you don't want to use these
kind of loose approximations,

76
00:05:22,519 --> 00:05:27,432
you can ask your favorite programming
language for a function which

77
00:05:27,432 --> 00:05:32,369
just generates you numbers from Gaussian,
and it probably has it.

78
00:05:32,369 --> 00:05:36,509
So it's a pretty standard thing
to have in a programming language

79
00:05:36,509 --> 00:05:39,531
to generate numbers from
a standard Gaussian.

80
00:05:39,531 --> 00:05:42,763
And this is kind of more efficient and

81
00:05:42,763 --> 00:05:48,715
more accurate than just generating
it from central limit theorem.

82
00:05:48,715 --> 00:05:50,735
So, okay, Gaussians are easy.

83
00:05:50,735 --> 00:05:55,288
But what if we have a more complicated
distribution like this and

84
00:05:55,288 --> 00:05:57,917
we don't know how to sample from it?

85
00:05:57,917 --> 00:06:02,436
Again, as we discussed in the previous
video, it can happen if for example,

86
00:06:02,436 --> 00:06:07,303
this is a posterior distribution on some
parameters or some latent variables, and

87
00:06:07,303 --> 00:06:11,078
we may even know it up to
normalization constant, not exactly.

88
00:06:11,078 --> 00:06:12,847
So what do we do in this situation?

89
00:06:14,242 --> 00:06:16,907
Well, we know how to
sample from Gaussians.

90
00:06:16,907 --> 00:06:19,830
So let's try to use it.

91
00:06:19,830 --> 00:06:24,718
Let's upper bound our distribution
with some Gaussian times

92
00:06:24,718 --> 00:06:28,573
some constant, for
example 2 in this example.

93
00:06:28,573 --> 00:06:33,387
Because it's impossible to upper
bound one distribution with another

94
00:06:33,387 --> 00:06:37,890
one without a constant,
without multiplying it with something.

95
00:06:37,890 --> 00:06:40,618
Okay, so we have a distribution P of x,
the blue curve,

96
00:06:40,618 --> 00:06:42,020
which we want to sample from.

97
00:06:42,020 --> 00:06:46,328
And we have the orange curve,
which is 2 times some Gaussian,

98
00:06:46,328 --> 00:06:48,126
which we can sample from.

99
00:06:48,126 --> 00:06:52,790
So how can we generate
samples from the blue curve?

100
00:06:52,790 --> 00:06:58,213
Well, let's start with generating a sample
from the orange curve, which we can do.

101
00:06:58,213 --> 00:07:03,558
Now we, for example, with high
probability, can get a number around 0.

102
00:07:03,558 --> 00:07:09,121
Because it's around 0, it's somewhere
where the orange density is high.

103
00:07:09,121 --> 00:07:13,294
So there will be quite a few points there.

104
00:07:13,294 --> 00:07:18,370
But the blue density is not high there,
it's kind of a local minimum,

105
00:07:18,370 --> 00:07:21,222
so we have to somehow correct for this.

106
00:07:21,222 --> 00:07:25,822
We have to, for example, throw away some
of the points which were generated in this

107
00:07:25,822 --> 00:07:28,596
region where the actual
blue probability is low.

108
00:07:28,596 --> 00:07:30,213
So let's do just that.

109
00:07:30,213 --> 00:07:35,147
Let's reject some of the points
generated from the orange curve.

110
00:07:35,147 --> 00:07:39,807
And what is the rule for
ejecting the point?

111
00:07:39,807 --> 00:07:42,367
Well, first of all,
it should be probabilistic.

112
00:07:42,367 --> 00:07:47,230
Because you can't just reject
all points in some region.

113
00:07:47,230 --> 00:07:50,797
Otherwise, the resulting
distribution of your

114
00:07:50,797 --> 00:07:55,065
accepted points will have 0
probability in this region.

115
00:07:55,065 --> 00:08:00,870
So we have to reject some of the points
in some kind of probabilistic manner.

116
00:08:00,870 --> 00:08:07,182
And the probability of each one to
reject the particular point, so

117
00:08:07,182 --> 00:08:14,282
point x tilde here should be proportional
to the height of the blue curve and

118
00:08:14,282 --> 00:08:18,809
with respect to the height
of the orange curve.

119
00:08:18,809 --> 00:08:22,324
So if in the current point
the two curves coincide,

120
00:08:22,324 --> 00:08:26,677
they have the same values,
you don't want to reject anything.

121
00:08:26,677 --> 00:08:33,098
You are happy to accept the point
which was just generated.

122
00:08:33,098 --> 00:08:36,967
If for example, at some region,
the orange curve is high, but

123
00:08:36,967 --> 00:08:40,767
the blue curve is 0,
then you don't want to accept anything.

124
00:08:40,767 --> 00:08:43,500
You want to reject all
the points in these regions.

125
00:08:43,500 --> 00:08:50,998
If it's somewhere in between, if for
example, like here we have around 0.7,

126
00:08:50,998 --> 00:08:56,648
proportion of the blue curve
to the orange one around 0.7,

127
00:08:56,648 --> 00:09:00,125
then we'll expect to keep 0.7, so

128
00:09:00,125 --> 00:09:05,694
70% points in this region from
the ones we were sampling.

129
00:09:05,694 --> 00:09:12,551
So we can say that we have generated x
tilde, which is from the orange curve.

130
00:09:12,551 --> 00:09:15,138
Then we generate its y coordinate,

131
00:09:15,138 --> 00:09:19,285
which is from 0 to the orange
curve at the current point.

132
00:09:19,285 --> 00:09:24,092
And then we'll reject
this point if it appeared

133
00:09:24,092 --> 00:09:28,310
in the red region so above the blue curve.

134
00:09:28,310 --> 00:09:32,321
And we'll accept this point if it
appeared below the blue curve.

135
00:09:33,569 --> 00:09:36,437
You may ask, why does it work?

136
00:09:36,437 --> 00:09:42,014
Well, if you look at the points that
were generated by this procedure, before

137
00:09:42,014 --> 00:09:47,937
rejecting anything, they will be uniformly
distributed under the orange curve.

138
00:09:47,937 --> 00:09:52,162
When we apply our rejection procedure,

139
00:09:52,162 --> 00:09:56,139
we throw away the points that appeared

140
00:09:56,139 --> 00:10:00,998
above the blue curve and
below the orange one.

141
00:10:00,998 --> 00:10:05,877
And what is left is the points that
are uniformly distributed under the blue

142
00:10:05,877 --> 00:10:06,424
curve.

143
00:10:06,424 --> 00:10:10,664
And it's kind of natural
that these points,

144
00:10:10,664 --> 00:10:16,810
their x coordinates are distributed
as the blue curve suggests.

145
00:10:16,810 --> 00:10:21,821
So that's great, it's a procedure
to generate samples from any

146
00:10:21,821 --> 00:10:27,209
distribution you can upper bound
with something you can sample from.

147
00:10:27,209 --> 00:10:29,437
But how efficient is it?

148
00:10:29,437 --> 00:10:34,012
So if you want to generate one point,
how many points you will have to

149
00:10:34,012 --> 00:10:38,272
generate from the orange curve
before you accept one of them?

150
00:10:38,272 --> 00:10:42,639
Well, the answer to this question
is just 1 divided by the constant.

151
00:10:42,639 --> 00:10:48,261
So one-half in our example,
n1 divided by m in the general case.

152
00:10:48,261 --> 00:10:53,271
And the reason for that is just
because it's the ratio between

153
00:10:53,271 --> 00:11:00,323
the area under the blue curve, which is 1
because it's a probability distribution.

154
00:11:00,323 --> 00:11:02,568
And the area under the orange curve,

155
00:11:02,568 --> 00:11:06,346
which is m because it's m times
a probability distribution.

156
00:11:06,346 --> 00:11:10,132
So if we throw, informally,
points in some area and

157
00:11:10,132 --> 00:11:14,522
then reject some of them,
then the probability of accepting

158
00:11:14,522 --> 00:11:18,843
the point is proportional to
the area where we accept them.

159
00:11:18,843 --> 00:11:23,784
And note that here we can even use this
method if we know our distribution after

160
00:11:23,784 --> 00:11:28,668
normalization constant as it usually
happens in probabilistic modeling.

161
00:11:28,668 --> 00:11:35,485
So in this case, if you can still upper
bound your unnormalized distribution,

162
00:11:35,485 --> 00:11:39,553
you can say that this constant m,
this tilde m,

163
00:11:39,553 --> 00:11:44,153
just includes the normalization
constant somehow.

164
00:11:44,153 --> 00:11:46,828
So you don't know the normalization, but

165
00:11:46,828 --> 00:11:51,964
you can upper bound the distribution with
some constant and this will still work.

166
00:11:51,964 --> 00:11:56,322
So, as a summary,
this method is called rejection sampling.

167
00:11:56,322 --> 00:12:01,031
And it's kind of universally applicable,
so it's really often that you can

168
00:12:01,031 --> 00:12:05,385
upper bound your distribution and
generate samples from it like this.

169
00:12:05,385 --> 00:12:10,857
But the downside is that if q and
p are too different from each other,

170
00:12:10,857 --> 00:12:14,313
then you have to choose
large constant m and

171
00:12:14,313 --> 00:12:18,163
then you will have to
reject most of the points.

172
00:12:18,163 --> 00:12:23,150
And so if, for example, m is 100, then to
generate just one point from your actual

173
00:12:23,150 --> 00:12:27,587
distribution, the blue curve,
you'll have to generate 100 points and

174
00:12:27,587 --> 00:12:30,526
then on average just one
of them will be accepted.

175
00:12:30,526 --> 00:12:35,034
And another problem with
this approach is that if you

176
00:12:35,034 --> 00:12:39,748
apply to multidimensional case,
it will still work,

177
00:12:39,748 --> 00:12:44,474
but the constant m will probably be very,
very large.

178
00:12:44,474 --> 00:12:48,080
So it's kind of exponential
in the number of dimensions.

179
00:12:48,080 --> 00:12:52,953
And it's really improbable that you will
be able to upper-bound your complicated

180
00:12:52,953 --> 00:12:57,409
distribution in ten-dimensional space
with a constant smaller than, for

181
00:12:57,409 --> 00:12:58,816
example, 1 million.

182
00:12:58,816 --> 00:13:02,688
And having a constant 1 million means
that to generate just one point,

183
00:13:02,688 --> 00:13:06,250
you have to wait a million
computational cycles, which is a lot.

184
00:13:06,250 --> 00:13:12,749
So this is a method for one dimension or
maybe a few dimensions,

185
00:13:12,749 --> 00:13:17,544
but not like thousand-dimensional spaces.

186
00:13:17,544 --> 00:13:27,544
[MUSIC]