1
00:00:00,000 --> 00:00:04,077
[MUSIC]

2
00:00:04,077 --> 00:00:08,241
Let's now describe the algorithm that
produces these random samples, and,

3
00:00:08,241 --> 00:00:12,670
to begin with, let's just present a
standard implementation of Gibbs sampling.

4
00:00:14,000 --> 00:00:17,280
Gibbs sampling treats both
our assignment variables and

5
00:00:17,280 --> 00:00:20,920
our model parameters
exactly in the same manner.

6
00:00:20,920 --> 00:00:23,220
Whereas, remember,
when we're looking at the Algorithm,

7
00:00:23,220 --> 00:00:24,170
there were different updates,

8
00:00:24,170 --> 00:00:28,070
when we're looking at our assignment
variables, then our model parameters.

9
00:00:28,070 --> 00:00:29,850
Okay.
And what Gibbs sampling does

10
00:00:29,850 --> 00:00:33,450
in its most standard implementation,
is it just cycles through all of these

11
00:00:33,450 --> 00:00:37,160
assignment variables and
model parameters and randomly samples

12
00:00:37,160 --> 00:00:41,520
each one from a conditional distribution
where we're conditioning on

13
00:00:41,520 --> 00:00:45,770
the previously sampled values of
all the other model parameters and

14
00:00:45,770 --> 00:00:50,370
assignment variables, and we're also
conditioning on our observations.

15
00:00:50,370 --> 00:00:54,470
So from iteration to iteration,
what we're conditioning on,

16
00:00:54,470 --> 00:00:59,830
the values of those assignment variables
and model parameters are going to change,

17
00:00:59,830 --> 00:01:02,500
but the observations,
they're always the same set of values.

18
00:01:02,500 --> 00:01:04,460
Whatever we observed from our data set.

19
00:01:05,680 --> 00:01:08,520
Well, let's look at this in pictures for
our LDA model.

20
00:01:08,520 --> 00:01:11,490
And let's imagine that at some
iteration of our Gibbs sampler,

21
00:01:11,490 --> 00:01:15,540
we have the following instantiation of
all of our assignment variables and

22
00:01:15,540 --> 00:01:20,820
model parameters, which, in this case, are
the topic vocabulary distributions And,

23
00:01:20,820 --> 00:01:26,210
the document specific assignment
variables of words to topics and

24
00:01:26,210 --> 00:01:29,070
the document specific
topic proportion factors.

25
00:01:30,570 --> 00:01:35,480
Well, at the next iteration of our Gibbs
sampler, we might consider reassigning all

26
00:01:35,480 --> 00:01:40,980
of our assignment variables of
words in a document to a topic.

27
00:01:40,980 --> 00:01:46,026
So this are the ziw's for
some document i and when we're going to

28
00:01:46,026 --> 00:01:51,075
resample these variables,
we form a conditional distribution

29
00:01:51,075 --> 00:01:55,841
where we're fixing the values
of all of the topic vocabulary

30
00:01:55,841 --> 00:02:01,100
distributions as well as the topic
preferences in this document.

31
00:02:02,830 --> 00:02:06,640
And so a question is what's the form
of this conditional distribution?

32
00:02:06,640 --> 00:02:09,400
Well, let's look at just one word.

33
00:02:09,400 --> 00:02:12,630
Let's say eeg in this document.

34
00:02:12,630 --> 00:02:17,310
And let's look at,
let's assume that this is word w,

35
00:02:19,930 --> 00:02:25,050
the w word in document i.

36
00:02:25,050 --> 00:02:29,310
And now let's look at riw2

37
00:02:29,310 --> 00:02:34,390
which will be our

38
00:02:34,390 --> 00:02:39,710
notation just like when we're talking
about responsibilities in the Algorithm.

39
00:02:39,710 --> 00:02:46,376
This is going to indicate
the probability of assigning

40
00:02:49,432 --> 00:02:53,803
ziw = 2, meaning word w in

41
00:02:53,803 --> 00:02:58,940
document i to the second topic.

42
00:03:00,360 --> 00:03:01,840
And what's this probability?

43
00:03:03,105 --> 00:03:08,750
Wel,l what's the prior probability that I
randomly choose a word in this document

44
00:03:08,750 --> 00:03:10,680
and that it happens to be from topic two

45
00:03:12,580 --> 00:03:16,520
before I actually look at
the value of what that word is?

46
00:03:16,520 --> 00:03:22,130
Well that's just how prevalent
topic two is in this document.

47
00:03:22,130 --> 00:03:27,269
So here we have high i2

48
00:03:27,269 --> 00:03:36,740
being the prior probability of ziw = 2.

49
00:03:36,740 --> 00:03:41,010
And then we're going to
multiply by the likelihood

50
00:03:41,010 --> 00:03:44,900
of observing the word EEG
within the second topic.

51
00:03:46,070 --> 00:03:51,090
So this will be probability of EEG,

52
00:03:51,090 --> 00:03:55,926
the actual value of this
wth word in the document,

53
00:03:55,926 --> 00:04:00,368
given ziw = 2.

54
00:04:00,368 --> 00:04:07,993
So given and this probability vector right
here, so we're going to grab this out

55
00:04:12,286 --> 00:04:15,084
Right here.

56
00:04:19,526 --> 00:04:24,380
And what we do to compute this likelihood
is we simply look at topic two and

57
00:04:24,380 --> 00:04:27,882
we scroll all the way down
till we find the word EEG and

58
00:04:27,882 --> 00:04:30,985
then we look at the probability
of that word and

59
00:04:30,985 --> 00:04:35,460
within this tech topic,
the probability is probably pretty low.

60
00:04:36,560 --> 00:04:40,670
Then the final thing that we do to compute
this probability is we have to normalize

61
00:04:40,670 --> 00:04:42,470
over all possible assignments.

62
00:04:42,470 --> 00:04:46,930
So, summing over all j equals 1 to capital

63
00:04:46,930 --> 00:04:51,747
K total number of topics,
we look at pi ij,

64
00:04:51,747 --> 00:05:00,230
probability of EEG under an assignment of

65
00:05:01,330 --> 00:05:06,910
the Wth word in document i to cluster j or
topic, rather, j.

66
00:05:08,190 --> 00:05:12,630
Okay, so this looks exactly like
the responsibilities we saw in

67
00:05:12,630 --> 00:05:17,570
Except here when we're looking at
the prior probability on a given

68
00:05:17,570 --> 00:05:22,350
assignment we're looking specifically
within this document because we have these

69
00:05:22,350 --> 00:05:25,460
document specific topic prevalences.

70
00:05:25,460 --> 00:05:29,620
And then we're scoring our
word now under a given

71
00:05:29,620 --> 00:05:33,490
topic probability vector instead of
when we look at For mixtures of Gaussian

72
00:05:33,490 --> 00:05:38,470
we were scoring a whole vector under
a given Gaussian distribution.

73
00:05:39,540 --> 00:05:43,600
Okay, the idea is that we would for
this specific word

74
00:05:43,600 --> 00:05:47,060
compute the responsibility for
every possible topic not just topic two.

75
00:05:47,060 --> 00:05:50,350
Topic one, two,
three all the way to topic capital K.

76
00:05:50,350 --> 00:05:53,620
And then we normalize,

77
00:05:53,620 --> 00:05:57,860
we look at these normalized numbers so
this whole vector sums to one.

78
00:05:57,860 --> 00:06:01,390
So this represents what's called
a probability mass function.

79
00:06:01,390 --> 00:06:05,890
So it's a distribution over a set
of integers, one to capital K,

80
00:06:05,890 --> 00:06:10,030
and then we just draw a value
randomly from this distribution.

81
00:06:12,010 --> 00:06:16,370
And that value is going to be
our assignment of the word EEG

82
00:06:16,370 --> 00:06:17,250
to a given topic.

83
00:06:19,250 --> 00:06:24,690
So perhaps, for example for this word EEG,
maybe we would assign it to topic one.

84
00:06:24,690 --> 00:06:26,900
Since that's the topic about science but

85
00:06:26,900 --> 00:06:32,360
of course through this random set of
assignments that we’re making here,

86
00:06:32,360 --> 00:06:35,940
we could’ve drawn any topic to
assign to this specific word.

87
00:06:37,300 --> 00:06:42,420
Okay, we will repeat this procedure for
every word on this document, and

88
00:06:42,420 --> 00:06:45,730
then we can think about reassigning
out topic proportions for

89
00:06:45,730 --> 00:06:50,070
this document, given the set of word
assignments that we've just made.

90
00:06:50,070 --> 00:06:54,220
So what informs these topic
proportions for this document?

91
00:06:54,220 --> 00:06:58,860
Do we care about the topic
vocabulary distributions?

92
00:06:58,860 --> 00:07:00,130
No, we actually don't.

93
00:07:00,130 --> 00:07:05,300
All we need are the counts of how
many times a given topic was used

94
00:07:05,300 --> 00:07:10,720
in this document to inform these
topic proportions for this document.

95
00:07:12,170 --> 00:07:16,570
But these counts are going to be
regularized by our Bayesian prior.

96
00:07:16,570 --> 00:07:19,970
because remember we discussed before
that we can think of the Bayesian prior

97
00:07:19,970 --> 00:07:23,220
as introducing a set of what
we called pseudo-observations.

98
00:07:23,220 --> 00:07:29,450
So we can think of every topic as having a
fixed number of observations in that topic

99
00:07:29,450 --> 00:07:36,450
that bias the distribution from just using
the observe counts in this document.

100
00:07:36,450 --> 00:07:40,890
So we use these counts both
the deserve counts and this document

101
00:07:40,890 --> 00:07:45,880
given this sample set of work assignment
variables as well as the parameters of our

102
00:07:45,880 --> 00:07:51,000
basic in prior to form a distribution
over these topic proportions, and

103
00:07:51,000 --> 00:07:55,670
then, we sample these topic proportions
from that distribution, and the specific

104
00:07:55,670 --> 00:07:59,160
form of this distribution, however,
is beyond the scope of this course.

105
00:08:00,590 --> 00:08:04,840
But the point here is that we can sample
these topic proportions and then,

106
00:08:04,840 --> 00:08:09,720
we repeat this process for
every document in our corpus.

107
00:08:09,720 --> 00:08:13,480
So we repeat sampling the word
assignment variables and

108
00:08:13,480 --> 00:08:18,740
the topic proportions for
each document in our entire data set.

109
00:08:18,740 --> 00:08:22,390
Then having done this,
we can turn to the corpus wide

110
00:08:22,390 --> 00:08:25,500
topic vocabulary distributions and
reassign those as well.

111
00:08:26,640 --> 00:08:31,340
And now when we go to
figure out how probable our

112
00:08:31,340 --> 00:08:35,390
words within a given topic,
what informs that?

113
00:08:35,390 --> 00:08:39,620
Well we can simply look at our word
assignment variables in the entire corpus

114
00:08:39,620 --> 00:08:43,710
now and say, for example,
for the word EEG,

115
00:08:43,710 --> 00:08:49,510
we can think of how many times was
the word EEG assigned to topic one, and

116
00:08:49,510 --> 00:08:55,760
we can use that information to inform us
of how probable EEG is under topic one.

117
00:08:55,760 --> 00:08:57,120
And we can do this for

118
00:08:57,120 --> 00:09:00,750
every word in our vocabulary for
each one of these different topics.

119
00:09:02,038 --> 00:09:08,790
But again, these counts of topic usage
within the corpus are realized by

120
00:09:08,790 --> 00:09:12,900
priors placed over this topic vocabulary
distributions in our Bayesian framework.

121
00:09:14,210 --> 00:09:19,950
Okay, but in summary, we're going to
randomly resample our topic vocabulary

122
00:09:19,950 --> 00:09:24,630
distributions and then, the Gibbs sampling
algorithm repeats these steps again, and

123
00:09:24,630 --> 00:09:26,220
again, and again, and again.

124
00:09:26,220 --> 00:09:32,105
Resampling our word assignment variables,
our document specific topic proportions,

125
00:09:32,105 --> 00:09:36,890
and our corpus wide topic vocabulary
distributions, again and again,

126
00:09:36,890 --> 00:09:39,959
until we run out of our
computational budget.

127
00:09:39,959 --> 00:09:45,139
[MUSIC]