1
00:00:00,000 --> 00:00:04,674
[MUSIC]

2
00:00:04,674 --> 00:00:07,136
Well, you might be sitting here thinking,

3
00:00:07,136 --> 00:00:10,210
why are we sampling just
the word indicators?

4
00:00:10,210 --> 00:00:12,120
I don't care about those.

5
00:00:12,120 --> 00:00:14,300
Is this just a total waste?

6
00:00:14,300 --> 00:00:17,900
Remember that we discussed before that the
thing that we're typically interested in

7
00:00:17,900 --> 00:00:22,190
are the topic vocabulary distributions for
interpretability of the topics present in

8
00:00:22,190 --> 00:00:26,890
the corpus, as well as the topic
proportions within every document.

9
00:00:26,890 --> 00:00:31,780
Because that's our compact description
of the mixed membership of the document.

10
00:00:32,920 --> 00:00:35,600
So what do we do with the output
of this collapsed Gibbs sampler?

11
00:00:35,600 --> 00:00:39,200
Where we have all these samples
just of the word indicators.

12
00:00:40,340 --> 00:00:42,180
Well there are a number of
things that you can do, and

13
00:00:42,180 --> 00:00:44,050
I'm just going to describe one.

14
00:00:44,050 --> 00:00:49,160
So, one thing we can do is we can look
at the assignment of all the words

15
00:00:50,590 --> 00:00:54,040
in the corpus that maximize
the joint model probability.

16
00:00:54,040 --> 00:00:58,110
And it's actually the joint collapse
model probability where we've

17
00:00:58,110 --> 00:01:00,740
integrated over all of
these model parameters and

18
00:01:00,740 --> 00:01:04,750
just look at the probabilities on
these word assignment variables.

19
00:01:04,750 --> 00:01:10,030
And of course the probabilities of the
words themselves given those assignments.

20
00:01:10,030 --> 00:01:14,880
Then for this best sample of all
these word assignment variables,

21
00:01:14,880 --> 00:01:19,510
we can think of post facto after
running our collapsed sampler,

22
00:01:19,510 --> 00:01:23,480
doing inference on the topic vocabulary
distributions because once I've

23
00:01:23,480 --> 00:01:28,180
conditioned on a set of topic indicators
for every word in my vocabulary.

24
00:01:28,180 --> 00:01:32,650
I'm sorry, every word in my corpus,
then I can form

25
00:01:32,650 --> 00:01:38,020
the conditional distribution on my
topic vocabulary distributions.

26
00:01:38,020 --> 00:01:42,720
So this is exactly the distribution that
we described at a high level when we

27
00:01:42,720 --> 00:01:45,820
talked about our uncollapsed
standard Gibbs sampler.

28
00:01:45,820 --> 00:01:52,260
So, we could think about sampling these
vocabulary distributions and then

29
00:01:52,260 --> 00:01:57,310
we can also likewise think about doing
what's often called document embedding.

30
00:01:57,310 --> 00:02:02,590
Which is just forming the topic
proportion vector for a given document.

31
00:02:02,590 --> 00:02:07,320
So, this embedding is just taking this
document and forming its mixed membership

32
00:02:07,320 --> 00:02:13,340
representation, and
just like in our uncollapsed standard

33
00:02:13,340 --> 00:02:19,330
give sampler we can form the conditional
distribution of these topic proportions

34
00:02:19,330 --> 00:02:24,330
just given the word assignments
in the document we're looking at.

35
00:02:24,330 --> 00:02:29,070
So just to reiterate, when we look at
the topic vocabulary distributions,

36
00:02:29,070 --> 00:02:32,170
these are corpus-like things we have
to look at the assignments we made

37
00:02:32,170 --> 00:02:34,830
throughout the entire
corpus to infer these.

38
00:02:34,830 --> 00:02:38,460
But when we're looking at our
document-specific topic proportions,

39
00:02:38,460 --> 00:02:43,080
we just need to look at those assignments
made within that specific document.

40
00:02:44,370 --> 00:02:47,590
Then finally you can think
of embedding new documents.

41
00:02:47,590 --> 00:02:49,510
So you get a whole
collection of new documents.

42
00:02:49,510 --> 00:02:52,667
You already ran your
collapse Gibb sampler,

43
00:02:52,667 --> 00:02:55,498
what do you do with these new documents?

44
00:02:55,498 --> 00:03:00,360
Well the formal thing to do is
to completely rerun your sampler

45
00:03:00,360 --> 00:03:01,590
with these new documents.

46
00:03:01,590 --> 00:03:05,710
So add it in, resample everything for
these new documents.

47
00:03:05,710 --> 00:03:09,670
And then revisit the documents
that you've already sampled, but

48
00:03:09,670 --> 00:03:12,050
often you really can't
do that in practice.

49
00:03:12,050 --> 00:03:14,550
So one thing that you
could think about doing,

50
00:03:16,370 --> 00:03:20,650
which is an approximation procedure,
is to fix the topic vocabulary

51
00:03:20,650 --> 00:03:26,050
distributions using the procedure that
we described in the previous slides.

52
00:03:26,050 --> 00:03:28,480
And then having our topics fixed.

53
00:03:28,480 --> 00:03:32,040
So that's the description
we can think of as

54
00:03:33,150 --> 00:03:37,680
trained on a set of documents
that we've already looked at.

55
00:03:37,680 --> 00:03:40,960
We can embed new documents just by running

56
00:03:40,960 --> 00:03:45,070
an uncollapsed Gibb sampler
just on that document.

57
00:03:45,070 --> 00:03:51,010
Because remember, to form our Word
assignments in a given document and

58
00:03:51,010 --> 00:03:53,410
the topic proportions in that document.

59
00:03:55,020 --> 00:03:59,850
We only need to condition on
the topic vocabulary distributions.

60
00:03:59,850 --> 00:04:01,650
Not the other document In the corpus.

61
00:04:02,730 --> 00:04:05,811
So, we can actually
embed each one of these

62
00:04:05,811 --> 00:04:09,932
new documents in parallel
using this type of procedure.

63
00:04:09,932 --> 00:04:13,929
[MUSIC]