1
00:00:00,401 --> 00:00:04,600
[MUSIC]

2
00:00:04,600 --> 00:00:07,486
We're now ready to present our
Latent Dirichlet allocation mixed

3
00:00:07,486 --> 00:00:08,980
membership model.

4
00:00:08,980 --> 00:00:12,790
And remember, here our goal is
to associate with the document

5
00:00:12,790 --> 00:00:17,480
a collection of topics present in that
document as well as their relative

6
00:00:17,480 --> 00:00:18,980
proportions in the document.

7
00:00:18,980 --> 00:00:21,460
So how prevalent those topics are.

8
00:00:21,460 --> 00:00:24,260
And remember in the clustering
model that we just presented for

9
00:00:24,260 --> 00:00:26,980
our bag of words
representation of a document,

10
00:00:26,980 --> 00:00:31,350
we introduce a set of topic
specific vocabulary distributions.

11
00:00:31,350 --> 00:00:35,060
So associating probabilities over
every word in the vocabulary,

12
00:00:35,060 --> 00:00:39,660
specific to each topic, and
then the clustering model

13
00:00:39,660 --> 00:00:44,250
look to assign an entire
document to a single topic.

14
00:00:45,550 --> 00:00:46,810
And for this assignment,

15
00:00:46,810 --> 00:00:51,460
it would score all of the words in the
document under that topic distribution.

16
00:00:51,460 --> 00:00:57,610
And then the model also introduces
a corpus wide distribution

17
00:00:57,610 --> 00:01:02,840
over the prevalence of topics throughout
all the documents in this corpus.

18
00:01:02,840 --> 00:01:04,890
So that is, how likely for

19
00:01:04,890 --> 00:01:09,840
any given document that we grab that
it's from one of these sets of topics?

20
00:01:10,840 --> 00:01:15,198
In Latent Dirichlet allocation
we introduced this same set of

21
00:01:15,198 --> 00:01:20,060
topic-specific vocabulary distributions,
but now when we look at a given

22
00:01:20,060 --> 00:01:25,460
document We're looking to assign
every word to a single topic.

23
00:01:25,460 --> 00:01:28,850
Instead of assigning the entire document,

24
00:01:28,850 --> 00:01:32,790
every word is going to have
an assignment variable Z-I-W.

25
00:01:32,790 --> 00:01:38,970
So for the wth word in document i,
what topic it's assigned to.

26
00:01:40,850 --> 00:01:43,770
Then when we go to score a document,

27
00:01:43,770 --> 00:01:49,070
we're going to score all of the words
under each of these assigned topics.

28
00:01:49,070 --> 00:01:52,490
So, scoring all the orange
words under the orange topic,

29
00:01:52,490 --> 00:01:54,980
blue words under the blue topic and so on.

30
00:01:54,980 --> 00:01:59,750
And that's going to determine
how good those assignments are.

31
00:01:59,750 --> 00:02:03,010
And finally,
there's another thing different in LDA.

32
00:02:03,010 --> 00:02:09,480
Instead of introducing a corpus wide
distribution on this topic prevalences,

33
00:02:09,480 --> 00:02:13,710
each document is going to
have a distribution

34
00:02:13,710 --> 00:02:17,750
over the prevalence of
topics in that document.

35
00:02:17,750 --> 00:02:20,010
So now instead of pi,

36
00:02:20,010 --> 00:02:24,740
globally throughout the corpus we
have pi i specific to the document i.

37
00:02:24,740 --> 00:02:29,586
And what this vector is going to
represent are our desired

38
00:02:29,586 --> 00:02:33,812
topic prevalences in
this specific document.

39
00:02:33,812 --> 00:02:36,869
[MUSIC]