1
00:00:00,047 --> 00:00:04,317
[MUSIC]

2
00:00:04,317 --> 00:00:08,272
In this module we're going to cover
a really popular probabilistic model for

3
00:00:08,272 --> 00:00:12,298
document analysis called
Latent Dirichlet Allocation or LDA.

4
00:00:12,298 --> 00:00:16,920
And LDA is an example of a class of
methods called Mixed Membership Modeling.

5
00:00:16,920 --> 00:00:20,050
And to start with, let's motivate
the use of mixed membership models

6
00:00:20,050 --> 00:00:22,099
in the context of our document analysis.

7
00:00:23,300 --> 00:00:26,940
So far we've described clustering
models where the goal is to group

8
00:00:26,940 --> 00:00:29,350
related articles into disjoint sets or

9
00:00:29,350 --> 00:00:34,940
clusters where these clusters capture
the topics prevalent in the corpus.

10
00:00:34,940 --> 00:00:39,890
And in this context every document
is assigned to a single topic.

11
00:00:41,100 --> 00:00:45,620
A question though is, is an article really
about just one topic, like science?

12
00:00:46,840 --> 00:00:50,810
In the last module, we talked about
soft assignments capturing uncertainty

13
00:00:50,810 --> 00:00:56,270
in this cluster assignment, but
the clustering model still assumes that

14
00:00:56,270 --> 00:01:00,230
each document is assigned
to a single topic.

15
00:01:00,230 --> 00:01:04,060
We might just have uncertainty in what
that assignment is from the observed data.

16
00:01:05,170 --> 00:01:07,980
To make this more concrete,
let's look at a specific example

17
00:01:07,980 --> 00:01:12,210
where we have an article that's entitled
Modeling the Complex Dynamics and

18
00:01:12,210 --> 00:01:15,040
Changing Correlations of Epileptic Events.

19
00:01:15,040 --> 00:01:21,040
And in this article we see words like
patience, epilepsy, EG, clinical.

20
00:01:21,040 --> 00:01:23,930
So based on the content of this article.

21
00:01:23,930 --> 00:01:28,294
Maybe a clustering model would
group this article with other

22
00:01:28,294 --> 00:01:30,953
articles related to science topic.

23
00:01:30,953 --> 00:01:33,789
And maybe that's Cluster 4 in orange for
clustering.

24
00:01:33,789 --> 00:01:38,585
And what this bar chart represents
is simply a one hand encoding

25
00:01:38,585 --> 00:01:43,323
of our assignment of this article,
two cluster four zi = 4.

26
00:01:43,323 --> 00:01:51,500
But this article also has words like
is asynchronize, automatic, model,

27
00:01:51,500 --> 00:01:56,100
and things like this which might mean that
is really an article about technology.

28
00:01:56,100 --> 00:01:59,040
And so maybe we should cripple
with other text articles

29
00:01:59,040 --> 00:02:02,420
which in this case is cluster 2.

30
00:02:02,420 --> 00:02:05,540
Well, our soft assignments that we
talked about in the last module

31
00:02:05,540 --> 00:02:10,380
capture our uncertainty about whether this
article should be assigned to cluster 2 or

32
00:02:10,380 --> 00:02:14,720
cluster 4, assigned to the science
cluster or the technology cluster.

33
00:02:15,820 --> 00:02:19,460
But maybe what we really want to capture
is the fact that the article's about

34
00:02:19,460 --> 00:02:21,200
science and technology.

35
00:02:22,400 --> 00:02:26,100
That is, "Zi" is really 2 and 4.

36
00:02:26,100 --> 00:02:30,110
And so, I put Zi in quotes because
it's not going to be a single variable

37
00:02:30,110 --> 00:02:34,830
associated with a document to
represent this mixed assignment like

38
00:02:34,830 --> 00:02:37,210
what we saw in our clustering model.

39
00:02:37,210 --> 00:02:41,750
But in essence, what we're saying is
that this document has membership

40
00:02:41,750 --> 00:02:43,650
in both of these clusters 2 and 4.

41
00:02:43,650 --> 00:02:48,560
And we're going to go through in this
module exactly how we think about

42
00:02:48,560 --> 00:02:52,300
formunist type of mixed
membership assignment.

43
00:02:52,300 --> 00:02:57,140
And importantly, the other thing that
we're going to want to capture is not only

44
00:02:57,140 --> 00:03:02,130
which topics are present in this document,
but what's the relative proportion?

45
00:03:02,130 --> 00:03:03,630
How prevalent are these topics?

46
00:03:05,040 --> 00:03:09,692
So this is where mixed membership
models come in because mixed membership

47
00:03:09,692 --> 00:03:14,194
models allow us to associate any given
data point with a set of different

48
00:03:14,194 --> 00:03:18,788
cluster assignments or, in this case,
a document with a set of topics.

49
00:03:18,788 --> 00:03:23,535
Rather than assuming that every document
is associated with just a single cluster

50
00:03:23,535 --> 00:03:27,798
or topic or capturing uncertainty in
that single assignment like the soft

51
00:03:27,798 --> 00:03:30,016
assignments we talked about before.

52
00:03:30,016 --> 00:03:34,269
[MUSIC]