1
00:00:00,000 --> 00:00:04,669
[MUSIC]

2
00:00:04,669 --> 00:00:07,570
To build up to mixed membership models for
documents though,

3
00:00:07,570 --> 00:00:10,695
it's helpful to first present
an alternative clustering model to

4
00:00:10,695 --> 00:00:13,670
the mixture of Gaussian model
we presented in the last module.

5
00:00:15,020 --> 00:00:16,290
So, just to emphasize,

6
00:00:16,290 --> 00:00:20,590
we're going back to our clustering model,
where we're going to assume this

7
00:00:20,590 --> 00:00:24,770
more simple structure where every
document is assigned to a single topic.

8
00:00:26,120 --> 00:00:30,000
But so far, when we've looked at our
documents, we've represented them with

9
00:00:30,000 --> 00:00:35,060
this tf-idf vector and then we've taken
all these tf-idf vectors associated with

10
00:00:35,060 --> 00:00:39,120
every document in the corpus and we've
used a mixture of Gaussians to discover

11
00:00:39,120 --> 00:00:45,470
some set of clusters in this tf-idf space.

12
00:00:45,470 --> 00:00:49,210
But now what we're going to do
is an alternative representation

13
00:00:49,210 --> 00:00:52,380
of a document called
a Bag-of-words representation.

14
00:00:52,380 --> 00:00:56,650
Where we simply take, all of the words
that are present in our document,

15
00:00:56,650 --> 00:00:59,170
throw them into a bag, and

16
00:00:59,170 --> 00:01:05,210
then shake that bag up, so that
the order of the words doesn't matter.

17
00:01:05,210 --> 00:01:10,770
So our representation of the document
is simply going to be an unordered

18
00:01:10,770 --> 00:01:17,150
set of words, but I use set loosely here
because this set is going to have multiple

19
00:01:17,150 --> 00:01:22,700
occurrences of a unique word if that word
appears multiple times in the document.

20
00:01:22,700 --> 00:01:28,880
So the multiplicity of the unique words
matters here, unlike in standard sets.

21
00:01:28,880 --> 00:01:31,668
So this is formally called a multiset.

22
00:01:31,668 --> 00:01:36,770
So now let's present a clustering model
for this new document representation, and

23
00:01:36,770 --> 00:01:41,910
to start with we need to specify the prior
probability that a given document

24
00:01:41,910 --> 00:01:44,770
is associated with a specific cluster.

25
00:01:44,770 --> 00:01:47,690
And these are topic prevalences,

26
00:01:47,690 --> 00:01:53,190
are going to be exactly like what we
had in our mixture of Gaussian case,

27
00:01:53,190 --> 00:01:56,970
where these just represent corpus
wide prevalence of topics.

28
00:01:56,970 --> 00:02:00,740
But now, our likelihood term is
going to be different because,

29
00:02:00,740 --> 00:02:07,140
instead of scoring every document,
under a specific Gaussian,

30
00:02:07,140 --> 00:02:10,790
like in the mixture of Gaussian case,
we're going to take our document and

31
00:02:10,790 --> 00:02:15,840
its bag-of-words representation, and
we're going to score this set of words

32
00:02:15,840 --> 00:02:20,080
under a topic probability
vector over words.

33
00:02:20,080 --> 00:02:25,090
Okay, so specifically every
topic is going to be associated

34
00:02:25,090 --> 00:02:30,190
with a probability distribution
over words in the vocabulary and

35
00:02:30,190 --> 00:02:35,470
using that, we're able to score
the words present in this document.

36
00:02:35,470 --> 00:02:41,200
To say how probable are they under this
specific topic and then we do this for

37
00:02:41,200 --> 00:02:45,810
every topic and we choose between
topics using both this prior and

38
00:02:45,810 --> 00:02:49,410
the likelihood just like in our
mixture of Gaussian example.

39
00:02:49,410 --> 00:02:54,180
So, just to be very clear, for
every topic like one's about science,

40
00:02:54,180 --> 00:02:58,280
and technology, and sports, even though,
of course, we don't have those labels,

41
00:02:58,280 --> 00:03:00,339
they're just going to be cluster one,
two, three.

42
00:03:01,690 --> 00:03:05,570
We're going to have a probability
vector over words in the vocabulary.

43
00:03:05,570 --> 00:03:10,270
And the way I'm showing them here on
this slide is ordered by how probable

44
00:03:11,290 --> 00:03:15,810
those words are in the topic from
most probable to least probable.

45
00:03:15,810 --> 00:03:20,320
Whereas, in the previous slide, I was just
listing the words that actually appeared

46
00:03:20,320 --> 00:03:23,300
in the dataset or
in that specific article.

47
00:03:24,350 --> 00:03:28,530
So now we can compare and contrast between
our mixture of Gaussian clustering model

48
00:03:28,530 --> 00:03:30,480
and the clustering model
that we just specified.

49
00:03:31,780 --> 00:03:35,460
So in both of these models
are prior topic probabilities.

50
00:03:35,460 --> 00:03:38,661
So the probability before we actually
look at the content of a document,

51
00:03:38,661 --> 00:03:42,600
that that document came from a given
cluster are given by these pi ks and

52
00:03:42,600 --> 00:03:46,540
they're specified in exactly
the same way in both cases.

53
00:03:46,540 --> 00:03:51,280
But in the mixture of Gaussian case, our
documents were represented by these tf-idf

54
00:03:51,280 --> 00:03:53,790
vectors or some vector,
could be a word count vector.

55
00:03:55,240 --> 00:04:01,160
And we scored that vector under
each one of the Gaussians.

56
00:04:01,160 --> 00:04:04,780
So remember each cluster was
defined by a Gaussian and

57
00:04:04,780 --> 00:04:08,076
you would compute the score of a given
data point under each of these Gaussians.

58
00:04:08,076 --> 00:04:12,930
And then weigh these prior and likelihood
terms to come up with our assignments for

59
00:04:12,930 --> 00:04:14,670
a given document.

60
00:04:14,670 --> 00:04:18,726
But now every document is represented
with this bag-of-words representation.

61
00:04:18,726 --> 00:04:21,565
And when we go to score the document,

62
00:04:21,565 --> 00:04:26,144
we're just going to look at
the probability of each of these

63
00:04:26,144 --> 00:04:31,102
words under the topic specific
probability vector over words.

64
00:04:31,102 --> 00:04:31,602
[MUSIC]