1
00:00:00,000 --> 00:00:04,505
[MUSIC]

2
00:00:04,505 --> 00:00:09,082
Having specified the LDA model we
now turn to inference in LDA and

3
00:00:09,082 --> 00:00:14,340
remember that our LDA model introduced
a set of topic specific vocabulary

4
00:00:14,340 --> 00:00:19,513
distributions that are shared throughout
the entire corpus and then for

5
00:00:19,513 --> 00:00:25,366
every document and every word and every
document there's an assignment variable

6
00:00:25,366 --> 00:00:30,451
of that word to a specific topic,
and then finally for every document,

7
00:00:30,451 --> 00:00:36,260
there's the topic proportions in
that document, so that vector pi i.

8
00:00:36,260 --> 00:00:41,270
So collectively this represents our
model parameters as well as our

9
00:00:41,270 --> 00:00:43,530
assignment variables, but

10
00:00:43,530 --> 00:00:48,080
remember that in our unsupervised learning
task, we just get words from documents.

11
00:00:48,080 --> 00:00:50,630
We get a whole bunch of documents and

12
00:00:50,630 --> 00:00:55,650
we transform them to our bag of words
representation and that's all we have.

13
00:00:55,650 --> 00:01:01,020
And somehow from this, we have to infer
all these word assignment variables and

14
00:01:01,020 --> 00:01:03,270
all these topic proportions and

15
00:01:03,270 --> 00:01:07,240
topic prevalences,
just from these observed words.

16
00:01:07,240 --> 00:01:09,978
So, it actually seems like a really,
really challenging task.

17
00:01:09,978 --> 00:01:15,336
So, just to be clear,
the input to LDA, or entrance in LDA,

18
00:01:15,336 --> 00:01:21,270
are sets of words from a collection
of documents in a corpus.

19
00:01:21,270 --> 00:01:27,000
And, the output is going to be our set of
topic-specific vocabulary distributions,

20
00:01:27,000 --> 00:01:32,660
shared throughout the corpus, as well as
our document specific word assignments,

21
00:01:32,660 --> 00:01:35,080
and our document specific
topic proportions.

22
00:01:36,180 --> 00:01:40,210
But before we get to algorithms for
performing this inference task,

23
00:01:40,210 --> 00:01:44,390
let's first describe how we
might interpret the outputs.

24
00:01:44,390 --> 00:01:49,630
So one thing we can do is examine
the coherence of the learned topics, and

25
00:01:49,630 --> 00:01:55,140
to do this we can take the distribution
over words in the vocabulary

26
00:01:55,140 --> 00:02:00,920
in every topic and order those words
by how probable they are in the topic.

27
00:02:00,920 --> 00:02:05,680
So we look at the most probable words
in every topic, and see if this forms

28
00:02:05,680 --> 00:02:10,220
a coherent set, and if it does, then post
facto we can actually label these topics

29
00:02:10,220 --> 00:02:14,480
with things like science, and
technology, and sports, and so on.

30
00:02:14,480 --> 00:02:18,660
And this provides us a qualitative
assessment of the topics present

31
00:02:18,660 --> 00:02:20,160
in the corpus.

32
00:02:20,160 --> 00:02:23,620
And one other thing I want to emphasize,
though, is that these

33
00:02:25,230 --> 00:02:29,160
topic distributions are not
typically SPARS factors.

34
00:02:29,160 --> 00:02:34,480
Typically, they place mass over
every word in the vocabulary.

35
00:02:34,480 --> 00:02:36,980
Though if you look at
the most probable words,

36
00:02:36,980 --> 00:02:40,310
those often form some type
of interpretable set,

37
00:02:40,310 --> 00:02:45,080
if your model's performing well and you're
going to explore this in the assignment.

38
00:02:45,080 --> 00:02:48,600
So I just want to emphasize that the words
we're showing here in these lists,

39
00:02:48,600 --> 00:02:52,290
and the fact that we're only showing
a few words is not the full description

40
00:02:52,290 --> 00:02:53,320
of each of these topics.

41
00:02:53,320 --> 00:02:55,109
It's really a much more complicated beast.

42
00:02:56,190 --> 00:03:00,240
The other thing we can look at and the
thing that we're often very interested in

43
00:03:00,240 --> 00:03:03,060
are the topic proportions
in every document.

44
00:03:03,060 --> 00:03:06,000
Because this vector can be
used to relate documents.

45
00:03:06,000 --> 00:03:12,570
So what other documents have similar types
of topics present in the document that

46
00:03:12,570 --> 00:03:15,944
can be used for retrieval tasks, something
you'll also look at in your assignment.

47
00:03:15,944 --> 00:03:20,750
And you can also use this
type of topic proportion

48
00:03:20,750 --> 00:03:24,730
factor to allocate an article
to multiple categories.

49
00:03:24,730 --> 00:03:27,660
So imagine you're some new site and

50
00:03:27,660 --> 00:03:32,080
you have an article, and you need to
put that article into a category.

51
00:03:33,110 --> 00:03:36,460
This type of representation actually
allows you to put that article

52
00:03:36,460 --> 00:03:41,070
into multiple categories and
get more viewership, for

53
00:03:41,070 --> 00:03:45,870
that article and present it to more
people who might be interested in it.

54
00:03:45,870 --> 00:03:51,140
And finally, you can also use these topic
proportions to do things like we described

55
00:03:51,140 --> 00:03:56,270
before like learning preferences of
a given user over a set of topics and

56
00:03:56,270 --> 00:04:01,800
this type of description that LDA
provides with a set of topics and

57
00:04:01,800 --> 00:04:06,040
their relative proportions provides
a much more descriptive form

58
00:04:06,040 --> 00:04:09,510
than the type of clustering output
that we talked about before.

59
00:04:09,510 --> 00:04:13,480
Definitely for the hard assignments and
for the soft assignments as well where

60
00:04:13,480 --> 00:04:16,540
really that just captured
uncertainty in assignment but

61
00:04:16,540 --> 00:04:20,480
not the fact that they're Inherently,
as specified in the model,

62
00:04:20,480 --> 00:04:24,300
a set of possible topics
associated with every document.

63
00:04:24,300 --> 00:04:29,120
So this lets us do even fancier
things in learning user preferences.

64
00:04:29,120 --> 00:04:32,850
And the last thing we haven't described
are the word assignment variables.

65
00:04:32,850 --> 00:04:35,470
And typically, honestly,
we're not actually interested in this,

66
00:04:35,470 --> 00:04:38,440
we're not actually
interested in whether a word

67
00:04:38,440 --> 00:04:43,490
in a specific document is associated
with a topic related to science things.

68
00:04:43,490 --> 00:04:47,510
But, these assignment variables
are going to play a really critical role

69
00:04:47,510 --> 00:04:51,660
in inferring the other model parameters
that are typically the things of interest.

70
00:04:51,660 --> 00:04:53,420
And so this is just like what we saw in.

71
00:04:53,420 --> 00:04:56,744
And we'll walk through this
explicitly in the next section.