1
00:00:00,251 --> 00:00:04,467
[MUSIC]

2
00:00:04,467 --> 00:00:08,746
And finally, in our fourth module, we look
at a probabilistic model that provide

3
00:00:08,746 --> 00:00:12,282
a more intricate description of our
data point in the relationships

4
00:00:12,282 --> 00:00:16,230
between data points than our
simple clustering representation.

5
00:00:16,230 --> 00:00:19,650
And in particular, we look at something
called mixed membership modeling.

6
00:00:19,650 --> 00:00:21,710
And for a document analysis task,

7
00:00:21,710 --> 00:00:24,310
this model was called
Latent Dirichlet allocation.

8
00:00:25,450 --> 00:00:30,350
But before we presented Latent Dirichlet
allocation or LDA for shorthand.

9
00:00:30,350 --> 00:00:33,650
We presented an alternative
document clustering model,

10
00:00:33,650 --> 00:00:38,140
where we introduce a set of topic's
specific distributions over the words and

11
00:00:38,140 --> 00:00:42,680
the vocabulary, where remember
every topic is a different cluster.

12
00:00:42,680 --> 00:00:46,940
And then every document was assigned
to a cluster just as before.

13
00:00:46,940 --> 00:00:51,060
But in forming that assignment the score
of the document under the cluster was

14
00:00:51,060 --> 00:00:55,860
computed by just looking at a bag of
words representation of the documents.

15
00:00:55,860 --> 00:01:00,370
So just an unordered set of the words
that appear in that document.

16
00:01:00,370 --> 00:01:04,640
And then scoring the words under
the specific topic distribution.

17
00:01:04,640 --> 00:01:09,370
And here, just like in the mixture models
we described previously, every cluster or

18
00:01:09,370 --> 00:01:14,040
topic in this case has a specific
prevalence in the overall corpus.

19
00:01:14,040 --> 00:01:20,090
So this is distribution over topics
that appear in the entire corpus.

20
00:01:20,090 --> 00:01:23,860
So in this module, we compared and
contrast this clustering model

21
00:01:23,860 --> 00:01:28,120
with the mixture of Gaussian clustering
model we presented in the third module.

22
00:01:28,120 --> 00:01:30,920
And then we turned to
the LDA model itself.

23
00:01:30,920 --> 00:01:34,710
Where here, every word in every document

24
00:01:34,710 --> 00:01:38,600
had an assignment variable linking
that word to a specific topic.

25
00:01:38,600 --> 00:01:42,750
So then when we think about
scoring a document in LDA.

26
00:01:42,750 --> 00:01:47,190
We think of scoring every word
under its associated topic.

27
00:01:47,190 --> 00:01:52,159
Where these topics are defined exactly
like in the alternative questioning we

28
00:01:52,159 --> 00:01:53,295
just described.

29
00:01:53,295 --> 00:01:56,250
Where there's a distribution
over every word and

30
00:01:56,250 --> 00:01:58,790
a vocabulary specific to the topic.

31
00:01:58,790 --> 00:02:02,812
But the fact that there's a topic
indicator per word in the document,

32
00:02:02,812 --> 00:02:04,389
rather than per document.

33
00:02:04,389 --> 00:02:08,099
It's not the only thing that distinguishes
this model from the clustering model we

34
00:02:08,099 --> 00:02:09,370
just described.

35
00:02:09,370 --> 00:02:13,680
The other thing is, we introduced this
topic proportion vector specific to

36
00:02:13,680 --> 00:02:19,320
each document rather than it representing
corpus wide topic proportions.

37
00:02:19,320 --> 00:02:22,650
And this is really one of
the key aspects of LDA,

38
00:02:22,650 --> 00:02:27,670
because this forms our mixed membership
representation of every document.

39
00:02:27,670 --> 00:02:30,243
So a document doesn't
belong to just one topic,

40
00:02:30,243 --> 00:02:32,264
it belongs to a collection of topics.

41
00:02:32,264 --> 00:02:36,506
And there are different weights on how
much membership the document has in each

42
00:02:36,506 --> 00:02:38,330
one of these topics.

43
00:02:38,330 --> 00:02:41,070
And in this module,
we described how we can think of

44
00:02:41,070 --> 00:02:43,920
these topic proportions as
a learned feature representation.

45
00:02:43,920 --> 00:02:48,330
Where we can use it to do things
like allocating this article to

46
00:02:48,330 --> 00:02:53,430
multiple sections on a news website or
using it to relate different articles

47
00:02:53,430 --> 00:02:59,170
to one another or using it to learn
user's preferences over different topics.

48
00:02:59,170 --> 00:02:59,740
And likewise,

49
00:02:59,740 --> 00:03:03,370
we talked about how we can think of
looking at these topic distributions over

50
00:03:03,370 --> 00:03:09,470
the vocabulary to describe post facto
what these topics are really about.

51
00:03:09,470 --> 00:03:12,140
So these are the types of
inferences we can draw from LDA.

52
00:03:12,140 --> 00:03:15,870
But the question is how do we
learn this structure from data.

53
00:03:15,870 --> 00:03:20,210
And just like in clustering this
is a fully unsupervised task,

54
00:03:20,210 --> 00:03:24,630
where we just provide a set of words
in a set of document sin the corpus.

55
00:03:24,630 --> 00:03:27,700
And somehow from this we
want to extract out these

56
00:03:27,700 --> 00:03:33,120
topic vocabulary distributions and
these document topic proportions.

57
00:03:33,120 --> 00:03:34,840
And critical to doing this,

58
00:03:34,840 --> 00:03:41,140
just like in Is inferring that assignments
of the words to specific topics.

59
00:03:41,140 --> 00:03:45,040
But in this module, we describe that
LDA is specified as a Theseum model.

60
00:03:45,040 --> 00:03:48,180
And so we described a Bayesian
inference procedure for

61
00:03:48,180 --> 00:03:52,200
learning our model parameters,
as well as these assignment variables.

62
00:03:52,200 --> 00:03:55,070
And the algorithm we described
was called Gibbs sampling.

63
00:03:55,070 --> 00:03:58,280
And at first we presented a vanilla
version of Gibbs sampling where we simply

64
00:03:58,280 --> 00:04:01,970
iterate between all these assignment
variables and model parameters.

65
00:04:01,970 --> 00:04:06,070
Randomly reassigning each conditioned
on the instantiated values of

66
00:04:06,070 --> 00:04:08,450
all the other parameters or variables.

67
00:04:08,450 --> 00:04:12,420
So at first, we could think about
randomly reassigning the topics for

68
00:04:12,420 --> 00:04:13,910
every word in a document.

69
00:04:13,910 --> 00:04:16,500
And then we can think
about fixing these and

70
00:04:16,500 --> 00:04:21,360
sampling the topic proportion vector for
that specific document.

71
00:04:21,360 --> 00:04:24,560
And then repeating these steps for
all documents in the corpus.

72
00:04:24,560 --> 00:04:29,570
And then having fixed these values we
can think about resampling the topic

73
00:04:29,570 --> 00:04:31,720
vocabulary distributions.

74
00:04:31,720 --> 00:04:35,120
But then in the module we described
a little bit fancier version

75
00:04:35,120 --> 00:04:39,850
of sampling that we can perform in
LDA calle Collapsed Gibbs sampling.

76
00:04:39,850 --> 00:04:44,158
Where we analytically integrate out over
all these model parameters the topic

77
00:04:44,158 --> 00:04:48,409
vocabulary distributions and
these document specific topic proportions.

78
00:04:48,409 --> 00:04:53,399
And we just sequentially sample each
indicator variable of a given word to

79
00:04:53,399 --> 00:04:58,147
a specific topic conditioned on all
the other assignments made in that

80
00:04:58,147 --> 00:05:02,310
document and
every other document in the corpus.

81
00:05:02,310 --> 00:05:07,450
And we went through a derivation of
the form of this conditional distribution,

82
00:05:07,450 --> 00:05:09,350
specifically there are two terms.

83
00:05:09,350 --> 00:05:14,450
One is how much a given document
likes this specific topic,

84
00:05:14,450 --> 00:05:19,120
and the other is how much that topic
likes a specific word considered.

85
00:05:19,120 --> 00:05:21,890
And we said that we multiply
those two terms together.

86
00:05:21,890 --> 00:05:24,840
And then we think about renormalizing this

87
00:05:24,840 --> 00:05:27,890
across all possible
assignments that we could make.

88
00:05:27,890 --> 00:05:32,420
And then we use that
distribution to sample a new

89
00:05:32,420 --> 00:05:35,500
topic indicator for that specific word.

90
00:05:35,500 --> 00:05:40,380
Then we cycle through all words in the
document and all documents in the corpus.

91
00:05:40,380 --> 00:05:45,077
Finally, in this module we talked about
how we can use the output of Gibbs

92
00:05:45,077 --> 00:05:47,473
sampling to do Bayesian inference.

93
00:05:47,473 --> 00:05:52,283
Remember if we're thinking about doing
predictions in the Bayesian framework, we

94
00:05:52,283 --> 00:05:57,950
want to integrate over our uncertainty in
what value the model parameters can take.

95
00:05:57,950 --> 00:06:02,939
So we talked about how we can take each
one of our give samples form predictions

96
00:06:02,939 --> 00:06:06,646
from that sample and
then average across those samples.

97
00:06:06,646 --> 00:06:11,437
Or alternatively and something that's
very commonly done in practice is just

98
00:06:11,437 --> 00:06:16,228
look at the one sample that maximise,
what we call joint model probability and

99
00:06:16,228 --> 00:06:18,349
then use that to draw inferences.

100
00:06:19,930 --> 00:06:24,669
So in summary, as you've seen in just what
is supposed to be a brief recap we've

101
00:06:24,669 --> 00:06:29,362
covered just an enormous amount of
topics and very, very advanced concepts.

102
00:06:29,362 --> 00:06:34,293
We'd look at bunch of different models as
well as a bunch of different algorithms.

103
00:06:34,293 --> 00:06:39,040
And through this process we learned some
machine learning concepts that are very

104
00:06:39,040 --> 00:06:43,650
general and very useful beyond ideas
of just clustering and retrieval.

105
00:06:43,650 --> 00:06:44,300
So for example,

106
00:06:44,300 --> 00:06:47,860
we talked about distance metrics that
apply in many different domains.

107
00:06:47,860 --> 00:06:52,130
We've talked about approximation
algorithms unsupervised learning task,

108
00:06:52,130 --> 00:06:57,400
probabilistic modeling, scalability
through notions of data parallelism and

109
00:06:57,400 --> 00:07:01,850
finally this idea of Bayesian models and
Bayesian inference.

110
00:07:01,850 --> 00:07:06,639
And having gone through this course, you
now have a really, really extensive set of

111
00:07:06,639 --> 00:07:11,360
tools to go out and tackle quite different
problems than we saw in the regression and

112
00:07:11,360 --> 00:07:12,926
classification courses.

113
00:07:12,926 --> 00:07:17,279
[MUSIC]