[MUSIC] And finally, in our fourth module, we look
at a probabilistic model that provide a more intricate description of our
data point in the relationships between data points than our
simple clustering representation. And in particular, we look at something
called mixed membership modeling. And for a document analysis task, this model was called
Latent Dirichlet allocation. But before we presented Latent Dirichlet
allocation or LDA for shorthand. We presented an alternative
document clustering model, where we introduce a set of topic's
specific distributions over the words and the vocabulary, where remember
every topic is a different cluster. And then every document was assigned
to a cluster just as before. But in forming that assignment the score
of the document under the cluster was computed by just looking at a bag of
words representation of the documents. So just an unordered set of the words
that appear in that document. And then scoring the words under
the specific topic distribution. And here, just like in the mixture models
we described previously, every cluster or topic in this case has a specific
prevalence in the overall corpus. So this is distribution over topics
that appear in the entire corpus. So in this module, we compared and
contrast this clustering model with the mixture of Gaussian clustering
model we presented in the third module. And then we turned to
the LDA model itself. Where here, every word in every document had an assignment variable linking
that word to a specific topic. So then when we think about
scoring a document in LDA. We think of scoring every word
under its associated topic. Where these topics are defined exactly
like in the alternative questioning we just described. Where there's a distribution
over every word and a vocabulary specific to the topic. But the fact that there's a topic
indicator per word in the document, rather than per document. It's not the only thing that distinguishes
this model from the clustering model we just described. The other thing is, we introduced this
topic proportion vector specific to each document rather than it representing
corpus wide topic proportions. And this is really one of
the key aspects of LDA, because this forms our mixed membership
representation of every document. So a document doesn't
belong to just one topic, it belongs to a collection of topics. And there are different weights on how
much membership the document has in each one of these topics. And in this module,
we described how we can think of these topic proportions as
a learned feature representation. Where we can use it to do things
like allocating this article to multiple sections on a news website or
using it to relate different articles to one another or using it to learn
user's preferences over different topics. And likewise, we talked about how we can think of
looking at these topic distributions over the vocabulary to describe post facto
what these topics are really about. So these are the types of
inferences we can draw from LDA. But the question is how do we
learn this structure from data. And just like in clustering this
is a fully unsupervised task, where we just provide a set of words
in a set of document sin the corpus. And somehow from this we
want to extract out these topic vocabulary distributions and
these document topic proportions. And critical to doing this, just like in Is inferring that assignments
of the words to specific topics. But in this module, we describe that
LDA is specified as a Theseum model. And so we described a Bayesian
inference procedure for learning our model parameters,
as well as these assignment variables. And the algorithm we described
was called Gibbs sampling. And at first we presented a vanilla
version of Gibbs sampling where we simply iterate between all these assignment
variables and model parameters. Randomly reassigning each conditioned
on the instantiated values of all the other parameters or variables. So at first, we could think about
randomly reassigning the topics for every word in a document. And then we can think
about fixing these and sampling the topic proportion vector for
that specific document. And then repeating these steps for
all documents in the corpus. And then having fixed these values we
can think about resampling the topic vocabulary distributions. But then in the module we described
a little bit fancier version of sampling that we can perform in
LDA calle Collapsed Gibbs sampling. Where we analytically integrate out over
all these model parameters the topic vocabulary distributions and
these document specific topic proportions. And we just sequentially sample each
indicator variable of a given word to a specific topic conditioned on all
the other assignments made in that document and
every other document in the corpus. And we went through a derivation of
the form of this conditional distribution, specifically there are two terms. One is how much a given document
likes this specific topic, and the other is how much that topic
likes a specific word considered. And we said that we multiply
those two terms together. And then we think about renormalizing this across all possible
assignments that we could make. And then we use that
distribution to sample a new topic indicator for that specific word. Then we cycle through all words in the
document and all documents in the corpus. Finally, in this module we talked about
how we can use the output of Gibbs sampling to do Bayesian inference. Remember if we're thinking about doing
predictions in the Bayesian framework, we want to integrate over our uncertainty in
what value the model parameters can take. So we talked about how we can take each
one of our give samples form predictions from that sample and
then average across those samples. Or alternatively and something that's
very commonly done in practice is just look at the one sample that maximise,
what we call joint model probability and then use that to draw inferences. So in summary, as you've seen in just what
is supposed to be a brief recap we've covered just an enormous amount of
topics and very, very advanced concepts. We'd look at bunch of different models as
well as a bunch of different algorithms. And through this process we learned some
machine learning concepts that are very general and very useful beyond ideas
of just clustering and retrieval. So for example, we talked about distance metrics that
apply in many different domains. We've talked about approximation
algorithms unsupervised learning task, probabilistic modeling, scalability
through notions of data parallelism and finally this idea of Bayesian models and
Bayesian inference. And having gone through this course, you
now have a really, really extensive set of tools to go out and tackle quite different
problems than we saw in the regression and classification courses. [MUSIC]