[MUSIC] We're now ready to present our
Latent Dirichlet allocation mixed membership model. And remember, here our goal is
to associate with the document a collection of topics present in that
document as well as their relative proportions in the document. So how prevalent those topics are. And remember in the clustering
model that we just presented for our bag of words
representation of a document, we introduce a set of topic
specific vocabulary distributions. So associating probabilities over
every word in the vocabulary, specific to each topic, and
then the clustering model look to assign an entire
document to a single topic. And for this assignment, it would score all of the words in the
document under that topic distribution. And then the model also introduces
a corpus wide distribution over the prevalence of topics throughout
all the documents in this corpus. So that is, how likely for any given document that we grab that
it's from one of these sets of topics? In Latent Dirichlet allocation
we introduced this same set of topic-specific vocabulary distributions,
but now when we look at a given document We're looking to assign
every word to a single topic. Instead of assigning the entire document, every word is going to have
an assignment variable Z-I-W. So for the wth word in document i,
what topic it's assigned to. Then when we go to score a document, we're going to score all of the words
under each of these assigned topics. So, scoring all the orange
words under the orange topic, blue words under the blue topic and so on. And that's going to determine
how good those assignments are. And finally,
there's another thing different in LDA. Instead of introducing a corpus wide
distribution on this topic prevalences, each document is going to
have a distribution over the prevalence of
topics in that document. So now instead of pi, globally throughout the corpus we
have pi i specific to the document i. And what this vector is going to
represent are our desired topic prevalences in
this specific document. [MUSIC]