[MUSIC] In this module we're going to cover
a really popular probabilistic model for document analysis called
Latent Dirichlet Allocation or LDA. And LDA is an example of a class of
methods called Mixed Membership Modeling. And to start with, let's motivate
the use of mixed membership models in the context of our document analysis. So far we've described clustering
models where the goal is to group related articles into disjoint sets or clusters where these clusters capture
the topics prevalent in the corpus. And in this context every document
is assigned to a single topic. A question though is, is an article really
about just one topic, like science? In the last module, we talked about
soft assignments capturing uncertainty in this cluster assignment, but
the clustering model still assumes that each document is assigned
to a single topic. We might just have uncertainty in what
that assignment is from the observed data. To make this more concrete,
let's look at a specific example where we have an article that's entitled
Modeling the Complex Dynamics and Changing Correlations of Epileptic Events. And in this article we see words like
patience, epilepsy, EG, clinical. So based on the content of this article. Maybe a clustering model would
group this article with other articles related to science topic. And maybe that's Cluster 4 in orange for
clustering. And what this bar chart represents
is simply a one hand encoding of our assignment of this article,
two cluster four zi = 4. But this article also has words like
is asynchronize, automatic, model, and things like this which might mean that
is really an article about technology. And so maybe we should cripple
with other text articles which in this case is cluster 2. Well, our soft assignments that we
talked about in the last module capture our uncertainty about whether this
article should be assigned to cluster 2 or cluster 4, assigned to the science
cluster or the technology cluster. But maybe what we really want to capture
is the fact that the article's about science and technology. That is, "Zi" is really 2 and 4. And so, I put Zi in quotes because
it's not going to be a single variable associated with a document to
represent this mixed assignment like what we saw in our clustering model. But in essence, what we're saying is
that this document has membership in both of these clusters 2 and 4. And we're going to go through in this
module exactly how we think about formunist type of mixed
membership assignment. And importantly, the other thing that
we're going to want to capture is not only which topics are present in this document,
but what's the relative proportion? How prevalent are these topics? So this is where mixed membership
models come in because mixed membership models allow us to associate any given
data point with a set of different cluster assignments or, in this case,
a document with a set of topics. Rather than assuming that every document
is associated with just a single cluster or topic or capturing uncertainty in
that single assignment like the soft assignments we talked about before. [MUSIC]