[MUSIC] So this is a standard implementation
of a Gibbs sampler, but we can actually do things that
are a little bit fancier. For example, we can do something
called collapsed Gibbs sampling. And we're going to describe this
in the context of our LDA model. And the idea here is the fact that we
can actually analytically marginalize over all of the uncertainty
in our model parameters and just sample the word assignment variables. So we never have to sample our
topic vocabulary distributions and all the document specific proportions. We just go through and iterate,
sampling these word assignment variables. And that seems pretty cool because
what we've done is we've dramatically, dramatically reduced the space in which
we're exploring in this Gibbs sampler. And so this can actually lead to
much better performance in practice because remember, we have very massive
vocabularies in general, and so we have this distribution over the entire
vocabulary for each one of our topics. So that's a lot of parameters
to think about learning. And likewise, for every document
in our corpus, we have a set of probabilities over topics being
present in that document. So being able to collapse these things
away, these model parameters away, and just look at these assignment variables
and the documents can be quite helpful. So in pictures what we're saying is
that we can completely collapse away all of these model parameters and just look at
iteravely resampling our word assignment variables for every word in the document
and then every word in the corpus. But there is a caveat here, and that's
the fact that we now have to actually sequentially resample our words given
all the other words in the corpus. And we didn't discuss this previously but
I want to highlight it now. If you look at the uncollapsed sampler,
our standard implementation of Gibbs sampling for LDA, when we looked
at a given document, we ended up with a data parallel problem for
resampling the word indicator variables. You could do each word indicator
variable completely in parallel. All you needed to do was condition on
the topic vocabulary distributions and the document specific topic proportions
and then everything decoupled. There was no dependence on the assignments
to other words in the document or in the corpus. But now,
when we don't have these model parameters, what informs us of an assignment
of a given word to a given topic? Well the other words and the other
assignments that were made to those words. So we can think of all those
other assignment variables in the corpus as a surrogate for
the model parameters. And we'll discuss this in more
detail in the coming slides. But the take home message here is
that we never have to sample our topic vocabulary distributions or
the document specific topic proportions. We just sample these
word indicator variables. But we do so sequentially losing the
ability to paralyze across that operation. [MUSIC]