[MUSIC] Let's now describe the algorithm that
produces these random samples, and, to begin with, let's just present a
standard implementation of Gibbs sampling. Gibbs sampling treats both
our assignment variables and our model parameters
exactly in the same manner. Whereas, remember,
when we're looking at the Algorithm, there were different updates, when we're looking at our assignment
variables, then our model parameters. Okay.
And what Gibbs sampling does in its most standard implementation,
is it just cycles through all of these assignment variables and
model parameters and randomly samples each one from a conditional distribution
where we're conditioning on the previously sampled values of
all the other model parameters and assignment variables, and we're also
conditioning on our observations. So from iteration to iteration,
what we're conditioning on, the values of those assignment variables
and model parameters are going to change, but the observations,
they're always the same set of values. Whatever we observed from our data set. Well, let's look at this in pictures for
our LDA model. And let's imagine that at some
iteration of our Gibbs sampler, we have the following instantiation of
all of our assignment variables and model parameters, which, in this case, are
the topic vocabulary distributions And, the document specific assignment
variables of words to topics and the document specific
topic proportion factors. Well, at the next iteration of our Gibbs
sampler, we might consider reassigning all of our assignment variables of
words in a document to a topic. So this are the ziw's for
some document i and when we're going to resample these variables,
we form a conditional distribution where we're fixing the values
of all of the topic vocabulary distributions as well as the topic
preferences in this document. And so a question is what's the form
of this conditional distribution? Well, let's look at just one word. Let's say eeg in this document. And let's look at,
let's assume that this is word w, the w word in document i. And now let's look at riw2 which will be our notation just like when we're talking
about responsibilities in the Algorithm. This is going to indicate
the probability of assigning ziw = 2, meaning word w in document i to the second topic. And what's this probability? Wel,l what's the prior probability that I
randomly choose a word in this document and that it happens to be from topic two before I actually look at
the value of what that word is? Well that's just how prevalent
topic two is in this document. So here we have high i2 being the prior probability of ziw = 2. And then we're going to
multiply by the likelihood of observing the word EEG
within the second topic. So this will be probability of EEG, the actual value of this
wth word in the document, given ziw = 2. So given and this probability vector right
here, so we're going to grab this out Right here. And what we do to compute this likelihood
is we simply look at topic two and we scroll all the way down
till we find the word EEG and then we look at the probability
of that word and within this tech topic,
the probability is probably pretty low. Then the final thing that we do to compute
this probability is we have to normalize over all possible assignments. So, summing over all j equals 1 to capital K total number of topics,
we look at pi ij, probability of EEG under an assignment of the Wth word in document i to cluster j or
topic, rather, j. Okay, so this looks exactly like
the responsibilities we saw in Except here when we're looking at
the prior probability on a given assignment we're looking specifically
within this document because we have these document specific topic prevalences. And then we're scoring our
word now under a given topic probability vector instead of
when we look at For mixtures of Gaussian we were scoring a whole vector under
a given Gaussian distribution. Okay, the idea is that we would for
this specific word compute the responsibility for
every possible topic not just topic two. Topic one, two,
three all the way to topic capital K. And then we normalize, we look at these normalized numbers so
this whole vector sums to one. So this represents what's called
a probability mass function. So it's a distribution over a set
of integers, one to capital K, and then we just draw a value
randomly from this distribution. And that value is going to be
our assignment of the word EEG to a given topic. So perhaps, for example for this word EEG,
maybe we would assign it to topic one. Since that's the topic about science but of course through this random set of
assignments that we’re making here, we could’ve drawn any topic to
assign to this specific word. Okay, we will repeat this procedure for
every word on this document, and then we can think about reassigning
out topic proportions for this document, given the set of word
assignments that we've just made. So what informs these topic
proportions for this document? Do we care about the topic
vocabulary distributions? No, we actually don't. All we need are the counts of how
many times a given topic was used in this document to inform these
topic proportions for this document. But these counts are going to be
regularized by our Bayesian prior. because remember we discussed before
that we can think of the Bayesian prior as introducing a set of what
we called pseudo-observations. So we can think of every topic as having a
fixed number of observations in that topic that bias the distribution from just using
the observe counts in this document. So we use these counts both
the deserve counts and this document given this sample set of work assignment
variables as well as the parameters of our basic in prior to form a distribution
over these topic proportions, and then, we sample these topic proportions
from that distribution, and the specific form of this distribution, however,
is beyond the scope of this course. But the point here is that we can sample
these topic proportions and then, we repeat this process for
every document in our corpus. So we repeat sampling the word
assignment variables and the topic proportions for
each document in our entire data set. Then having done this,
we can turn to the corpus wide topic vocabulary distributions and
reassign those as well. And now when we go to
figure out how probable our words within a given topic,
what informs that? Well we can simply look at our word
assignment variables in the entire corpus now and say, for example,
for the word EEG, we can think of how many times was
the word EEG assigned to topic one, and we can use that information to inform us
of how probable EEG is under topic one. And we can do this for every word in our vocabulary for
each one of these different topics. But again, these counts of topic usage
within the corpus are realized by priors placed over this topic vocabulary
distributions in our Bayesian framework. Okay, but in summary, we're going to
randomly resample our topic vocabulary distributions and then, the Gibbs sampling
algorithm repeats these steps again, and again, and again, and again. Resampling our word assignment variables,
our document specific topic proportions, and our corpus wide topic vocabulary
distributions, again and again, until we run out of our
computational budget. [MUSIC]