[MUSIC] Let's now describe the algorithm that produces these random samples, and, to begin with, let's just present a standard implementation of Gibbs sampling. Gibbs sampling treats both our assignment variables and our model parameters exactly in the same manner. Whereas, remember, when we're looking at the Algorithm, there were different updates, when we're looking at our assignment variables, then our model parameters. Okay. And what Gibbs sampling does in its most standard implementation, is it just cycles through all of these assignment variables and model parameters and randomly samples each one from a conditional distribution where we're conditioning on the previously sampled values of all the other model parameters and assignment variables, and we're also conditioning on our observations. So from iteration to iteration, what we're conditioning on, the values of those assignment variables and model parameters are going to change, but the observations, they're always the same set of values. Whatever we observed from our data set. Well, let's look at this in pictures for our LDA model. And let's imagine that at some iteration of our Gibbs sampler, we have the following instantiation of all of our assignment variables and model parameters, which, in this case, are the topic vocabulary distributions And, the document specific assignment variables of words to topics and the document specific topic proportion factors. Well, at the next iteration of our Gibbs sampler, we might consider reassigning all of our assignment variables of words in a document to a topic. So this are the ziw's for some document i and when we're going to resample these variables, we form a conditional distribution where we're fixing the values of all of the topic vocabulary distributions as well as the topic preferences in this document. And so a question is what's the form of this conditional distribution? Well, let's look at just one word. Let's say eeg in this document. And let's look at, let's assume that this is word w, the w word in document i. And now let's look at riw2 which will be our notation just like when we're talking about responsibilities in the Algorithm. This is going to indicate the probability of assigning ziw = 2, meaning word w in document i to the second topic. And what's this probability? Wel,l what's the prior probability that I randomly choose a word in this document and that it happens to be from topic two before I actually look at the value of what that word is? Well that's just how prevalent topic two is in this document. So here we have high i2 being the prior probability of ziw = 2. And then we're going to multiply by the likelihood of observing the word EEG within the second topic. So this will be probability of EEG, the actual value of this wth word in the document, given ziw = 2. So given and this probability vector right here, so we're going to grab this out Right here. And what we do to compute this likelihood is we simply look at topic two and we scroll all the way down till we find the word EEG and then we look at the probability of that word and within this tech topic, the probability is probably pretty low. Then the final thing that we do to compute this probability is we have to normalize over all possible assignments. So, summing over all j equals 1 to capital K total number of topics, we look at pi ij, probability of EEG under an assignment of the Wth word in document i to cluster j or topic, rather, j. Okay, so this looks exactly like the responsibilities we saw in Except here when we're looking at the prior probability on a given assignment we're looking specifically within this document because we have these document specific topic prevalences. And then we're scoring our word now under a given topic probability vector instead of when we look at For mixtures of Gaussian we were scoring a whole vector under a given Gaussian distribution. Okay, the idea is that we would for this specific word compute the responsibility for every possible topic not just topic two. Topic one, two, three all the way to topic capital K. And then we normalize, we look at these normalized numbers so this whole vector sums to one. So this represents what's called a probability mass function. So it's a distribution over a set of integers, one to capital K, and then we just draw a value randomly from this distribution. And that value is going to be our assignment of the word EEG to a given topic. So perhaps, for example for this word EEG, maybe we would assign it to topic one. Since that's the topic about science but of course through this random set of assignments that we’re making here, we could’ve drawn any topic to assign to this specific word. Okay, we will repeat this procedure for every word on this document, and then we can think about reassigning out topic proportions for this document, given the set of word assignments that we've just made. So what informs these topic proportions for this document? Do we care about the topic vocabulary distributions? No, we actually don't. All we need are the counts of how many times a given topic was used in this document to inform these topic proportions for this document. But these counts are going to be regularized by our Bayesian prior. because remember we discussed before that we can think of the Bayesian prior as introducing a set of what we called pseudo-observations. So we can think of every topic as having a fixed number of observations in that topic that bias the distribution from just using the observe counts in this document. So we use these counts both the deserve counts and this document given this sample set of work assignment variables as well as the parameters of our basic in prior to form a distribution over these topic proportions, and then, we sample these topic proportions from that distribution, and the specific form of this distribution, however, is beyond the scope of this course. But the point here is that we can sample these topic proportions and then, we repeat this process for every document in our corpus. So we repeat sampling the word assignment variables and the topic proportions for each document in our entire data set. Then having done this, we can turn to the corpus wide topic vocabulary distributions and reassign those as well. And now when we go to figure out how probable our words within a given topic, what informs that? Well we can simply look at our word assignment variables in the entire corpus now and say, for example, for the word EEG, we can think of how many times was the word EEG assigned to topic one, and we can use that information to inform us of how probable EEG is under topic one. And we can do this for every word in our vocabulary for each one of these different topics. But again, these counts of topic usage within the corpus are realized by priors placed over this topic vocabulary distributions in our Bayesian framework. Okay, but in summary, we're going to randomly resample our topic vocabulary distributions and then, the Gibbs sampling algorithm repeats these steps again, and again, and again, and again. Resampling our word assignment variables, our document specific topic proportions, and our corpus wide topic vocabulary distributions, again and again, until we run out of our computational budget. [MUSIC]