[MUSIC] But, as an alternative, we can perform Bayesian inference using a totally different algorithm called Gibbs sampling. And in Gibbs sampling, we're going to iteratively provide hard assignments just like we did in k-means, but these hard assignments are going to be drawn randomly from a specific distribution, whereas remembering k-means, we just solved an equation and got a solution. But the point here is that as we look across these random samples of our model parameters and assignment variables that are being produced by Gibbs sampling, this is capturing the uncertainty in the values that these variables can take. And the reason we're presenting Gibbs sampling in contrast to variational Is the fact that the updates of our Gibbs sampling steps tend to be very intuitive and very straightforward to implement. Okay, so let's look at a couple samples from a Gibbs sampler to get a little intuition about what's going on here. So remember in our LDA model, here are all our assignment variables and model parameters that we want to do inference over, and this is iteration 10,000 of a Gibbs sampler, and the reason I'm looking so far out at iteration 10,000 of the Gibbs sampler is the fact that in very large models, like LDA, it can take quite a few iterations before we get out of a potentially bad initialization and start getting some reasonable assignments. So let's imagine this is iteration 10,000, and these are the values assigned, they were randomly assigned to each of these module parameters and assignment variables. And then maybe this is sample 10,001, and sample 10,002, and if i just flip back and forth between these samples and do this a few times, we're just going back and forth through these three samples, we see how the values change. This is the randomness inherent in this algorithm, and so we can talk about what do we expect to happen, as we keep sampling over iterations. Well, one thing we can look at here is something called the joint model probability. So this is like the likelihood that we monitored before, but in a Bayesian model, not only do we have to account for the likelihood of our data, given a specific set of model parameters. But we also have to look at the probability of those model parameters themselves. And together, these two terms form the joint model probability. And if we look at joint model probability over Gibbs sampling iterations, we tend to get plots that look like the following, where there's this initial burn in period where we're just moving from our initialization to some reasonable space, and then we wander around. And throughout this whole process, the joint model probability can go up and down. And the reason is because this is a randomized algorithm. There's no guarantee that we're going to increase our joint probability. This is not an optimization algorithm. And that's actually the point of it, we're actually intending to explore this space of possible solutions, rather than converge just to a single point. And this is very similar to what we saw for stochastic gradient descent, or ascent, where there, there was randomness too because we were computing a noisy estimate of the gradient, and we showed that our objective function wasn't going to simply monotonically increase or not decrease, it might actually go up and down. But the important thing here is that eventually, meaning after enough iterations, the random samples that are produced by our Gibbs sampling algorithm allow us to form what are called correct Bayesian estimates. And to provide some examples of what this means, if we're thinking of a predictive task, we said that Bayesian's average over uncertainty in the parameters. Well what we can do from the output of our Gibbs sampling is we can look at a given sample and all the assignments, they're hard assignments, of the model parameters and variables, and we can form our prediction from those set of assignments. Then we can look at our next sample, form our prediction, our next sample, form our prediction. Then we can average over these predictions, across the different samples, and naturally because of how the Gibbs sampler is exploring uncertainty, it's spending more time in high probability regions and less time in low probability regions. It's forming a correct prediction, a correct Bayesian average, over the uncertainty in our parameters. Another thing we can think about doing is grabbing out what I'll call a good solution. And what I mean by a good solution is the set of model parameters and assignment variables that out of the space of everything we've explored, that maximize this joint model probability. And this is an estimate of what we call that maximum a posteriori parameter estimate. So if we look at monitoring joint probability over iterations, we just look at grabbing out the hard assignments made at the iteration that maximizes joint probability. [MUSIC]