[MUSIC] But, as an alternative, we can perform
Bayesian inference using a totally different algorithm called Gibbs sampling. And in Gibbs sampling, we're going to
iteratively provide hard assignments just like we did in k-means, but these
hard assignments are going to be drawn randomly from a specific distribution,
whereas remembering k-means, we just solved an equation and
got a solution. But the point here is that as we look
across these random samples of our model parameters and assignment variables
that are being produced by Gibbs sampling, this is capturing the uncertainty in
the values that these variables can take. And the reason we're presenting Gibbs
sampling in contrast to variational Is the fact that the updates
of our Gibbs sampling steps tend to be very intuitive and
very straightforward to implement. Okay, so let's look at a couple samples from a Gibbs sampler to get a little
intuition about what's going on here. So remember in our LDA model,
here are all our assignment variables and model parameters that we want
to do inference over, and this is iteration 10,000 of a Gibbs
sampler, and the reason I'm looking so far out at iteration 10,000 of the Gibbs
sampler is the fact that in very large models, like LDA, it can take quite
a few iterations before we get out of a potentially bad initialization and
start getting some reasonable assignments. So let's imagine this is iteration 10,000,
and these are the values assigned, they were randomly assigned to each
of these module parameters and assignment variables. And then maybe this is sample 10,001, and
sample 10,002, and if i just flip back and forth between these samples and do this
a few times, we're just going back and forth through these three samples,
we see how the values change. This is the randomness inherent in this
algorithm, and so we can talk about what do we expect to happen,
as we keep sampling over iterations. Well, one thing we can look
at here is something called the joint model probability. So this is like the likelihood that we
monitored before, but in a Bayesian model, not only do we have to account for
the likelihood of our data, given a specific set of model parameters. But we also have to
look at the probability of those model parameters themselves. And together, these two terms
form the joint model probability. And if we look at joint model probability
over Gibbs sampling iterations, we tend to get plots that look like the
following, where there's this initial burn in period where we're just moving from our
initialization to some reasonable space, and then we wander around. And throughout this whole process, the joint model probability can go up and
down. And the reason is because this
is a randomized algorithm. There's no guarantee that we're going
to increase our joint probability. This is not an optimization algorithm. And that's actually the point of it, we're
actually intending to explore this space of possible solutions, rather than
converge just to a single point. And this is very similar to what we
saw for stochastic gradient descent, or ascent, where there,
there was randomness too because we were computing a noisy
estimate of the gradient, and we showed that our objective
function wasn't going to simply monotonically increase or not decrease,
it might actually go up and down. But the important thing here is that
eventually, meaning after enough iterations, the random samples that are
produced by our Gibbs sampling algorithm allow us to form what are called
correct Bayesian estimates. And to provide some examples of what this
means, if we're thinking of a predictive task, we said that Bayesian's average
over uncertainty in the parameters. Well what we can do from the output of our
Gibbs sampling is we can look at a given sample and all the assignments, they're
hard assignments, of the model parameters and variables, and we can form our
prediction from those set of assignments. Then we can look at our next sample,
form our prediction, our next sample, form our prediction. Then we can average over these
predictions, across the different samples, and naturally because of how the Gibbs
sampler is exploring uncertainty, it's spending more time in
high probability regions and less time in low probability regions. It's forming a correct prediction, a correct Bayesian average,
over the uncertainty in our parameters. Another thing we can think
about doing is grabbing out what I'll call a good solution. And what I mean by a good solution
is the set of model parameters and assignment variables that out of
the space of everything we've explored, that maximize this joint
model probability. And this is an estimate of what
we call that maximum a posteriori parameter estimate. So if we look at monitoring joint
probability over iterations, we just look at grabbing out
the hard assignments made at the iteration that
maximizes joint probability. [MUSIC]