[MUSIC] Let's now describe a specific algorithm
we can use for entrance in our LDA model. An algorithm we're going to
cover is called Gibbs sampling. Recall that for the clustering task,
we focused on two different algorithms. One was k-means which iterated
between the following steps where we're making hard assignments
of datapoints to clusters. And we said that we can view k-means as a
coordinate descent algorithm, and as such, we can view it as maximizing
a given objective function. We then turn to For
mixtures of Gaussians and show that we iterate it between
this E-step and M-step and again, show that this could be viewed
as a coordinate descent algorithm. So in this case, we're also
maximizing an objective function, but here we're making soft assignments
of data points to clusters. Well, before we get to LDA,
a natural question is what can we do for this alternative clustering model we
presented where we used a bag of words representation of our document
instead of a tf-idf vector, and then used this alternative clustering
model instead of mixtures of Gaussians. Well, in this case, we can derive
an Algorithm just like we did for mixtures of Gaussian. But here, instead of having a Gaussian
likelihood of the tf-idf factor, we have what's called a multinomial likelihood of
our observed word counts in the document. So specifically,
we can look at the document and say, well, maybe there are m sub w counts
of word w in this document. So we can think of this as mw successes of
word w, where the probability of a success was given by the topic specific
probability of that given word. So we can use this to compute
the likelihood of all of our word counts in this document
under a given topic. And so we can think of this alternative
clustering model as a mixture of multinomial distributions instead
of a mixture of a Gaussians. And in this case, we can derive
an Algorithm just like we did for a mixture of Gaussians. But what can we do for our LDA model? Well, here we could derive an Style
algorithm to perform maximum likelihood estimation of our model parameters,
but this maximum likelihood estimation suffers from the same over fitting problem
that we saw in the mixtures of Gaussians. And we talked about issues and
high dimensional spaces. But the challenge is even harder here or the issue of over-fitting tends
to be even more severe in LDA, just because of the sheer number of
parameters that we have to learn. So instead, LDA is actually typically
specified as a Bayesian model. And when it's not,
it's not actually called LDA. It's called probabilistic
latent semantic analysis or probabilistic latent semantic indexing. But we've been talking about
everything in the context of LDA, which kind of swept under the rug some of the
details about this Bayesian specification. It's a bit too advanced for
the purposes of this course. But we're going to go through
some of the intuition, some of what we get out of this Bayesian approach
as well as a Bayesian inference algorithm. So, one thing a Bayesian approach does, is it accounts uncertainty in the
parameters when we're making predictions. So, instead of just doing something
like computing the maximum likelihood estimate of our parameters,
fixing the value of those parameters and then making predictions. We say, well, the parameter could have
been this, or it could have been that, or it could have been that. And we think about averaging
over that uncertainty when we're forming our predictions. So you can think of this as forming
a prediction for each one of these values. And computing a weighted
average of those predictions based on how much we think the parameter
is each one of these values. And the other thing that a Bayesian
approach does is that it naturally regularizes
our parameter estimates. So, if we think about maximizing our
objective instead of computing what's called a maximum likelihood estimate. You end up with something called
a maximum a posteriori estimate, and this is exactly akin to ideas we've talked
about many times in the specialization of regularizing the maximum
likelihood estimate. Okay, well in the context of
this Bayesian specification, the Algorithm is no longer tractable. Because the E step in particular
has this intractable expectation. So instead what you can do is something
that's called variational Where you introduce an approximation to handle
that really complicated expectation. [MUSIC]