[MUSIC] Let's now describe a specific algorithm we can use for entrance in our LDA model. An algorithm we're going to cover is called Gibbs sampling. Recall that for the clustering task, we focused on two different algorithms. One was k-means which iterated between the following steps where we're making hard assignments of datapoints to clusters. And we said that we can view k-means as a coordinate descent algorithm, and as such, we can view it as maximizing a given objective function. We then turn to For mixtures of Gaussians and show that we iterate it between this E-step and M-step and again, show that this could be viewed as a coordinate descent algorithm. So in this case, we're also maximizing an objective function, but here we're making soft assignments of data points to clusters. Well, before we get to LDA, a natural question is what can we do for this alternative clustering model we presented where we used a bag of words representation of our document instead of a tf-idf vector, and then used this alternative clustering model instead of mixtures of Gaussians. Well, in this case, we can derive an Algorithm just like we did for mixtures of Gaussian. But here, instead of having a Gaussian likelihood of the tf-idf factor, we have what's called a multinomial likelihood of our observed word counts in the document. So specifically, we can look at the document and say, well, maybe there are m sub w counts of word w in this document. So we can think of this as mw successes of word w, where the probability of a success was given by the topic specific probability of that given word. So we can use this to compute the likelihood of all of our word counts in this document under a given topic. And so we can think of this alternative clustering model as a mixture of multinomial distributions instead of a mixture of a Gaussians. And in this case, we can derive an Algorithm just like we did for a mixture of Gaussians. But what can we do for our LDA model? Well, here we could derive an Style algorithm to perform maximum likelihood estimation of our model parameters, but this maximum likelihood estimation suffers from the same over fitting problem that we saw in the mixtures of Gaussians. And we talked about issues and high dimensional spaces. But the challenge is even harder here or the issue of over-fitting tends to be even more severe in LDA, just because of the sheer number of parameters that we have to learn. So instead, LDA is actually typically specified as a Bayesian model. And when it's not, it's not actually called LDA. It's called probabilistic latent semantic analysis or probabilistic latent semantic indexing. But we've been talking about everything in the context of LDA, which kind of swept under the rug some of the details about this Bayesian specification. It's a bit too advanced for the purposes of this course. But we're going to go through some of the intuition, some of what we get out of this Bayesian approach as well as a Bayesian inference algorithm. So, one thing a Bayesian approach does, is it accounts uncertainty in the parameters when we're making predictions. So, instead of just doing something like computing the maximum likelihood estimate of our parameters, fixing the value of those parameters and then making predictions. We say, well, the parameter could have been this, or it could have been that, or it could have been that. And we think about averaging over that uncertainty when we're forming our predictions. So you can think of this as forming a prediction for each one of these values. And computing a weighted average of those predictions based on how much we think the parameter is each one of these values. And the other thing that a Bayesian approach does is that it naturally regularizes our parameter estimates. So, if we think about maximizing our objective instead of computing what's called a maximum likelihood estimate. You end up with something called a maximum a posteriori estimate, and this is exactly akin to ideas we've talked about many times in the specialization of regularizing the maximum likelihood estimate. Okay, well in the context of this Bayesian specification, the Algorithm is no longer tractable. Because the E step in particular has this intractable expectation. So instead what you can do is something that's called variational Where you introduce an approximation to handle that really complicated expectation. [MUSIC]