[MUSIC] And finally, in our fourth module, we look at a probabilistic model that provide a more intricate description of our data point in the relationships between data points than our simple clustering representation. And in particular, we look at something called mixed membership modeling. And for a document analysis task, this model was called Latent Dirichlet allocation. But before we presented Latent Dirichlet allocation or LDA for shorthand. We presented an alternative document clustering model, where we introduce a set of topic's specific distributions over the words and the vocabulary, where remember every topic is a different cluster. And then every document was assigned to a cluster just as before. But in forming that assignment the score of the document under the cluster was computed by just looking at a bag of words representation of the documents. So just an unordered set of the words that appear in that document. And then scoring the words under the specific topic distribution. And here, just like in the mixture models we described previously, every cluster or topic in this case has a specific prevalence in the overall corpus. So this is distribution over topics that appear in the entire corpus. So in this module, we compared and contrast this clustering model with the mixture of Gaussian clustering model we presented in the third module. And then we turned to the LDA model itself. Where here, every word in every document had an assignment variable linking that word to a specific topic. So then when we think about scoring a document in LDA. We think of scoring every word under its associated topic. Where these topics are defined exactly like in the alternative questioning we just described. Where there's a distribution over every word and a vocabulary specific to the topic. But the fact that there's a topic indicator per word in the document, rather than per document. It's not the only thing that distinguishes this model from the clustering model we just described. The other thing is, we introduced this topic proportion vector specific to each document rather than it representing corpus wide topic proportions. And this is really one of the key aspects of LDA, because this forms our mixed membership representation of every document. So a document doesn't belong to just one topic, it belongs to a collection of topics. And there are different weights on how much membership the document has in each one of these topics. And in this module, we described how we can think of these topic proportions as a learned feature representation. Where we can use it to do things like allocating this article to multiple sections on a news website or using it to relate different articles to one another or using it to learn user's preferences over different topics. And likewise, we talked about how we can think of looking at these topic distributions over the vocabulary to describe post facto what these topics are really about. So these are the types of inferences we can draw from LDA. But the question is how do we learn this structure from data. And just like in clustering this is a fully unsupervised task, where we just provide a set of words in a set of document sin the corpus. And somehow from this we want to extract out these topic vocabulary distributions and these document topic proportions. And critical to doing this, just like in Is inferring that assignments of the words to specific topics. But in this module, we describe that LDA is specified as a Theseum model. And so we described a Bayesian inference procedure for learning our model parameters, as well as these assignment variables. And the algorithm we described was called Gibbs sampling. And at first we presented a vanilla version of Gibbs sampling where we simply iterate between all these assignment variables and model parameters. Randomly reassigning each conditioned on the instantiated values of all the other parameters or variables. So at first, we could think about randomly reassigning the topics for every word in a document. And then we can think about fixing these and sampling the topic proportion vector for that specific document. And then repeating these steps for all documents in the corpus. And then having fixed these values we can think about resampling the topic vocabulary distributions. But then in the module we described a little bit fancier version of sampling that we can perform in LDA calle Collapsed Gibbs sampling. Where we analytically integrate out over all these model parameters the topic vocabulary distributions and these document specific topic proportions. And we just sequentially sample each indicator variable of a given word to a specific topic conditioned on all the other assignments made in that document and every other document in the corpus. And we went through a derivation of the form of this conditional distribution, specifically there are two terms. One is how much a given document likes this specific topic, and the other is how much that topic likes a specific word considered. And we said that we multiply those two terms together. And then we think about renormalizing this across all possible assignments that we could make. And then we use that distribution to sample a new topic indicator for that specific word. Then we cycle through all words in the document and all documents in the corpus. Finally, in this module we talked about how we can use the output of Gibbs sampling to do Bayesian inference. Remember if we're thinking about doing predictions in the Bayesian framework, we want to integrate over our uncertainty in what value the model parameters can take. So we talked about how we can take each one of our give samples form predictions from that sample and then average across those samples. Or alternatively and something that's very commonly done in practice is just look at the one sample that maximise, what we call joint model probability and then use that to draw inferences. So in summary, as you've seen in just what is supposed to be a brief recap we've covered just an enormous amount of topics and very, very advanced concepts. We'd look at bunch of different models as well as a bunch of different algorithms. And through this process we learned some machine learning concepts that are very general and very useful beyond ideas of just clustering and retrieval. So for example, we talked about distance metrics that apply in many different domains. We've talked about approximation algorithms unsupervised learning task, probabilistic modeling, scalability through notions of data parallelism and finally this idea of Bayesian models and Bayesian inference. And having gone through this course, you now have a really, really extensive set of tools to go out and tackle quite different problems than we saw in the regression and classification courses. [MUSIC]