Hi everyone. This week, we have explored a lot of ways to build vector representations for words or for some pieces of text. This lesson is about topic modeling. Topic modeling is an alternative way to build vector representations for your document collections. So, let us start with a brief introduction to the task. You are given a text collection and you want to build some hidden representation. So, you want to say that okay there are some topics here and every document is described with those topics that are discussed in this document. Now, what is the topic? Well you can imagine that you can describe a topic with some words. For example, such topic as weather is described with sky, rain, sun, and something like this and such topics such as mathematics is described with some mathematical terms and probably they do not even overlap at all. So, you can think about it as soft biclustering. Why soft b-clustering? So, first, it is biclustering because you cluster both words and documents. Second, it is soft because you will see that we are going to build some probability distributions to softly assign words and documents to topics. This is the formal way of saying the same thing. You are given a text collection, so you are given the counts of how many times every word occurs in every document? And what you need to find is two kinds of probability distributions. So, first the probability distribution over words for topics and second the probability distribution over topics for documents. And importantly, this is just the definition of a topic, so you should not think that topic is something complicated like it is in like real life or as linguists can say. For us, for all this lesson, topic is just a probability distribution. That's it. Where do we need this kind of models in real life? Well actually everywhere because everywhere you have big collections of documents. It can be some news flows or some social media messages or maybe some data for your domain like for example papers, research papers that you do not want to read but you want to know that there are some papers about these and that and they are connected this way. So, you want some nice overview of the area to build this automatically and topic models can do it for you. Some other applications would be social network analysis or even dialog systems because you can imagine that you would generate some text. You know how to generate text, right, from the previous week, but now you can do this text generation dependent on the topics that you want to mention. So, there are many many other applications, for example aggregation of news flows, when you have some news about politics for example and you want to say that this topic becomes popular nowadays, and one other important application that i want to mention is exploratory search, which means that you want to say this is some document that I am interested in, could you please find some similar documents and tell me how they are interconnected? Now, let us do some math, so let us discuss probabilistic latent semantic analysis, PLSA. This is a topic model proposed by Thomas Hofmann in 1999. This is a very basic model that tries to predict words in documents and it does so by a mixture of topics. So, do you understand what happens for the first equation here in this formula? Well, this is a law of total probability. So, if you just don't care about documents in the formulas for now, about D, you can notice that this is the law of total probability applied here. Just take a moment to understand this. Now what about the second equation here? Well, this is not correct, this is just our assumption. So, just for simplicity, we assume that the probability of word given the topic doesn't depend anymore on the document. So, this is conditional independence assumption. This is all that we need to introduce PLSA model. Now i just want you to give you intuition on how that works. So, this is a generative story. This is a story how the data is generated by our model. You have some probability distribution of topics for the document and first you decide what would be the topic for the next word. Then, once you have decided on that, you can draw a certain word from the probability distribution for this topic. So, this model just assumes that the text is generated not by authors, not just by handwriting, but by some probability procedure. So, first we toss a coin and decide what topic will be next, and then we toss a coin again and decide what would be the exact word, and we go on through the whole text. Well, this is just one way to think about it. If you do not feel very comfortable with this way, I will provide for you another way of thinking. So, this is a matrix way of thinking about this same model. You can imagine that you have some data which is just word document co-occurrences. So, you know how many times each word occurs in each document. That's why you can compute distributions. You can compute probabilities of words in documents. You just normalize those counts and that's it. Now you need to factorize this real matrix into two matrices of your parameters, Phi and Theta. One matrix, Phi matrix, is about probability distributions over words and Theta matrix contains probability distributions over topics. Actually every column in this matrix is a probability distribution. So, this is just a matrix form of the same formula in the top of the slide, and you can see that it holds just for one element and for any element obviously. So, this is the introduction of the model and in the next video we will figure out how to train this model. So stay with me.