1 00:00:02,740 --> 00:00:05,620 Hi everyone. This week, 2 00:00:05,620 --> 00:00:08,100 we have explored a lot of ways to build 3 00:00:08,100 --> 00:00:12,377 vector representations for words or for some pieces of text. 4 00:00:12,377 --> 00:00:15,130 This lesson is about topic modeling. 5 00:00:15,130 --> 00:00:17,790 Topic modeling is an alternative way to 6 00:00:17,790 --> 00:00:21,605 build vector representations for your document collections. 7 00:00:21,605 --> 00:00:24,875 So, let us start with a brief introduction to the task. 8 00:00:24,875 --> 00:00:30,515 You are given a text collection and you want to build some hidden representation. 9 00:00:30,515 --> 00:00:34,140 So, you want to say that okay there are some topics here and 10 00:00:34,140 --> 00:00:39,855 every document is described with those topics that are discussed in this document. 11 00:00:39,855 --> 00:00:41,810 Now, what is the topic? 12 00:00:41,810 --> 00:00:46,138 Well you can imagine that you can describe a topic with some words. 13 00:00:46,138 --> 00:00:51,795 For example, such topic as weather is described with sky, rain, sun, 14 00:00:51,795 --> 00:00:55,760 and something like this and such topics such as mathematics is described with 15 00:00:55,760 --> 00:01:01,325 some mathematical terms and probably they do not even overlap at all. 16 00:01:01,325 --> 00:01:05,150 So, you can think about it as soft biclustering. 17 00:01:05,150 --> 00:01:07,860 Why soft b-clustering? 18 00:01:07,860 --> 00:01:12,850 So, first, it is biclustering because you cluster both words and documents. 19 00:01:12,850 --> 00:01:17,370 Second, it is soft because you will see that we are going to build 20 00:01:17,370 --> 00:01:24,945 some probability distributions to softly assign words and documents to topics. 21 00:01:24,945 --> 00:01:27,910 This is the formal way of saying the same thing. 22 00:01:27,910 --> 00:01:29,645 You are given a text collection, 23 00:01:29,645 --> 00:01:35,560 so you are given the counts of how many times every word occurs in every document? 24 00:01:35,560 --> 00:01:40,088 And what you need to find is two kinds of probability distributions. 25 00:01:40,088 --> 00:01:44,310 So, first the probability distribution over words for 26 00:01:44,310 --> 00:01:49,950 topics and second the probability distribution over topics for documents. 27 00:01:49,950 --> 00:01:53,878 And importantly, this is just the definition of a topic, 28 00:01:53,878 --> 00:01:57,230 so you should not think that topic is something complicated 29 00:01:57,230 --> 00:02:00,995 like it is in like real life or as linguists can say. 30 00:02:00,995 --> 00:02:03,480 For us, for all this lesson, 31 00:02:03,480 --> 00:02:06,325 topic is just a probability distribution. 32 00:02:06,325 --> 00:02:07,845 That's it. 33 00:02:07,845 --> 00:02:11,070 Where do we need this kind of models in real life? 34 00:02:11,070 --> 00:02:17,090 Well actually everywhere because everywhere you have big collections of documents. 35 00:02:17,090 --> 00:02:22,260 It can be some news flows or some social media messages or maybe 36 00:02:22,260 --> 00:02:27,500 some data for your domain like for example papers, 37 00:02:27,500 --> 00:02:31,600 research papers that you do not want to read but you want to know 38 00:02:31,600 --> 00:02:36,535 that there are some papers about these and that and they are connected this way. 39 00:02:36,535 --> 00:02:38,910 So, you want some nice overview of 40 00:02:38,910 --> 00:02:44,900 the area to build this automatically and topic models can do it for you. 41 00:02:44,900 --> 00:02:49,230 Some other applications would be social network analysis or 42 00:02:49,230 --> 00:02:54,650 even dialog systems because you can imagine that you would generate some text. 43 00:02:54,650 --> 00:02:56,500 You know how to generate text, 44 00:02:56,500 --> 00:02:58,420 right, from the previous week, 45 00:02:58,420 --> 00:03:04,900 but now you can do this text generation dependent on the topics that you want to mention. 46 00:03:04,900 --> 00:03:08,593 So, there are many many other applications, 47 00:03:08,593 --> 00:03:11,085 for example aggregation of news flows, 48 00:03:11,085 --> 00:03:14,640 when you have some news about politics for example and you 49 00:03:14,640 --> 00:03:18,525 want to say that this topic becomes popular nowadays, 50 00:03:18,525 --> 00:03:23,490 and one other important application that i want to mention is exploratory search, 51 00:03:23,490 --> 00:03:28,659 which means that you want to say this is some document that I am interested in, 52 00:03:28,659 --> 00:03:34,550 could you please find some similar documents and tell me how they are interconnected? 53 00:03:34,550 --> 00:03:36,180 Now, let us do some math, 54 00:03:36,180 --> 00:03:40,890 so let us discuss probabilistic latent semantic analysis, PLSA. 55 00:03:40,890 --> 00:03:45,675 This is a topic model proposed by Thomas Hofmann in 1999. 56 00:03:45,675 --> 00:03:49,320 This is a very basic model that tries to predict words in 57 00:03:49,320 --> 00:03:53,985 documents and it does so by a mixture of topics. 58 00:03:53,985 --> 00:04:00,190 So, do you understand what happens for the first equation here in this formula? 59 00:04:00,190 --> 00:04:03,635 Well, this is a law of total probability. 60 00:04:03,635 --> 00:04:08,605 So, if you just don't care about documents in the formulas for now, 61 00:04:08,605 --> 00:04:14,550 about D, you can notice that this is the law of total probability applied here. 62 00:04:14,550 --> 00:04:18,045 Just take a moment to understand this. 63 00:04:18,045 --> 00:04:21,843 Now what about the second equation here? 64 00:04:21,843 --> 00:04:24,111 Well, this is not correct, 65 00:04:24,111 --> 00:04:26,150 this is just our assumption. 66 00:04:26,150 --> 00:04:27,730 So, just for simplicity, 67 00:04:27,730 --> 00:04:31,465 we assume that the probability of word 68 00:04:31,465 --> 00:04:36,100 given the topic doesn't depend anymore on the document. 69 00:04:36,100 --> 00:04:39,725 So, this is conditional independence assumption. 70 00:04:39,725 --> 00:04:43,470 This is all that we need to introduce PLSA model. 71 00:04:43,470 --> 00:04:48,125 Now i just want you to give you intuition on how that works. 72 00:04:48,125 --> 00:04:50,564 So, this is a generative story. 73 00:04:50,564 --> 00:04:56,470 This is a story how the data is generated by our model. 74 00:04:56,470 --> 00:05:00,610 You have some probability distribution of topics for the document and 75 00:05:00,610 --> 00:05:05,650 first you decide what would be the topic for the next word. 76 00:05:05,650 --> 00:05:08,125 Then, once you have decided on that, 77 00:05:08,125 --> 00:05:14,135 you can draw a certain word from the probability distribution for this topic. 78 00:05:14,135 --> 00:05:19,770 So, this model just assumes that the text is generated not by authors, 79 00:05:19,770 --> 00:05:21,960 not just by handwriting, 80 00:05:21,960 --> 00:05:24,530 but by some probability procedure. 81 00:05:24,530 --> 00:05:29,265 So, first we toss a coin and decide what topic will be next, 82 00:05:29,265 --> 00:05:33,345 and then we toss a coin again and decide what would be the exact word, 83 00:05:33,345 --> 00:05:36,090 and we go on through the whole text. 84 00:05:36,090 --> 00:05:39,530 Well, this is just one way to think about it. 85 00:05:39,530 --> 00:05:42,290 If you do not feel very comfortable with this way, 86 00:05:42,290 --> 00:05:45,230 I will provide for you another way of thinking. 87 00:05:45,230 --> 00:05:49,700 So, this is a matrix way of thinking about this same model. 88 00:05:49,700 --> 00:05:54,280 You can imagine that you have some data which is just word document co-occurrences. 89 00:05:54,280 --> 00:05:59,545 So, you know how many times each word occurs in each document. 90 00:05:59,545 --> 00:06:02,470 That's why you can compute distributions. 91 00:06:02,470 --> 00:06:05,955 You can compute probabilities of words in documents. 92 00:06:05,955 --> 00:06:09,585 You just normalize those counts and that's it. 93 00:06:09,585 --> 00:06:12,740 Now you need to factorize this real matrix 94 00:06:12,740 --> 00:06:16,925 into two matrices of your parameters, Phi and Theta. 95 00:06:16,925 --> 00:06:19,190 One matrix, Phi matrix, 96 00:06:19,190 --> 00:06:22,915 is about probability distributions over words and 97 00:06:22,915 --> 00:06:27,575 Theta matrix contains probability distributions over topics. 98 00:06:27,575 --> 00:06:33,370 Actually every column in this matrix is a probability distribution. 99 00:06:33,370 --> 00:06:40,003 So, this is just a matrix form of the same formula in the top of the slide, 100 00:06:40,003 --> 00:06:46,830 and you can see that it holds just for one element and for any element obviously. 101 00:06:46,830 --> 00:06:50,210 So, this is the introduction of the model and in 102 00:06:50,210 --> 00:06:55,990 the next video we will figure out how to train this model. So stay with me.