1 00:00:00,401 --> 00:00:04,600 [MUSIC] 2 00:00:04,600 --> 00:00:07,486 We're now ready to present our Latent Dirichlet allocation mixed 3 00:00:07,486 --> 00:00:08,980 membership model. 4 00:00:08,980 --> 00:00:12,790 And remember, here our goal is to associate with the document 5 00:00:12,790 --> 00:00:17,480 a collection of topics present in that document as well as their relative 6 00:00:17,480 --> 00:00:18,980 proportions in the document. 7 00:00:18,980 --> 00:00:21,460 So how prevalent those topics are. 8 00:00:21,460 --> 00:00:24,260 And remember in the clustering model that we just presented for 9 00:00:24,260 --> 00:00:26,980 our bag of words representation of a document, 10 00:00:26,980 --> 00:00:31,350 we introduce a set of topic specific vocabulary distributions. 11 00:00:31,350 --> 00:00:35,060 So associating probabilities over every word in the vocabulary, 12 00:00:35,060 --> 00:00:39,660 specific to each topic, and then the clustering model 13 00:00:39,660 --> 00:00:44,250 look to assign an entire document to a single topic. 14 00:00:45,550 --> 00:00:46,810 And for this assignment, 15 00:00:46,810 --> 00:00:51,460 it would score all of the words in the document under that topic distribution. 16 00:00:51,460 --> 00:00:57,610 And then the model also introduces a corpus wide distribution 17 00:00:57,610 --> 00:01:02,840 over the prevalence of topics throughout all the documents in this corpus. 18 00:01:02,840 --> 00:01:04,890 So that is, how likely for 19 00:01:04,890 --> 00:01:09,840 any given document that we grab that it's from one of these sets of topics? 20 00:01:10,840 --> 00:01:15,198 In Latent Dirichlet allocation we introduced this same set of 21 00:01:15,198 --> 00:01:20,060 topic-specific vocabulary distributions, but now when we look at a given 22 00:01:20,060 --> 00:01:25,460 document We're looking to assign every word to a single topic. 23 00:01:25,460 --> 00:01:28,850 Instead of assigning the entire document, 24 00:01:28,850 --> 00:01:32,790 every word is going to have an assignment variable Z-I-W. 25 00:01:32,790 --> 00:01:38,970 So for the wth word in document i, what topic it's assigned to. 26 00:01:40,850 --> 00:01:43,770 Then when we go to score a document, 27 00:01:43,770 --> 00:01:49,070 we're going to score all of the words under each of these assigned topics. 28 00:01:49,070 --> 00:01:52,490 So, scoring all the orange words under the orange topic, 29 00:01:52,490 --> 00:01:54,980 blue words under the blue topic and so on. 30 00:01:54,980 --> 00:01:59,750 And that's going to determine how good those assignments are. 31 00:01:59,750 --> 00:02:03,010 And finally, there's another thing different in LDA. 32 00:02:03,010 --> 00:02:09,480 Instead of introducing a corpus wide distribution on this topic prevalences, 33 00:02:09,480 --> 00:02:13,710 each document is going to have a distribution 34 00:02:13,710 --> 00:02:17,750 over the prevalence of topics in that document. 35 00:02:17,750 --> 00:02:20,010 So now instead of pi, 36 00:02:20,010 --> 00:02:24,740 globally throughout the corpus we have pi i specific to the document i. 37 00:02:24,740 --> 00:02:29,586 And what this vector is going to represent are our desired 38 00:02:29,586 --> 00:02:33,812 topic prevalences in this specific document. 39 00:02:33,812 --> 00:02:36,869 [MUSIC]