1 00:00:00,047 --> 00:00:04,317 [MUSIC] 2 00:00:04,317 --> 00:00:08,272 In this module we're going to cover a really popular probabilistic model for 3 00:00:08,272 --> 00:00:12,298 document analysis called Latent Dirichlet Allocation or LDA. 4 00:00:12,298 --> 00:00:16,920 And LDA is an example of a class of methods called Mixed Membership Modeling. 5 00:00:16,920 --> 00:00:20,050 And to start with, let's motivate the use of mixed membership models 6 00:00:20,050 --> 00:00:22,099 in the context of our document analysis. 7 00:00:23,300 --> 00:00:26,940 So far we've described clustering models where the goal is to group 8 00:00:26,940 --> 00:00:29,350 related articles into disjoint sets or 9 00:00:29,350 --> 00:00:34,940 clusters where these clusters capture the topics prevalent in the corpus. 10 00:00:34,940 --> 00:00:39,890 And in this context every document is assigned to a single topic. 11 00:00:41,100 --> 00:00:45,620 A question though is, is an article really about just one topic, like science? 12 00:00:46,840 --> 00:00:50,810 In the last module, we talked about soft assignments capturing uncertainty 13 00:00:50,810 --> 00:00:56,270 in this cluster assignment, but the clustering model still assumes that 14 00:00:56,270 --> 00:01:00,230 each document is assigned to a single topic. 15 00:01:00,230 --> 00:01:04,060 We might just have uncertainty in what that assignment is from the observed data. 16 00:01:05,170 --> 00:01:07,980 To make this more concrete, let's look at a specific example 17 00:01:07,980 --> 00:01:12,210 where we have an article that's entitled Modeling the Complex Dynamics and 18 00:01:12,210 --> 00:01:15,040 Changing Correlations of Epileptic Events. 19 00:01:15,040 --> 00:01:21,040 And in this article we see words like patience, epilepsy, EG, clinical. 20 00:01:21,040 --> 00:01:23,930 So based on the content of this article. 21 00:01:23,930 --> 00:01:28,294 Maybe a clustering model would group this article with other 22 00:01:28,294 --> 00:01:30,953 articles related to science topic. 23 00:01:30,953 --> 00:01:33,789 And maybe that's Cluster 4 in orange for clustering. 24 00:01:33,789 --> 00:01:38,585 And what this bar chart represents is simply a one hand encoding 25 00:01:38,585 --> 00:01:43,323 of our assignment of this article, two cluster four zi = 4. 26 00:01:43,323 --> 00:01:51,500 But this article also has words like is asynchronize, automatic, model, 27 00:01:51,500 --> 00:01:56,100 and things like this which might mean that is really an article about technology. 28 00:01:56,100 --> 00:01:59,040 And so maybe we should cripple with other text articles 29 00:01:59,040 --> 00:02:02,420 which in this case is cluster 2. 30 00:02:02,420 --> 00:02:05,540 Well, our soft assignments that we talked about in the last module 31 00:02:05,540 --> 00:02:10,380 capture our uncertainty about whether this article should be assigned to cluster 2 or 32 00:02:10,380 --> 00:02:14,720 cluster 4, assigned to the science cluster or the technology cluster. 33 00:02:15,820 --> 00:02:19,460 But maybe what we really want to capture is the fact that the article's about 34 00:02:19,460 --> 00:02:21,200 science and technology. 35 00:02:22,400 --> 00:02:26,100 That is, "Zi" is really 2 and 4. 36 00:02:26,100 --> 00:02:30,110 And so, I put Zi in quotes because it's not going to be a single variable 37 00:02:30,110 --> 00:02:34,830 associated with a document to represent this mixed assignment like 38 00:02:34,830 --> 00:02:37,210 what we saw in our clustering model. 39 00:02:37,210 --> 00:02:41,750 But in essence, what we're saying is that this document has membership 40 00:02:41,750 --> 00:02:43,650 in both of these clusters 2 and 4. 41 00:02:43,650 --> 00:02:48,560 And we're going to go through in this module exactly how we think about 42 00:02:48,560 --> 00:02:52,300 formunist type of mixed membership assignment. 43 00:02:52,300 --> 00:02:57,140 And importantly, the other thing that we're going to want to capture is not only 44 00:02:57,140 --> 00:03:02,130 which topics are present in this document, but what's the relative proportion? 45 00:03:02,130 --> 00:03:03,630 How prevalent are these topics? 46 00:03:05,040 --> 00:03:09,692 So this is where mixed membership models come in because mixed membership 47 00:03:09,692 --> 00:03:14,194 models allow us to associate any given data point with a set of different 48 00:03:14,194 --> 00:03:18,788 cluster assignments or, in this case, a document with a set of topics. 49 00:03:18,788 --> 00:03:23,535 Rather than assuming that every document is associated with just a single cluster 50 00:03:23,535 --> 00:03:27,798 or topic or capturing uncertainty in that single assignment like the soft 51 00:03:27,798 --> 00:03:30,016 assignments we talked about before. 52 00:03:30,016 --> 00:03:34,269 [MUSIC]