1 00:00:00,000 --> 00:00:04,669 [MUSIC] 2 00:00:04,669 --> 00:00:07,570 To build up to mixed membership models for documents though, 3 00:00:07,570 --> 00:00:10,695 it's helpful to first present an alternative clustering model to 4 00:00:10,695 --> 00:00:13,670 the mixture of Gaussian model we presented in the last module. 5 00:00:15,020 --> 00:00:16,290 So, just to emphasize, 6 00:00:16,290 --> 00:00:20,590 we're going back to our clustering model, where we're going to assume this 7 00:00:20,590 --> 00:00:24,770 more simple structure where every document is assigned to a single topic. 8 00:00:26,120 --> 00:00:30,000 But so far, when we've looked at our documents, we've represented them with 9 00:00:30,000 --> 00:00:35,060 this tf-idf vector and then we've taken all these tf-idf vectors associated with 10 00:00:35,060 --> 00:00:39,120 every document in the corpus and we've used a mixture of Gaussians to discover 11 00:00:39,120 --> 00:00:45,470 some set of clusters in this tf-idf space. 12 00:00:45,470 --> 00:00:49,210 But now what we're going to do is an alternative representation 13 00:00:49,210 --> 00:00:52,380 of a document called a Bag-of-words representation. 14 00:00:52,380 --> 00:00:56,650 Where we simply take, all of the words that are present in our document, 15 00:00:56,650 --> 00:00:59,170 throw them into a bag, and 16 00:00:59,170 --> 00:01:05,210 then shake that bag up, so that the order of the words doesn't matter. 17 00:01:05,210 --> 00:01:10,770 So our representation of the document is simply going to be an unordered 18 00:01:10,770 --> 00:01:17,150 set of words, but I use set loosely here because this set is going to have multiple 19 00:01:17,150 --> 00:01:22,700 occurrences of a unique word if that word appears multiple times in the document. 20 00:01:22,700 --> 00:01:28,880 So the multiplicity of the unique words matters here, unlike in standard sets. 21 00:01:28,880 --> 00:01:31,668 So this is formally called a multiset. 22 00:01:31,668 --> 00:01:36,770 So now let's present a clustering model for this new document representation, and 23 00:01:36,770 --> 00:01:41,910 to start with we need to specify the prior probability that a given document 24 00:01:41,910 --> 00:01:44,770 is associated with a specific cluster. 25 00:01:44,770 --> 00:01:47,690 And these are topic prevalences, 26 00:01:47,690 --> 00:01:53,190 are going to be exactly like what we had in our mixture of Gaussian case, 27 00:01:53,190 --> 00:01:56,970 where these just represent corpus wide prevalence of topics. 28 00:01:56,970 --> 00:02:00,740 But now, our likelihood term is going to be different because, 29 00:02:00,740 --> 00:02:07,140 instead of scoring every document, under a specific Gaussian, 30 00:02:07,140 --> 00:02:10,790 like in the mixture of Gaussian case, we're going to take our document and 31 00:02:10,790 --> 00:02:15,840 its bag-of-words representation, and we're going to score this set of words 32 00:02:15,840 --> 00:02:20,080 under a topic probability vector over words. 33 00:02:20,080 --> 00:02:25,090 Okay, so specifically every topic is going to be associated 34 00:02:25,090 --> 00:02:30,190 with a probability distribution over words in the vocabulary and 35 00:02:30,190 --> 00:02:35,470 using that, we're able to score the words present in this document. 36 00:02:35,470 --> 00:02:41,200 To say how probable are they under this specific topic and then we do this for 37 00:02:41,200 --> 00:02:45,810 every topic and we choose between topics using both this prior and 38 00:02:45,810 --> 00:02:49,410 the likelihood just like in our mixture of Gaussian example. 39 00:02:49,410 --> 00:02:54,180 So, just to be very clear, for every topic like one's about science, 40 00:02:54,180 --> 00:02:58,280 and technology, and sports, even though, of course, we don't have those labels, 41 00:02:58,280 --> 00:03:00,339 they're just going to be cluster one, two, three. 42 00:03:01,690 --> 00:03:05,570 We're going to have a probability vector over words in the vocabulary. 43 00:03:05,570 --> 00:03:10,270 And the way I'm showing them here on this slide is ordered by how probable 44 00:03:11,290 --> 00:03:15,810 those words are in the topic from most probable to least probable. 45 00:03:15,810 --> 00:03:20,320 Whereas, in the previous slide, I was just listing the words that actually appeared 46 00:03:20,320 --> 00:03:23,300 in the dataset or in that specific article. 47 00:03:24,350 --> 00:03:28,530 So now we can compare and contrast between our mixture of Gaussian clustering model 48 00:03:28,530 --> 00:03:30,480 and the clustering model that we just specified. 49 00:03:31,780 --> 00:03:35,460 So in both of these models are prior topic probabilities. 50 00:03:35,460 --> 00:03:38,661 So the probability before we actually look at the content of a document, 51 00:03:38,661 --> 00:03:42,600 that that document came from a given cluster are given by these pi ks and 52 00:03:42,600 --> 00:03:46,540 they're specified in exactly the same way in both cases. 53 00:03:46,540 --> 00:03:51,280 But in the mixture of Gaussian case, our documents were represented by these tf-idf 54 00:03:51,280 --> 00:03:53,790 vectors or some vector, could be a word count vector. 55 00:03:55,240 --> 00:04:01,160 And we scored that vector under each one of the Gaussians. 56 00:04:01,160 --> 00:04:04,780 So remember each cluster was defined by a Gaussian and 57 00:04:04,780 --> 00:04:08,076 you would compute the score of a given data point under each of these Gaussians. 58 00:04:08,076 --> 00:04:12,930 And then weigh these prior and likelihood terms to come up with our assignments for 59 00:04:12,930 --> 00:04:14,670 a given document. 60 00:04:14,670 --> 00:04:18,726 But now every document is represented with this bag-of-words representation. 61 00:04:18,726 --> 00:04:21,565 And when we go to score the document, 62 00:04:21,565 --> 00:04:26,144 we're just going to look at the probability of each of these 63 00:04:26,144 --> 00:04:31,102 words under the topic specific probability vector over words. 64 00:04:31,102 --> 00:04:31,602 [MUSIC]