1 00:00:00,000 --> 00:00:04,674 [MUSIC] 2 00:00:04,674 --> 00:00:07,136 Well, you might be sitting here thinking, 3 00:00:07,136 --> 00:00:10,210 why are we sampling just the word indicators? 4 00:00:10,210 --> 00:00:12,120 I don't care about those. 5 00:00:12,120 --> 00:00:14,300 Is this just a total waste? 6 00:00:14,300 --> 00:00:17,900 Remember that we discussed before that the thing that we're typically interested in 7 00:00:17,900 --> 00:00:22,190 are the topic vocabulary distributions for interpretability of the topics present in 8 00:00:22,190 --> 00:00:26,890 the corpus, as well as the topic proportions within every document. 9 00:00:26,890 --> 00:00:31,780 Because that's our compact description of the mixed membership of the document. 10 00:00:32,920 --> 00:00:35,600 So what do we do with the output of this collapsed Gibbs sampler? 11 00:00:35,600 --> 00:00:39,200 Where we have all these samples just of the word indicators. 12 00:00:40,340 --> 00:00:42,180 Well there are a number of things that you can do, and 13 00:00:42,180 --> 00:00:44,050 I'm just going to describe one. 14 00:00:44,050 --> 00:00:49,160 So, one thing we can do is we can look at the assignment of all the words 15 00:00:50,590 --> 00:00:54,040 in the corpus that maximize the joint model probability. 16 00:00:54,040 --> 00:00:58,110 And it's actually the joint collapse model probability where we've 17 00:00:58,110 --> 00:01:00,740 integrated over all of these model parameters and 18 00:01:00,740 --> 00:01:04,750 just look at the probabilities on these word assignment variables. 19 00:01:04,750 --> 00:01:10,030 And of course the probabilities of the words themselves given those assignments. 20 00:01:10,030 --> 00:01:14,880 Then for this best sample of all these word assignment variables, 21 00:01:14,880 --> 00:01:19,510 we can think of post facto after running our collapsed sampler, 22 00:01:19,510 --> 00:01:23,480 doing inference on the topic vocabulary distributions because once I've 23 00:01:23,480 --> 00:01:28,180 conditioned on a set of topic indicators for every word in my vocabulary. 24 00:01:28,180 --> 00:01:32,650 I'm sorry, every word in my corpus, then I can form 25 00:01:32,650 --> 00:01:38,020 the conditional distribution on my topic vocabulary distributions. 26 00:01:38,020 --> 00:01:42,720 So this is exactly the distribution that we described at a high level when we 27 00:01:42,720 --> 00:01:45,820 talked about our uncollapsed standard Gibbs sampler. 28 00:01:45,820 --> 00:01:52,260 So, we could think about sampling these vocabulary distributions and then 29 00:01:52,260 --> 00:01:57,310 we can also likewise think about doing what's often called document embedding. 30 00:01:57,310 --> 00:02:02,590 Which is just forming the topic proportion vector for a given document. 31 00:02:02,590 --> 00:02:07,320 So, this embedding is just taking this document and forming its mixed membership 32 00:02:07,320 --> 00:02:13,340 representation, and just like in our uncollapsed standard 33 00:02:13,340 --> 00:02:19,330 give sampler we can form the conditional distribution of these topic proportions 34 00:02:19,330 --> 00:02:24,330 just given the word assignments in the document we're looking at. 35 00:02:24,330 --> 00:02:29,070 So just to reiterate, when we look at the topic vocabulary distributions, 36 00:02:29,070 --> 00:02:32,170 these are corpus-like things we have to look at the assignments we made 37 00:02:32,170 --> 00:02:34,830 throughout the entire corpus to infer these. 38 00:02:34,830 --> 00:02:38,460 But when we're looking at our document-specific topic proportions, 39 00:02:38,460 --> 00:02:43,080 we just need to look at those assignments made within that specific document. 40 00:02:44,370 --> 00:02:47,590 Then finally you can think of embedding new documents. 41 00:02:47,590 --> 00:02:49,510 So you get a whole collection of new documents. 42 00:02:49,510 --> 00:02:52,667 You already ran your collapse Gibb sampler, 43 00:02:52,667 --> 00:02:55,498 what do you do with these new documents? 44 00:02:55,498 --> 00:03:00,360 Well the formal thing to do is to completely rerun your sampler 45 00:03:00,360 --> 00:03:01,590 with these new documents. 46 00:03:01,590 --> 00:03:05,710 So add it in, resample everything for these new documents. 47 00:03:05,710 --> 00:03:09,670 And then revisit the documents that you've already sampled, but 48 00:03:09,670 --> 00:03:12,050 often you really can't do that in practice. 49 00:03:12,050 --> 00:03:14,550 So one thing that you could think about doing, 50 00:03:16,370 --> 00:03:20,650 which is an approximation procedure, is to fix the topic vocabulary 51 00:03:20,650 --> 00:03:26,050 distributions using the procedure that we described in the previous slides. 52 00:03:26,050 --> 00:03:28,480 And then having our topics fixed. 53 00:03:28,480 --> 00:03:32,040 So that's the description we can think of as 54 00:03:33,150 --> 00:03:37,680 trained on a set of documents that we've already looked at. 55 00:03:37,680 --> 00:03:40,960 We can embed new documents just by running 56 00:03:40,960 --> 00:03:45,070 an uncollapsed Gibb sampler just on that document. 57 00:03:45,070 --> 00:03:51,010 Because remember, to form our Word assignments in a given document and 58 00:03:51,010 --> 00:03:53,410 the topic proportions in that document. 59 00:03:55,020 --> 00:03:59,850 We only need to condition on the topic vocabulary distributions. 60 00:03:59,850 --> 00:04:01,650 Not the other document In the corpus. 61 00:04:02,730 --> 00:04:05,811 So, we can actually embed each one of these 62 00:04:05,811 --> 00:04:09,932 new documents in parallel using this type of procedure. 63 00:04:09,932 --> 00:04:13,929 [MUSIC]