1 00:00:00,000 --> 00:00:04,505 [MUSIC] 2 00:00:04,505 --> 00:00:09,082 Having specified the LDA model we now turn to inference in LDA and 3 00:00:09,082 --> 00:00:14,340 remember that our LDA model introduced a set of topic specific vocabulary 4 00:00:14,340 --> 00:00:19,513 distributions that are shared throughout the entire corpus and then for 5 00:00:19,513 --> 00:00:25,366 every document and every word and every document there's an assignment variable 6 00:00:25,366 --> 00:00:30,451 of that word to a specific topic, and then finally for every document, 7 00:00:30,451 --> 00:00:36,260 there's the topic proportions in that document, so that vector pi i. 8 00:00:36,260 --> 00:00:41,270 So collectively this represents our model parameters as well as our 9 00:00:41,270 --> 00:00:43,530 assignment variables, but 10 00:00:43,530 --> 00:00:48,080 remember that in our unsupervised learning task, we just get words from documents. 11 00:00:48,080 --> 00:00:50,630 We get a whole bunch of documents and 12 00:00:50,630 --> 00:00:55,650 we transform them to our bag of words representation and that's all we have. 13 00:00:55,650 --> 00:01:01,020 And somehow from this, we have to infer all these word assignment variables and 14 00:01:01,020 --> 00:01:03,270 all these topic proportions and 15 00:01:03,270 --> 00:01:07,240 topic prevalences, just from these observed words. 16 00:01:07,240 --> 00:01:09,978 So, it actually seems like a really, really challenging task. 17 00:01:09,978 --> 00:01:15,336 So, just to be clear, the input to LDA, or entrance in LDA, 18 00:01:15,336 --> 00:01:21,270 are sets of words from a collection of documents in a corpus. 19 00:01:21,270 --> 00:01:27,000 And, the output is going to be our set of topic-specific vocabulary distributions, 20 00:01:27,000 --> 00:01:32,660 shared throughout the corpus, as well as our document specific word assignments, 21 00:01:32,660 --> 00:01:35,080 and our document specific topic proportions. 22 00:01:36,180 --> 00:01:40,210 But before we get to algorithms for performing this inference task, 23 00:01:40,210 --> 00:01:44,390 let's first describe how we might interpret the outputs. 24 00:01:44,390 --> 00:01:49,630 So one thing we can do is examine the coherence of the learned topics, and 25 00:01:49,630 --> 00:01:55,140 to do this we can take the distribution over words in the vocabulary 26 00:01:55,140 --> 00:02:00,920 in every topic and order those words by how probable they are in the topic. 27 00:02:00,920 --> 00:02:05,680 So we look at the most probable words in every topic, and see if this forms 28 00:02:05,680 --> 00:02:10,220 a coherent set, and if it does, then post facto we can actually label these topics 29 00:02:10,220 --> 00:02:14,480 with things like science, and technology, and sports, and so on. 30 00:02:14,480 --> 00:02:18,660 And this provides us a qualitative assessment of the topics present 31 00:02:18,660 --> 00:02:20,160 in the corpus. 32 00:02:20,160 --> 00:02:23,620 And one other thing I want to emphasize, though, is that these 33 00:02:25,230 --> 00:02:29,160 topic distributions are not typically SPARS factors. 34 00:02:29,160 --> 00:02:34,480 Typically, they place mass over every word in the vocabulary. 35 00:02:34,480 --> 00:02:36,980 Though if you look at the most probable words, 36 00:02:36,980 --> 00:02:40,310 those often form some type of interpretable set, 37 00:02:40,310 --> 00:02:45,080 if your model's performing well and you're going to explore this in the assignment. 38 00:02:45,080 --> 00:02:48,600 So I just want to emphasize that the words we're showing here in these lists, 39 00:02:48,600 --> 00:02:52,290 and the fact that we're only showing a few words is not the full description 40 00:02:52,290 --> 00:02:53,320 of each of these topics. 41 00:02:53,320 --> 00:02:55,109 It's really a much more complicated beast. 42 00:02:56,190 --> 00:03:00,240 The other thing we can look at and the thing that we're often very interested in 43 00:03:00,240 --> 00:03:03,060 are the topic proportions in every document. 44 00:03:03,060 --> 00:03:06,000 Because this vector can be used to relate documents. 45 00:03:06,000 --> 00:03:12,570 So what other documents have similar types of topics present in the document that 46 00:03:12,570 --> 00:03:15,944 can be used for retrieval tasks, something you'll also look at in your assignment. 47 00:03:15,944 --> 00:03:20,750 And you can also use this type of topic proportion 48 00:03:20,750 --> 00:03:24,730 factor to allocate an article to multiple categories. 49 00:03:24,730 --> 00:03:27,660 So imagine you're some new site and 50 00:03:27,660 --> 00:03:32,080 you have an article, and you need to put that article into a category. 51 00:03:33,110 --> 00:03:36,460 This type of representation actually allows you to put that article 52 00:03:36,460 --> 00:03:41,070 into multiple categories and get more viewership, for 53 00:03:41,070 --> 00:03:45,870 that article and present it to more people who might be interested in it. 54 00:03:45,870 --> 00:03:51,140 And finally, you can also use these topic proportions to do things like we described 55 00:03:51,140 --> 00:03:56,270 before like learning preferences of a given user over a set of topics and 56 00:03:56,270 --> 00:04:01,800 this type of description that LDA provides with a set of topics and 57 00:04:01,800 --> 00:04:06,040 their relative proportions provides a much more descriptive form 58 00:04:06,040 --> 00:04:09,510 than the type of clustering output that we talked about before. 59 00:04:09,510 --> 00:04:13,480 Definitely for the hard assignments and for the soft assignments as well where 60 00:04:13,480 --> 00:04:16,540 really that just captured uncertainty in assignment but 61 00:04:16,540 --> 00:04:20,480 not the fact that they're Inherently, as specified in the model, 62 00:04:20,480 --> 00:04:24,300 a set of possible topics associated with every document. 63 00:04:24,300 --> 00:04:29,120 So this lets us do even fancier things in learning user preferences. 64 00:04:29,120 --> 00:04:32,850 And the last thing we haven't described are the word assignment variables. 65 00:04:32,850 --> 00:04:35,470 And typically, honestly, we're not actually interested in this, 66 00:04:35,470 --> 00:04:38,440 we're not actually interested in whether a word 67 00:04:38,440 --> 00:04:43,490 in a specific document is associated with a topic related to science things. 68 00:04:43,490 --> 00:04:47,510 But, these assignment variables are going to play a really critical role 69 00:04:47,510 --> 00:04:51,660 in inferring the other model parameters that are typically the things of interest. 70 00:04:51,660 --> 00:04:53,420 And so this is just like what we saw in. 71 00:04:53,420 --> 00:04:56,744 And we'll walk through this explicitly in the next section.