1 00:00:00,251 --> 00:00:04,467 [MUSIC] 2 00:00:04,467 --> 00:00:08,746 And finally, in our fourth module, we look at a probabilistic model that provide 3 00:00:08,746 --> 00:00:12,282 a more intricate description of our data point in the relationships 4 00:00:12,282 --> 00:00:16,230 between data points than our simple clustering representation. 5 00:00:16,230 --> 00:00:19,650 And in particular, we look at something called mixed membership modeling. 6 00:00:19,650 --> 00:00:21,710 And for a document analysis task, 7 00:00:21,710 --> 00:00:24,310 this model was called Latent Dirichlet allocation. 8 00:00:25,450 --> 00:00:30,350 But before we presented Latent Dirichlet allocation or LDA for shorthand. 9 00:00:30,350 --> 00:00:33,650 We presented an alternative document clustering model, 10 00:00:33,650 --> 00:00:38,140 where we introduce a set of topic's specific distributions over the words and 11 00:00:38,140 --> 00:00:42,680 the vocabulary, where remember every topic is a different cluster. 12 00:00:42,680 --> 00:00:46,940 And then every document was assigned to a cluster just as before. 13 00:00:46,940 --> 00:00:51,060 But in forming that assignment the score of the document under the cluster was 14 00:00:51,060 --> 00:00:55,860 computed by just looking at a bag of words representation of the documents. 15 00:00:55,860 --> 00:01:00,370 So just an unordered set of the words that appear in that document. 16 00:01:00,370 --> 00:01:04,640 And then scoring the words under the specific topic distribution. 17 00:01:04,640 --> 00:01:09,370 And here, just like in the mixture models we described previously, every cluster or 18 00:01:09,370 --> 00:01:14,040 topic in this case has a specific prevalence in the overall corpus. 19 00:01:14,040 --> 00:01:20,090 So this is distribution over topics that appear in the entire corpus. 20 00:01:20,090 --> 00:01:23,860 So in this module, we compared and contrast this clustering model 21 00:01:23,860 --> 00:01:28,120 with the mixture of Gaussian clustering model we presented in the third module. 22 00:01:28,120 --> 00:01:30,920 And then we turned to the LDA model itself. 23 00:01:30,920 --> 00:01:34,710 Where here, every word in every document 24 00:01:34,710 --> 00:01:38,600 had an assignment variable linking that word to a specific topic. 25 00:01:38,600 --> 00:01:42,750 So then when we think about scoring a document in LDA. 26 00:01:42,750 --> 00:01:47,190 We think of scoring every word under its associated topic. 27 00:01:47,190 --> 00:01:52,159 Where these topics are defined exactly like in the alternative questioning we 28 00:01:52,159 --> 00:01:53,295 just described. 29 00:01:53,295 --> 00:01:56,250 Where there's a distribution over every word and 30 00:01:56,250 --> 00:01:58,790 a vocabulary specific to the topic. 31 00:01:58,790 --> 00:02:02,812 But the fact that there's a topic indicator per word in the document, 32 00:02:02,812 --> 00:02:04,389 rather than per document. 33 00:02:04,389 --> 00:02:08,099 It's not the only thing that distinguishes this model from the clustering model we 34 00:02:08,099 --> 00:02:09,370 just described. 35 00:02:09,370 --> 00:02:13,680 The other thing is, we introduced this topic proportion vector specific to 36 00:02:13,680 --> 00:02:19,320 each document rather than it representing corpus wide topic proportions. 37 00:02:19,320 --> 00:02:22,650 And this is really one of the key aspects of LDA, 38 00:02:22,650 --> 00:02:27,670 because this forms our mixed membership representation of every document. 39 00:02:27,670 --> 00:02:30,243 So a document doesn't belong to just one topic, 40 00:02:30,243 --> 00:02:32,264 it belongs to a collection of topics. 41 00:02:32,264 --> 00:02:36,506 And there are different weights on how much membership the document has in each 42 00:02:36,506 --> 00:02:38,330 one of these topics. 43 00:02:38,330 --> 00:02:41,070 And in this module, we described how we can think of 44 00:02:41,070 --> 00:02:43,920 these topic proportions as a learned feature representation. 45 00:02:43,920 --> 00:02:48,330 Where we can use it to do things like allocating this article to 46 00:02:48,330 --> 00:02:53,430 multiple sections on a news website or using it to relate different articles 47 00:02:53,430 --> 00:02:59,170 to one another or using it to learn user's preferences over different topics. 48 00:02:59,170 --> 00:02:59,740 And likewise, 49 00:02:59,740 --> 00:03:03,370 we talked about how we can think of looking at these topic distributions over 50 00:03:03,370 --> 00:03:09,470 the vocabulary to describe post facto what these topics are really about. 51 00:03:09,470 --> 00:03:12,140 So these are the types of inferences we can draw from LDA. 52 00:03:12,140 --> 00:03:15,870 But the question is how do we learn this structure from data. 53 00:03:15,870 --> 00:03:20,210 And just like in clustering this is a fully unsupervised task, 54 00:03:20,210 --> 00:03:24,630 where we just provide a set of words in a set of document sin the corpus. 55 00:03:24,630 --> 00:03:27,700 And somehow from this we want to extract out these 56 00:03:27,700 --> 00:03:33,120 topic vocabulary distributions and these document topic proportions. 57 00:03:33,120 --> 00:03:34,840 And critical to doing this, 58 00:03:34,840 --> 00:03:41,140 just like in Is inferring that assignments of the words to specific topics. 59 00:03:41,140 --> 00:03:45,040 But in this module, we describe that LDA is specified as a Theseum model. 60 00:03:45,040 --> 00:03:48,180 And so we described a Bayesian inference procedure for 61 00:03:48,180 --> 00:03:52,200 learning our model parameters, as well as these assignment variables. 62 00:03:52,200 --> 00:03:55,070 And the algorithm we described was called Gibbs sampling. 63 00:03:55,070 --> 00:03:58,280 And at first we presented a vanilla version of Gibbs sampling where we simply 64 00:03:58,280 --> 00:04:01,970 iterate between all these assignment variables and model parameters. 65 00:04:01,970 --> 00:04:06,070 Randomly reassigning each conditioned on the instantiated values of 66 00:04:06,070 --> 00:04:08,450 all the other parameters or variables. 67 00:04:08,450 --> 00:04:12,420 So at first, we could think about randomly reassigning the topics for 68 00:04:12,420 --> 00:04:13,910 every word in a document. 69 00:04:13,910 --> 00:04:16,500 And then we can think about fixing these and 70 00:04:16,500 --> 00:04:21,360 sampling the topic proportion vector for that specific document. 71 00:04:21,360 --> 00:04:24,560 And then repeating these steps for all documents in the corpus. 72 00:04:24,560 --> 00:04:29,570 And then having fixed these values we can think about resampling the topic 73 00:04:29,570 --> 00:04:31,720 vocabulary distributions. 74 00:04:31,720 --> 00:04:35,120 But then in the module we described a little bit fancier version 75 00:04:35,120 --> 00:04:39,850 of sampling that we can perform in LDA calle Collapsed Gibbs sampling. 76 00:04:39,850 --> 00:04:44,158 Where we analytically integrate out over all these model parameters the topic 77 00:04:44,158 --> 00:04:48,409 vocabulary distributions and these document specific topic proportions. 78 00:04:48,409 --> 00:04:53,399 And we just sequentially sample each indicator variable of a given word to 79 00:04:53,399 --> 00:04:58,147 a specific topic conditioned on all the other assignments made in that 80 00:04:58,147 --> 00:05:02,310 document and every other document in the corpus. 81 00:05:02,310 --> 00:05:07,450 And we went through a derivation of the form of this conditional distribution, 82 00:05:07,450 --> 00:05:09,350 specifically there are two terms. 83 00:05:09,350 --> 00:05:14,450 One is how much a given document likes this specific topic, 84 00:05:14,450 --> 00:05:19,120 and the other is how much that topic likes a specific word considered. 85 00:05:19,120 --> 00:05:21,890 And we said that we multiply those two terms together. 86 00:05:21,890 --> 00:05:24,840 And then we think about renormalizing this 87 00:05:24,840 --> 00:05:27,890 across all possible assignments that we could make. 88 00:05:27,890 --> 00:05:32,420 And then we use that distribution to sample a new 89 00:05:32,420 --> 00:05:35,500 topic indicator for that specific word. 90 00:05:35,500 --> 00:05:40,380 Then we cycle through all words in the document and all documents in the corpus. 91 00:05:40,380 --> 00:05:45,077 Finally, in this module we talked about how we can use the output of Gibbs 92 00:05:45,077 --> 00:05:47,473 sampling to do Bayesian inference. 93 00:05:47,473 --> 00:05:52,283 Remember if we're thinking about doing predictions in the Bayesian framework, we 94 00:05:52,283 --> 00:05:57,950 want to integrate over our uncertainty in what value the model parameters can take. 95 00:05:57,950 --> 00:06:02,939 So we talked about how we can take each one of our give samples form predictions 96 00:06:02,939 --> 00:06:06,646 from that sample and then average across those samples. 97 00:06:06,646 --> 00:06:11,437 Or alternatively and something that's very commonly done in practice is just 98 00:06:11,437 --> 00:06:16,228 look at the one sample that maximise, what we call joint model probability and 99 00:06:16,228 --> 00:06:18,349 then use that to draw inferences. 100 00:06:19,930 --> 00:06:24,669 So in summary, as you've seen in just what is supposed to be a brief recap we've 101 00:06:24,669 --> 00:06:29,362 covered just an enormous amount of topics and very, very advanced concepts. 102 00:06:29,362 --> 00:06:34,293 We'd look at bunch of different models as well as a bunch of different algorithms. 103 00:06:34,293 --> 00:06:39,040 And through this process we learned some machine learning concepts that are very 104 00:06:39,040 --> 00:06:43,650 general and very useful beyond ideas of just clustering and retrieval. 105 00:06:43,650 --> 00:06:44,300 So for example, 106 00:06:44,300 --> 00:06:47,860 we talked about distance metrics that apply in many different domains. 107 00:06:47,860 --> 00:06:52,130 We've talked about approximation algorithms unsupervised learning task, 108 00:06:52,130 --> 00:06:57,400 probabilistic modeling, scalability through notions of data parallelism and 109 00:06:57,400 --> 00:07:01,850 finally this idea of Bayesian models and Bayesian inference. 110 00:07:01,850 --> 00:07:06,639 And having gone through this course, you now have a really, really extensive set of 111 00:07:06,639 --> 00:07:11,360 tools to go out and tackle quite different problems than we saw in the regression and 112 00:07:11,360 --> 00:07:12,926 classification courses. 113 00:07:12,926 --> 00:07:17,279 [MUSIC]