1 00:00:02,910 --> 00:00:04,210 Hey. 2 00:00:04,210 --> 00:00:08,415 You know the basic topic model which is called PLSA, 3 00:00:08,415 --> 00:00:10,750 and now you know how to train it. 4 00:00:10,750 --> 00:00:14,230 Now, what are some other topic models in this world? 5 00:00:14,230 --> 00:00:19,260 What are some other applications that we can solve with the topic modeling? 6 00:00:19,260 --> 00:00:21,440 I want to start with a nice application. 7 00:00:21,440 --> 00:00:24,605 It is about diary of Martha Ballard. 8 00:00:24,605 --> 00:00:26,620 So, this is a big diary. 9 00:00:26,620 --> 00:00:29,965 She was writing for 27 years. 10 00:00:29,965 --> 00:00:36,115 This is why it's rather complicated for people to read this diary and to analyze this. 11 00:00:36,115 --> 00:00:39,330 So, some other people decided to apply topic modeling to 12 00:00:39,330 --> 00:00:44,470 this and see what other topics revealed in this diary. 13 00:00:44,470 --> 00:00:50,440 These are some examples of the topics and you can see just the top most probable words. 14 00:00:50,440 --> 00:00:53,140 So, you remember you have your Phi metrics which 15 00:00:53,140 --> 00:00:56,490 stand for probabilities of words and topics. 16 00:00:56,490 --> 00:01:02,280 And this is exactly those words with the highest probabilities. 17 00:01:02,280 --> 00:01:07,940 And actually you can see that the topics are rather intuitively interpretable. 18 00:01:07,940 --> 00:01:09,805 So, there is something about the gardens, 19 00:01:09,805 --> 00:01:12,710 and potatoes, and work in these gardens. 20 00:01:12,710 --> 00:01:15,305 There is something about shopping like sugar, 21 00:01:15,305 --> 00:01:17,560 or flour, or something else. 22 00:01:17,560 --> 00:01:20,285 So, you can look through these top words, 23 00:01:20,285 --> 00:01:23,950 and you can name the topics, and that's nice. 24 00:01:23,950 --> 00:01:28,790 What's nicer, you can look into how these topics change over time. 25 00:01:28,790 --> 00:01:33,585 So, for example the gardening topic is very popular during summer, 26 00:01:33,585 --> 00:01:37,055 in her diary, and it's not very popular during winter, 27 00:01:37,055 --> 00:01:39,185 and it makes perfect sense. 28 00:01:39,185 --> 00:01:43,460 Right? Another topic which is about emotions has 29 00:01:43,460 --> 00:01:46,040 some high probabilities during 30 00:01:46,040 --> 00:01:50,110 those periods of her life when she had some emotional events. 31 00:01:50,110 --> 00:01:54,545 For example, one moment of high probability there 32 00:01:54,545 --> 00:01:59,330 corresponds to the moment when she got her husband into prison, 33 00:01:59,330 --> 00:02:01,325 and somebody else died, 34 00:02:01,325 --> 00:02:02,790 and something else happened. 35 00:02:02,790 --> 00:02:05,025 So, the historians can I say that, 36 00:02:05,025 --> 00:02:06,470 ''OK, this is interpretable. 37 00:02:06,470 --> 00:02:10,470 We understand why this topic has high probability there.'' 38 00:02:10,470 --> 00:02:16,310 Now, to feel flexible and to apply your topics in many applications, 39 00:02:16,310 --> 00:02:18,570 we need to do a little bit more math. 40 00:02:18,570 --> 00:02:23,650 So, first, this is the model called Latent Dirichlet Allocation, 41 00:02:23,650 --> 00:02:27,620 and I guess this is the most popular topic model ever. 42 00:02:27,620 --> 00:02:32,210 So, it was proposed in 2003 by David Blei, 43 00:02:32,210 --> 00:02:36,445 and actually any paper about topic models now cite this work. 44 00:02:36,445 --> 00:02:42,625 But, you know this is not very different from PLSA model. 45 00:02:42,625 --> 00:02:46,225 So, everything that it says is that, 46 00:02:46,225 --> 00:02:50,030 ''OK we will still have Phi and Theta parameters, 47 00:02:50,030 --> 00:02:53,965 but we are going to have Dirichlet priors for them.'' 48 00:02:53,965 --> 00:03:00,340 So, Dirichlet distribution has rather ugly form and you do not need to memorize this, 49 00:03:00,340 --> 00:03:02,745 you can just always Google it. 50 00:03:02,745 --> 00:03:09,425 But, important thing here is that we say that our parameters are not just fixed values, 51 00:03:09,425 --> 00:03:12,225 they have some distribution. 52 00:03:12,225 --> 00:03:15,335 That's why as the output of our model, 53 00:03:15,335 --> 00:03:19,830 we are also going to have some distribution over parameters. 54 00:03:19,830 --> 00:03:22,535 So, not just two matrices of values, 55 00:03:22,535 --> 00:03:24,685 but distribution over them, 56 00:03:24,685 --> 00:03:28,235 and this will be called posterior distribution and it will be 57 00:03:28,235 --> 00:03:32,850 also Dirichlet but with some other hyperparameters. 58 00:03:32,850 --> 00:03:37,910 In other course of our specialization devoted to Bayesian methods, 59 00:03:37,910 --> 00:03:44,605 you could learn about lots of ways how to estimate this model and how to train it. 60 00:03:44,605 --> 00:03:47,655 So, here I just name a few ways. 61 00:03:47,655 --> 00:03:49,995 One way would be a Variational Bayes. 62 00:03:49,995 --> 00:03:52,120 Another way would be Gibbs Sampling. 63 00:03:52,120 --> 00:03:55,090 All of them have lots of complicated math, 64 00:03:55,090 --> 00:03:58,370 so we are not going to these details right now. 65 00:03:58,370 --> 00:04:00,800 Instead, I'm going just to show you what is 66 00:04:00,800 --> 00:04:05,500 the main path for developing new topic models. 67 00:04:05,500 --> 00:04:08,870 So, usually people use probabilistic graphical models and 68 00:04:08,870 --> 00:04:13,340 Bayesian inference to provide new topic models and they say, 69 00:04:13,340 --> 00:04:15,540 ''OK, we will have more parameters, 70 00:04:15,540 --> 00:04:17,250 we will have more priors. 71 00:04:17,250 --> 00:04:19,550 They will be connected to this and that way.'' 72 00:04:19,550 --> 00:04:25,560 So people draw this nice pictures about what happens in the models. 73 00:04:25,560 --> 00:04:28,480 And again, let us not go into 74 00:04:28,480 --> 00:04:33,480 the math details but instead let us look how these models can be applied. 75 00:04:33,480 --> 00:04:39,130 Well, one extension of LDA model would be Hierarchical topic model. 76 00:04:39,130 --> 00:04:43,950 So, you can imagine that you want your topics to build some hierarchy. 77 00:04:43,950 --> 00:04:47,240 For example, the topic about speech recognition would be 78 00:04:47,240 --> 00:04:50,885 a subtopic for the topic about algorithms. 79 00:04:50,885 --> 00:04:54,920 And you see that the root topic has 80 00:04:54,920 --> 00:05:00,110 some very general Lexis and this is actually not surprising. 81 00:05:00,110 --> 00:05:05,735 So, unfortunately, general Lexis is always something that we see with high probabilities, 82 00:05:05,735 --> 00:05:08,225 especially for root topics. 83 00:05:08,225 --> 00:05:09,680 And in some models, 84 00:05:09,680 --> 00:05:14,390 you can try to distill your topics and to say well maybe we should have 85 00:05:14,390 --> 00:05:16,500 some separate topics about 86 00:05:16,500 --> 00:05:21,620 the stop words and we don't want to see them in our main topics, 87 00:05:21,620 --> 00:05:24,895 so we can also play with it. 88 00:05:24,895 --> 00:05:30,485 Now, another important extension of topic models is Dynamic topic models. 89 00:05:30,485 --> 00:05:35,445 So, these are models that say that topics can evolve over time. 90 00:05:35,445 --> 00:05:42,680 So, you have some keywords for the topic in one year and they change for the other year. 91 00:05:42,680 --> 00:05:46,910 Or you can see how the probability of the topics changes. 92 00:05:46,910 --> 00:05:51,690 For example, you have some news flow and you know that some topic about 93 00:05:51,690 --> 00:05:57,610 bank-related stuff is super popular in this month but not that popular later. 94 00:05:57,610 --> 00:06:03,095 OK? One more extension, multilingual topic models. 95 00:06:03,095 --> 00:06:06,450 So, topic is something that is not really dependent on 96 00:06:06,450 --> 00:06:11,280 the language because mathematics exists everywhere, right? 97 00:06:11,280 --> 00:06:14,625 So, we can just express it with different terms in English, 98 00:06:14,625 --> 00:06:16,425 in Italian, in Russian, 99 00:06:16,425 --> 00:06:18,535 and in any other language. 100 00:06:18,535 --> 00:06:20,875 And this model captures this intuition. 101 00:06:20,875 --> 00:06:24,390 So, we have some topics that are just the same 102 00:06:24,390 --> 00:06:29,360 for every language but they are expressed with different terms. 103 00:06:29,360 --> 00:06:33,155 You usually train this model on parallel data so you have 104 00:06:33,155 --> 00:06:37,250 two Wikipedia articles for the same topic, 105 00:06:37,250 --> 00:06:42,283 or let's better say for the same particular concept, 106 00:06:42,283 --> 00:06:46,350 and you know that the topics of these articles should be similar, 107 00:06:46,350 --> 00:06:50,115 but expressed with different terms, and that's okay. 108 00:06:50,115 --> 00:06:53,155 So, we have covered some extensions of Topic Models, 109 00:06:53,155 --> 00:06:56,270 and believe me there are much more in the literature. 110 00:06:56,270 --> 00:07:01,240 So, one natural question that you might have now if whether there 111 00:07:01,240 --> 00:07:06,193 is a way to combine all those requirements into one topic model. 112 00:07:06,193 --> 00:07:11,440 And there might be different approaches here and one approach which we 113 00:07:11,440 --> 00:07:17,560 develop here in our NLP Lab is called Additive Regularization for Topic Models. 114 00:07:17,560 --> 00:07:19,350 The idea is super simple. 115 00:07:19,350 --> 00:07:22,475 So, we have some likelihood for PLSA model. 116 00:07:22,475 --> 00:07:25,565 Now, let us have some additional regularizers. 117 00:07:25,565 --> 00:07:29,140 Let us add them to the likelihood with some coefficients. 118 00:07:29,140 --> 00:07:33,880 So, all we need is to formalize our requirements with some regularizers, 119 00:07:33,880 --> 00:07:38,590 and then tune those tau coefficients to say that, for example, 120 00:07:38,590 --> 00:07:45,085 we need better hierarchy rather than better dynamics In the model. 121 00:07:45,085 --> 00:07:50,295 So, just to provide one example of how those regularizers can look like, 122 00:07:50,295 --> 00:07:54,295 we can imagine that we need different topics in our model, 123 00:07:54,295 --> 00:07:58,335 so it would be great to have as different topics as possible. 124 00:07:58,335 --> 00:08:06,370 To do this, we can try to maximize the negative pairwise correlations between the topics. 125 00:08:06,370 --> 00:08:10,090 So, this is exactly what is written down in the bottom formula. 126 00:08:10,090 --> 00:08:17,245 You have your pairs of topics and you try to make them as different as possible. 127 00:08:17,245 --> 00:08:19,125 Now, how can you train this model? 128 00:08:19,125 --> 00:08:22,030 Well, you still can use EM algorithm. 129 00:08:22,030 --> 00:08:24,565 So, the E-step holds the same, 130 00:08:24,565 --> 00:08:28,275 exactly the same as it was for the PLSA topic model. 131 00:08:28,275 --> 00:08:31,385 The M-step changes, but very slightly. 132 00:08:31,385 --> 00:08:34,935 So, the only thing that is new here is green. 133 00:08:34,935 --> 00:08:40,395 This is the derivatives of the regularizers for your parameters. 134 00:08:40,395 --> 00:08:44,325 So, you need to add these terms here to get 135 00:08:44,325 --> 00:08:49,730 maximum likelihood estimations for the parameters for the M-step. 136 00:08:49,730 --> 00:08:51,660 And this is pretty straightforward, 137 00:08:51,660 --> 00:08:54,070 so you just formalize your criteria, 138 00:08:54,070 --> 00:08:55,920 you took the derivatives, 139 00:08:55,920 --> 00:08:59,925 and you could built this into your model. 140 00:08:59,925 --> 00:09:03,230 Now, I will just show you one more example for this. 141 00:09:03,230 --> 00:09:06,020 So, in many applications we need to model 142 00:09:06,020 --> 00:09:11,405 not only words in the texts but some additional modalities. 143 00:09:11,405 --> 00:09:13,180 What I mean is some metadata, 144 00:09:13,180 --> 00:09:16,350 some users, maybe authors of the papers, 145 00:09:16,350 --> 00:09:18,930 time stamps, and categories, 146 00:09:18,930 --> 00:09:24,480 and many other things that can go with the documents but that are not just words. 147 00:09:24,480 --> 00:09:28,650 Can we build somehow them into our model? 148 00:09:28,650 --> 00:09:32,255 We can actually use absolutely the same intuition. 149 00:09:32,255 --> 00:09:33,375 So, let us just, 150 00:09:33,375 --> 00:09:35,535 instead of one likelihood, 151 00:09:35,535 --> 00:09:38,220 have some weighted likelihoods. 152 00:09:38,220 --> 00:09:41,365 So, let us have a likelihood for every modality and 153 00:09:41,365 --> 00:09:45,365 let us weigh them with some modality coefficients. 154 00:09:45,365 --> 00:09:49,785 Now, what do we have for every modality? 155 00:09:49,785 --> 00:09:52,155 Actually, we have different vocabularies. 156 00:09:52,155 --> 00:09:58,640 So, we treat the tokens of authors modality as a separate vocabulary, 157 00:09:58,640 --> 00:10:01,350 so every topic will be now 158 00:10:01,350 --> 00:10:06,720 not only the distribution of words but the distribution over authors as well. 159 00:10:06,720 --> 00:10:08,245 Or if we have five modalities, 160 00:10:08,245 --> 00:10:14,175 every topic will be represented by five distinct distributions. 161 00:10:14,175 --> 00:10:17,675 One cool thing about multimodal topic models is that you 162 00:10:17,675 --> 00:10:22,670 represent any entities in this hidden space of topics. 163 00:10:22,670 --> 00:10:28,050 So, this is a way somehow to unify all the information in your model. 164 00:10:28,050 --> 00:10:32,565 For example, you can find what are the most probable topics for 165 00:10:32,565 --> 00:10:37,775 words and what are the most probable topics for time stamps, let's say. 166 00:10:37,775 --> 00:10:41,895 And then you can compare some time stamps and words and say, 167 00:10:41,895 --> 00:10:46,505 ''What are the most similar words for this day?'' 168 00:10:46,505 --> 00:10:49,490 And this is an example that does exactly this. 169 00:10:49,490 --> 00:10:54,300 So, we had a corpora that has some time stamps for 170 00:10:54,300 --> 00:11:01,315 the documents and we model the topics both for words and for time stamps, 171 00:11:01,315 --> 00:11:05,655 and we get to know that the closest words for the time stamp, 172 00:11:05,655 --> 00:11:10,215 which corresponds to the Oscars date would be Oscar, 173 00:11:10,215 --> 00:11:14,350 Birdman, and some other words that are really related to this date. 174 00:11:14,350 --> 00:11:20,640 So, once again, this is a way to embed all your different modalities into 175 00:11:20,640 --> 00:11:27,890 one space and somehow find a way to build similarities between them. 176 00:11:27,890 --> 00:11:35,280 OK. Now, what would be your actions if you want to build your topic models? 177 00:11:35,280 --> 00:11:37,310 Well, probably you need some libraries. 178 00:11:37,310 --> 00:11:43,540 So, BigARTM library is the implementation of the last approach that I mentioned. 179 00:11:43,540 --> 00:11:47,695 Gensim and MALLET implement online LDA topic model. 180 00:11:47,695 --> 00:11:51,540 Gensim was build for Python and MALLET is built for JAVA. 181 00:11:51,540 --> 00:11:56,200 And Vowpal Wabbit is the implementation of the same online LDA topic model, 182 00:11:56,200 --> 00:11:59,250 but it is known to be super fast. 183 00:11:59,250 --> 00:12:03,305 So, maybe it's also a good idea to check it out. 184 00:12:03,305 --> 00:12:08,870 Now, finally, just a few words about visualization of topic models. 185 00:12:08,870 --> 00:12:12,380 So you will never get through large collections and that is 186 00:12:12,380 --> 00:12:15,535 not so easy to represent the output of your model, 187 00:12:15,535 --> 00:12:21,190 those probability distributions, in such a way that people can understand that. 188 00:12:21,190 --> 00:12:24,545 So, this is an example how to visualize Phi metrics. 189 00:12:24,545 --> 00:12:29,025 We have words by topic's metrics here and you can see that 190 00:12:29,025 --> 00:12:33,560 we group those words that correspond to every certain topic 191 00:12:33,560 --> 00:12:36,465 together so that we can see that 192 00:12:36,465 --> 00:12:39,410 this blue topic is about 193 00:12:39,410 --> 00:12:43,620 these terms and the other one is about social networks and so on. 194 00:12:43,620 --> 00:12:48,275 But actually, the visualization of topic models is the whole world. 195 00:12:48,275 --> 00:12:54,980 So this website contains 380 ways to visualize your topic models. 196 00:12:54,980 --> 00:13:02,540 So, I want to end this video and ask you to just explore them maybe for a few moments, 197 00:13:02,540 --> 00:13:05,660 and you will get to know that topic models can build 198 00:13:05,660 --> 00:13:10,540 very different and colorful representations of your data.