1
00:00:02,910 --> 00:00:04,210
Hey.

2
00:00:04,210 --> 00:00:08,415
You know the basic topic model which is called PLSA,

3
00:00:08,415 --> 00:00:10,750
and now you know how to train it.

4
00:00:10,750 --> 00:00:14,230
Now, what are some other topic models in this world?

5
00:00:14,230 --> 00:00:19,260
What are some other applications that we can solve with the topic modeling?

6
00:00:19,260 --> 00:00:21,440
I want to start with a nice application.

7
00:00:21,440 --> 00:00:24,605
It is about diary of Martha Ballard.

8
00:00:24,605 --> 00:00:26,620
So, this is a big diary.

9
00:00:26,620 --> 00:00:29,965
She was writing for 27 years.

10
00:00:29,965 --> 00:00:36,115
This is why it's rather complicated for people to read this diary and to analyze this.

11
00:00:36,115 --> 00:00:39,330
So, some other people decided to apply topic modeling to

12
00:00:39,330 --> 00:00:44,470
this and see what other topics revealed in this diary.

13
00:00:44,470 --> 00:00:50,440
These are some examples of the topics and you can see just the top most probable words.

14
00:00:50,440 --> 00:00:53,140
So, you remember you have your Phi metrics which

15
00:00:53,140 --> 00:00:56,490
stand for probabilities of words and topics.

16
00:00:56,490 --> 00:01:02,280
And this is exactly those words with the highest probabilities.

17
00:01:02,280 --> 00:01:07,940
And actually you can see that the topics are rather intuitively interpretable.

18
00:01:07,940 --> 00:01:09,805
So, there is something about the gardens,

19
00:01:09,805 --> 00:01:12,710
and potatoes, and work in these gardens.

20
00:01:12,710 --> 00:01:15,305
There is something about shopping like sugar,

21
00:01:15,305 --> 00:01:17,560
or flour, or something else.

22
00:01:17,560 --> 00:01:20,285
So, you can look through these top words,

23
00:01:20,285 --> 00:01:23,950
and you can name the topics, and that's nice.

24
00:01:23,950 --> 00:01:28,790
What's nicer, you can look into how these topics change over time.

25
00:01:28,790 --> 00:01:33,585
So, for example the gardening topic is very popular during summer,

26
00:01:33,585 --> 00:01:37,055
in her diary, and it's not very popular during winter,

27
00:01:37,055 --> 00:01:39,185
and it makes perfect sense.

28
00:01:39,185 --> 00:01:43,460
Right? Another topic which is about emotions has

29
00:01:43,460 --> 00:01:46,040
some high probabilities during

30
00:01:46,040 --> 00:01:50,110
those periods of her life when she had some emotional events.

31
00:01:50,110 --> 00:01:54,545
For example, one moment of high probability there

32
00:01:54,545 --> 00:01:59,330
corresponds to the moment when she got her husband into prison,

33
00:01:59,330 --> 00:02:01,325
and somebody else died,

34
00:02:01,325 --> 00:02:02,790
and something else happened.

35
00:02:02,790 --> 00:02:05,025
So, the historians can I say that,

36
00:02:05,025 --> 00:02:06,470
''OK, this is interpretable.

37
00:02:06,470 --> 00:02:10,470
We understand why this topic has high probability there.''

38
00:02:10,470 --> 00:02:16,310
Now, to feel flexible and to apply your topics in many applications,

39
00:02:16,310 --> 00:02:18,570
we need to do a little bit more math.

40
00:02:18,570 --> 00:02:23,650
So, first, this is the model called Latent Dirichlet Allocation,

41
00:02:23,650 --> 00:02:27,620
and I guess this is the most popular topic model ever.

42
00:02:27,620 --> 00:02:32,210
So, it was proposed in 2003 by David Blei,

43
00:02:32,210 --> 00:02:36,445
and actually any paper about topic models now cite this work.

44
00:02:36,445 --> 00:02:42,625
But, you know this is not very different from PLSA model.

45
00:02:42,625 --> 00:02:46,225
So, everything that it says is that,

46
00:02:46,225 --> 00:02:50,030
''OK we will still have Phi and Theta parameters,

47
00:02:50,030 --> 00:02:53,965
but we are going to have Dirichlet priors for them.''

48
00:02:53,965 --> 00:03:00,340
So, Dirichlet distribution has rather ugly form and you do not need to memorize this,

49
00:03:00,340 --> 00:03:02,745
you can just always Google it.

50
00:03:02,745 --> 00:03:09,425
But, important thing here is that we say that our parameters are not just fixed values,

51
00:03:09,425 --> 00:03:12,225
they have some distribution.

52
00:03:12,225 --> 00:03:15,335
That's why as the output of our model,

53
00:03:15,335 --> 00:03:19,830
we are also going to have some distribution over parameters.

54
00:03:19,830 --> 00:03:22,535
So, not just two matrices of values,

55
00:03:22,535 --> 00:03:24,685
but distribution over them,

56
00:03:24,685 --> 00:03:28,235
and this will be called posterior distribution and it will be

57
00:03:28,235 --> 00:03:32,850
also Dirichlet but with some other hyperparameters.

58
00:03:32,850 --> 00:03:37,910
In other course of our specialization devoted to Bayesian methods,

59
00:03:37,910 --> 00:03:44,605
you could learn about lots of ways how to estimate this model and how to train it.

60
00:03:44,605 --> 00:03:47,655
So, here I just name a few ways.

61
00:03:47,655 --> 00:03:49,995
One way would be a Variational Bayes.

62
00:03:49,995 --> 00:03:52,120
Another way would be Gibbs Sampling.

63
00:03:52,120 --> 00:03:55,090
All of them have lots of complicated math,

64
00:03:55,090 --> 00:03:58,370
so we are not going to these details right now.

65
00:03:58,370 --> 00:04:00,800
Instead, I'm going just to show you what is

66
00:04:00,800 --> 00:04:05,500
the main path for developing new topic models.

67
00:04:05,500 --> 00:04:08,870
So, usually people use probabilistic graphical models and

68
00:04:08,870 --> 00:04:13,340
Bayesian inference to provide new topic models and they say,

69
00:04:13,340 --> 00:04:15,540
''OK, we will have more parameters,

70
00:04:15,540 --> 00:04:17,250
we will have more priors.

71
00:04:17,250 --> 00:04:19,550
They will be connected to this and that way.''

72
00:04:19,550 --> 00:04:25,560
So people draw this nice pictures about what happens in the models.

73
00:04:25,560 --> 00:04:28,480
And again, let us not go into

74
00:04:28,480 --> 00:04:33,480
the math details but instead let us look how these models can be applied.

75
00:04:33,480 --> 00:04:39,130
Well, one extension of LDA model would be Hierarchical topic model.

76
00:04:39,130 --> 00:04:43,950
So, you can imagine that you want your topics to build some hierarchy.

77
00:04:43,950 --> 00:04:47,240
For example, the topic about speech recognition would be

78
00:04:47,240 --> 00:04:50,885
a subtopic for the topic about algorithms.

79
00:04:50,885 --> 00:04:54,920
And you see that the root topic has

80
00:04:54,920 --> 00:05:00,110
some very general Lexis and this is actually not surprising.

81
00:05:00,110 --> 00:05:05,735
So, unfortunately, general Lexis is always something that we see with high probabilities,

82
00:05:05,735 --> 00:05:08,225
especially for root topics.

83
00:05:08,225 --> 00:05:09,680
And in some models,

84
00:05:09,680 --> 00:05:14,390
you can try to distill your topics and to say well maybe we should have

85
00:05:14,390 --> 00:05:16,500
some separate topics about

86
00:05:16,500 --> 00:05:21,620
the stop words and we don't want to see them in our main topics,

87
00:05:21,620 --> 00:05:24,895
so we can also play with it.

88
00:05:24,895 --> 00:05:30,485
Now, another important extension of topic models is Dynamic topic models.

89
00:05:30,485 --> 00:05:35,445
So, these are models that say that topics can evolve over time.

90
00:05:35,445 --> 00:05:42,680
So, you have some keywords for the topic in one year and they change for the other year.

91
00:05:42,680 --> 00:05:46,910
Or you can see how the probability of the topics changes.

92
00:05:46,910 --> 00:05:51,690
For example, you have some news flow and you know that some topic about

93
00:05:51,690 --> 00:05:57,610
bank-related stuff is super popular in this month but not that popular later.

94
00:05:57,610 --> 00:06:03,095
OK? One more extension, multilingual topic models.

95
00:06:03,095 --> 00:06:06,450
So, topic is something that is not really dependent on

96
00:06:06,450 --> 00:06:11,280
the language because mathematics exists everywhere, right?

97
00:06:11,280 --> 00:06:14,625
So, we can just express it with different terms in English,

98
00:06:14,625 --> 00:06:16,425
in Italian, in Russian,

99
00:06:16,425 --> 00:06:18,535
and in any other language.

100
00:06:18,535 --> 00:06:20,875
And this model captures this intuition.

101
00:06:20,875 --> 00:06:24,390
So, we have some topics that are just the same

102
00:06:24,390 --> 00:06:29,360
for every language but they are expressed with different terms.

103
00:06:29,360 --> 00:06:33,155
You usually train this model on parallel data so you have

104
00:06:33,155 --> 00:06:37,250
two Wikipedia articles for the same topic,

105
00:06:37,250 --> 00:06:42,283
or let's better say for the same particular concept,

106
00:06:42,283 --> 00:06:46,350
and you know that the topics of these articles should be similar,

107
00:06:46,350 --> 00:06:50,115
but expressed with different terms, and that's okay.

108
00:06:50,115 --> 00:06:53,155
So, we have covered some extensions of Topic Models,

109
00:06:53,155 --> 00:06:56,270
and believe me there are much more in the literature.

110
00:06:56,270 --> 00:07:01,240
So, one natural question that you might have now if whether there

111
00:07:01,240 --> 00:07:06,193
is a way to combine all those requirements into one topic model.

112
00:07:06,193 --> 00:07:11,440
And there might be different approaches here and one approach which we

113
00:07:11,440 --> 00:07:17,560
develop here in our NLP Lab is called Additive Regularization for Topic Models.

114
00:07:17,560 --> 00:07:19,350
The idea is super simple.

115
00:07:19,350 --> 00:07:22,475
So, we have some likelihood for PLSA model.

116
00:07:22,475 --> 00:07:25,565
Now, let us have some additional regularizers.

117
00:07:25,565 --> 00:07:29,140
Let us add them to the likelihood with some coefficients.

118
00:07:29,140 --> 00:07:33,880
So, all we need is to formalize our requirements with some regularizers,

119
00:07:33,880 --> 00:07:38,590
and then tune those tau coefficients to say that, for example,

120
00:07:38,590 --> 00:07:45,085
we need better hierarchy rather than better dynamics In the model.

121
00:07:45,085 --> 00:07:50,295
So, just to provide one example of how those regularizers can look like,

122
00:07:50,295 --> 00:07:54,295
we can imagine that we need different topics in our model,

123
00:07:54,295 --> 00:07:58,335
so it would be great to have as different topics as possible.

124
00:07:58,335 --> 00:08:06,370
To do this, we can try to maximize the negative pairwise correlations between the topics.

125
00:08:06,370 --> 00:08:10,090
So, this is exactly what is written down in the bottom formula.

126
00:08:10,090 --> 00:08:17,245
You have your pairs of topics and you try to make them as different as possible.

127
00:08:17,245 --> 00:08:19,125
Now, how can you train this model?

128
00:08:19,125 --> 00:08:22,030
Well, you still can use EM algorithm.

129
00:08:22,030 --> 00:08:24,565
So, the E-step holds the same,

130
00:08:24,565 --> 00:08:28,275
exactly the same as it was for the PLSA topic model.

131
00:08:28,275 --> 00:08:31,385
The M-step changes, but very slightly.

132
00:08:31,385 --> 00:08:34,935
So, the only thing that is new here is green.

133
00:08:34,935 --> 00:08:40,395
This is the derivatives of the regularizers for your parameters.

134
00:08:40,395 --> 00:08:44,325
So, you need to add these terms here to get

135
00:08:44,325 --> 00:08:49,730
maximum likelihood estimations for the parameters for the M-step.

136
00:08:49,730 --> 00:08:51,660
And this is pretty straightforward,

137
00:08:51,660 --> 00:08:54,070
so you just formalize your criteria,

138
00:08:54,070 --> 00:08:55,920
you took the derivatives,

139
00:08:55,920 --> 00:08:59,925
and you could built this into your model.

140
00:08:59,925 --> 00:09:03,230
Now, I will just show you one more example for this.

141
00:09:03,230 --> 00:09:06,020
So, in many applications we need to model

142
00:09:06,020 --> 00:09:11,405
not only words in the texts but some additional modalities.

143
00:09:11,405 --> 00:09:13,180
What I mean is some metadata,

144
00:09:13,180 --> 00:09:16,350
some users, maybe authors of the papers,

145
00:09:16,350 --> 00:09:18,930
time stamps, and categories,

146
00:09:18,930 --> 00:09:24,480
and many other things that can go with the documents but that are not just words.

147
00:09:24,480 --> 00:09:28,650
Can we build somehow them into our model?

148
00:09:28,650 --> 00:09:32,255
We can actually use absolutely the same intuition.

149
00:09:32,255 --> 00:09:33,375
So, let us just,

150
00:09:33,375 --> 00:09:35,535
instead of one likelihood,

151
00:09:35,535 --> 00:09:38,220
have some weighted likelihoods.

152
00:09:38,220 --> 00:09:41,365
So, let us have a likelihood for every modality and

153
00:09:41,365 --> 00:09:45,365
let us weigh them with some modality coefficients.

154
00:09:45,365 --> 00:09:49,785
Now, what do we have for every modality?

155
00:09:49,785 --> 00:09:52,155
Actually, we have different vocabularies.

156
00:09:52,155 --> 00:09:58,640
So, we treat the tokens of authors modality as a separate vocabulary,

157
00:09:58,640 --> 00:10:01,350
so every topic will be now

158
00:10:01,350 --> 00:10:06,720
not only the distribution of words but the distribution over authors as well.

159
00:10:06,720 --> 00:10:08,245
Or if we have five modalities,

160
00:10:08,245 --> 00:10:14,175
every topic will be represented by five distinct distributions.

161
00:10:14,175 --> 00:10:17,675
One cool thing about multimodal topic models is that you

162
00:10:17,675 --> 00:10:22,670
represent any entities in this hidden space of topics.

163
00:10:22,670 --> 00:10:28,050
So, this is a way somehow to unify all the information in your model.

164
00:10:28,050 --> 00:10:32,565
For example, you can find what are the most probable topics for

165
00:10:32,565 --> 00:10:37,775
words and what are the most probable topics for time stamps, let's say.

166
00:10:37,775 --> 00:10:41,895
And then you can compare some time stamps and words and say,

167
00:10:41,895 --> 00:10:46,505
''What are the most similar words for this day?''

168
00:10:46,505 --> 00:10:49,490
And this is an example that does exactly this.

169
00:10:49,490 --> 00:10:54,300
So, we had a corpora that has some time stamps for

170
00:10:54,300 --> 00:11:01,315
the documents and we model the topics both for words and for time stamps,

171
00:11:01,315 --> 00:11:05,655
and we get to know that the closest words for the time stamp,

172
00:11:05,655 --> 00:11:10,215
which corresponds to the Oscars date would be Oscar,

173
00:11:10,215 --> 00:11:14,350
Birdman, and some other words that are really related to this date.

174
00:11:14,350 --> 00:11:20,640
So, once again, this is a way to embed all your different modalities into

175
00:11:20,640 --> 00:11:27,890
one space and somehow find a way to build similarities between them.

176
00:11:27,890 --> 00:11:35,280
OK. Now, what would be your actions if you want to build your topic models?

177
00:11:35,280 --> 00:11:37,310
Well, probably you need some libraries.

178
00:11:37,310 --> 00:11:43,540
So, BigARTM library is the implementation of the last approach that I mentioned.

179
00:11:43,540 --> 00:11:47,695
Gensim and MALLET implement online LDA topic model.

180
00:11:47,695 --> 00:11:51,540
Gensim was build for Python and MALLET is built for JAVA.

181
00:11:51,540 --> 00:11:56,200
And Vowpal Wabbit is the implementation of the same online LDA topic model,

182
00:11:56,200 --> 00:11:59,250
but it is known to be super fast.

183
00:11:59,250 --> 00:12:03,305
So, maybe it's also a good idea to check it out.

184
00:12:03,305 --> 00:12:08,870
Now, finally, just a few words about visualization of topic models.

185
00:12:08,870 --> 00:12:12,380
So you will never get through large collections and that is

186
00:12:12,380 --> 00:12:15,535
not so easy to represent the output of your model,

187
00:12:15,535 --> 00:12:21,190
those probability distributions, in such a way that people can understand that.

188
00:12:21,190 --> 00:12:24,545
So, this is an example how to visualize Phi metrics.

189
00:12:24,545 --> 00:12:29,025
We have words by topic's metrics here and you can see that

190
00:12:29,025 --> 00:12:33,560
we group those words that correspond to every certain topic

191
00:12:33,560 --> 00:12:36,465
together so that we can see that

192
00:12:36,465 --> 00:12:39,410
this blue topic is about

193
00:12:39,410 --> 00:12:43,620
these terms and the other one is about social networks and so on.

194
00:12:43,620 --> 00:12:48,275
But actually, the visualization of topic models is the whole world.

195
00:12:48,275 --> 00:12:54,980
So this website contains 380 ways to visualize your topic models.

196
00:12:54,980 --> 00:13:02,540
So, I want to end this video and ask you to just explore them maybe for a few moments,

197
00:13:02,540 --> 00:13:05,660
and you will get to know that topic models can build

198
00:13:05,660 --> 00:13:10,540
very different and colorful representations of your data.