1
00:00:02,740 --> 00:00:05,620
Hi everyone. This week,

2
00:00:05,620 --> 00:00:08,100
we have explored a lot of ways to build

3
00:00:08,100 --> 00:00:12,377
vector representations for words or for some pieces of text.

4
00:00:12,377 --> 00:00:15,130
This lesson is about topic modeling.

5
00:00:15,130 --> 00:00:17,790
Topic modeling is an alternative way to

6
00:00:17,790 --> 00:00:21,605
build vector representations for your document collections.

7
00:00:21,605 --> 00:00:24,875
So, let us start with a brief introduction to the task.

8
00:00:24,875 --> 00:00:30,515
You are given a text collection and you want to build some hidden representation.

9
00:00:30,515 --> 00:00:34,140
So, you want to say that okay there are some topics here and

10
00:00:34,140 --> 00:00:39,855
every document is described with those topics that are discussed in this document.

11
00:00:39,855 --> 00:00:41,810
Now, what is the topic?

12
00:00:41,810 --> 00:00:46,138
Well you can imagine that you can describe a topic with some words.

13
00:00:46,138 --> 00:00:51,795
For example, such topic as weather is described with sky, rain, sun,

14
00:00:51,795 --> 00:00:55,760
and something like this and such topics such as mathematics is described with

15
00:00:55,760 --> 00:01:01,325
some mathematical terms and probably they do not even overlap at all.

16
00:01:01,325 --> 00:01:05,150
So, you can think about it as soft biclustering.

17
00:01:05,150 --> 00:01:07,860
Why soft b-clustering?

18
00:01:07,860 --> 00:01:12,850
So, first, it is biclustering because you cluster both words and documents.

19
00:01:12,850 --> 00:01:17,370
Second, it is soft because you will see that we are going to build

20
00:01:17,370 --> 00:01:24,945
some probability distributions to softly assign words and documents to topics.

21
00:01:24,945 --> 00:01:27,910
This is the formal way of saying the same thing.

22
00:01:27,910 --> 00:01:29,645
You are given a text collection,

23
00:01:29,645 --> 00:01:35,560
so you are given the counts of how many times every word occurs in every document?

24
00:01:35,560 --> 00:01:40,088
And what you need to find is two kinds of probability distributions.

25
00:01:40,088 --> 00:01:44,310
So, first the probability distribution over words for

26
00:01:44,310 --> 00:01:49,950
topics and second the probability distribution over topics for documents.

27
00:01:49,950 --> 00:01:53,878
And importantly, this is just the definition of a topic,

28
00:01:53,878 --> 00:01:57,230
so you should not think that topic is something complicated

29
00:01:57,230 --> 00:02:00,995
like it is in like real life or as linguists can say.

30
00:02:00,995 --> 00:02:03,480
For us, for all this lesson,

31
00:02:03,480 --> 00:02:06,325
topic is just a probability distribution.

32
00:02:06,325 --> 00:02:07,845
That's it.

33
00:02:07,845 --> 00:02:11,070
Where do we need this kind of models in real life?

34
00:02:11,070 --> 00:02:17,090
Well actually everywhere because everywhere you have big collections of documents.

35
00:02:17,090 --> 00:02:22,260
It can be some news flows or some social media messages or maybe

36
00:02:22,260 --> 00:02:27,500
some data for your domain like for example papers,

37
00:02:27,500 --> 00:02:31,600
research papers that you do not want to read but you want to know

38
00:02:31,600 --> 00:02:36,535
that there are some papers about these and that and they are connected this way.

39
00:02:36,535 --> 00:02:38,910
So, you want some nice overview of

40
00:02:38,910 --> 00:02:44,900
the area to build this automatically and topic models can do it for you.

41
00:02:44,900 --> 00:02:49,230
Some other applications would be social network analysis or

42
00:02:49,230 --> 00:02:54,650
even dialog systems because you can imagine that you would generate some text.

43
00:02:54,650 --> 00:02:56,500
You know how to generate text,

44
00:02:56,500 --> 00:02:58,420
right, from the previous week,

45
00:02:58,420 --> 00:03:04,900
but now you can do this text generation dependent on the topics that you want to mention.

46
00:03:04,900 --> 00:03:08,593
So, there are many many other applications,

47
00:03:08,593 --> 00:03:11,085
for example aggregation of news flows,

48
00:03:11,085 --> 00:03:14,640
when you have some news about politics for example and you

49
00:03:14,640 --> 00:03:18,525
want to say that this topic becomes popular nowadays,

50
00:03:18,525 --> 00:03:23,490
and one other important application that i want to mention is exploratory search,

51
00:03:23,490 --> 00:03:28,659
which means that you want to say this is some document that I am interested in,

52
00:03:28,659 --> 00:03:34,550
could you please find some similar documents and tell me how they are interconnected?

53
00:03:34,550 --> 00:03:36,180
Now, let us do some math,

54
00:03:36,180 --> 00:03:40,890
so let us discuss probabilistic latent semantic analysis, PLSA.

55
00:03:40,890 --> 00:03:45,675
This is a topic model proposed by Thomas Hofmann in 1999.

56
00:03:45,675 --> 00:03:49,320
This is a very basic model that tries to predict words in

57
00:03:49,320 --> 00:03:53,985
documents and it does so by a mixture of topics.

58
00:03:53,985 --> 00:04:00,190
So, do you understand what happens for the first equation here in this formula?

59
00:04:00,190 --> 00:04:03,635
Well, this is a law of total probability.

60
00:04:03,635 --> 00:04:08,605
So, if you just don't care about documents in the formulas for now,

61
00:04:08,605 --> 00:04:14,550
about D, you can notice that this is the law of total probability applied here.

62
00:04:14,550 --> 00:04:18,045
Just take a moment to understand this.

63
00:04:18,045 --> 00:04:21,843
Now what about the second equation here?

64
00:04:21,843 --> 00:04:24,111
Well, this is not correct,

65
00:04:24,111 --> 00:04:26,150
this is just our assumption.

66
00:04:26,150 --> 00:04:27,730
So, just for simplicity,

67
00:04:27,730 --> 00:04:31,465
we assume that the probability of word

68
00:04:31,465 --> 00:04:36,100
given the topic doesn't depend anymore on the document.

69
00:04:36,100 --> 00:04:39,725
So, this is conditional independence assumption.

70
00:04:39,725 --> 00:04:43,470
This is all that we need to introduce PLSA model.

71
00:04:43,470 --> 00:04:48,125
Now i just want you to give you intuition on how that works.

72
00:04:48,125 --> 00:04:50,564
So, this is a generative story.

73
00:04:50,564 --> 00:04:56,470
This is a story how the data is generated by our model.

74
00:04:56,470 --> 00:05:00,610
You have some probability distribution of topics for the document and

75
00:05:00,610 --> 00:05:05,650
first you decide what would be the topic for the next word.

76
00:05:05,650 --> 00:05:08,125
Then, once you have decided on that,

77
00:05:08,125 --> 00:05:14,135
you can draw a certain word from the probability distribution for this topic.

78
00:05:14,135 --> 00:05:19,770
So, this model just assumes that the text is generated not by authors,

79
00:05:19,770 --> 00:05:21,960
not just by handwriting,

80
00:05:21,960 --> 00:05:24,530
but by some probability procedure.

81
00:05:24,530 --> 00:05:29,265
So, first we toss a coin and decide what topic will be next,

82
00:05:29,265 --> 00:05:33,345
and then we toss a coin again and decide what would be the exact word,

83
00:05:33,345 --> 00:05:36,090
and we go on through the whole text.

84
00:05:36,090 --> 00:05:39,530
Well, this is just one way to think about it.

85
00:05:39,530 --> 00:05:42,290
If you do not feel very comfortable with this way,

86
00:05:42,290 --> 00:05:45,230
I will provide for you another way of thinking.

87
00:05:45,230 --> 00:05:49,700
So, this is a matrix way of thinking about this same model.

88
00:05:49,700 --> 00:05:54,280
You can imagine that you have some data which is just word document co-occurrences.

89
00:05:54,280 --> 00:05:59,545
So, you know how many times each word occurs in each document.

90
00:05:59,545 --> 00:06:02,470
That's why you can compute distributions.

91
00:06:02,470 --> 00:06:05,955
You can compute probabilities of words in documents.

92
00:06:05,955 --> 00:06:09,585
You just normalize those counts and that's it.

93
00:06:09,585 --> 00:06:12,740
Now you need to factorize this real matrix

94
00:06:12,740 --> 00:06:16,925
into two matrices of your parameters, Phi and Theta.

95
00:06:16,925 --> 00:06:19,190
One matrix, Phi matrix,

96
00:06:19,190 --> 00:06:22,915
is about probability distributions over words and

97
00:06:22,915 --> 00:06:27,575
Theta matrix contains probability distributions over topics.

98
00:06:27,575 --> 00:06:33,370
Actually every column in this matrix is a probability distribution.

99
00:06:33,370 --> 00:06:40,003
So, this is just a matrix form of the same formula in the top of the slide,

100
00:06:40,003 --> 00:06:46,830
and you can see that it holds just for one element and for any element obviously.

101
00:06:46,830 --> 00:06:50,210
So, this is the introduction of the model and in

102
00:06:50,210 --> 00:06:55,990
the next video we will figure out how to train this model. So stay with me.