1
00:00:02,750 --> 00:00:07,710
Hey. Let us understand how to train PLSA model.

2
00:00:07,710 --> 00:00:09,015
So, just to recap,

3
00:00:09,015 --> 00:00:14,730
this is a topic model that predicts words in documents by a mixture of topics.

4
00:00:14,730 --> 00:00:17,680
So we have some parameters in this model.

5
00:00:17,680 --> 00:00:20,280
We have two kinds of probability distributions,

6
00:00:20,280 --> 00:00:24,260
phi parameters stand for probabilities of words and topics,

7
00:00:24,260 --> 00:00:29,240
and theta parameters stand for probabilities of topics and documents.

8
00:00:29,240 --> 00:00:32,495
Now, you have your probabilistic model of data,

9
00:00:32,495 --> 00:00:34,235
and you have your data.

10
00:00:34,235 --> 00:00:35,515
How do you train your models?

11
00:00:35,515 --> 00:00:38,430
So, how do you estimate the parameters?

12
00:00:38,430 --> 00:00:42,840
Likelihood maximization is something that always help us.

13
00:00:42,840 --> 00:00:47,310
So the top line for this slide is the log-likelihood of our model,

14
00:00:47,310 --> 00:00:51,685
and we need to maximize this with the respect to our parameters.

15
00:00:51,685 --> 00:00:55,300
Now, let us do some modification in this formula.

16
00:00:55,300 --> 00:00:57,704
So first, let us apply logarithm,

17
00:00:57,704 --> 00:01:02,380
and we will have the sum of logarithms instead of the logarithm of the products.

18
00:01:02,380 --> 00:01:06,435
Then, let us just get rid of the probability of the document because

19
00:01:06,435 --> 00:01:11,160
the probability of the document does not depend on our parameters,

20
00:01:11,160 --> 00:01:14,395
which they do not even know how to model this pairs.

21
00:01:14,395 --> 00:01:16,475
So we just forget about them.

22
00:01:16,475 --> 00:01:21,905
What we care about is the probabilities of words in documents.

23
00:01:21,905 --> 00:01:25,685
So we substitute them by the sum of our topics.

24
00:01:25,685 --> 00:01:28,720
So this is what our model says.

25
00:01:28,720 --> 00:01:30,065
Great. So that's it.

26
00:01:30,065 --> 00:01:33,063
And we want to maximize this likelihood,

27
00:01:33,063 --> 00:01:36,035
and we need to remember about constraints.

28
00:01:36,035 --> 00:01:38,780
So our parameters are probabilities.

29
00:01:38,780 --> 00:01:41,070
That's why they need to be non-negative,

30
00:01:41,070 --> 00:01:44,385
and they need to be a normalized.

31
00:01:44,385 --> 00:01:50,195
Now, you can notice that this term that we need to maximize is not very nice.

32
00:01:50,195 --> 00:01:52,700
We have a logarithms for the sum,

33
00:01:52,700 --> 00:01:57,900
and this is something ugly that is not really clear how to maximize.

34
00:01:57,900 --> 00:02:01,558
But fortunately, we have EM-algorithm,

35
00:02:01,558 --> 00:02:06,685
you could hear about this algorithm in other course in our Specialization.

36
00:02:06,685 --> 00:02:12,310
But now, I want just to come to this algorithm intuitively.

37
00:02:12,310 --> 00:02:15,520
So let us start with some data.

38
00:02:15,520 --> 00:02:18,960
So we are going to train our model on plain text.

39
00:02:18,960 --> 00:02:23,020
So this is everything of what we have.

40
00:02:23,020 --> 00:02:27,090
Now, let us remember that we know the generative model.

41
00:02:27,090 --> 00:02:32,370
So we assume that every word in this text has

42
00:02:32,370 --> 00:02:38,550
some one topic that was generated when we decided to reach what will be next.

43
00:02:38,550 --> 00:02:41,450
So let us pretend, just for a moment,

44
00:02:41,450 --> 00:02:42,890
just for one slide,

45
00:02:42,890 --> 00:02:45,290
that we know these topics.

46
00:02:45,290 --> 00:02:49,430
So let us pretend that we know that the words sky, raining,

47
00:02:49,430 --> 00:02:55,220
and clear up go from sub topic number 22, and that's it.

48
00:02:55,220 --> 00:02:57,260
So we know these assignments.

49
00:02:57,260 --> 00:03:02,695
How would you then calculate the probabilities of words in topics?

50
00:03:02,695 --> 00:03:05,820
So you know you have four words for this topic,

51
00:03:05,820 --> 00:03:11,175
and you want to calculate the probability of sky, let's say.

52
00:03:11,175 --> 00:03:12,915
This is how you do it.

53
00:03:12,915 --> 00:03:14,205
You just say, "Well,

54
00:03:14,205 --> 00:03:17,445
I like one word out of these four words.

55
00:03:17,445 --> 00:03:21,420
So the probability will be one divided by four."

56
00:03:21,420 --> 00:03:24,990
By NWT here, I denote the count of

57
00:03:24,990 --> 00:03:31,115
how many times this certain word was connected to this certain topic.

58
00:03:31,115 --> 00:03:35,740
So, can you imagine how would we evaluate the probability of

59
00:03:35,740 --> 00:03:40,822
topics in this document for this colorful case.

60
00:03:40,822 --> 00:03:42,120
Well, it's just the same.

61
00:03:42,120 --> 00:03:46,615
So we know that we have four words about this red topic,

62
00:03:46,615 --> 00:03:49,745
and we have 54 words in our document,

63
00:03:49,745 --> 00:03:53,440
that's why we have this probability for this example.

64
00:03:53,440 --> 00:03:56,935
Well, unfortunately, life is not like this.

65
00:03:56,935 --> 00:04:00,310
We do not know this colorful topic assignments.

66
00:04:00,310 --> 00:04:04,680
What we have is just plain text. And that's a problem.

67
00:04:04,680 --> 00:04:08,580
But, can we somehow estimate those assignments?

68
00:04:08,580 --> 00:04:14,025
Can we somehow estimate the probabilities of the colors for every word?

69
00:04:14,025 --> 00:04:19,215
Yes we can. So, Bayes rule helps us here.

70
00:04:19,215 --> 00:04:24,460
What we can do, we can say that we need probabilities of topics for each word

71
00:04:24,460 --> 00:04:29,570
in each document and apply Bayes rule and product rule.

72
00:04:29,570 --> 00:04:31,390
So, to understand this,

73
00:04:31,390 --> 00:04:37,195
I just advise you to forget about D in all this formulas,

74
00:04:37,195 --> 00:04:40,955
and then everything will be very clear.

75
00:04:40,955 --> 00:04:42,968
So we just apply these two rules,

76
00:04:42,968 --> 00:04:46,500
and we get some estimates for probabilities of

77
00:04:46,500 --> 00:04:51,255
our hidden variables, probabilities of topics.

78
00:04:51,255 --> 00:04:53,975
Now, it's time to put everything together.

79
00:04:53,975 --> 00:04:59,665
So, we have EM-algorithm which has two steps, E-step and M-step.

80
00:04:59,665 --> 00:05:05,345
Each step is about estimating the probabilities of hidden variables,

81
00:05:05,345 --> 00:05:08,030
and this is what we have just discussed.

82
00:05:08,030 --> 00:05:11,780
M-step is about those updates for parameters.

83
00:05:11,780 --> 00:05:18,770
So we have discussed it for the simple case when we know the topics assignment exactly.

84
00:05:18,770 --> 00:05:20,680
Now, we do not know them exactly.

85
00:05:20,680 --> 00:05:26,540
So, it is a bit more complicated to compute NWT counts.

86
00:05:26,540 --> 00:05:31,460
This is not just how many times the word is connected with this topic,

87
00:05:31,460 --> 00:05:33,110
but it's still doable.

88
00:05:33,110 --> 00:05:35,790
So, we just take the words,

89
00:05:35,790 --> 00:05:37,655
we take the counts of the words,

90
00:05:37,655 --> 00:05:42,935
and we weight them with the probabilities that we know from the E-step.

91
00:05:42,935 --> 00:05:47,710
And that's how we get some estimates for NWT.

92
00:05:47,710 --> 00:05:50,650
So this is not int counter anymore.

93
00:05:50,650 --> 00:05:55,160
It has some flow to variable that still has the same meaning,

94
00:05:55,160 --> 00:05:57,295
still has the same intuition.

95
00:05:57,295 --> 00:06:02,130
So, the EM-algorithm is a super powerful technique,

96
00:06:02,130 --> 00:06:06,030
and it can be used any time when you have your model,

97
00:06:06,030 --> 00:06:08,190
you have your observable data,

98
00:06:08,190 --> 00:06:10,330
and you have some hidden variables.

99
00:06:10,330 --> 00:06:14,820
So, this is all formulas that we need for now.

100
00:06:14,820 --> 00:06:18,975
You just want to understand that to build your topic model,

101
00:06:18,975 --> 00:06:22,650
you need to repeat those E-step and M-step iteratively.

102
00:06:22,650 --> 00:06:24,660
So, you scan your data,

103
00:06:24,660 --> 00:06:29,390
you compute probabilities of topics using your current parameters,

104
00:06:29,390 --> 00:06:31,560
then you update parameters using

105
00:06:31,560 --> 00:06:35,730
your current probabilities of topics and you repeat this again and again.

106
00:06:35,730 --> 00:06:39,878
And this iterative process converge and hopefully,

107
00:06:39,878 --> 00:06:43,200
you will get your nice topic model trained.