1
00:00:03,660 --> 00:00:09,220
In previous videos, we derived formulas for updating the E-step.

2
00:00:09,220 --> 00:00:11,945
Now, let us move on to the M step.

3
00:00:11,945 --> 00:00:16,810
So, on the M-step, what we want to do is to maximize

4
00:00:16,810 --> 00:00:22,140
the expected logarithm of the joint distribution.

5
00:00:22,140 --> 00:00:25,555
Alright, so let's write it down.

6
00:00:25,555 --> 00:00:34,385
The expected value over Q of theta and Q of C,

7
00:00:34,385 --> 00:00:40,885
the logarithm of the joint distribution P of theta,

8
00:00:40,885 --> 00:00:48,295
Z, and W. And we want to maximize it with respect to our matrix, Phi.

9
00:00:48,295 --> 00:00:54,905
And before we write down the exact value for the logarithm,

10
00:00:54,905 --> 00:00:57,970
let's see which values are constant.

11
00:00:57,970 --> 00:01:03,065
So, we have to cross out the terms that do not depend on Phi.

12
00:01:03,065 --> 00:01:05,720
Actually, there are many of them here.

13
00:01:05,720 --> 00:01:08,980
So, the first term that does not depend on Phi is

14
00:01:08,980 --> 00:01:15,300
constant and also here the only term that depends on Phi is this one,

15
00:01:15,300 --> 00:01:19,090
so you can cross this term out too.

16
00:01:19,090 --> 00:01:25,840
Alright, so let's write down this formula.

17
00:01:25,840 --> 00:01:29,570
This would be an expectation with respect to Q of theta,

18
00:01:29,570 --> 00:01:34,130
Q of Z of this function.

19
00:01:34,130 --> 00:01:38,991
So, it should be the sum over all documents,

20
00:01:38,991 --> 00:01:41,920
sum over all words in this document.

21
00:01:41,920 --> 00:01:44,950
So, it would be N from one to Nd,

22
00:01:44,950 --> 00:01:47,220
the number of words of the document.

23
00:01:47,220 --> 00:01:50,225
Sum over all topics,

24
00:01:50,225 --> 00:01:53,125
D from one to get to a T,

25
00:01:53,125 --> 00:02:02,065
the indicator that the following topic occurs in this position.

26
00:02:02,065 --> 00:02:13,200
And finally, the logarithm of Phi T.W.D.N plus some constant.

27
00:02:13,200 --> 00:02:19,505
And we're trying to maximize this with respect to the matrix Phi.

28
00:02:19,505 --> 00:02:22,730
We should also satisfy two constraints.

29
00:02:22,730 --> 00:02:28,285
First one is that sales of Phi are the probabilities they should be non negative.

30
00:02:28,285 --> 00:02:34,403
So, Phi T.W should be non negative

31
00:02:34,403 --> 00:02:40,665
for any topic and for any word in the vocabulary.

32
00:02:40,665 --> 00:02:43,930
And also, we should ensure that since it is

33
00:02:43,930 --> 00:02:49,390
a probability distribution it should self to one along the W's.

34
00:02:49,390 --> 00:02:55,459
So, if we sum up over W's from one to the vocabulary size,

35
00:02:55,459 --> 00:03:02,260
Phi T.W should be equal to one for any topic.

36
00:03:02,260 --> 00:03:07,795
The first constraint is actually already satisfied since we have Phi under the logarithm.

37
00:03:07,795 --> 00:03:14,644
So, we can satisfy on the second constraint.

38
00:03:14,644 --> 00:03:17,630
Do this less use the Lagrangian.

39
00:03:17,630 --> 00:03:20,455
So, the Lagrangian here would be equal to

40
00:03:20,455 --> 00:03:25,467
the fully function plus the sum over all constraints,

41
00:03:25,467 --> 00:03:29,910
all of the constraints multiplied by the Lagrangian multipliers.

42
00:03:29,910 --> 00:03:40,425
So, the Lagrangian L would be equal to the expected value of Q of theta, Q of Z,

43
00:03:40,425 --> 00:03:45,778
sum over D from one to capital D,

44
00:03:45,778 --> 00:03:51,365
sum over N from one to number of words.

45
00:03:51,365 --> 00:03:53,845
And finally, sum over topics,

46
00:03:53,845 --> 00:04:01,070
the indicator again, T of the logarithm,

47
00:04:01,070 --> 00:04:06,295
Phi T.W.D.N

48
00:04:06,295 --> 00:04:12,280
plus the Lagrangian multipliers.

49
00:04:12,280 --> 00:04:16,230
Those would be, sum of overall constraints.

50
00:04:16,230 --> 00:04:20,945
We have two of them, captured two of them.

51
00:04:20,945 --> 00:04:24,905
The multiplier Lambda T,

52
00:04:24,905 --> 00:04:32,680
that was a constraint so it would be sum over all words, Phi T.W.

53
00:04:32,680 --> 00:04:34,681
minus one.

54
00:04:34,681 --> 00:04:36,160
All right,

55
00:04:36,160 --> 00:04:44,942
so we can take the expectation under the summation and get a bit shorter formula.

56
00:04:44,942 --> 00:04:47,485
So, it will be sum over documents,

57
00:04:47,485 --> 00:04:51,676
sum over words, sum over topics.

58
00:04:51,676 --> 00:05:01,315
The expectation of Z D.N equals to T is actually Gamma D.N as position T. So,

59
00:05:01,315 --> 00:05:07,865
we can write it down as Gamma D.N and let me write down the index T here at the top.

60
00:05:07,865 --> 00:05:12,495
So, it would be times logarithm of Phi

61
00:05:12,495 --> 00:05:18,850
T.W.D.N plus the constraints.

62
00:05:18,850 --> 00:05:23,023
Lambda T for T from one to the number of topics,

63
00:05:23,023 --> 00:05:27,005
sum over the words,

64
00:05:27,005 --> 00:05:29,945
Phi T.W minus one.

65
00:05:29,945 --> 00:05:34,505
All right, so the usual way to do

66
00:05:34,505 --> 00:05:40,690
such things is to compute the derivative the Lagrangian with respect to the variables.

67
00:05:40,690 --> 00:05:44,724
So, let's try to derive the partial derivative of

68
00:05:44,724 --> 00:05:51,940
the Lagrangian with respect to sum Phi T.W.

69
00:05:51,940 --> 00:05:59,500
So, you can put the derivative under the summations` and here we will get the fully form.

70
00:05:59,500 --> 00:06:01,470
So, it would be sum over D,

71
00:06:01,470 --> 00:06:04,267
sum over N, sum over T,

72
00:06:04,267 --> 00:06:13,935
Gamma D.N at position T, one over T.W.D.N.

73
00:06:13,935 --> 00:06:16,080
And also, we have to ensure this,

74
00:06:16,080 --> 00:06:22,935
that this index matches the index that we [inaudible] derivative with respect.

75
00:06:22,935 --> 00:06:32,060
So, we'll have to multiply it by the indicator that this word matches this word.

76
00:06:32,060 --> 00:06:41,980
So, it would be W.D.N of course the W. And finally,

77
00:06:41,980 --> 00:06:49,463
we have to compute the derivative with respect to these terms so it would be plus,

78
00:06:49,463 --> 00:06:51,431
we will have only one term,

79
00:06:51,431 --> 00:06:57,045
that is Lambda T, and that is it.

80
00:06:57,045 --> 00:07:03,780
So, I want to say that this thing should be equal to one, to zero.

81
00:07:03,780 --> 00:07:10,705
So, from this we can derive the value of Phi T.W.

82
00:07:10,705 --> 00:07:16,330
In this case, this is actually equals to W. So,

83
00:07:16,330 --> 00:07:19,140
we can say that this term,

84
00:07:19,140 --> 00:07:28,130
it equals to Phi T.W,

85
00:07:28,130 --> 00:07:32,645
unless [inaudible] So, Phi T,W

86
00:07:32,645 --> 00:07:39,677
equals this summation and then enumerator,

87
00:07:39,677 --> 00:07:49,703
sum over D, sum over N. Sum over T, Gamma D.N.T,

88
00:07:49,703 --> 00:07:59,723
the indicator that the word W occurs at the position and in the document and D

89
00:07:59,723 --> 00:08:12,755
over minus Lambda T. So,

90
00:08:12,755 --> 00:08:17,394
what we can do next is we can sum it up over

91
00:08:17,394 --> 00:08:24,675
all possible values of W. The thing on the left,

92
00:08:24,675 --> 00:08:27,065
let me write it down here.

93
00:08:27,065 --> 00:08:32,369
So, we added the sum of expected W here,

94
00:08:32,369 --> 00:08:34,520
sum of expected W here,

95
00:08:34,520 --> 00:08:42,285
and this thing equals to one from our constraint.

96
00:08:42,285 --> 00:08:50,380
So, on this term equals to one.

97
00:08:50,380 --> 00:08:53,390
All right, let me write it down carefully.

98
00:08:53,390 --> 00:08:56,595
So, this means that the Lambda here,

99
00:08:56,595 --> 00:08:58,405
the minus Lambda here,

100
00:08:58,405 --> 00:09:04,900
actually equals to the numerator and it's sum with respect to all values of W. So,

101
00:09:04,900 --> 00:09:09,365
Lambda T actually equals to the sum,

102
00:09:09,365 --> 00:09:12,120
let me write down just the letters.

103
00:09:12,120 --> 00:09:16,542
So, with respect to all words, all documents,

104
00:09:16,542 --> 00:09:23,100
all positions and old topics of Gamma D.N position T,

105
00:09:23,100 --> 00:09:31,090
the indicator that the worth of a given position is the one that we need.

106
00:09:31,090 --> 00:09:33,580
Alright, now we know the value of Lambda,

107
00:09:33,580 --> 00:09:38,980
we can plug it in into this formula and get the final result.

108
00:09:38,980 --> 00:09:51,545
That is, the Phi T.W equals to the sum with respect to D.N.T,

109
00:09:51,545 --> 00:09:55,685
Gamma D and T indicate in term,

110
00:09:55,685 --> 00:10:00,610
D.N equals to W. Over the same things

111
00:10:00,610 --> 00:10:05,950
sound up with respect to all possible words in our vocabulary.

112
00:10:05,950 --> 00:10:10,345
So, W prime D.N.T,

113
00:10:10,345 --> 00:10:17,685
Gamma D.N.T times the indicator,

114
00:10:17,685 --> 00:10:22,000
W.D.N equals W prime.

115
00:10:22,000 --> 00:10:31,235
And here is the updated formula for Phi.

116
00:10:31,235 --> 00:10:34,518
So, let's see our algorithm again.

117
00:10:34,518 --> 00:10:41,080
On E step, we [inaudible] update of theta N.Z until convergence.

118
00:10:41,080 --> 00:10:43,000
And on M step,

119
00:10:43,000 --> 00:10:47,325
you update a Phi using the following formula.

120
00:10:47,325 --> 00:10:51,860
So, now, we are ready to train our model.

121
00:10:51,860 --> 00:10:56,380
However, we should do one more thing.

122
00:10:56,380 --> 00:11:01,095
Besides training, we need to know how to predict new values.

123
00:11:01,095 --> 00:11:03,880
For example, you have a new document in your book and you

124
00:11:03,880 --> 00:11:06,584
want to predict the values of Z.

125
00:11:06,584 --> 00:11:11,770
Those are the mark up of the words.

126
00:11:11,770 --> 00:11:15,565
So, for each quarter once assigned the topic and also Theta,

127
00:11:15,565 --> 00:11:20,390
we want to assign the global distribution of topics in this document.

128
00:11:20,390 --> 00:11:23,795
Let's do it here.

129
00:11:23,795 --> 00:11:34,805
All right, so to do this,

130
00:11:34,805 --> 00:11:42,720
we want to approximate the probability of P of theta,

131
00:11:42,720 --> 00:11:44,930
let me write it down as D star.

132
00:11:44,930 --> 00:11:48,840
Now, D star is the index of our new document.

133
00:11:48,840 --> 00:11:53,241
And the Z.D star,

134
00:11:53,241 --> 00:11:57,405
the topic assignments for the new document.

135
00:11:57,405 --> 00:12:05,865
Doing our training data and also the matrix Phi that we found in the E.M maverick.

136
00:12:05,865 --> 00:12:10,430
So, to approximate it, let's do the missile approximation again.

137
00:12:10,430 --> 00:12:18,975
So, try to find it in the form of Q of C of theta D star,

138
00:12:18,975 --> 00:12:24,469
Q of Z D star.

139
00:12:24,469 --> 00:12:30,280
And so, we will try to minimize the KL divergence between these two terms.

140
00:12:30,280 --> 00:12:36,980
So, here's KL divergence and we minimize it with respect

141
00:12:36,980 --> 00:12:44,690
to Q of theta D star and Q of Z.

142
00:12:44,690 --> 00:12:46,891
D star.

143
00:12:46,891 --> 00:12:50,105
So, here is our formula for prediction.

144
00:12:50,105 --> 00:12:55,959
We can do the mental approximation with the same formulas that we derive for the E step.

145
00:12:55,959 --> 00:13:00,525
So, we know how to train the model and we also know how to predict from it.

146
00:13:00,525 --> 00:13:01,930
And in the next video,

147
00:13:01,930 --> 00:13:03,090
you will see some extensions.

148
00:13:03,090 --> 00:13:09,610
Those are how we can modify the model so that we have some desired properties.