1 00:00:03,660 --> 00:00:09,220 In previous videos, we derived formulas for updating the E-step. 2 00:00:09,220 --> 00:00:11,945 Now, let us move on to the M step. 3 00:00:11,945 --> 00:00:16,810 So, on the M-step, what we want to do is to maximize 4 00:00:16,810 --> 00:00:22,140 the expected logarithm of the joint distribution. 5 00:00:22,140 --> 00:00:25,555 Alright, so let's write it down. 6 00:00:25,555 --> 00:00:34,385 The expected value over Q of theta and Q of C, 7 00:00:34,385 --> 00:00:40,885 the logarithm of the joint distribution P of theta, 8 00:00:40,885 --> 00:00:48,295 Z, and W. And we want to maximize it with respect to our matrix, Phi. 9 00:00:48,295 --> 00:00:54,905 And before we write down the exact value for the logarithm, 10 00:00:54,905 --> 00:00:57,970 let's see which values are constant. 11 00:00:57,970 --> 00:01:03,065 So, we have to cross out the terms that do not depend on Phi. 12 00:01:03,065 --> 00:01:05,720 Actually, there are many of them here. 13 00:01:05,720 --> 00:01:08,980 So, the first term that does not depend on Phi is 14 00:01:08,980 --> 00:01:15,300 constant and also here the only term that depends on Phi is this one, 15 00:01:15,300 --> 00:01:19,090 so you can cross this term out too. 16 00:01:19,090 --> 00:01:25,840 Alright, so let's write down this formula. 17 00:01:25,840 --> 00:01:29,570 This would be an expectation with respect to Q of theta, 18 00:01:29,570 --> 00:01:34,130 Q of Z of this function. 19 00:01:34,130 --> 00:01:38,991 So, it should be the sum over all documents, 20 00:01:38,991 --> 00:01:41,920 sum over all words in this document. 21 00:01:41,920 --> 00:01:44,950 So, it would be N from one to Nd, 22 00:01:44,950 --> 00:01:47,220 the number of words of the document. 23 00:01:47,220 --> 00:01:50,225 Sum over all topics, 24 00:01:50,225 --> 00:01:53,125 D from one to get to a T, 25 00:01:53,125 --> 00:02:02,065 the indicator that the following topic occurs in this position. 26 00:02:02,065 --> 00:02:13,200 And finally, the logarithm of Phi T.W.D.N plus some constant. 27 00:02:13,200 --> 00:02:19,505 And we're trying to maximize this with respect to the matrix Phi. 28 00:02:19,505 --> 00:02:22,730 We should also satisfy two constraints. 29 00:02:22,730 --> 00:02:28,285 First one is that sales of Phi are the probabilities they should be non negative. 30 00:02:28,285 --> 00:02:34,403 So, Phi T.W should be non negative 31 00:02:34,403 --> 00:02:40,665 for any topic and for any word in the vocabulary. 32 00:02:40,665 --> 00:02:43,930 And also, we should ensure that since it is 33 00:02:43,930 --> 00:02:49,390 a probability distribution it should self to one along the W's. 34 00:02:49,390 --> 00:02:55,459 So, if we sum up over W's from one to the vocabulary size, 35 00:02:55,459 --> 00:03:02,260 Phi T.W should be equal to one for any topic. 36 00:03:02,260 --> 00:03:07,795 The first constraint is actually already satisfied since we have Phi under the logarithm. 37 00:03:07,795 --> 00:03:14,644 So, we can satisfy on the second constraint. 38 00:03:14,644 --> 00:03:17,630 Do this less use the Lagrangian. 39 00:03:17,630 --> 00:03:20,455 So, the Lagrangian here would be equal to 40 00:03:20,455 --> 00:03:25,467 the fully function plus the sum over all constraints, 41 00:03:25,467 --> 00:03:29,910 all of the constraints multiplied by the Lagrangian multipliers. 42 00:03:29,910 --> 00:03:40,425 So, the Lagrangian L would be equal to the expected value of Q of theta, Q of Z, 43 00:03:40,425 --> 00:03:45,778 sum over D from one to capital D, 44 00:03:45,778 --> 00:03:51,365 sum over N from one to number of words. 45 00:03:51,365 --> 00:03:53,845 And finally, sum over topics, 46 00:03:53,845 --> 00:04:01,070 the indicator again, T of the logarithm, 47 00:04:01,070 --> 00:04:06,295 Phi T.W.D.N 48 00:04:06,295 --> 00:04:12,280 plus the Lagrangian multipliers. 49 00:04:12,280 --> 00:04:16,230 Those would be, sum of overall constraints. 50 00:04:16,230 --> 00:04:20,945 We have two of them, captured two of them. 51 00:04:20,945 --> 00:04:24,905 The multiplier Lambda T, 52 00:04:24,905 --> 00:04:32,680 that was a constraint so it would be sum over all words, Phi T.W. 53 00:04:32,680 --> 00:04:34,681 minus one. 54 00:04:34,681 --> 00:04:36,160 All right, 55 00:04:36,160 --> 00:04:44,942 so we can take the expectation under the summation and get a bit shorter formula. 56 00:04:44,942 --> 00:04:47,485 So, it will be sum over documents, 57 00:04:47,485 --> 00:04:51,676 sum over words, sum over topics. 58 00:04:51,676 --> 00:05:01,315 The expectation of Z D.N equals to T is actually Gamma D.N as position T. So, 59 00:05:01,315 --> 00:05:07,865 we can write it down as Gamma D.N and let me write down the index T here at the top. 60 00:05:07,865 --> 00:05:12,495 So, it would be times logarithm of Phi 61 00:05:12,495 --> 00:05:18,850 T.W.D.N plus the constraints. 62 00:05:18,850 --> 00:05:23,023 Lambda T for T from one to the number of topics, 63 00:05:23,023 --> 00:05:27,005 sum over the words, 64 00:05:27,005 --> 00:05:29,945 Phi T.W minus one. 65 00:05:29,945 --> 00:05:34,505 All right, so the usual way to do 66 00:05:34,505 --> 00:05:40,690 such things is to compute the derivative the Lagrangian with respect to the variables. 67 00:05:40,690 --> 00:05:44,724 So, let's try to derive the partial derivative of 68 00:05:44,724 --> 00:05:51,940 the Lagrangian with respect to sum Phi T.W. 69 00:05:51,940 --> 00:05:59,500 So, you can put the derivative under the summations` and here we will get the fully form. 70 00:05:59,500 --> 00:06:01,470 So, it would be sum over D, 71 00:06:01,470 --> 00:06:04,267 sum over N, sum over T, 72 00:06:04,267 --> 00:06:13,935 Gamma D.N at position T, one over T.W.D.N. 73 00:06:13,935 --> 00:06:16,080 And also, we have to ensure this, 74 00:06:16,080 --> 00:06:22,935 that this index matches the index that we [inaudible] derivative with respect. 75 00:06:22,935 --> 00:06:32,060 So, we'll have to multiply it by the indicator that this word matches this word. 76 00:06:32,060 --> 00:06:41,980 So, it would be W.D.N of course the W. And finally, 77 00:06:41,980 --> 00:06:49,463 we have to compute the derivative with respect to these terms so it would be plus, 78 00:06:49,463 --> 00:06:51,431 we will have only one term, 79 00:06:51,431 --> 00:06:57,045 that is Lambda T, and that is it. 80 00:06:57,045 --> 00:07:03,780 So, I want to say that this thing should be equal to one, to zero. 81 00:07:03,780 --> 00:07:10,705 So, from this we can derive the value of Phi T.W. 82 00:07:10,705 --> 00:07:16,330 In this case, this is actually equals to W. So, 83 00:07:16,330 --> 00:07:19,140 we can say that this term, 84 00:07:19,140 --> 00:07:28,130 it equals to Phi T.W, 85 00:07:28,130 --> 00:07:32,645 unless [inaudible] So, Phi T,W 86 00:07:32,645 --> 00:07:39,677 equals this summation and then enumerator, 87 00:07:39,677 --> 00:07:49,703 sum over D, sum over N. Sum over T, Gamma D.N.T, 88 00:07:49,703 --> 00:07:59,723 the indicator that the word W occurs at the position and in the document and D 89 00:07:59,723 --> 00:08:12,755 over minus Lambda T. So, 90 00:08:12,755 --> 00:08:17,394 what we can do next is we can sum it up over 91 00:08:17,394 --> 00:08:24,675 all possible values of W. The thing on the left, 92 00:08:24,675 --> 00:08:27,065 let me write it down here. 93 00:08:27,065 --> 00:08:32,369 So, we added the sum of expected W here, 94 00:08:32,369 --> 00:08:34,520 sum of expected W here, 95 00:08:34,520 --> 00:08:42,285 and this thing equals to one from our constraint. 96 00:08:42,285 --> 00:08:50,380 So, on this term equals to one. 97 00:08:50,380 --> 00:08:53,390 All right, let me write it down carefully. 98 00:08:53,390 --> 00:08:56,595 So, this means that the Lambda here, 99 00:08:56,595 --> 00:08:58,405 the minus Lambda here, 100 00:08:58,405 --> 00:09:04,900 actually equals to the numerator and it's sum with respect to all values of W. So, 101 00:09:04,900 --> 00:09:09,365 Lambda T actually equals to the sum, 102 00:09:09,365 --> 00:09:12,120 let me write down just the letters. 103 00:09:12,120 --> 00:09:16,542 So, with respect to all words, all documents, 104 00:09:16,542 --> 00:09:23,100 all positions and old topics of Gamma D.N position T, 105 00:09:23,100 --> 00:09:31,090 the indicator that the worth of a given position is the one that we need. 106 00:09:31,090 --> 00:09:33,580 Alright, now we know the value of Lambda, 107 00:09:33,580 --> 00:09:38,980 we can plug it in into this formula and get the final result. 108 00:09:38,980 --> 00:09:51,545 That is, the Phi T.W equals to the sum with respect to D.N.T, 109 00:09:51,545 --> 00:09:55,685 Gamma D and T indicate in term, 110 00:09:55,685 --> 00:10:00,610 D.N equals to W. Over the same things 111 00:10:00,610 --> 00:10:05,950 sound up with respect to all possible words in our vocabulary. 112 00:10:05,950 --> 00:10:10,345 So, W prime D.N.T, 113 00:10:10,345 --> 00:10:17,685 Gamma D.N.T times the indicator, 114 00:10:17,685 --> 00:10:22,000 W.D.N equals W prime. 115 00:10:22,000 --> 00:10:31,235 And here is the updated formula for Phi. 116 00:10:31,235 --> 00:10:34,518 So, let's see our algorithm again. 117 00:10:34,518 --> 00:10:41,080 On E step, we [inaudible] update of theta N.Z until convergence. 118 00:10:41,080 --> 00:10:43,000 And on M step, 119 00:10:43,000 --> 00:10:47,325 you update a Phi using the following formula. 120 00:10:47,325 --> 00:10:51,860 So, now, we are ready to train our model. 121 00:10:51,860 --> 00:10:56,380 However, we should do one more thing. 122 00:10:56,380 --> 00:11:01,095 Besides training, we need to know how to predict new values. 123 00:11:01,095 --> 00:11:03,880 For example, you have a new document in your book and you 124 00:11:03,880 --> 00:11:06,584 want to predict the values of Z. 125 00:11:06,584 --> 00:11:11,770 Those are the mark up of the words. 126 00:11:11,770 --> 00:11:15,565 So, for each quarter once assigned the topic and also Theta, 127 00:11:15,565 --> 00:11:20,390 we want to assign the global distribution of topics in this document. 128 00:11:20,390 --> 00:11:23,795 Let's do it here. 129 00:11:23,795 --> 00:11:34,805 All right, so to do this, 130 00:11:34,805 --> 00:11:42,720 we want to approximate the probability of P of theta, 131 00:11:42,720 --> 00:11:44,930 let me write it down as D star. 132 00:11:44,930 --> 00:11:48,840 Now, D star is the index of our new document. 133 00:11:48,840 --> 00:11:53,241 And the Z.D star, 134 00:11:53,241 --> 00:11:57,405 the topic assignments for the new document. 135 00:11:57,405 --> 00:12:05,865 Doing our training data and also the matrix Phi that we found in the E.M maverick. 136 00:12:05,865 --> 00:12:10,430 So, to approximate it, let's do the missile approximation again. 137 00:12:10,430 --> 00:12:18,975 So, try to find it in the form of Q of C of theta D star, 138 00:12:18,975 --> 00:12:24,469 Q of Z D star. 139 00:12:24,469 --> 00:12:30,280 And so, we will try to minimize the KL divergence between these two terms. 140 00:12:30,280 --> 00:12:36,980 So, here's KL divergence and we minimize it with respect 141 00:12:36,980 --> 00:12:44,690 to Q of theta D star and Q of Z. 142 00:12:44,690 --> 00:12:46,891 D star. 143 00:12:46,891 --> 00:12:50,105 So, here is our formula for prediction. 144 00:12:50,105 --> 00:12:55,959 We can do the mental approximation with the same formulas that we derive for the E step. 145 00:12:55,959 --> 00:13:00,525 So, we know how to train the model and we also know how to predict from it. 146 00:13:00,525 --> 00:13:01,930 And in the next video, 147 00:13:01,930 --> 00:13:03,090 you will see some extensions. 148 00:13:03,090 --> 00:13:09,610 Those are how we can modify the model so that we have some desired properties.