1 00:00:00,000 --> 00:00:00,951 [MUSIC] 2 00:00:00,951 --> 00:00:08,916 In this video, we will derive the formulas for the model. 3 00:00:08,916 --> 00:00:13,419 That is we want to try to train the model by finding the optimal values of phi. 4 00:00:13,419 --> 00:00:18,281 We'll do this by maximizing the likelihood. 5 00:00:18,281 --> 00:00:23,480 So we would try to have the probability of our data, given the matrix phi. 6 00:00:23,480 --> 00:00:30,326 And to maximize it with respect to phi. 7 00:00:30,326 --> 00:00:35,695 We'll try to apply the variational Algorithm for this purpose. 8 00:00:35,695 --> 00:00:37,927 So in variatiional EL we have two steps. 9 00:00:43,393 --> 00:00:47,383 We'll try to minimize the counter versions, 10 00:00:47,383 --> 00:00:51,066 between the variation distribution, and 11 00:00:51,066 --> 00:00:56,190 this case we try to find the solution in the following form. 12 00:00:56,190 --> 00:01:03,823 It would be the distribution over theta times the distribution over Z. 13 00:01:03,823 --> 00:01:09,040 Between these and the probability of theta and 14 00:01:09,040 --> 00:01:15,205 Z, the posterior probability on them given the data. 15 00:01:15,205 --> 00:01:19,081 And minimize it. 16 00:01:19,081 --> 00:01:24,227 We'll just get back to theta, with Q of theta and Q of Z. 17 00:01:24,227 --> 00:01:29,496 All right, this is an E-step and an M-step. 18 00:01:34,630 --> 00:01:40,339 We maximize the expected value 19 00:01:43,211 --> 00:01:45,550 Of the logarithm, 20 00:01:47,210 --> 00:01:52,367 Of the joint probability of theta z and w, 21 00:01:56,921 --> 00:02:01,085 Maximize it with respect to theta, with respect to phi. 22 00:02:03,165 --> 00:02:09,588 So our plan for now is to draw the formulas for the E step, 23 00:02:09,588 --> 00:02:14,796 that is we want to find the q of theta and q of z. 24 00:02:14,796 --> 00:02:16,787 Let's start with q of theta. 25 00:02:21,493 --> 00:02:24,504 All right, so let's start with theta. 26 00:02:24,504 --> 00:02:29,559 The formulas for theta are as follows. 27 00:02:29,559 --> 00:02:30,493 So the log of q of theta. 28 00:02:30,493 --> 00:02:37,221 Those are formulas from the means field approximation. 29 00:02:37,221 --> 00:02:44,048 Equals to the expected value with respect to, all variables except for 30 00:02:44,048 --> 00:02:49,051 theta and the only variable that we have left is z, so 31 00:02:49,051 --> 00:02:54,659 with respect to q of z, the logarithm of this distribution 32 00:02:57,069 --> 00:03:02,882 Log of p of theta z given w + some constant. 33 00:03:06,853 --> 00:03:09,724 So we actually don't know this term. 34 00:03:09,724 --> 00:03:14,996 However, we can rewrite it using Bay's formula. 35 00:03:14,996 --> 00:03:20,894 So it would be equal to the ratio between the joint probability, 36 00:03:20,894 --> 00:03:24,281 actually this isn't base formula, 37 00:03:24,281 --> 00:03:29,527 it is just the definition of the conditional probability 38 00:03:29,527 --> 00:03:33,916 over the probabiilty of W, from this we have W. 39 00:03:33,916 --> 00:03:39,151 So this term the denominator does not depend 40 00:03:39,151 --> 00:03:43,967 on Q of theta, so it's just a constant. 41 00:03:43,967 --> 00:03:49,540 And so we can rewrite it as expectation. 42 00:03:49,540 --> 00:03:54,551 Of q of c, the logarithm of the joined 43 00:03:54,551 --> 00:04:01,678 distribution over the theta c and w plus a constant. 44 00:04:01,678 --> 00:04:08,959 All right, so we are estimating a distribution over theta so 45 00:04:08,959 --> 00:04:15,257 before we plug in this value from this huge formula. 46 00:04:15,257 --> 00:04:20,082 Let's see which terms are constant with respect to theta. 47 00:04:20,082 --> 00:04:22,441 So this term depends on theta. 48 00:04:22,441 --> 00:04:24,746 We have theta here. 49 00:04:24,746 --> 00:04:25,433 We have theta here. 50 00:04:25,433 --> 00:04:29,249 However in this one we don't have theta. 51 00:04:29,249 --> 00:04:31,259 So this is actually a constant. 52 00:04:35,883 --> 00:04:40,350 All right, so, actually we still have a lot of terms. 53 00:04:40,350 --> 00:04:44,369 I'll try and, Solve 54 00:04:44,369 --> 00:04:49,179 expectation with respect to q of z. 55 00:04:54,511 --> 00:04:58,214 And now we write out these terms. 56 00:04:58,214 --> 00:05:05,480 So it is the sum over all documents. 57 00:05:05,480 --> 00:05:11,886 Then we have bracket, sum over all topics. 58 00:05:14,765 --> 00:05:20,057 Alpha (t-1) logarithm of theta 59 00:05:20,057 --> 00:05:25,001 dt + the sum over hours n topics. 60 00:05:25,001 --> 00:05:29,371 So, this would be n=1 to Nd, 61 00:05:29,371 --> 00:05:34,084 number of words in the document. 62 00:05:34,084 --> 00:05:39,810 Sum over all topics that can be assigned to each word, 63 00:05:39,810 --> 00:05:45,547 indicator that the written variable has the value t. 64 00:05:45,547 --> 00:05:53,550 So in this case we assign the topic t to the word n of the document d. 65 00:05:53,550 --> 00:05:59,450 Times a logarithm of theta dt as well, I already have it here. 66 00:05:59,450 --> 00:06:04,367 And we have to close this bracket. 67 00:06:05,470 --> 00:06:09,875 All right, we have an expectation and 68 00:06:09,875 --> 00:06:14,012 we can take it under the summation. 69 00:06:14,012 --> 00:06:17,458 So this term does not depend on z. 70 00:06:17,458 --> 00:06:22,237 And here only dependence of z is here. 71 00:06:22,237 --> 00:06:27,325 So it will be the sum over all documents. 72 00:06:27,325 --> 00:06:31,610 This term will not change. 73 00:06:31,610 --> 00:06:37,652 So the sum for t, from 1 to t over all topics, 74 00:06:37,652 --> 00:06:43,853 alpha t minus 1, logarithm of theta dt plus, 75 00:06:43,853 --> 00:06:48,941 then we take the expectation here, so 76 00:06:48,941 --> 00:06:57,541 this will be the sum over our hours [SOUND] the sum over all topics. 77 00:06:57,541 --> 00:07:02,505 Expectation with respect to q 78 00:07:02,505 --> 00:07:06,681 of zdn of the indicator. 79 00:07:11,121 --> 00:07:17,872 And finally, times the logarithm of theta dt. 80 00:07:17,872 --> 00:07:19,863 [SOUND] All right. 81 00:07:19,863 --> 00:07:24,977 So, let's note this value, 82 00:07:24,977 --> 00:07:29,690 the expectation as gamma dn 83 00:07:32,692 --> 00:07:38,102 Now we can group the terms that have the value of 84 00:07:38,102 --> 00:07:42,936 of theta dt, and get the following expression. 85 00:07:47,062 --> 00:07:49,439 We'll have a sum over all documents. 86 00:07:54,032 --> 00:07:58,202 Sum over all topics. 87 00:08:03,032 --> 00:08:09,710 We'll have alpha t- 1 from this term. 88 00:08:09,710 --> 00:08:17,534 We'll have the sum over all hours of the gamma dn. 89 00:08:21,644 --> 00:08:27,305 So summation gamma dn, and 90 00:08:27,305 --> 00:08:35,925 finally times the logarithm of theta d t. 91 00:08:35,925 --> 00:08:44,126 And actually we should add a constant here, here and here. 92 00:08:44,126 --> 00:08:49,016 All right, so what is 93 00:08:49,016 --> 00:08:53,914 this distribution? 94 00:08:53,914 --> 00:08:55,510 Do you recognize the form of it. 95 00:08:55,510 --> 00:09:01,197 Actually this is a [INAUDIBLE] distribution again. 96 00:09:01,197 --> 00:09:07,364 So if we take the exponent of this function to get the q of theta 97 00:09:11,555 --> 00:09:15,534 we will see that it equals to 98 00:09:15,534 --> 00:09:20,378 the product over all documents, 99 00:09:20,378 --> 00:09:25,395 [SOUND], product over all topics, 100 00:09:25,395 --> 00:09:30,585 theta dt, the power would be in this 101 00:09:30,585 --> 00:09:36,124 term alpha t plus sum of gamma d n-1. 102 00:09:36,124 --> 00:09:41,327 And times some constant. 103 00:09:41,327 --> 00:09:45,628 So we can write the proportional sign instead of equality. 104 00:09:45,628 --> 00:09:52,187 So actually this is a [INAUDIBLE] distribution. 105 00:09:52,187 --> 00:09:58,066 We can write down [INAUDIBLE] the q of theta equals to the product 106 00:09:58,066 --> 00:10:03,154 over all documents, The distribution over theta d, 107 00:10:03,154 --> 00:10:07,241 and since that is a stanard distribution. 108 00:10:07,241 --> 00:10:11,754 So you recognise, you should recognise the form here. 109 00:10:11,754 --> 00:10:19,120 Q of theta d would be a distribution. 110 00:10:19,120 --> 00:10:22,306 With the following parameters. 111 00:10:22,306 --> 00:10:26,513 So we should add one to the power and 112 00:10:26,513 --> 00:10:30,584 we will have the following form. 113 00:10:30,584 --> 00:10:32,718 So this would be vector alpha. 114 00:10:32,718 --> 00:10:38,451 Also vector of summations of gamma dn. 115 00:10:38,451 --> 00:10:43,832 So as a parameter of this distribution, 116 00:10:43,832 --> 00:10:50,014 we sum up the vector alpha and the vector sum of gamma d n. 117 00:10:50,014 --> 00:10:58,210 So gamma d n itself depends on t since here we had a value of t. 118 00:10:58,210 --> 00:11:05,370 Alright so we derive the update formula for q of theta. 119 00:11:05,370 --> 00:11:08,219 Also know that here we use the value of gamma dn. 120 00:11:08,219 --> 00:11:13,027 And the gamma dn can be approximated 121 00:11:13,027 --> 00:11:17,839 only using the distribution q of z. 122 00:11:17,839 --> 00:11:21,681 So before we have the training algorithm, 123 00:11:21,681 --> 00:11:26,694 we have to derive the value, the logarithm of q of z, and 124 00:11:26,694 --> 00:11:34,080 also we'll have to compute the expectation of this distribution of zdn equals t. 125 00:11:34,080 --> 00:11:44,080 [MUSIC].