1 00:00:02,750 --> 00:00:07,710 Hey. Let us understand how to train PLSA model. 2 00:00:07,710 --> 00:00:09,015 So, just to recap, 3 00:00:09,015 --> 00:00:14,730 this is a topic model that predicts words in documents by a mixture of topics. 4 00:00:14,730 --> 00:00:17,680 So we have some parameters in this model. 5 00:00:17,680 --> 00:00:20,280 We have two kinds of probability distributions, 6 00:00:20,280 --> 00:00:24,260 phi parameters stand for probabilities of words and topics, 7 00:00:24,260 --> 00:00:29,240 and theta parameters stand for probabilities of topics and documents. 8 00:00:29,240 --> 00:00:32,495 Now, you have your probabilistic model of data, 9 00:00:32,495 --> 00:00:34,235 and you have your data. 10 00:00:34,235 --> 00:00:35,515 How do you train your models? 11 00:00:35,515 --> 00:00:38,430 So, how do you estimate the parameters? 12 00:00:38,430 --> 00:00:42,840 Likelihood maximization is something that always help us. 13 00:00:42,840 --> 00:00:47,310 So the top line for this slide is the log-likelihood of our model, 14 00:00:47,310 --> 00:00:51,685 and we need to maximize this with the respect to our parameters. 15 00:00:51,685 --> 00:00:55,300 Now, let us do some modification in this formula. 16 00:00:55,300 --> 00:00:57,704 So first, let us apply logarithm, 17 00:00:57,704 --> 00:01:02,380 and we will have the sum of logarithms instead of the logarithm of the products. 18 00:01:02,380 --> 00:01:06,435 Then, let us just get rid of the probability of the document because 19 00:01:06,435 --> 00:01:11,160 the probability of the document does not depend on our parameters, 20 00:01:11,160 --> 00:01:14,395 which they do not even know how to model this pairs. 21 00:01:14,395 --> 00:01:16,475 So we just forget about them. 22 00:01:16,475 --> 00:01:21,905 What we care about is the probabilities of words in documents. 23 00:01:21,905 --> 00:01:25,685 So we substitute them by the sum of our topics. 24 00:01:25,685 --> 00:01:28,720 So this is what our model says. 25 00:01:28,720 --> 00:01:30,065 Great. So that's it. 26 00:01:30,065 --> 00:01:33,063 And we want to maximize this likelihood, 27 00:01:33,063 --> 00:01:36,035 and we need to remember about constraints. 28 00:01:36,035 --> 00:01:38,780 So our parameters are probabilities. 29 00:01:38,780 --> 00:01:41,070 That's why they need to be non-negative, 30 00:01:41,070 --> 00:01:44,385 and they need to be a normalized. 31 00:01:44,385 --> 00:01:50,195 Now, you can notice that this term that we need to maximize is not very nice. 32 00:01:50,195 --> 00:01:52,700 We have a logarithms for the sum, 33 00:01:52,700 --> 00:01:57,900 and this is something ugly that is not really clear how to maximize. 34 00:01:57,900 --> 00:02:01,558 But fortunately, we have EM-algorithm, 35 00:02:01,558 --> 00:02:06,685 you could hear about this algorithm in other course in our Specialization. 36 00:02:06,685 --> 00:02:12,310 But now, I want just to come to this algorithm intuitively. 37 00:02:12,310 --> 00:02:15,520 So let us start with some data. 38 00:02:15,520 --> 00:02:18,960 So we are going to train our model on plain text. 39 00:02:18,960 --> 00:02:23,020 So this is everything of what we have. 40 00:02:23,020 --> 00:02:27,090 Now, let us remember that we know the generative model. 41 00:02:27,090 --> 00:02:32,370 So we assume that every word in this text has 42 00:02:32,370 --> 00:02:38,550 some one topic that was generated when we decided to reach what will be next. 43 00:02:38,550 --> 00:02:41,450 So let us pretend, just for a moment, 44 00:02:41,450 --> 00:02:42,890 just for one slide, 45 00:02:42,890 --> 00:02:45,290 that we know these topics. 46 00:02:45,290 --> 00:02:49,430 So let us pretend that we know that the words sky, raining, 47 00:02:49,430 --> 00:02:55,220 and clear up go from sub topic number 22, and that's it. 48 00:02:55,220 --> 00:02:57,260 So we know these assignments. 49 00:02:57,260 --> 00:03:02,695 How would you then calculate the probabilities of words in topics? 50 00:03:02,695 --> 00:03:05,820 So you know you have four words for this topic, 51 00:03:05,820 --> 00:03:11,175 and you want to calculate the probability of sky, let's say. 52 00:03:11,175 --> 00:03:12,915 This is how you do it. 53 00:03:12,915 --> 00:03:14,205 You just say, "Well, 54 00:03:14,205 --> 00:03:17,445 I like one word out of these four words. 55 00:03:17,445 --> 00:03:21,420 So the probability will be one divided by four." 56 00:03:21,420 --> 00:03:24,990 By NWT here, I denote the count of 57 00:03:24,990 --> 00:03:31,115 how many times this certain word was connected to this certain topic. 58 00:03:31,115 --> 00:03:35,740 So, can you imagine how would we evaluate the probability of 59 00:03:35,740 --> 00:03:40,822 topics in this document for this colorful case. 60 00:03:40,822 --> 00:03:42,120 Well, it's just the same. 61 00:03:42,120 --> 00:03:46,615 So we know that we have four words about this red topic, 62 00:03:46,615 --> 00:03:49,745 and we have 54 words in our document, 63 00:03:49,745 --> 00:03:53,440 that's why we have this probability for this example. 64 00:03:53,440 --> 00:03:56,935 Well, unfortunately, life is not like this. 65 00:03:56,935 --> 00:04:00,310 We do not know this colorful topic assignments. 66 00:04:00,310 --> 00:04:04,680 What we have is just plain text. And that's a problem. 67 00:04:04,680 --> 00:04:08,580 But, can we somehow estimate those assignments? 68 00:04:08,580 --> 00:04:14,025 Can we somehow estimate the probabilities of the colors for every word? 69 00:04:14,025 --> 00:04:19,215 Yes we can. So, Bayes rule helps us here. 70 00:04:19,215 --> 00:04:24,460 What we can do, we can say that we need probabilities of topics for each word 71 00:04:24,460 --> 00:04:29,570 in each document and apply Bayes rule and product rule. 72 00:04:29,570 --> 00:04:31,390 So, to understand this, 73 00:04:31,390 --> 00:04:37,195 I just advise you to forget about D in all this formulas, 74 00:04:37,195 --> 00:04:40,955 and then everything will be very clear. 75 00:04:40,955 --> 00:04:42,968 So we just apply these two rules, 76 00:04:42,968 --> 00:04:46,500 and we get some estimates for probabilities of 77 00:04:46,500 --> 00:04:51,255 our hidden variables, probabilities of topics. 78 00:04:51,255 --> 00:04:53,975 Now, it's time to put everything together. 79 00:04:53,975 --> 00:04:59,665 So, we have EM-algorithm which has two steps, E-step and M-step. 80 00:04:59,665 --> 00:05:05,345 Each step is about estimating the probabilities of hidden variables, 81 00:05:05,345 --> 00:05:08,030 and this is what we have just discussed. 82 00:05:08,030 --> 00:05:11,780 M-step is about those updates for parameters. 83 00:05:11,780 --> 00:05:18,770 So we have discussed it for the simple case when we know the topics assignment exactly. 84 00:05:18,770 --> 00:05:20,680 Now, we do not know them exactly. 85 00:05:20,680 --> 00:05:26,540 So, it is a bit more complicated to compute NWT counts. 86 00:05:26,540 --> 00:05:31,460 This is not just how many times the word is connected with this topic, 87 00:05:31,460 --> 00:05:33,110 but it's still doable. 88 00:05:33,110 --> 00:05:35,790 So, we just take the words, 89 00:05:35,790 --> 00:05:37,655 we take the counts of the words, 90 00:05:37,655 --> 00:05:42,935 and we weight them with the probabilities that we know from the E-step. 91 00:05:42,935 --> 00:05:47,710 And that's how we get some estimates for NWT. 92 00:05:47,710 --> 00:05:50,650 So this is not int counter anymore. 93 00:05:50,650 --> 00:05:55,160 It has some flow to variable that still has the same meaning, 94 00:05:55,160 --> 00:05:57,295 still has the same intuition. 95 00:05:57,295 --> 00:06:02,130 So, the EM-algorithm is a super powerful technique, 96 00:06:02,130 --> 00:06:06,030 and it can be used any time when you have your model, 97 00:06:06,030 --> 00:06:08,190 you have your observable data, 98 00:06:08,190 --> 00:06:10,330 and you have some hidden variables. 99 00:06:10,330 --> 00:06:14,820 So, this is all formulas that we need for now. 100 00:06:14,820 --> 00:06:18,975 You just want to understand that to build your topic model, 101 00:06:18,975 --> 00:06:22,650 you need to repeat those E-step and M-step iteratively. 102 00:06:22,650 --> 00:06:24,660 So, you scan your data, 103 00:06:24,660 --> 00:06:29,390 you compute probabilities of topics using your current parameters, 104 00:06:29,390 --> 00:06:31,560 then you update parameters using 105 00:06:31,560 --> 00:06:35,730 your current probabilities of topics and you repeat this again and again. 106 00:06:35,730 --> 00:06:39,878 And this iterative process converge and hopefully, 107 00:06:39,878 --> 00:06:43,200 you will get your nice topic model trained.