1 00:00:00,000 --> 00:00:02,700 [SOUND] In this video, 2 00:00:02,700 --> 00:00:08,710 we'll finally see the Latent Dirichlet Allocation. 3 00:00:08,710 --> 00:00:12,345 Let me remind you what topics are in documents. 4 00:00:12,345 --> 00:00:15,079 So a document is a distribution over topics. 5 00:00:15,079 --> 00:00:20,363 For example, we can assign for a document, a distribution like this. 6 00:00:20,363 --> 00:00:23,666 So 80% cats and 20% dogs. 7 00:00:23,666 --> 00:00:26,215 Another topic is a distribution over words. 8 00:00:26,215 --> 00:00:31,556 For example, the topic cats would have 40% times the word cats and 9 00:00:31,556 --> 00:00:35,607 the word meow 30% and the other words like dogs and 10 00:00:35,607 --> 00:00:40,140 words like and these on will have really low probability. 11 00:00:40,140 --> 00:00:43,551 The topic about dogs would have dog and 12 00:00:43,551 --> 00:00:48,356 woof words whereas with and others with low. 13 00:00:48,356 --> 00:00:55,532 Let's see how we can generate the for example the cat meowed on the dog. 14 00:00:55,532 --> 00:01:01,378 The first word, the cat word is taken from the topic cats. 15 00:01:01,378 --> 00:01:05,756 And with 40% probability we could sample the word cat. 16 00:01:05,756 --> 00:01:10,214 The second word, meow, is also from the topic cats. 17 00:01:10,214 --> 00:01:15,208 And it's sampled with 30% probability from the topic cats. 18 00:01:15,208 --> 00:01:19,834 And finally, the word dog is from the topic on dogs, and 19 00:01:19,834 --> 00:01:23,581 with 40% probability we could sample it. 20 00:01:23,581 --> 00:01:26,578 So here's our model. 21 00:01:26,578 --> 00:01:31,224 We have a distribution of our topics for the document number d, 22 00:01:31,224 --> 00:01:32,990 we will call it theta d. 23 00:01:32,990 --> 00:01:36,173 Then, for each word in the document, 24 00:01:36,173 --> 00:01:41,354 we assign the probability, we assign the topic of each word. 25 00:01:41,354 --> 00:01:49,605 For example, Zed D1, would respond to the topic of the first word in document D. 26 00:01:49,605 --> 00:01:55,244 And finally, for example, the little variable Zed DN, 27 00:01:55,244 --> 00:02:00,775 would respond to the Topic of the nth word in document d. 28 00:02:00,775 --> 00:02:05,256 Each latent variable can take the values from 1 to T, 29 00:02:05,256 --> 00:02:10,949 where T is the number of topics that we will try to find in our corpus. 30 00:02:10,949 --> 00:02:14,660 The corpus is a collection of the documents. 31 00:02:14,660 --> 00:02:20,732 So learn from the corresponding topics we can sample the words. 32 00:02:20,732 --> 00:02:24,140 So we'll sample the word, for 33 00:02:24,140 --> 00:02:28,605 example, WD1 from the topic that D!. 34 00:02:28,605 --> 00:02:33,950 And the words, we can take values from 1 to V, where V is the size of a category. 35 00:02:33,950 --> 00:02:40,341 So what I draw now is actually a Bayesian network. 36 00:02:40,341 --> 00:02:44,289 We can draw it using [INAUDIBLE] as follows. 37 00:02:44,289 --> 00:02:48,568 So here's our Bayesian network in a [INAUDIBLE]. 38 00:02:48,568 --> 00:02:50,096 We have theta. 39 00:02:50,096 --> 00:02:53,575 Those are, top rebuild is for document and 40 00:02:53,575 --> 00:02:58,220 we repeat it three times, that means for each document. 41 00:02:58,220 --> 00:03:02,669 The theater in part Z is the other topics of the words and 42 00:03:02,669 --> 00:03:06,561 finally from the topics we generate the words. 43 00:03:06,561 --> 00:03:11,506 And we repeat it N times and of course corresponding to each word. 44 00:03:11,506 --> 00:03:17,091 The probability over w z and theta is written below. 45 00:03:17,091 --> 00:03:20,023 Let's try to interpret each component of it. 46 00:03:20,023 --> 00:03:22,616 The first says that for each document, 47 00:03:22,616 --> 00:03:27,103 we generate topic probabilities from the probability p of theta d. 48 00:03:27,103 --> 00:03:31,439 Then for each work in this document we select 49 00:03:31,439 --> 00:03:36,366 a topic with probability p of Z D N, given theta D. 50 00:03:36,366 --> 00:03:41,406 And finally when we have a topic we start a word from 51 00:03:41,406 --> 00:03:46,447 this topic, this is probability of the word WDN, 52 00:03:46,447 --> 00:03:53,255 given that DN And so here's our final model. 53 00:03:53,255 --> 00:03:56,733 So now we need to define these three probabilities. 54 00:03:56,733 --> 00:04:01,973 The probability of theta, there's Z theta and. 55 00:04:01,973 --> 00:04:07,474 The probability or theta, is modelled as I just said, the distribution with some of 56 00:04:07,474 --> 00:04:12,511 the parameter of alpha Here's actual initial choice, since the components 57 00:04:12,511 --> 00:04:17,255 of theta should sum up to one, and we need some distribution [INAUDIBLE]. 58 00:04:17,255 --> 00:04:21,519 And now we've only seen the [INAUDIBLE] distribution. 59 00:04:21,519 --> 00:04:26,109 The probability of the topics given the theta would actually 60 00:04:26,109 --> 00:04:29,982 be equal to the component of this structure theta. 61 00:04:29,982 --> 00:04:33,859 The component theta d is that idea. 62 00:04:33,859 --> 00:04:40,088 So, this narration is bit complex but actually it is quite logical. 63 00:04:40,088 --> 00:04:44,408 So we just take the component of the vector of d, 64 00:04:44,408 --> 00:04:47,540 responding to the current topic. 65 00:04:47,540 --> 00:04:50,641 All right, and finally we need to select the words. 66 00:04:50,641 --> 00:04:51,987 And to select the words, 67 00:04:51,987 --> 00:04:56,039 we need to know the probabilities of the words in the corresponding topic. 68 00:04:56,039 --> 00:05:00,109 That is, we should somehow find the topics. 69 00:05:00,109 --> 00:05:04,855 We will sort those probabilities in the matrix file and 70 00:05:04,855 --> 00:05:10,836 the corresponding probability can be found in the Row number Z ten and 71 00:05:10,836 --> 00:05:12,604 column number WGM. 72 00:05:12,604 --> 00:05:15,373 And so actually our goal would be to find this matrix. 73 00:05:15,373 --> 00:05:20,770 We have a few constraints on this so first of all it should be non-negative 74 00:05:20,770 --> 00:05:25,919 since we're modeling probabilities and also it should sum up to one. 75 00:05:25,919 --> 00:05:30,384 All right, so here are four variables. 76 00:05:30,384 --> 00:05:32,594 We have the data that is known, 77 00:05:32,594 --> 00:05:37,185 we have a matrix file that is unknown and we want to try to find it. 78 00:05:37,185 --> 00:05:40,573 And also we have latent variables zee and theta. 79 00:05:40,573 --> 00:05:44,455 We'll also try to find distribution to them. 80 00:05:44,455 --> 00:05:54,455 [SOUND]