[SOUND] In this video, we'll finally see
the Latent Dirichlet Allocation. Let me remind you what
topics are in documents. So a document is
a distribution over topics. For example, we can assign for
a document, a distribution like this. So 80% cats and 20% dogs. Another topic is
a distribution over words. For example, the topic cats would
have 40% times the word cats and the word meow 30% and
the other words like dogs and words like and
these on will have really low probability. The topic about dogs would have dog and woof words whereas with and
others with low. Let's see how we can generate the for
example the cat meowed on the dog. The first word,
the cat word is taken from the topic cats. And with 40% probability we
could sample the word cat. The second word, meow,
is also from the topic cats. And it's sampled with 30%
probability from the topic cats. And finally, the word dog is
from the topic on dogs, and with 40% probability we could sample it. So here's our model. We have a distribution of our topics for
the document number d, we will call it theta d. Then, for each word in the document, we assign the probability,
we assign the topic of each word. For example, Zed D1, would respond to
the topic of the first word in document D. And finally, for example,
the little variable Zed DN, would respond to the Topic of
the nth word in document d. Each latent variable can
take the values from 1 to T, where T is the number of topics that
we will try to find in our corpus. The corpus is a collection
of the documents. So learn from the corresponding
topics we can sample the words. So we'll sample the word, for example, WD1 from the topic that D!. And the words, we can take values from 1
to V, where V is the size of a category. So what I draw now is
actually a Bayesian network. We can draw it using
[INAUDIBLE] as follows. So here's our Bayesian
network in a [INAUDIBLE]. We have theta. Those are, top rebuild is for document and we repeat it three times,
that means for each document. The theater in part Z is
the other topics of the words and finally from the topics
we generate the words. And we repeat it N times and
of course corresponding to each word. The probability over w z and
theta is written below. Let's try to interpret
each component of it. The first says that for each document, we generate topic probabilities
from the probability p of theta d. Then for
each work in this document we select a topic with probability p of Z D N,
given theta D. And finally when we have
a topic we start a word from this topic,
this is probability of the word WDN, given that DN And so
here's our final model. So now we need to define
these three probabilities. The probability of theta,
there's Z theta and. The probability or theta, is modelled as I
just said, the distribution with some of the parameter of alpha Here's actual
initial choice, since the components of theta should sum up to one, and
we need some distribution [INAUDIBLE]. And now we've only seen
the [INAUDIBLE] distribution. The probability of the topics
given the theta would actually be equal to the component
of this structure theta. The component theta d is that idea. So, this narration is bit complex but
actually it is quite logical. So we just take the component
of the vector of d, responding to the current topic. All right, and
finally we need to select the words. And to select the words, we need to know the probabilities of
the words in the corresponding topic. That is,
we should somehow find the topics. We will sort those probabilities
in the matrix file and the corresponding probability can be
found in the Row number Z ten and column number WGM. And so actually our goal
would be to find this matrix. We have a few constraints on this so
first of all it should be non-negative since we're modeling probabilities and
also it should sum up to one. All right, so here are four variables. We have the data that is known, we have a matrix file that is unknown and
we want to try to find it. And also we have latent variables zee and
theta. We'll also try to find
distribution to them. [SOUND]