To wrap up our discussion on mixture
models, I want to discuss how we can think about applying this mixture
model to our document clustering task. This is something else that you guys
are going to explore in the assignment. And remember, our goal is to
discover groups of related articles, but when reducing our mixture model,
what we're aiming to do is to account for uncertainty in the assignments
of data points to these clusters like this article that lies between
this blue and orange cluster. And for our document representation, we're
going to use our standard td-idf vector. So, it's going to be
a vector with the number of elements of this vector equal
to the size of the vocabulary. And so, then based on this
document representation, we can think of embedding
our documents in r to the v. So, the real space with v
different dimensions where v is the size of the vocabulary. And this picture here, we're just
showing two dimensions of course, so it's as if we had a vocabulary,
which is two different words. And then, based on this representation,
our goal of using mixtures of Gaussians to discover clusters that might
look like something like the following. And again, our mixtures of Gaussians
are going to be able to provide soft assignments of data
points to these clusters. And in the next session we're
going to describe algorithmically how we provide these soft assignments. Let's think a little bit more about
applying just a vanilla mixture of Gaussian to this application, in particular, this high dimensional
document representation. Because in two dimensions,
when we talked about parameters of the Galician reset
there was a mean vector that has two parameters that we're going to need to
learn, which center is this distribution. But then,
there is this covariance metric and in two dimensions we have four parameters. There was the variance terms so the spread
along each one of our two dimensions and then the off-diagonal term,
which captured the correlation structure. And we, very quickly, mention that it's
a symmetric matrix so it turns out these off-diagonal terms, sigma one,
two and sigma two, one are equal. So really there are three unique
parameters that we need to specify in two dimensions for this Gaussian, or three parameters that we're
going to need to learn from the data. But when we're looking at V
different dimensions, and imagine V is really large, if we
allowed for a full covariance matrix, we would have V(V+1) over 2
unique parameters to learn. And that's a lot on the order
of V squared parameters. And the way that we get V(V+1)
over 2 instead of V squared is by the fact that it's
a symmetric matrix. So it's not quite V squared parameters. Okay, so instead of specifying
these full covariance matrices, what we're going to do is assume
a diagonal form to the covariance matrix. So there's going to be a variance term for
each one of our different dimensions. One to capital V and then all of the off
diagonal terms are going to be fixed to 0. And so we're just going to be learning the
variances along the different dimensions. And, so, what this means is that, instead
of having an arbitrarily aligned ellipse, we're going to constrain ourselves just
to looking at access aligned ellipses. And this represents
a restrictive assumption, but I want to mention a few things. The first is that we can actually
learn the weights on the different dimensions and these can be different
between the different clusters. So, in one cluster, we might learn
that one word is more important than the other word in forming the score
of the observation under that cluster and that could be different
than any other clusters. So this represents a formulation that's
still a lot more flexible than what you had in k-means. because remembering k-means, the typical formulation would just
assume spherically symmetric clusters. That's a specific case of axis-aligned
ellipses where the variance terms are all the same, they're all
sigma squared along the diagonal. And we said we could also do a weighted
version of Euclidean distance but that would give us these axis-aligned
ellipses that are exactly the same between different clusters. And furthermore, in the case of k-means,
we had to specify those weights. Whereas, in the next section, we're
going to to show that we can actually learn the weights of this mixture model so the weights on these different
dimensions that are cluster specific. And in addition to learning the parameters
of these different Gaussians from the data, we're also going to learn
the soft assignments, so, the probability that any observation is associated
with any of the different clusters. [MUSIC]