To wrap up our discussion on mixture models, I want to discuss how we can think about applying this mixture model to our document clustering task. This is something else that you guys are going to explore in the assignment. And remember, our goal is to discover groups of related articles, but when reducing our mixture model, what we're aiming to do is to account for uncertainty in the assignments of data points to these clusters like this article that lies between this blue and orange cluster. And for our document representation, we're going to use our standard td-idf vector. So, it's going to be a vector with the number of elements of this vector equal to the size of the vocabulary. And so, then based on this document representation, we can think of embedding our documents in r to the v. So, the real space with v different dimensions where v is the size of the vocabulary. And this picture here, we're just showing two dimensions of course, so it's as if we had a vocabulary, which is two different words. And then, based on this representation, our goal of using mixtures of Gaussians to discover clusters that might look like something like the following. And again, our mixtures of Gaussians are going to be able to provide soft assignments of data points to these clusters. And in the next session we're going to describe algorithmically how we provide these soft assignments. Let's think a little bit more about applying just a vanilla mixture of Gaussian to this application, in particular, this high dimensional document representation. Because in two dimensions, when we talked about parameters of the Galician reset there was a mean vector that has two parameters that we're going to need to learn, which center is this distribution. But then, there is this covariance metric and in two dimensions we have four parameters. There was the variance terms so the spread along each one of our two dimensions and then the off-diagonal term, which captured the correlation structure. And we, very quickly, mention that it's a symmetric matrix so it turns out these off-diagonal terms, sigma one, two and sigma two, one are equal. So really there are three unique parameters that we need to specify in two dimensions for this Gaussian, or three parameters that we're going to need to learn from the data. But when we're looking at V different dimensions, and imagine V is really large, if we allowed for a full covariance matrix, we would have V(V+1) over 2 unique parameters to learn. And that's a lot on the order of V squared parameters. And the way that we get V(V+1) over 2 instead of V squared is by the fact that it's a symmetric matrix. So it's not quite V squared parameters. Okay, so instead of specifying these full covariance matrices, what we're going to do is assume a diagonal form to the covariance matrix. So there's going to be a variance term for each one of our different dimensions. One to capital V and then all of the off diagonal terms are going to be fixed to 0. And so we're just going to be learning the variances along the different dimensions. And, so, what this means is that, instead of having an arbitrarily aligned ellipse, we're going to constrain ourselves just to looking at access aligned ellipses. And this represents a restrictive assumption, but I want to mention a few things. The first is that we can actually learn the weights on the different dimensions and these can be different between the different clusters. So, in one cluster, we might learn that one word is more important than the other word in forming the score of the observation under that cluster and that could be different than any other clusters. So this represents a formulation that's still a lot more flexible than what you had in k-means. because remembering k-means, the typical formulation would just assume spherically symmetric clusters. That's a specific case of axis-aligned ellipses where the variance terms are all the same, they're all sigma squared along the diagonal. And we said we could also do a weighted version of Euclidean distance but that would give us these axis-aligned ellipses that are exactly the same between different clusters. And furthermore, in the case of k-means, we had to specify those weights. Whereas, in the next section, we're going to to show that we can actually learn the weights of this mixture model so the weights on these different dimensions that are cluster specific. And in addition to learning the parameters of these different Gaussians from the data, we're also going to learn the soft assignments, so, the probability that any observation is associated with any of the different clusters. [MUSIC]