[MUSIC] In our third module, we continued this discussion of clustering, but within the context of probabilistic model based approaches, specifically mixture models. And we discuss how mixture models allow us to capture uncertainty in the assignment of data points to clusters, and we also motivated the use of mixture models as opposed to k-means, just by describing some of the failure modes of k-means. Because k-means assume these spherically symmetric clusters and computes assignments of data point to the clusters just based of the cluster centers, so there're a number of scenarios in which k-means just really can't capture the underlying cluster structure very well. So in our discussion of mixture models, we use the very visually appealing application of trying to group images into related classes. So for example, here, we might have an unlabeled, jumbled set of images that we somehow want to group into, in this case, images that are all sunset images, forest images, and cloud images. But we talked about the fact that we just have this data that's all mixed together and the question is how are we going to handle this. Well, in our probabilistic approach, we're going to treat the observed quantity, which we represented just as this RGB vector as a random quantity and look at distributions on that vector. So the distribution can have a really complicated form because it's mixing over these different classes. But to think about unjumbling this mess, we talked about introducing cluster specific Gaussian components, or possibly other distributions, though in this application we're going to looking at Gaussians. And then, saying that our mixture model, it's just a weighted collection of these different components, with weights based on what the relative proportions of each of these classes is, in this data set. So more formally, we said that our mixture of Gaussian model was defined in terms of a collection of Gaussian components each defined with a specific mean and variance as well as a mixture of weight. So how heavily that class was going to represented in the mixture. So all of these pictures are just for one dimensional data, but we talked about how we think of defining mixtures of Gaussians at higher dimensions. And then, we turn back to our document clustering task. Where here, remember our document representation, if we think of using TFIDF or just simply word count factors, has a representation that has dimension equal to capital V, the number of words in our vocabulary. So we talked about some of the challenges of scaling our mixture models to these high dimensional scenarios. And in particular instead of allowing arbitrary Gaussian components in a mixture model, we said that in practice, one would often want to specify just a diagonal form to the Gaussian covariance matrix to dramatically reduce the number of parameters. But we said that this form is still much more flexible than what you have in k-means and in particular we're still going to be learning the weightings on these different dimensions in our mixture Gaussian model. And then, having specified our mixture Gaussian model so understanding the components of this probabilistic model, return to how to do inference in this model. So remember we're just going to present the algorithm with a data set, so a set of points that are unlabeled, and we discussed an algorithm in particular called, expectation maximization or the Algorithm. That allows us to jointly infer both the model parameters as well as these soft assignments that describe our uncertainty in the assignment of data points to specific clusters. So in particular this Algorithm is an iterative algorithm where the first step which is the e step, forms these soft assignments, based on our current estimate of the model parameters. Then the second step, which is called the M-step, computes the maximum likelihood estimate of our model perimeters given the soft assignments at the current iteration and then we reiterate again and again between these steps until convergence. And just like k-means, we showed that Can be describes as a coordinate to send algorithms. So we get the same types of properties, where you get convergence to a local mode of the objective. So k-means means initialization matters a lot. And typically people rerun the algorithm many times, and choose the best solution. But another thing that we described was that for our mixture models we have a lot more parameters than we do for k-means but we present in some ways to think about handling or at least emulating these issues of over feeding. Well, in pictures we present a set of illustration showing In practice where over iterations of the Steps we see how we're able to learn soft assignments of observations to specific clusters as well as the cluster parameters. Where even at convergence of algorithm we can still get these uncertainties in whether a given data point is assigned to one cluster versus another cluster. And this is actually something we want. We want these uncertainties as part of the output of this algorithm. And finally, at the end of the third module we related very formally the expectation maximization algorithm for mixtures of Gaussians to k-means. In particular, we said if we look at a mixture of Gaussians with spherically symmetric Gaussians, so diagonal co-variants with same elements along the diagonal and then we shrink that variance to 0 and we run our Algorithm, the output is exactly the same as k-means. Because what we're going to end up doing are making hard assignments of data points to specific clusters just based on the distance to the cluster center. So indeed looking at For mixtures of Gaussians is really a generalization of the k-means algorithm. [MUSIC]