Let's now spend some time getting into the nitty gritty details of And also provide a comparison between For mixtures of calciums and K-mines. Well, to begin with, we talked about the converge solution of In the series of pie chart plots that we just showed. Well, the question is do we know that the Algorithm is going to converge? Well, you can actually recast the E and M steps of the Algorithm as alternating maximization of an objective. And from this we see that Is actually a coordinate-ascent algorithm. And under mild conditions, you can prove that Indeed converges to a local mode of the objective. And in your assignment, you're going to monitor convergence of the Algorithm by computing the log likelihood of the data given the current estimates of our model parameters and responsibilities. Well, based on the nonconvexed landscape and convergence to a local mode, how we think of initializing the Algorithm can be really critical to the quality of the local mode that we find, as well as to the convergence rates of our algorithm. And there are many ways we can think about initializing. One is we can think about just randomly selecting some data points as our initial centers in doing hard assignments of observations to the nearest cluster center. And then using those assignments to form our initial parameter estimates and then our Algorithm would iterate from there. Or, another alternative would be to pick the cluster centers sequentially, like how we described in our k-means++ algorithm, instead of just randomly choosing K different data points all at once. Or, we could think about actually running k-means and using the final solution of that as our cluster centers for this initialization scheme. And there are also other approaches. For example, in speech applications, it's really popular to do things where you gradually grow the mixture model, by splitting and sometimes merging clusters. Until you have K clusters formed. Having discussed convergence and initialization, let's now spend a little bit of time discussing a potential pit fall of the Algorithm that we've described. And this all stems from the fact that maximum likelihood estimation can over fit the data. So as an example of this let's imagine that we just have two clusters. The green cluster and the blue cluster. And let's assume that it's some iteration of our algorithm only one observation is assigned to the green cluster and all the other observations are assigned to the blue cluster. And when I'm speaking now, I'm talking in terms of hard assignments. But this same thing that I'm going to talk about holds if we're using soft assignments instead. Before this single observation in this green cluster, what's the thing that's going to maximize the likelihood of the data in that cluster? Well, for the green cluster, we're simply going to set the mean of the cluster exactly equal to that data point, and then we're going to shrink as a variance all the way to zero, right around that data point. And for the blue cluster in this example that I've shown here, we're going to shift it's mean over and increase it's variance to capture all the data points that were shown. Okay so, this doesn't seem like a great solution here. Right? Because we have this one data point with this tiny, tiny, tiny, tiny, tiny, infinitesimally small cluster tied right to the data point value. And than, these two sets of observations which to me looked pretty separable, all lumped together in one much more diffuse cluster. But, the likelihood of this assignment is actually infinite. All driven by the fact that the likelihood of the green observation, under a Gaussian that's absolutely certain that the only value it can take is the value that was observed, is infinite. And so that's going to completely dominate the overall likelihood of this set of assignment and we're going to see that this is actually the preferred solution. Luckily in practice you almost never get into assignments and this very simplicity case of having just that one green observation in that cluster, but this is something that really can happen and you have to think carefully about it especially when you're thinking about ways to initialize your algorithm. But the problem can actually be much much worse in high dimensions, so to think about this let's go back to our document cluster in example. So here we are going to assume that we have some large vocabulary and every document is represented by maybe just a word count factor, or a TFTIF over that vocabulary. And let's imagine that for a given cluster, only one document in that cluster, again I'm talking in terms of hard assignments but things generalized. But if there's only one document in that cluster that has the word w in it or more generally if all of the documents in that cluster happen to have the same count of a specific word w. Then what's the Maximum likelihood estimate of the parameters in that cluster for that word W? So just looking at that dimension, well, we simply set the mean equal to that observed count, either from that one document, or all these documents that have the same count. And then we shrink the variance to zero about that value. So this isn't actually that contrived of an example because if we're in a very high dimensional phase. So a very large vocabulary, we really could imagine cases where all the documents in a given cluster especially if we have enough clusters to partition up our data, all the observations in a given cluster might not have a word at all or there might just be one document that has the word. So this kind of set up really, really can happen in high dimensions and what ends up happening as a result of this is that in later iterations we go to compute the likelihood of some other document that has word w. And some other count associated with word w than what this cluster is specified, then it's going to have zero likelihood or zero responsibility assumed by this given cluster because this cluster says nope, you can only have this value that I've observed in my document so far. And so what we're going to see is that our responsibilities are going to shoot to either zero responsibilities if a document doesn't agree on just this one dimension this one vocabulary word. Or it's going to be one if we're in that cluster and we have agreement on that dimension. And I want to actually emphasize that really the way that you end up seeing this in practice is with an underflow issue. So you're going to get these covariance matrices and you can't compute their inverses even when you're shrinking the variances very, very small and not even setting them exactly equal to zero. So this can be just a practical issue where your codes going to break if you don't correct for this potential issue. And there is a very simple fix that we can consider which we're going to call either smoothing the parameter estimate or regularizing our Updates. And this simple fix is just not to let our variance parameters go to zero. And to do this, every time we go to estimate our covariance we're simply going to add a very small diagonal term to that covariance matrix to ensure that none of the variances are zero. As an alternative approach, you can take a more formal Bayesian approach and place priors over the model parameters. So the cluster proportions, the means, and covariances of the mixture model. And what you can view this as is placing what are called pseudo-observations in each one of the clusters. So it's as if you've fixed some set of observations with some values in every cluster. And that's going to bias our maximum likelihood estimates, to account for those observations. And because of that structure, it's going to avoid cases where you degenerate by not having any actual observations of a certain value. And so we can think of this as smoothing our parameters estimates but in this Bayesian approach it would be smoothing the cluster proportion and the means and the covariances rather than the simple approach that we've specify here where we just smoothing our covariance estimate. And you're going to see this type of smoothing, when you go to do For document clustering in the assignment. [MUSIC]