Let's now spend some time getting
into the nitty gritty details of And also provide a comparison between For
mixtures of calciums and K-mines. Well, to begin with,
we talked about the converge solution of In the series of pie chart
plots that we just showed. Well, the question is do we know that
the Algorithm is going to converge? Well, you can actually recast the E and M steps of the Algorithm as alternating
maximization of an objective. And from this we see that Is actually
a coordinate-ascent algorithm. And under mild conditions, you can prove that Indeed converges
to a local mode of the objective. And in your assignment, you're going to monitor convergence of the
Algorithm by computing the log likelihood of the data given the current estimates of
our model parameters and responsibilities. Well, based on the nonconvexed landscape
and convergence to a local mode, how we think of initializing the Algorithm
can be really critical to the quality of the local mode that we find, as well as to
the convergence rates of our algorithm. And there are many ways we
can think about initializing. One is we can think about just randomly
selecting some data points as our initial centers in doing hard assignments of
observations to the nearest cluster center. And then using those assignments to
form our initial parameter estimates and then our Algorithm would
iterate from there. Or, another alternative would be to
pick the cluster centers sequentially, like how we described in
our k-means++ algorithm, instead of just randomly choosing K
different data points all at once. Or, we could think about actually
running k-means and using the final solution of that as our cluster centers
for this initialization scheme. And there are also other approaches. For example, in speech applications,
it's really popular to do things where you gradually grow the mixture model, by
splitting and sometimes merging clusters. Until you have K clusters formed. Having discussed convergence and initialization, let's now spend
a little bit of time discussing a potential pit fall of
the Algorithm that we've described. And this all stems from the fact
that maximum likelihood estimation can over fit the data. So as an example of this let's imagine
that we just have two clusters. The green cluster and the blue cluster. And let's assume that it's some
iteration of our algorithm only one observation is assigned
to the green cluster and all the other observations
are assigned to the blue cluster. And when I'm speaking now,
I'm talking in terms of hard assignments. But this same thing that I'm
going to talk about holds if we're using soft assignments instead. Before this single observation
in this green cluster, what's the thing that's going to maximize the
likelihood of the data in that cluster? Well, for the green cluster,
we're simply going to set the mean of the cluster exactly
equal to that data point, and then we're going to shrink as a variance all the way
to zero, right around that data point. And for the blue cluster in this
example that I've shown here, we're going to shift it's mean over and increase it's variance to capture
all the data points that were shown. Okay so, this doesn't seem
like a great solution here. Right?
Because we have this one data point with this tiny, tiny, tiny, tiny, tiny, infinitesimally small cluster tied
right to the data point value. And than, these two sets of observations
which to me looked pretty separable, all lumped together in one
much more diffuse cluster. But, the likelihood of this
assignment is actually infinite. All driven by the fact that
the likelihood of the green observation, under a Gaussian that's absolutely
certain that the only value it can take is the value that was observed,
is infinite. And so that's going to completely dominate
the overall likelihood of this set of assignment and we're going to see that
this is actually the preferred solution. Luckily in practice you almost
never get into assignments and this very simplicity case of having just that
one green observation in that cluster, but this is something that really can
happen and you have to think carefully about it especially when you're thinking
about ways to initialize your algorithm. But the problem can actually be much
much worse in high dimensions, so to think about this let's go back
to our document cluster in example. So here we are going to assume that
we have some large vocabulary and every document is represented by
maybe just a word count factor, or a TFTIF over that vocabulary. And let's imagine that for
a given cluster, only one document in that cluster, again I'm talking in terms of
hard assignments but things generalized. But if there's only one document in that
cluster that has the word w in it or more generally if all of
the documents in that cluster happen to have the same
count of a specific word w. Then what's the Maximum likelihood estimate of the parameters
in that cluster for that word W? So just looking at that dimension, well, we simply set the mean
equal to that observed count, either from that one document, or all
these documents that have the same count. And then we shrink the variance
to zero about that value. So this isn't actually that
contrived of an example because if we're in a very high dimensional phase. So a very large vocabulary,
we really could imagine cases where all the documents in a given cluster
especially if we have enough clusters to partition up our data, all the
observations in a given cluster might not have a word at all or there might just
be one document that has the word. So this kind of set up really,
really can happen in high dimensions and what ends up happening as a result
of this is that in later iterations we go to compute the likelihood of
some other document that has word w. And some other count associated with word
w than what this cluster is specified, then it's going to have zero likelihood or zero responsibility assumed by this given
cluster because this cluster says nope, you can only have this value that
I've observed in my document so far. And so what we're going to see is that
our responsibilities are going to shoot to either zero responsibilities if a document doesn't agree on just this
one dimension this one vocabulary word. Or it's going to be one if
we're in that cluster and we have agreement on that dimension. And I want to actually emphasize
that really the way that you end up seeing this in practice
is with an underflow issue. So you're going to get these
covariance matrices and you can't compute their inverses even
when you're shrinking the variances very, very small and not even setting
them exactly equal to zero. So this can be just a practical issue
where your codes going to break if you don't correct for
this potential issue. And there is a very simple fix that we can
consider which we're going to call either smoothing the parameter estimate or
regularizing our Updates. And this simple fix is just not to let
our variance parameters go to zero. And to do this, every time we go
to estimate our covariance we're simply going to add a very small
diagonal term to that covariance matrix to ensure that none of
the variances are zero. As an alternative approach, you can
take a more formal Bayesian approach and place priors over the model parameters. So the cluster proportions, the means,
and covariances of the mixture model. And what you can view this as is placing what are called pseudo-observations
in each one of the clusters. So it's as if you've fixed some set
of observations with some values in every cluster. And that's going to bias our
maximum likelihood estimates, to account for those observations. And because of that structure,
it's going to avoid cases where you degenerate by not having any actual
observations of a certain value. And so we can think of this as
smoothing our parameters estimates but in this Bayesian approach it would be
smoothing the cluster proportion and the means and the covariances rather
than the simple approach that we've specify here where we just
smoothing our covariance estimate. And you're going to see
this type of smoothing, when you go to do For
document clustering in the assignment. [MUSIC]