[MUSIC] Finally, I wanted to draw a connection
between For mixtures of Gaussian's and our k-means algorithm. And to do this let's consider a mixture
of Gaussian's, where we have spherically symmetric Gaussian's that are the same
between the different clusters, just having different cluster centers. So this is equivalent to
having covariances that have sigma squared along the diagonal, and the sigma squared is exactly equal
along every element of this diagonal. So this is going to be the covariance
matrix associated with each of the different clusters. And then what we're going to do is we're
going to take this variance term and we're going to drive it to 0. So we have these
infinitesimally tight clusters. Well because we have these spherically
symmetric clusters, when we go to compute the relative likelihood of an assignment
of an observation to one cluster versus another cluster, that just depends on the
relative distances to the cluster centers. And because we've driven
those variances to 0, those relative likelihoods
go to either 1 or 0. And this happens because of
the certainty the clusters have, indicated by their 0 variances,
or in the limit, 0 variances. And of course, when we're computing
our cluster responsibilities, we're also weighing in the relative
proportions of these different clusters. But this is completely dominated
by these significant differences in the likelihoods,
that ratio being either 0 or 1. And so what ends up happening here is
datapoints are going to be fully assigned to a single cluster, just based on
the distance to that cluster center. And this is just like
what we see in k-means. And so when we apply our Algorithm
to this mixture of Gaussian's with these variances, that are of
course the same across dimensions but then shrinking those to 0. Is in the E-step when we're estimating our
responsibilities, we're going to get these responsibilities that are just 0 or 1 for
the reasons that we just described. And then when we go to do our M-step,
we're just going to update our cluster means, the variances are fixed,
and the limit to 0. And when we're estimating our
cluster means and of course, our cluster proportions too,
these cluster means are just going to look at the datapoints that are now hard
observe, hard assigned to that cluster and use that to estimate
the mean in that cluster. And so
what we see is that these two steps of our Algorithm are exactly
what we do in k-means. Where we make hard
assignments of observations to clusters just based on
the distance to the cluster center. And then we recenter the clusters, or we just compute the average of
the datapoints assigned to each cluster. [MUSIC]