[MUSIC] So we've talked about the k-meansw clustering algorithm and then a smart initialization to the algorithm. But now let's turn to assessing how do we measure the quality of the clustering and what's a method for choosing the number of clusters? So as an example, here are two of the clusterings that we found based on two different initializations of our k-means algorithm. And a question is, which one do I prefer? Which one is better? Well to assess this, we can return to, what is the k-means objective? What are we trying to minimize here? Well, k-means is trying to minimize the sum of the square distances between our observations and the cluster centers. So we're minimizing over the set of all cluster indicators for all of our observations Xi. And the set of all cluster centers mu j. So all possible values of assignments of observations to clusters and all possible values of where we place those cluster centers, the sum of these square distances. So here this sum is sum of squared distances in cluster j. And here, we're summing over all clusters. So we can think about using this as a measure of quality for a given clustering. So in particular, if the sum of the squared distances is small, then that's going to represent a better clustering according to the k-means objective than one that's large. So to see this here, we have all these distances in the pink or fuchsia cluster and all of these little distances in the green cluster. But when we look at the blue cluster, we have really, really, really big distances that are getting. We have some small distances, but a bunch of really big distances. So when we're summing over the square of all these distances, this is going to be a very large number. Whereas here We're summing, in each case for each cluster, over Smaller numbers on average. So we're going to say that this cluster here is more heterogeneous, Meaning that there are more dissimilar objects within a given cluster. Whereas here, we end up with tighter clusters. Where things are more homogenous. So, we're going to refer to this metric as cluster heterogeneity. So, what's the sum of the heterogeneitiy's over all of our different clusters and we want this to be small so that where in examples like this one where we have these tight, nice clusters. And just to be clear, even though we're calling this a measure of quality, lower is better. We want smaller sums of square distances. Okay, so let's just spend some time thinking about what happens to this heterogeneity metric as k, the number of clusters, increases. Well, when there are more clusters available to the algorithm. We can refine these clusters more and more to our observed data. And so this can lead to our classic notion of over-fitting, but not quite classic, because now we're in this unsupervised setting. But it's a similar kind of idea where the model is very, very specified to the data in a way that might not be descriptive in general for what we're thinking about. So as an extreme case of this, imagine that we define the number of clusters as equal to the number of observed data points. Then what we can do is for each one of our observations, we can define that to be a cluster center. So what's the heterogeneity of that clustering? It's 0. And why is it 0? Because it's, we're looking at the sum over all of our clusters of the square distance of the observation to the cluster center. And what's the distance to the cluster center? It's 0. Each cluster center is the observation itself. So, in general, the lowest possible heterogeneity that we can achieve decreases as we increase k. But, I want to emphasize that for any given run of our k-means algorithm. Let's say, we run out with k +1 clusters, well, the heterogeneity for that converge solution, could actually be larger than the heterogeneity for a converged solution, where we ran algorithm with just k clusters. But in general, when we're thinking about over the set of possible heterogeneity's we have, the trend is that it decreases with k. So a question is, how do we think about choosing the number of clusters? Well we said that our heterogeneity metric or the lowest possible heterogeneity we can achieve decreases as we increase the number of clusters. And we said lower heterogeneity is better, but then we also said well, we don't want too many clusters because that won't really describe groupings that are very natural. We're just going to over partition the data into these little groups and get very low heterogeneity, but not a good description or not a good structure extracted from the data. So somehow we need to trade off between these two things. And so what we can do is we can think about a heuristic, where we're just going to look at something called the elbow of the curve, where it's a little bit unclear whether the elbow is here or here. So I'll just say, choose one of these values of k where maybe I would prefer this value of k because the difference in heterogeneity relative to values much, much larger, it's not very big. But this should provide a more parsimonious description of my data in using this much, much larger k. But this is just a heuristic. So, in general, there's really no answer for what's the right value of k? And you're going to explore this in depth in the assignment associated with this module. There's one last thing I wanted to mention here, which is so far this plot is a plot of the lowest possible heterogeneity. Of course we're not going to compute all possible clusterings for every value of k to define this curve. So in practice what people do, is they just do multiple random restarts of our algorithm, even if we're using k-means plus, plus, or random initializations, mix those in. Do a bunch of different restarts for every value of k, and then just look at the clustering that provides the lowest measure of heterogeneity and plot that as a function of k. And then choose k based on that. But again, that itself is very computationally intensive. So in practice, people deal wide range of things, but it's worth mentioning that how you choose k of course really, really matters for what the resulting clusters represent. If you choose k = 2, you're just dividing your cadence into two groups. If they're really inherently, very clearly separated clusters you're not going to capture that. But if you choose a million clusters, you're going to get very, very fine tuned clusters, obviously, depending on the size of your data, I'm trying to give an example that's very large. So, bear with me. But the point is that you really need to look at the data, think about what these things mean in your application, and when it's hard to look at the data, you can start doing things like the heuristic we're talking about here where you select the elbow of this curve. But again, you'll get a lot of hands on practice with it in the assignment. [MUSIC]