[MUSIC] K-means might seem like heuristic algorithm but you can actually view it as an example of a coordinate descent algorithm. In particular K-means iterate between two steps. One is assigning observations to the nearest cluster center. And the other is updating the cluster centers. Well, this first step where we assign observations to the nearest cluster center, we can clearly see from this equation is equivalent to minimizing the given objective. For the second step, it's a little bit less clear. But we can actually rewrite this second step as follows, in terms of this minimization of an objective. In particular minimizing the sum of the square distances because remember that computing the mean of a set of observations is equivalent to computing its center of mass and that's exactly what minimizing the sum of square distances is equivalent to. So now what we see is that we're iterating between two different steps where each step represents a minimization of some objective. And so from this we can see that this is an example of a coordinate descent algorithm. And if the notion of a coordinate descent algorithm isn't sounding very familiar that's probably because you didn't take the regression course. That's really, really sad. Go back, take a regression course. No. You don't have to take the regression course, but if you're not familiar with coordinate, a center coordinate descent, then you can refer to that course, and the videos in that course, to get a more detailed explanation of these types of algorithms. But the idea here in particular is that we're iterating between minimizing our z variables, given our mews. So minimizing an objective over our cluster indicators, given the cluster centers. And then we're looking at a minimization to find our cluster centers. Given are assigned cluster indicators. Okay, so now let's get back to this idea of converges of k-means. So first, does this algorithm converge and if it does, what does it converge to? Well, the first thing to note is that the k-means objective landscape, is this really complex non-complex thing. So because of that, we can rule out converging to a global optimum or, at least, being guaranteed, obviously, to converge to a global optimum. But a question now is do we converge at least to a local optimum? And because we can cast k-means as a core net descent algorithm, then we know that we are indeed converging to a local optimum. But in a lot of problems, like I said, there's a really complicated landscape to this k-means objective. So there are lots and lots of local modes and the quality of the solutions between these local modes can vary quite dramatically. So let's look at an example of this. Here's a data set, the black points are data points. And what we're showing by these crosses, these colored crosses are our initialized centers. So here is one initialization of our k-means algorithm. And then what we're showing in the other plot with the colored circles, are the labels of the clusters, after convergence of this algorithm. And convergents of this algorithm, one way to assess that, is just when our assignments stop changing. And what we see is that we've taken this little group of observations in the bottom left hand corner. And we've divided those into two clusters and then we have this big huge cluster, this blue cluster, capturing everything else in the data set. But if we initialize our algorithm with different cluster centers, then we've converge to a completely different solution. And this one might look a little bit more reasonable, depending on what the structure of the data might be. Where we end up with three clusters, each of which forms a bit tighter of a group, but lets look at another initialization, where here, the groupings actually look similar, so if we flip back and forth between these two initializations the groupings look similar. But the cluster labels that we associated with each of these have swapped around. The pink clusters become the blue and the blue has become the pink. Well there's nothing specific to any of our cluster labeled. Cluster one, two or three, over different initializations, you know what observation gets assigned to cluster one, or cluster two or cluster three. Doesn't matter what actually is significant are the groups of observations that are grouped together. So what matters is observation xi and xj assigned to the same group, whether it's labeled group one, group two, or group three. Okay, but the other thing to note here are these observations here, highlighted with this orange circle where there's really a lot of uncertainty whether they should be in the blue cluster or this pink cluster. And if you look back at our other solution. And we're going to flip back and forth between these. We see that the assignment of those observations has changed. There's no longer consistency in which grouping those observations appear. In one case here they are appearing in the second cluster, the middle cluster, and in the other case they are appearing in that top right hand side cluster. And this notion of uncertainty in our clustering is really important and it's something we're going to come back to in the next module. But the point that we want to emphasize here is that k-means is very sensitive to initialization and you can actually get very crazy solutions, even crazier than what we've shown in these few examples just by converging to a local mode. [MUSIC]