So, by now we've spent quite a bit of time talking about the minimum cost spanning tree problem. There's a number of motivations why we do that. the first, it's just a uniquely great problem for the study of greedy algorithms. You can propose a bunch of different greedy algorithms, and quite unusually, it's all of them seem to work. So you get correct greedy algorithms, but it's quite subtle what's driving their correctness. You also get a lot of practice, arguing about graphs, and arguing about exchange arguments, and so on. The second reason, it's been worthwhile spending time on these algorithms. It has given us some more practice with data structures and how to use them to speed up algorithms, namely heaps, to speed up Prim's algorithm, to union-find data structure to speed up Kruskal's algorithm. The third reason that we're talking about them is they have applications in their own rights. That's the subject of this video and the next and we're going to focus on applications to clustering problems. So let me begin by just talking about the goal of clustering informally. And then, I'll let you pin me down to a precise objective function on the next slide. So in a clustering problem, the input is n points that we think of as being embedded in space. And it's actually quite rare that in the underlying problem that we care about is it actually intrinsically geometric? Is it actually intrinsically points in space? Usually, we're representing something we care about. Maybe it's web pages. Maybe it's images. Maybe it's a database as points in space. And given a bunch of objects, we want to cluster them into, in some sense coherent groups. For those of you coming from a machine learning background, you'll often hear this problem referred to as unsupervised learning, meaning the data is unlabeled. We're looking for just patterns and data where the data is not annotated. This obviously is a fairly wishy-washy description of a problem, so let's be a little bit more precise. We're going to assume that part of the input is what we're going to call a similarity measure. Meaning for any two objects, we're going to have a function giving us a number indicating how similar or really rather how dissimilar they are to each other. In keeping with the geometric metaphor, we're going to refer to this function as a distance function. One thing that's cool is we don't need to impose many assumptions on this distance function. The one thing we're going to assume is that it's symmetric, meaning the distance from p to q is the same as the distance from q to p. So what are some examples? Well, if you want to really go with the geometric metaphor, if you're representing these points as in a space RM for some dimension M, you can just use the Euclidean distance or if you prefer some other norm, like say, l1 or l-infinity. In many application domains, there are widely accepted similarity or distance measures. one example would be for sequences, as we discussed in introductory lecture the penalty of the best alignment between two genome fragments. So now that we have this distance function, what would it mean to have coherent groups? Well, things which have small distance from each other which are similar, they should generally be in the same group, and things that are very dissimilar that have large distance between them, you would expect to mostly be in different groups. So how can we evaluate how good a clustering is, how well it's doing with the job of putting nearby points together and dissimilar points in different groups? Well, to be honest there's many ways of approaching and formalizing that problem. The approach we're going to take is an optimization-based approach. We're going to positive and objective function on clusterings and then seek out the clustering that optimizes that objective function. I want to warn you that's not the only way to approach the problem. There are other interesting approaches, but optimization is a natural one. Furthermore, just like in our scheduling application, there's more than one objective function that people study and that is well-motivated. So one very popular objective function would be, say the k-means objective, which I encourage you to look up and read more about. For this lecture, I'm just going to adopt one specific objective function. It's natural enough, but it's by no means the only one or even the primary one. But it'll serve the purpose of studying a natural greedy algorithm related to minimum spanning tree algorithms which is optimal in a precise sense. So let me develop the terminology needed to state the objective function and the optimization problem precisely. One issue that always comes up in clustering problems is how many clusters are you going to use? So to keep things simple in this video, we're going to assume that part of the input k indicates how many clusters you're supposed to use. So we're going to assume you know a priori how many clusters you want. So in some application domains, this is a totally reasonably assumption. You know, you might know for example that you want exactly two clusters, the k = 2. in some domains you may have you know, good domain knowledge from past experience that you know how many clusters you expect to need that's, all fine and good. Also you know, in practice, if you don't really know what k is supposed to be, you can go ahead and run the algorithm we're going to talk about for a bunch of different values of k and then you symmetric or just eyeball it to figure out which is the best solution. So the objective function we're going to look at is defined in terms of separated pairs of points. That is points that are assigned to distinct clusters. Now you know, if you have more than one cluster, inevitably there's going to be some pairs of points. Some points are going to be one groups, other points are going to be in the different group. So separated points are inevitable and the most alarming separated points are the ones that are the most similar, the ones that have the smallest distance. If points are separated, we want them to be for apart, so we're particularly concerned with nearby points that are separated. So that's going to be our objective function value, called the spacing of a clustering. It's the distance between the closest together pair of separated points. Now what do we want from the spacing of a clustering? Well, we want all of the separated points to be as far apart as possible. That is, we want the spacing to be big. The bigger the better. So that naturally motivates the formal problem statement. You're given as inputs, the, distance measure. You're told the distance between each pair of points. You're also told the desired number of clusters. Amongst all ways of clustering the points into k clusters, find the clustering which makes the spacing as big as possible. So let's develop a greedy algorithm that seeks to make the spacing as big as possible. And to facilitate the discussion, I'll use an example point set with just six black points up here in the upper right part of the slide. Now, the good idea behind this greedy algorithm is to not worry about the constraints that, at the end of the day, we can only output k different clusters. We're actually going to be infeasible, we'll have too many clusters throughout the algorithm. Only at the conclusion of the algorithm will we be down to k clusters, which will be our final infeasible solution. So that frees us up to initialize the procedure where the degenerate solution, where each point is in its own cluster. So in our example point set, we have these six pink isolated clusters. In general, you're going to have n clusters and we've got to get down to k. Now, let's remember what the spacing objective is. In the spacing objective, you go over all of the separated pairs of points. So for this degenerate solution, it's all pairs of points. And you look at the most alarming separated pair, that is those, that are the, the closest to each other. So the spacing is the distance between the closest pair of separated points. Now, in a greedy algorithm, you want to increase your objective function as much as possible. But actually, things are pretty cut and dried in this scenario. Suppose I give you a clustering and you want to make the spacing bigger, the only way you can do that is by taking the currently closest pair of separated points and making them not separated any more. That is putting them in the same cluster. So, it's in some sense obvious what you want to do to make the objective function go up at all. You gotta look at the pair of points that is defining the objective, the closest pair of separated points, and you have to fuse them. You have to fuse their clusters so that they're no longer separated. In this example, it looks to me like the closest pair of points, which of course are separated, is this pair in the upper right. So, if we want to make the spacing bigger, then we fuse them into a common cluster. [SOUND] So we started with six clusters. Now, we're down to five. So now we reevaluate the spacing again of this new clustering. We ask, what is the closest pair of separated points? So that would seem to me to be this pair in the bottom right. [SOUND] And again, the only way we can increase the spacing by merging clustering is to merge these two isolated clusters into one. Now, we do it again. We say, which pair of points determines the current spacing, which the currently separated pair of points that are nearest to each other? That to me would look like this pair that's on the rightmost part of the picture. The only way to merge two clusters and make the spacing actually go up is to merge the clusters containing the pairs of points determining the current spacing. So in this case, two different clusters with two points, each would collapse into a single cluster with four points. Let's assume that we wanted three clusters, clusters anyways, which is where we are. So at this point, the greedy algorithm is going to halt. So let me now spell out the pseudocode of the greedy algorithm more generally, but it's exactly what you'd expect given the discussion so far. All right. So, I'd like you to stare at this pseudocode for ten seconds or however long you need and try to relate this to an algorithm that we've seen in the course. In particular, an algorithm that we've seen quite recently. I hope it reminds you strongly of an algorithm we've already covered. Specifically, I hope you see a strong resemblance between this greedy algorithm and Kruskal's algorithm for computing a minimum cost spanning tree. Indeed, we can think of this greedy clustering algorithm as being exactly the same as Kruskal's minimum cost spanning tree algorithm except aborted early. Stopped when there's k components remaining, that is before the final k - 1 edge additions. So, just to make sure the correspondence is clear. What is the graph? What are the edges? What are the edge costs? Well, the objects in the clustering problem, that is the points. Those correspond to vertices in a graph. The other part of the input of the clustering problem are distances, which are given for every pair of points. So those play the role that edge costs were playing in the minimum spanning tree problem. Since we have a distance or an edge cost for every single pair of points, we can think of the edge set in the clustering problem as being the complete graph because we have an edge cost or a distance for every single pair. So this type of agglomerative clustering has a name. This idea of fusing components one at a time using MST-lik criterion. It's called single-link clustering. So a single n clustering is a good idea. If you work at all with clustering problems or unsupervised learning problems, it definitely should be a tool in your toolbox. we're going to justify its existence in one particular way in the next video, when we show that it does indeed maximize the spacing over all possible k clusterings. But even if you don't care about the spacing objective function per se, you want to be familiar with single-link clustering. It has many other nice properties as well.