1 00:00:02,880 --> 00:00:06,256 So, by now we've spent quite a bit of time talking about the minimum cost 2 00:00:06,256 --> 00:00:09,217 spanning tree problem. There's a number of motivations why we do 3 00:00:09,217 --> 00:00:11,299 that. the first, it's just a uniquely great 4 00:00:11,299 --> 00:00:13,288 problem for the study of greedy algorithms. 5 00:00:13,288 --> 00:00:16,757 You can propose a bunch of different greedy algorithms, and quite unusually, 6 00:00:16,757 --> 00:00:20,041 it's all of them seem to work. So you get correct greedy algorithms, but 7 00:00:20,041 --> 00:00:22,401 it's quite subtle what's driving their correctness. 8 00:00:22,401 --> 00:00:26,055 You also get a lot of practice, arguing about graphs, and arguing about exchange 9 00:00:26,055 --> 00:00:28,784 arguments, and so on. The second reason, it's been worthwhile 10 00:00:28,784 --> 00:00:32,439 spending time on these algorithms. It has given us some more practice with data 11 00:00:32,439 --> 00:00:34,937 structures and how to use them to speed up algorithms, 12 00:00:34,937 --> 00:00:36,880 namely heaps, to speed up Prim's algorithm, 13 00:00:36,880 --> 00:00:40,145 to union-find data structure to speed up Kruskal's algorithm. 14 00:00:40,145 --> 00:00:44,464 The third reason that we're talking about them is they have applications in their 15 00:00:44,464 --> 00:00:47,203 own rights. That's the subject of this video and the 16 00:00:47,203 --> 00:00:51,140 next and we're going to focus on applications to clustering problems. 17 00:00:51,140 --> 00:00:55,510 So let me begin by just talking about the goal of clustering informally. 18 00:00:55,510 --> 00:01:00,244 And then, I'll let you pin me down to a precise objective function on the next 19 00:01:00,244 --> 00:01:03,037 slide. So in a clustering problem, the input is 20 00:01:03,037 --> 00:01:06,254 n points that we think of as being embedded in space. 21 00:01:06,254 --> 00:01:11,231 And it's actually quite rare that in the underlying problem that we care about is 22 00:01:11,231 --> 00:01:15,784 it actually intrinsically geometric? Is it actually intrinsically points in 23 00:01:15,784 --> 00:01:18,576 space? Usually, we're representing something we 24 00:01:18,576 --> 00:01:20,518 care about. Maybe it's web pages. 25 00:01:20,518 --> 00:01:24,100 Maybe it's images. Maybe it's a database as points in space. 26 00:01:24,100 --> 00:01:28,932 And given a bunch of objects, we want to cluster them into, in some sense coherent 27 00:01:28,932 --> 00:01:31,677 groups. For those of you coming from a machine 28 00:01:31,677 --> 00:01:36,391 learning background, you'll often hear this problem referred to as unsupervised 29 00:01:36,391 --> 00:01:41,164 learning, meaning the data is unlabeled. We're looking for just patterns and data 30 00:01:41,164 --> 00:01:45,341 where the data is not annotated. This obviously is a fairly wishy-washy 31 00:01:45,341 --> 00:01:49,380 description of a problem, so let's be a little bit more precise. 32 00:01:49,380 --> 00:01:53,136 We're going to assume that part of the input is what we're going to call a 33 00:01:53,136 --> 00:01:56,021 similarity measure. Meaning for any two objects, we're 34 00:01:56,021 --> 00:02:00,050 going to have a function giving us a number indicating how similar or really 35 00:02:00,050 --> 00:02:03,240 rather how dissimilar they are to each other. 36 00:02:03,240 --> 00:02:07,764 In keeping with the geometric metaphor, we're going to refer to this function as 37 00:02:07,764 --> 00:02:11,953 a distance function. One thing that's cool is we don't need to 38 00:02:11,953 --> 00:02:14,565 impose many assumptions on this distance function. 39 00:02:14,565 --> 00:02:17,647 The one thing we're going to assume is that it's symmetric, 40 00:02:17,647 --> 00:02:22,000 meaning the distance from p to q is the same as the distance from q to p. 41 00:02:22,000 --> 00:02:25,823 So what are some examples? Well, if you want to really go with the 42 00:02:25,823 --> 00:02:30,193 geometric metaphor, if you're representing these points as in a space 43 00:02:30,193 --> 00:02:35,109 RM for some dimension M, you can just use the Euclidean distance or if you prefer 44 00:02:35,109 --> 00:02:37,658 some other norm, like say, l1 or l-infinity. 45 00:02:37,658 --> 00:02:42,331 In many application domains, there are widely accepted similarity or distance 46 00:02:42,331 --> 00:02:45,548 measures. one example would be for sequences, as we 47 00:02:45,548 --> 00:02:50,768 discussed in introductory lecture the penalty of the best alignment between two 48 00:02:50,768 --> 00:02:53,944 genome fragments. So now that we have this distance 49 00:02:53,944 --> 00:02:56,734 function, what would it mean to have coherent groups? 50 00:02:56,734 --> 00:03:00,866 Well, things which have small distance from each other which are similar, they 51 00:03:00,866 --> 00:03:05,106 should generally be in the same group, and things that are very dissimilar that 52 00:03:05,106 --> 00:03:09,184 have large distance between them, you would expect to mostly be in different 53 00:03:09,184 --> 00:03:11,521 groups. So how can we evaluate how good a 54 00:03:11,521 --> 00:03:15,335 clustering is, how well it's doing with the job of putting nearby points together 55 00:03:15,335 --> 00:03:17,313 and dissimilar points in different groups? 56 00:03:17,313 --> 00:03:21,080 Well, to be honest there's many ways of approaching and formalizing that problem. 57 00:03:21,080 --> 00:03:24,094 The approach we're going to take is an optimization-based approach. 58 00:03:24,094 --> 00:03:27,673 We're going to positive and objective function on clusterings and then seek out 59 00:03:27,673 --> 00:03:30,216 the clustering that optimizes that objective function. 60 00:03:30,216 --> 00:03:33,277 I want to warn you that's not the only way to approach the problem. 61 00:03:33,277 --> 00:03:36,715 There are other interesting approaches, but optimization is a natural one. 62 00:03:36,715 --> 00:03:40,199 Furthermore, just like in our scheduling application, there's more than one 63 00:03:40,199 --> 00:03:43,213 objective function that people study and that is well-motivated. 64 00:03:43,213 --> 00:03:47,340 So one very popular objective function would be, say the k-means objective, 65 00:03:47,340 --> 00:03:49,968 which I encourage you to look up and read more about. 66 00:03:49,968 --> 00:03:53,587 For this lecture, I'm just going to adopt one specific objective function. 67 00:03:53,587 --> 00:03:57,454 It's natural enough, but it's by no means the only one or even the primary one. 68 00:03:57,454 --> 00:04:01,420 But it'll serve the purpose of studying a natural greedy algorithm related to 69 00:04:01,420 --> 00:04:04,940 minimum spanning tree algorithms which is optimal in a precise sense. 70 00:04:09,120 --> 00:04:13,581 So let me develop the terminology needed to state the objective function and the 71 00:04:13,581 --> 00:04:17,162 optimization problem precisely. One issue that always comes up in 72 00:04:17,162 --> 00:04:20,577 clustering problems is how many clusters are you going to use? 73 00:04:20,577 --> 00:04:24,874 So to keep things simple in this video, we're going to assume that part of the 74 00:04:24,874 --> 00:04:28,124 input k indicates how many clusters you're supposed to use. 75 00:04:28,124 --> 00:04:32,700 So we're going to assume you know a priori how many clusters you want. 76 00:04:32,700 --> 00:04:36,012 So in some application domains, this is a totally reasonably assumption. 77 00:04:36,012 --> 00:04:40,177 You know, you might know for example that you want exactly two clusters, the k = 2. 78 00:04:40,177 --> 00:04:43,632 in some domains you may have you know, good domain knowledge from past 79 00:04:43,632 --> 00:04:47,087 experience that you know how many clusters you expect to need that's, all 80 00:04:47,087 --> 00:04:49,595 fine and good. Also you know, in practice, if you don't 81 00:04:49,595 --> 00:04:53,476 really know what k is supposed to be, you can go ahead and run the algorithm we're 82 00:04:53,476 --> 00:04:56,647 going to talk about for a bunch of different values of k and then you 83 00:04:56,647 --> 00:04:59,960 symmetric or just eyeball it to figure out which is the best solution. 84 00:05:00,980 --> 00:05:06,094 So the objective function we're going to look at is defined in terms of separated 85 00:05:06,094 --> 00:05:09,337 pairs of points. That is points that are assigned to 86 00:05:09,337 --> 00:05:13,541 distinct clusters. Now you know, if you have more than one 87 00:05:13,541 --> 00:05:16,437 cluster, inevitably there's going to be some pairs of points. 88 00:05:16,437 --> 00:05:20,247 Some points are going to be one groups, other points are going to be in the 89 00:05:20,247 --> 00:05:23,042 different group. So separated points are inevitable and 90 00:05:23,042 --> 00:05:27,055 the most alarming separated points are the ones that are the most similar, the 91 00:05:27,055 --> 00:05:31,018 ones that have the smallest distance. If points are separated, we want them to 92 00:05:31,018 --> 00:05:33,610 be for apart, so we're particularly concerned with 93 00:05:33,610 --> 00:05:36,760 nearby points that are separated. So that's going to be our objective 94 00:05:36,760 --> 00:05:39,401 function value, called the spacing of a clustering. 95 00:05:39,401 --> 00:05:43,380 It's the distance between the closest together pair of separated points. 96 00:05:43,380 --> 00:05:46,617 Now what do we want from the spacing of a clustering? 97 00:05:46,617 --> 00:05:51,015 Well, we want all of the separated points to be as far apart as possible. 98 00:05:51,015 --> 00:05:55,200 That is, we want the spacing to be big. The bigger the better. 99 00:05:55,200 --> 00:05:58,683 So that naturally motivates the formal problem statement. 100 00:05:58,683 --> 00:06:01,372 You're given as inputs, the, distance measure. 101 00:06:01,372 --> 00:06:04,611 You're told the distance between each pair of points. 102 00:06:04,611 --> 00:06:07,545 You're also told the desired number of clusters. 103 00:06:07,545 --> 00:06:12,312 Amongst all ways of clustering the points into k clusters, find the clustering 104 00:06:12,312 --> 00:06:14,940 which makes the spacing as big as possible. 105 00:06:16,240 --> 00:06:20,729 So let's develop a greedy algorithm that seeks to make the spacing as big as 106 00:06:20,729 --> 00:06:23,644 possible. And to facilitate the discussion, I'll 107 00:06:23,644 --> 00:06:28,250 use an example point set with just six black points up here in the upper right 108 00:06:28,250 --> 00:06:32,160 part of the slide. Now, the good idea behind this greedy 109 00:06:32,160 --> 00:06:35,930 algorithm is to not worry about the constraints that, at the end of the day, 110 00:06:35,930 --> 00:06:39,651 we can only output k different clusters. We're actually going to be infeasible, 111 00:06:39,651 --> 00:06:42,365 we'll have too many clusters throughout the algorithm. 112 00:06:42,365 --> 00:06:46,438 Only at the conclusion of the algorithm will we be down to k clusters, which will 113 00:06:46,438 --> 00:06:50,108 be our final infeasible solution. So that frees us up to initialize the 114 00:06:50,108 --> 00:06:54,080 procedure where the degenerate solution, where each point is in its own cluster. 115 00:06:56,280 --> 00:07:00,102 So in our example point set, we have these six pink isolated clusters. 116 00:07:00,102 --> 00:07:04,035 In general, you're going to have n clusters and we've got to get down to k. 117 00:07:04,035 --> 00:07:06,750 Now, let's remember what the spacing objective is. 118 00:07:06,750 --> 00:07:10,666 In the spacing objective, you go over all of the separated pairs of points. 119 00:07:10,666 --> 00:07:13,735 So for this degenerate solution, it's all pairs of points. 120 00:07:13,735 --> 00:07:17,545 And you look at the most alarming separated pair, that is those, that are 121 00:07:17,545 --> 00:07:21,196 the, the closest to each other. So the spacing is the distance between 122 00:07:21,196 --> 00:07:25,059 the closest pair of separated points. Now, in a greedy algorithm, you want to 123 00:07:25,059 --> 00:07:27,864 increase your objective function as much as possible. 124 00:07:27,864 --> 00:07:31,145 But actually, things are pretty cut and dried in this scenario. 125 00:07:31,145 --> 00:07:34,850 Suppose I give you a clustering and you want to make the spacing bigger, 126 00:07:34,850 --> 00:07:39,134 the only way you can do that is by taking the currently closest pair of separated 127 00:07:39,134 --> 00:07:41,537 points and making them not separated any more. 128 00:07:41,537 --> 00:07:45,717 That is putting them in the same cluster. So, it's in some sense obvious what you 129 00:07:45,717 --> 00:07:48,486 want to do to make the objective function go up at all. 130 00:07:48,486 --> 00:07:52,195 You gotta look at the pair of points that is defining the objective, 131 00:07:52,195 --> 00:07:55,487 the closest pair of separated points, and you have to fuse them. 132 00:07:55,487 --> 00:07:59,040 You have to fuse their clusters so that they're no longer separated. 133 00:08:01,140 --> 00:08:06,570 In this example, it looks to me like the closest pair of points, which of course 134 00:08:06,570 --> 00:08:09,732 are separated, is this pair in the upper right. 135 00:08:09,732 --> 00:08:14,750 So, if we want to make the spacing bigger, then we fuse them into a common 136 00:08:14,750 --> 00:08:25,827 cluster. [SOUND] So we started with six clusters. 137 00:08:25,827 --> 00:08:29,864 Now, we're down to five. So now we reevaluate the spacing again of 138 00:08:29,864 --> 00:08:33,280 this new clustering. We ask, what is the closest pair of 139 00:08:33,280 --> 00:08:36,883 separated points? So that would seem to me to be this pair 140 00:08:36,883 --> 00:08:40,485 in the bottom right. [SOUND] And again, the only way we can 141 00:08:40,485 --> 00:08:45,082 increase the spacing by merging clustering is to merge these two isolated 142 00:08:45,082 --> 00:08:48,308 clusters into one. Now, we do it again. We say, which pair 143 00:08:48,308 --> 00:08:52,211 of points determines the current spacing, which the currently separated pair of 144 00:08:52,211 --> 00:08:55,916 points that are nearest to each other? That to me would look like this pair 145 00:08:55,916 --> 00:08:58,660 that's on the rightmost part of the picture. 146 00:08:58,660 --> 00:09:02,874 The only way to merge two clusters and make the spacing actually go up is to 147 00:09:02,874 --> 00:09:06,923 merge the clusters containing the pairs of points determining the current 148 00:09:06,923 --> 00:09:09,496 spacing. So in this case, two different clusters 149 00:09:09,496 --> 00:09:13,600 with two points, each would collapse into a single cluster with four points. 150 00:09:13,600 --> 00:09:17,322 Let's assume that we wanted three clusters, clusters anyways, which is 151 00:09:17,322 --> 00:09:20,222 where we are. So at this point, the greedy algorithm is 152 00:09:20,222 --> 00:09:23,342 going to halt. So let me now spell out the pseudocode of 153 00:09:23,342 --> 00:09:27,556 the greedy algorithm more generally, but it's exactly what you'd expect given 154 00:09:27,556 --> 00:09:31,608 the discussion so far. All right. 155 00:09:31,608 --> 00:09:35,785 So, I'd like you to stare at this pseudocode for ten seconds or however 156 00:09:35,785 --> 00:09:40,256 long you need and try to relate this to an algorithm that we've seen in the 157 00:09:40,256 --> 00:09:42,903 course. In particular, an algorithm that we've 158 00:09:42,903 --> 00:09:46,257 seen quite recently. I hope it reminds you strongly of an 159 00:09:46,257 --> 00:09:55,024 algorithm we've already covered. Specifically, I hope you see a strong 160 00:09:55,024 --> 00:09:59,689 resemblance between this greedy algorithm and Kruskal's algorithm for computing a 161 00:09:59,689 --> 00:10:03,216 minimum cost spanning tree. Indeed, we can think of this greedy 162 00:10:03,216 --> 00:10:08,251 clustering algorithm as being exactly the same as Kruskal's minimum cost spanning 163 00:10:08,251 --> 00:10:12,974 tree algorithm except aborted early. Stopped when there's k components 164 00:10:12,974 --> 00:10:16,482 remaining, that is before the final k - 1 edge 165 00:10:16,482 --> 00:10:19,855 additions. So, just to make sure the correspondence 166 00:10:19,855 --> 00:10:21,677 is clear. What is the graph? 167 00:10:21,677 --> 00:10:24,578 What are the edges? What are the edge costs? 168 00:10:24,578 --> 00:10:28,761 Well, the objects in the clustering problem, that is the points. 169 00:10:28,761 --> 00:10:33,401 Those correspond to vertices in a graph. The other part of the input of the 170 00:10:33,401 --> 00:10:37,506 clustering problem are distances, which are given for every pair of points. 171 00:10:37,506 --> 00:10:41,999 So those play the role that edge costs were playing in the minimum spanning tree 172 00:10:41,999 --> 00:10:44,717 problem. Since we have a distance or an edge cost 173 00:10:44,717 --> 00:10:48,489 for every single pair of points, we can think of the edge set in the 174 00:10:48,489 --> 00:10:52,760 clustering problem as being the complete graph because we have an edge cost or a 175 00:10:52,760 --> 00:10:57,602 distance for every single pair. So this type of agglomerative clustering 176 00:10:57,602 --> 00:11:01,039 has a name. This idea of fusing components one at a 177 00:11:01,039 --> 00:11:05,760 time using MST-lik criterion. It's called single-link clustering. 178 00:11:05,760 --> 00:11:09,404 So a single n clustering is a good idea. If you work at all with clustering 179 00:11:09,404 --> 00:11:12,999 problems or unsupervised learning problems, it definitely should be a tool 180 00:11:12,999 --> 00:11:15,866 in your toolbox. we're going to justify its existence in 181 00:11:15,866 --> 00:11:19,704 one particular way in the next video, when we show that it does indeed maximize 182 00:11:19,704 --> 00:11:21,842 the spacing over all possible k clusterings. 183 00:11:21,842 --> 00:11:25,487 But even if you don't care about the spacing objective function per se, you 184 00:11:25,487 --> 00:11:27,819 want to be familiar with single-link clustering. 185 00:11:27,819 --> 00:11:29,860 It has many other nice properties as well.