1 00:00:03,120 --> 00:00:07,324 This video will prove the correctness of our greedy algorithm for clustering. 2 00:00:07,324 --> 00:00:11,310 We'll show that it maximizes the spacing over all possible K clusterings. 3 00:00:11,310 --> 00:00:15,370 You might have hoped that we could deduce the correctness of this greedy algorithm 4 00:00:15,370 --> 00:00:19,136 for clustering immediately from our correctness proofs for various greedy 5 00:00:19,136 --> 00:00:22,805 minimum spanning tree algorithms. Unfortunately that doesn't seem to be the 6 00:00:22,805 --> 00:00:24,713 case. In the minimum cost spanning tree 7 00:00:24,713 --> 00:00:27,746 problem, we're focusing on minimizing the sum of the edge cost. 8 00:00:27,746 --> 00:00:31,121 Here we're looking at different objective, maximizing the spacings. 9 00:00:31,121 --> 00:00:34,692 We do need to do a proof from scratch. That's said, you know, the arguments 10 00:00:34,692 --> 00:00:38,410 we'll use should look familiar to you not just from the sort of exchange type 11 00:00:38,410 --> 00:00:42,323 arguments when we prove the cut property, but also it might remind you even more, 12 00:00:42,323 --> 00:00:45,210 going back further, to our greedy algorithms for scheduling. 13 00:00:45,210 --> 00:00:48,000 So let's now set up the notation for the proof. 14 00:00:48,000 --> 00:00:51,266 As usual, we're going to look at the output of r algorithm. 15 00:00:51,266 --> 00:00:54,531 It achieves some objective function value, some spacing. 16 00:00:54,531 --> 00:00:57,322 We're going to look at an arbitrary competitor. 17 00:00:57,322 --> 00:01:01,359 Some other proposed scheduling. We're going to show that we're at least 18 00:01:01,359 --> 00:01:03,793 as good, our spacing is, at least, as large. 19 00:01:03,793 --> 00:01:08,187 So specifically, we'll denote the clusters in the output of r algorithm by 20 00:01:08,187 --> 00:01:11,686 C1 up to CK. Our clustering has some spacing, some 21 00:01:11,686 --> 00:01:15,796 distance between the near, closest pair of separated points. 22 00:01:15,796 --> 00:01:18,969 Call it capital S. We're going to denote our competitor, 23 00:01:18,969 --> 00:01:23,485 some alternative K clustering by C-Hat one of the C-Hat K, what is it that we're 24 00:01:23,485 --> 00:01:26,646 tryin to show? We want to show that this arbitrary other 25 00:01:26,646 --> 00:01:30,880 clustering has spacing no larger than R's, if we can show that, then because 26 00:01:30,880 --> 00:01:35,170 this clustering was arbitrary it means the greedy clustering has spacing as 27 00:01:35,170 --> 00:01:39,291 large as any other, so it's maximizing the spacing, that's what we want to 28 00:01:39,291 --> 00:01:42,547 proof. But differently we want to exhibit a pair 29 00:01:42,547 --> 00:01:48,174 of points separated by this cluster and C one-half to C1K, such that the distance 30 00:01:48,174 --> 00:01:51,440 between those separated points is S or smaller. 31 00:01:52,560 --> 00:01:55,322 So, let me just quickly depose of a trivial case. 32 00:01:55,322 --> 00:01:59,869 If the C hats are the same as the C's, possibly up to a renaming, then of course 33 00:01:59,869 --> 00:02:04,358 exactly the same pairs of points are separated into each of the clustering, so 34 00:02:04,358 --> 00:02:08,674 that the spacing is exactly the same. So that's not a case we have to worry 35 00:02:08,674 --> 00:02:11,245 about. The interesting case, then, is when the c 36 00:02:11,245 --> 00:02:15,641 hats differ fundamentally from the cs, when they're not merely a permutation of 37 00:02:15,641 --> 00:02:19,814 the clusters in the greedy clustering. And the maneuver we're going to do here 38 00:02:19,814 --> 00:02:23,820 is similar in spirit to what we did in our scheduling correctness proof. 39 00:02:23,820 --> 00:02:28,160 Way back in our scheduling correctness proof, we argued that any schedule that 40 00:02:28,160 --> 00:02:31,943 differs from the greedy one, suffers from, in some sense, a local flaw. 41 00:02:31,943 --> 00:02:36,116 We identified an adjacent pair of jobs that was, in some sense, out of order 42 00:02:36,116 --> 00:02:40,256 with respect to the greedy ordering. The analog here is, we're going to argue 43 00:02:40,256 --> 00:02:44,414 that, for any clustering which is not merely a permutation of the greedy 44 00:02:44,414 --> 00:02:47,475 clustering. There has to be a pair of points which is 45 00:02:47,475 --> 00:02:50,766 classified differently in the c hats relative to the c's. 46 00:02:50,766 --> 00:02:55,040 By differently, I mean they're clustered together in the greedy clustering. 47 00:02:55,040 --> 00:02:58,274 These points, p and q, belong to the same cluster, c sub i. 48 00:02:58,274 --> 00:03:02,663 Yet, in this alternative clustering, which is not just the permutation of the 49 00:03:02,663 --> 00:03:05,839 greedy clustering. They're placed in different clusters. 50 00:03:05,839 --> 00:03:08,900 One, maybe p and c hat i, and q and some other c hat j. 51 00:03:11,200 --> 00:03:14,968 So I want to now split the proof into an easy case and a tricky case. 52 00:03:14,968 --> 00:03:19,118 To explain why the easy case is easy lets, lets observe a property that this 53 00:03:19,118 --> 00:03:22,996 greedy clustering algorithm has. Now the algorithm's philosophy is that 54 00:03:22,996 --> 00:03:27,201 the squeaky wheel should get the grease. That is, the separated pair of points 55 00:03:27,201 --> 00:03:30,860 that are closest to each other are the ones that should get merged. 56 00:03:30,860 --> 00:03:35,185 So for this reason, because it's always the closest separated pair that get 57 00:03:35,185 --> 00:03:39,510 merged, if you look at the sequence of point pairs that get merged together, 58 00:03:39,510 --> 00:03:43,547 that determine the spacing in each subsequent iteration, the distances 59 00:03:43,547 --> 00:03:47,699 between these sort of worst separated points is only going up over time. 60 00:03:47,699 --> 00:03:51,894 At the beginning of the algorithm, the closest pair of points in the entire 61 00:03:51,894 --> 00:03:54,228 point set are the ones that get directly merged. 62 00:03:54,228 --> 00:03:57,665 Then those are out of the picture, and now that some further away pair of 63 00:03:57,665 --> 00:03:59,877 points are separated, it determines the spacing, 64 00:03:59,877 --> 00:04:02,702 then they get merged. Once they've been coalesced, then there 65 00:04:02,702 --> 00:04:05,903 is still some further away pair of points, which is now the smallest 66 00:04:05,903 --> 00:04:07,598 separated. They get merged, and so on. 67 00:04:07,598 --> 00:04:10,987 So if you look at the sequence of distances between the pairs of points 68 00:04:10,987 --> 00:04:14,601 that are directly merged by the greedy algorithm, that is only going up over 69 00:04:14,601 --> 00:04:16,932 time. And this sequence culminates with the 70 00:04:16,932 --> 00:04:21,271 final spacing S of the greedy algorithm. At some sense, the spacing of the output 71 00:04:21,271 --> 00:04:25,609 of the greedy algorithm is the distance between the point period that would get 72 00:04:25,609 --> 00:04:29,893 merged if we ran the greedy algorithm one more in moderation but unfortunately 73 00:04:29,893 --> 00:04:31,736 we're not allowed to do that. Okay? 74 00:04:31,736 --> 00:04:36,129 So the point is, for every pair of points directly merged by the greedy algorithm, 75 00:04:36,129 --> 00:04:39,740 they're always a distance at most S away from each other. 76 00:04:39,740 --> 00:04:42,710 So the easy case, then, is when this pair of points, pq, 77 00:04:42,710 --> 00:04:45,794 which, on the one hand, lie in a common greedy structure, 78 00:04:45,794 --> 00:04:48,933 but on the other hand, in different clusters with c hats. 79 00:04:48,933 --> 00:04:52,913 If they were, at some point, not merely in the same cluster, but actually 80 00:04:52,913 --> 00:04:57,229 directly merged by the greedy algorithm. If, at some iteration, they determined 81 00:04:57,229 --> 00:05:01,601 the spacing, and were picked by the greedy algorithm to have their, clusters 82 00:05:01,601 --> 00:05:04,124 merged. Then we just argued that the distance 83 00:05:04,124 --> 00:05:08,608 between p and q is no more than the space in capital s of the greedy clustering. 84 00:05:08,608 --> 00:05:11,860 And since p and q lie in different clusters of the c hats. 85 00:05:11,860 --> 00:05:16,588 It's separated by the C hats and therefore they upper bound the spacing of 86 00:05:16,588 --> 00:05:19,866 the C hats. Maybe there's some even closer separated 87 00:05:19,866 --> 00:05:23,649 pair by the C hats. But the very least P and Q are separated 88 00:05:23,649 --> 00:05:27,180 so they upper bound the spacing of the C hat clustering. 89 00:05:27,180 --> 00:05:30,746 So that's what we wanted to prove. We wanted to show that this alternative 90 00:05:30,746 --> 00:05:33,589 spacing didn't have better spacing than our greedy spacing. 91 00:05:33,589 --> 00:05:36,432 It had to be at most as big. It had to be at most capital S. 92 00:05:36,432 --> 00:05:40,191 So in this easy case, when P and Q are directly merged by the greedy algorithm, 93 00:05:40,191 --> 00:05:42,600 we're done. So the tricky case is when P and Q are 94 00:05:42,600 --> 00:05:46,311 only indirectly merged, and you may be wondering at the moment, what does that 95 00:05:46,311 --> 00:05:48,431 mean? How did two people wind up in the same 96 00:05:48,431 --> 00:05:51,034 cluster if they weren't, at some point, directly merged? 97 00:05:51,034 --> 00:05:53,540 So let's draw a picture and see how that can happen. 98 00:05:55,640 --> 00:06:00,136 So the issue is that two points P and Q might wind up in a common greeting 99 00:06:00,136 --> 00:06:02,894 cluster, not because the greedy algorithm ever 100 00:06:02,894 --> 00:06:07,571 explicitly considered that point pair, but rather because of a path or cascade 101 00:06:07,571 --> 00:06:11,798 of direct mergers of other point pairs. Imagine, for example, that at some 102 00:06:11,798 --> 00:06:16,433 iteration of the greedy algorithm the point P was considered explicitly along 103 00:06:16,433 --> 00:06:20,095 with the point A1, where here A1 is meant to be different than Q. 104 00:06:20,095 --> 00:06:24,272 So that's a direct merger, and P and A1 wind up in the same cluster. Their 105 00:06:24,272 --> 00:06:27,533 clusters are merged. Maybe the same thing happened to the 106 00:06:27,533 --> 00:06:30,738 point Q at some point A sub L which is different than P. 107 00:06:30,738 --> 00:06:35,201 Sooner or later maybe, you know, at some other time, some totally unrelated pair 108 00:06:35,201 --> 00:06:39,607 of points A2 and A3 are directly merged and then at some point A1 and A2 are 109 00:06:39,607 --> 00:06:43,532 considered by the greedy algorithm. Algorithm, because the other closest pair 110 00:06:43,532 --> 00:06:45,610 of separated points, and, they get merged. 111 00:06:45,610 --> 00:06:49,980 And so on. So the edges in this picture are meant to 112 00:06:49,980 --> 00:06:54,745 indicate direct mergers, pairs of points that are explicitly fused because they 113 00:06:54,745 --> 00:06:58,232 determine the spacing of some point of the greedy iteration. 114 00:06:58,232 --> 00:07:02,707 But at the end of the day the greedy clustering is going to have the results 115 00:07:02,707 --> 00:07:05,985 of all of these mergings. So in case you're feeling confused, let 116 00:07:05,985 --> 00:07:09,822 me just point out that we really saw this exact same exact thing going on when we 117 00:07:09,822 --> 00:07:12,723 were talking about minimum spanning trees in Kruskal's rhythm. 118 00:07:12,723 --> 00:07:16,467 So, at an intermediate point in Kruskal's rhythm, after it's added some edges, but 119 00:07:16,467 --> 00:07:20,163 before it's constructed a spanning tree. As we discussed, the intermediate state 120 00:07:20,163 --> 00:07:22,269 is a bunch of different connected components. 121 00:07:22,269 --> 00:07:25,189 And there are vertices that Have an edge chosen between them. 122 00:07:25,189 --> 00:07:28,019 They, of course, are going to be in the same kinetic component. 123 00:07:28,019 --> 00:07:31,189 But then again, a kenetic component could have long paths in it. 124 00:07:31,189 --> 00:07:34,701 So you could have vertices that are in the same kinetic component in an 125 00:07:34,701 --> 00:07:38,116 intermediate state of Kruskal's Algorithm, despite the fact that we've 126 00:07:38,116 --> 00:07:40,311 haven't chosen an edge directly between them. 127 00:07:40,311 --> 00:07:42,799 There's rather, a path of chosen edges between them. 128 00:07:42,799 --> 00:07:44,848 It's exactly the same thing going on here. 129 00:07:44,848 --> 00:07:48,506 Now, what we have going for us is that, if a pair of points, as discussed, was 130 00:07:48,506 --> 00:07:52,165 directly merged, we know they're close. The distance between them is, at most, 131 00:07:52,165 --> 00:07:55,140 this spacing, capital S. We really don't know anything, frankly, 132 00:07:55,140 --> 00:07:59,044 about the distance between pairs of vertices that were not directly merged. 133 00:07:59,044 --> 00:08:03,040 They just, sort of, accidentally wound up in a common cluster. 134 00:08:03,040 --> 00:08:07,059 But this turns out to be good enough. This is actually sufficient to argue that 135 00:08:07,059 --> 00:08:10,570 this competitor clustering with the c-hat has spacing no more than s? 136 00:08:10,570 --> 00:08:12,300 No better than ours. Let's see why. 137 00:08:13,640 --> 00:08:18,459 So given that P and Q are in a common greedy cluster it must mean there was a 138 00:08:18,459 --> 00:08:22,537 path of direct mergers that forced them to be in the same cluster. 139 00:08:22,537 --> 00:08:27,294 So let's let the intermediate points involved in that path denoted A1 of two 140 00:08:27,294 --> 00:08:29,820 AL. So here's the part of the proof where we 141 00:08:29,820 --> 00:08:32,673 basically reduce the tricky case to the easy case. 142 00:08:32,673 --> 00:08:36,897 So we've got this pair of points, PQ. Now, remember, not, not only are they in 143 00:08:36,897 --> 00:08:40,607 a common greedy cluster. But they're in different clusters in our 144 00:08:40,607 --> 00:08:43,975 competitor in the c hats. So the point p is in some cluster. 145 00:08:43,975 --> 00:08:46,429 Call it c hat i. And Q is in something else. 146 00:08:46,429 --> 00:08:50,805 In particular, it's not in c hat i. Now, imagine you go on a hike. 147 00:08:50,805 --> 00:08:54,359 You start at the point p, and you hike along this path. 148 00:08:54,359 --> 00:08:57,190 You traverse these direct mergers toward q. 149 00:08:57,190 --> 00:09:01,008 Now, you're starting inside c hat I, and you end up outside. 150 00:09:01,008 --> 00:09:05,023 So at some point on your hike, you will traverse the boundary. 151 00:09:05,023 --> 00:09:09,894 You will, for the first time, escape from c hat I, and wind up in some other 152 00:09:09,894 --> 00:09:11,868 cluster. So that has to happen. 153 00:09:11,868 --> 00:09:16,410 And let's call ai and ai+1 the consecutive pair of points at which you 154 00:09:16,410 --> 00:09:19,833 go from inside this cluster to outside this cluster. 155 00:09:19,833 --> 00:09:24,453 And now we're back in the easy case. Now we're dealing with a separated pair 156 00:09:24,453 --> 00:09:27,146 that would directly merge by the greedy algorithm. 157 00:09:27,146 --> 00:09:31,563 Remember that we set up this path to be a path of direct mergers so in particular, 158 00:09:31,563 --> 00:09:35,495 AJ and AJ + one were direct mergers, therefore their distance is at most S. 159 00:09:35,495 --> 00:09:39,642 And again, by virtue of being direct mergers, their distance is at most the 160 00:09:39,642 --> 00:09:43,736 spacing of the greedy clustering and yet as a separated point by the C hats. 161 00:09:43,736 --> 00:09:46,644 It's also an upper bound on the spacing of the C hats. 162 00:09:46,644 --> 00:09:50,900 This means the spacing S of our greedy clustering is as good as the competitor. 163 00:09:50,900 --> 00:09:53,429 Is the competitor was arbitrary or optimal. 164 00:09:53,429 --> 00:09:54,900 That completes the proof.