1
00:00:03,120 --> 00:00:07,324
This video will prove the correctness of 
our greedy algorithm for clustering. 

2
00:00:07,324 --> 00:00:11,310
We'll show that it maximizes the spacing 
over all possible K clusterings. 

3
00:00:11,310 --> 00:00:15,370
You might have hoped that we could deduce 
the correctness of this greedy algorithm 

4
00:00:15,370 --> 00:00:19,136
for clustering immediately from our 
correctness proofs for various greedy 

5
00:00:19,136 --> 00:00:22,805
minimum spanning tree algorithms. 
Unfortunately that doesn't seem to be the 

6
00:00:22,805 --> 00:00:24,713
case. 
In the minimum cost spanning tree 

7
00:00:24,713 --> 00:00:27,746
problem, we're focusing on minimizing the 
sum of the edge cost. 

8
00:00:27,746 --> 00:00:31,121
Here we're looking at different 
objective, maximizing the spacings. 

9
00:00:31,121 --> 00:00:34,692
We do need to do a proof from scratch. 
That's said, you know, the arguments 

10
00:00:34,692 --> 00:00:38,410
we'll use should look familiar to you not 
just from the sort of exchange type 

11
00:00:38,410 --> 00:00:42,323
arguments when we prove the cut property, 
but also it might remind you even more, 

12
00:00:42,323 --> 00:00:45,210
going back further, to our greedy 
algorithms for scheduling. 

13
00:00:45,210 --> 00:00:48,000
So let's now set up the notation for the 
proof. 

14
00:00:48,000 --> 00:00:51,266
As usual, we're going to look at the 
output of r algorithm. 

15
00:00:51,266 --> 00:00:54,531
It achieves some objective function 
value, some spacing. 

16
00:00:54,531 --> 00:00:57,322
We're going to look at an arbitrary 
competitor. 

17
00:00:57,322 --> 00:01:01,359
Some other proposed scheduling. 
We're going to show that we're at least 

18
00:01:01,359 --> 00:01:03,793
as good, 
our spacing is, at least, as large. 

19
00:01:03,793 --> 00:01:08,187
So specifically, we'll denote the 
clusters in the output of r algorithm by 

20
00:01:08,187 --> 00:01:11,686
C1 up to CK. 
Our clustering has some spacing, some 

21
00:01:11,686 --> 00:01:15,796
distance between the near, closest pair 
of separated points. 

22
00:01:15,796 --> 00:01:18,969
Call it capital S. 
We're going to denote our competitor, 

23
00:01:18,969 --> 00:01:23,485
some alternative K clustering by C-Hat 
one of the C-Hat K, what is it that we're 

24
00:01:23,485 --> 00:01:26,646
tryin to show? 
We want to show that this arbitrary other 

25
00:01:26,646 --> 00:01:30,880
clustering has spacing no larger than 
R's, if we can show that, then because 

26
00:01:30,880 --> 00:01:35,170
this clustering was arbitrary it means 
the greedy clustering has spacing as 

27
00:01:35,170 --> 00:01:39,291
large as any other, so it's maximizing 
the spacing, that's what we want to 

28
00:01:39,291 --> 00:01:42,547
proof. 
But differently we want to exhibit a pair 

29
00:01:42,547 --> 00:01:48,174
of points separated by this cluster and C 
one-half to C1K, such that the distance 

30
00:01:48,174 --> 00:01:51,440
between those separated points is S or 
smaller. 

31
00:01:52,560 --> 00:01:55,322
So, let me just quickly depose of a 
trivial case. 

32
00:01:55,322 --> 00:01:59,869
If the C hats are the same as the C's, 
possibly up to a renaming, then of course 

33
00:01:59,869 --> 00:02:04,358
exactly the same pairs of points are 
separated into each of the clustering, so 

34
00:02:04,358 --> 00:02:08,674
that the spacing is exactly the same. 
So that's not a case we have to worry 

35
00:02:08,674 --> 00:02:11,245
about. 
The interesting case, then, is when the c 

36
00:02:11,245 --> 00:02:15,641
hats differ fundamentally from the cs, 
when they're not merely a permutation of 

37
00:02:15,641 --> 00:02:19,814
the clusters in the greedy clustering. 
And the maneuver we're going to do here 

38
00:02:19,814 --> 00:02:23,820
is similar in spirit to what we did in 
our scheduling correctness proof. 

39
00:02:23,820 --> 00:02:28,160
Way back in our scheduling correctness 
proof, we argued that any schedule that 

40
00:02:28,160 --> 00:02:31,943
differs from the greedy one, suffers 
from, in some sense, a local flaw. 

41
00:02:31,943 --> 00:02:36,116
We identified an adjacent pair of jobs 
that was, in some sense, out of order 

42
00:02:36,116 --> 00:02:40,256
with respect to the greedy ordering. 
The analog here is, we're going to argue 

43
00:02:40,256 --> 00:02:44,414
that, for any clustering which is not 
merely a permutation of the greedy 

44
00:02:44,414 --> 00:02:47,475
clustering. 
There has to be a pair of points which is 

45
00:02:47,475 --> 00:02:50,766
classified differently in the c hats 
relative to the c's. 

46
00:02:50,766 --> 00:02:55,040
By differently, I mean they're clustered 
together in the greedy clustering. 

47
00:02:55,040 --> 00:02:58,274
These points, p and q, belong to the same 
cluster, c sub i. 

48
00:02:58,274 --> 00:03:02,663
Yet, in this alternative clustering, 
which is not just the permutation of the 

49
00:03:02,663 --> 00:03:05,839
greedy clustering. 
They're placed in different clusters. 

50
00:03:05,839 --> 00:03:08,900
One, maybe p and c hat i, 
and q and some other c hat j. 

51
00:03:11,200 --> 00:03:14,968
So I want to now split the proof into an 
easy case and a tricky case. 

52
00:03:14,968 --> 00:03:19,118
To explain why the easy case is easy 
lets, lets observe a property that this 

53
00:03:19,118 --> 00:03:22,996
greedy clustering algorithm has. 
Now the algorithm's philosophy is that 

54
00:03:22,996 --> 00:03:27,201
the squeaky wheel should get the grease. 
That is, the separated pair of points 

55
00:03:27,201 --> 00:03:30,860
that are closest to each other are the 
ones that should get merged. 

56
00:03:30,860 --> 00:03:35,185
So for this reason, because it's always 
the closest separated pair that get 

57
00:03:35,185 --> 00:03:39,510
merged, if you look at the sequence of 
point pairs that get merged together, 

58
00:03:39,510 --> 00:03:43,547
that determine the spacing in each 
subsequent iteration, the distances 

59
00:03:43,547 --> 00:03:47,699
between these sort of worst separated 
points is only going up over time. 

60
00:03:47,699 --> 00:03:51,894
At the beginning of the algorithm, 
the closest pair of points in the entire 

61
00:03:51,894 --> 00:03:54,228
point set are the ones that get directly 
merged. 

62
00:03:54,228 --> 00:03:57,665
Then those are out of the picture, 
and now that some further away pair of 

63
00:03:57,665 --> 00:03:59,877
points are separated, it determines the 
spacing, 

64
00:03:59,877 --> 00:04:02,702
then they get merged. 
Once they've been coalesced, then there 

65
00:04:02,702 --> 00:04:05,903
is still some further away pair of 
points, which is now the smallest 

66
00:04:05,903 --> 00:04:07,598
separated. 
They get merged, and so on. 

67
00:04:07,598 --> 00:04:10,987
So if you look at the sequence of 
distances between the pairs of points 

68
00:04:10,987 --> 00:04:14,601
that are directly merged by the greedy 
algorithm, that is only going up over 

69
00:04:14,601 --> 00:04:16,932
time. 
And this sequence culminates with the 

70
00:04:16,932 --> 00:04:21,271
final spacing S of the greedy algorithm. 
At some sense, the spacing of the output 

71
00:04:21,271 --> 00:04:25,609
of the greedy algorithm is the distance 
between the point period that would get 

72
00:04:25,609 --> 00:04:29,893
merged if we ran the greedy algorithm one 
more in moderation but unfortunately 

73
00:04:29,893 --> 00:04:31,736
we're not allowed to do that. 
Okay? 

74
00:04:31,736 --> 00:04:36,129
So the point is, for every pair of points 
directly merged by the greedy algorithm, 

75
00:04:36,129 --> 00:04:39,740
they're always a distance at most S away 
from each other. 

76
00:04:39,740 --> 00:04:42,710
So the easy case, then, is when this pair 
of points, pq, 

77
00:04:42,710 --> 00:04:45,794
which, on the one hand, lie in a common 
greedy structure, 

78
00:04:45,794 --> 00:04:48,933
but on the other hand, in different 
clusters with c hats. 

79
00:04:48,933 --> 00:04:52,913
If they were, at some point, not merely 
in the same cluster, but actually 

80
00:04:52,913 --> 00:04:57,229
directly merged by the greedy algorithm. 
If, at some iteration, they determined 

81
00:04:57,229 --> 00:05:01,601
the spacing, and were picked by the 
greedy algorithm to have their, clusters 

82
00:05:01,601 --> 00:05:04,124
merged. 
Then we just argued that the distance 

83
00:05:04,124 --> 00:05:08,608
between p and q is no more than the space 
in capital s of the greedy clustering. 

84
00:05:08,608 --> 00:05:11,860
And since p and q lie in different 
clusters of the c hats. 

85
00:05:11,860 --> 00:05:16,588
It's separated by the C hats and 
therefore they upper bound the spacing of 

86
00:05:16,588 --> 00:05:19,866
the C hats. 
Maybe there's some even closer separated 

87
00:05:19,866 --> 00:05:23,649
pair by the C hats. 
But the very least P and Q are separated 

88
00:05:23,649 --> 00:05:27,180
so they upper bound the spacing of the C 
hat clustering. 

89
00:05:27,180 --> 00:05:30,746
So that's what we wanted to prove. 
We wanted to show that this alternative 

90
00:05:30,746 --> 00:05:33,589
spacing didn't have better spacing than 
our greedy spacing. 

91
00:05:33,589 --> 00:05:36,432
It had to be at most as big. 
It had to be at most capital S. 

92
00:05:36,432 --> 00:05:40,191
So in this easy case, when P and Q are 
directly merged by the greedy algorithm, 

93
00:05:40,191 --> 00:05:42,600
we're done. 
So the tricky case is when P and Q are 

94
00:05:42,600 --> 00:05:46,311
only indirectly merged, and you may be 
wondering at the moment, what does that 

95
00:05:46,311 --> 00:05:48,431
mean? 
How did two people wind up in the same 

96
00:05:48,431 --> 00:05:51,034
cluster if they weren't, at some point, 
directly merged? 

97
00:05:51,034 --> 00:05:53,540
So let's draw a picture and see how that 
can happen. 

98
00:05:55,640 --> 00:06:00,136
So the issue is that two points P and Q 
might wind up in a common greeting 

99
00:06:00,136 --> 00:06:02,894
cluster, 
not because the greedy algorithm ever 

100
00:06:02,894 --> 00:06:07,571
explicitly considered that point pair, 
but rather because of a path or cascade 

101
00:06:07,571 --> 00:06:11,798
of direct mergers of other point pairs. 
Imagine, for example, that at some 

102
00:06:11,798 --> 00:06:16,433
iteration of the greedy algorithm the 
point P was considered explicitly along 

103
00:06:16,433 --> 00:06:20,095
with the point A1, where here A1 is meant 
to be different than Q. 

104
00:06:20,095 --> 00:06:24,272
So that's a direct merger, and P and A1 
wind up in the same cluster. Their 

105
00:06:24,272 --> 00:06:27,533
clusters are merged. 
Maybe the same thing happened to the 

106
00:06:27,533 --> 00:06:30,738
point Q at some point A sub L which is 
different than P. 

107
00:06:30,738 --> 00:06:35,201
Sooner or later maybe, you know, at some 
other time, some totally unrelated pair 

108
00:06:35,201 --> 00:06:39,607
of points A2 and A3 are directly merged 
and then at some point A1 and A2 are 

109
00:06:39,607 --> 00:06:43,532
considered by the greedy algorithm. 
Algorithm, because the other closest pair 

110
00:06:43,532 --> 00:06:45,610
of separated points, and, they get 
merged. 

111
00:06:45,610 --> 00:06:49,980
And so on. 
So the edges in this picture are meant to 

112
00:06:49,980 --> 00:06:54,745
indicate direct mergers, pairs of points 
that are explicitly fused because they 

113
00:06:54,745 --> 00:06:58,232
determine the spacing of some point of 
the greedy iteration. 

114
00:06:58,232 --> 00:07:02,707
But at the end of the day the greedy 
clustering is going to have the results 

115
00:07:02,707 --> 00:07:05,985
of all of these mergings. 
So in case you're feeling confused, let 

116
00:07:05,985 --> 00:07:09,822
me just point out that we really saw this 
exact same exact thing going on when we 

117
00:07:09,822 --> 00:07:12,723
were talking about minimum spanning trees 
in Kruskal's rhythm. 

118
00:07:12,723 --> 00:07:16,467
So, at an intermediate point in Kruskal's 
rhythm, after it's added some edges, but 

119
00:07:16,467 --> 00:07:20,163
before it's constructed a spanning tree. 
As we discussed, the intermediate state 

120
00:07:20,163 --> 00:07:22,269
is a bunch of different connected 
components. 

121
00:07:22,269 --> 00:07:25,189
And there are vertices that Have an edge 
chosen between them. 

122
00:07:25,189 --> 00:07:28,019
They, of course, are going to be in the 
same kinetic component. 

123
00:07:28,019 --> 00:07:31,189
But then again, a kenetic component could 
have long paths in it. 

124
00:07:31,189 --> 00:07:34,701
So you could have vertices that are in 
the same kinetic component in an 

125
00:07:34,701 --> 00:07:38,116
intermediate state of Kruskal's 
Algorithm, despite the fact that we've 

126
00:07:38,116 --> 00:07:40,311
haven't chosen an edge directly between 
them. 

127
00:07:40,311 --> 00:07:42,799
There's rather, a path of chosen edges 
between them. 

128
00:07:42,799 --> 00:07:44,848
It's exactly the same thing going on 
here. 

129
00:07:44,848 --> 00:07:48,506
Now, what we have going for us is that, 
if a pair of points, as discussed, was 

130
00:07:48,506 --> 00:07:52,165
directly merged, we know they're close. 
The distance between them is, at most, 

131
00:07:52,165 --> 00:07:55,140
this spacing, capital S. 
We really don't know anything, frankly, 

132
00:07:55,140 --> 00:07:59,044
about the distance between pairs of 
vertices that were not directly merged. 

133
00:07:59,044 --> 00:08:03,040
They just, sort of, accidentally wound up 
in a common cluster. 

134
00:08:03,040 --> 00:08:07,059
But this turns out to be good enough. 
This is actually sufficient to argue that 

135
00:08:07,059 --> 00:08:10,570
this competitor clustering with the c-hat 
has spacing no more than s? 

136
00:08:10,570 --> 00:08:12,300
No better than ours. 
Let's see why. 

137
00:08:13,640 --> 00:08:18,459
So given that P and Q are in a common 
greedy cluster it must mean there was a 

138
00:08:18,459 --> 00:08:22,537
path of direct mergers that forced them 
to be in the same cluster. 

139
00:08:22,537 --> 00:08:27,294
So let's let the intermediate points 
involved in that path denoted A1 of two 

140
00:08:27,294 --> 00:08:29,820
AL. 
So here's the part of the proof where we 

141
00:08:29,820 --> 00:08:32,673
basically reduce the tricky case to the 
easy case. 

142
00:08:32,673 --> 00:08:36,897
So we've got this pair of points, PQ. 
Now, remember, not, not only are they in 

143
00:08:36,897 --> 00:08:40,607
a common greedy cluster. 
But they're in different clusters in our 

144
00:08:40,607 --> 00:08:43,975
competitor in the c hats. 
So the point p is in some cluster. 

145
00:08:43,975 --> 00:08:46,429
Call it c hat i. 
And Q is in something else. 

146
00:08:46,429 --> 00:08:50,805
In particular, it's not in c hat i. 
Now, imagine you go on a hike. 

147
00:08:50,805 --> 00:08:54,359
You start at the point p, and you hike 
along this path. 

148
00:08:54,359 --> 00:08:57,190
You traverse these direct mergers toward 
q. 

149
00:08:57,190 --> 00:09:01,008
Now, you're starting inside c hat I, and 
you end up outside. 

150
00:09:01,008 --> 00:09:05,023
So at some point on your hike, you will 
traverse the boundary. 

151
00:09:05,023 --> 00:09:09,894
You will, for the first time, escape from 
c hat I, and wind up in some other 

152
00:09:09,894 --> 00:09:11,868
cluster. 
So that has to happen. 

153
00:09:11,868 --> 00:09:16,410
And let's call ai and ai+1 the 
consecutive pair of points at which you 

154
00:09:16,410 --> 00:09:19,833
go from inside this cluster to outside 
this cluster. 

155
00:09:19,833 --> 00:09:24,453
And now we're back in the easy case. 
Now we're dealing with a separated pair 

156
00:09:24,453 --> 00:09:27,146
that would directly merge by the greedy 
algorithm. 

157
00:09:27,146 --> 00:09:31,563
Remember that we set up this path to be a 
path of direct mergers so in particular, 

158
00:09:31,563 --> 00:09:35,495
AJ and AJ + one were direct mergers, 
therefore their distance is at most S. 

159
00:09:35,495 --> 00:09:39,642
And again, by virtue of being direct 
mergers, their distance is at most the 

160
00:09:39,642 --> 00:09:43,736
spacing of the greedy clustering and yet 
as a separated point by the C hats. 

161
00:09:43,736 --> 00:09:46,644
It's also an upper bound on the spacing 
of the C hats. 

162
00:09:46,644 --> 00:09:50,900
This means the spacing S of our greedy 
clustering is as good as the competitor. 

163
00:09:50,900 --> 00:09:53,429
Is the competitor was arbitrary or 
optimal. 

164
00:09:53,429 --> 00:09:54,900
That completes the proof.