1
00:00:02,880 --> 00:00:06,256
So, by now we've spent quite a bit of 
time talking about the minimum cost 

2
00:00:06,256 --> 00:00:09,217
spanning tree problem. 
There's a number of motivations why we do 

3
00:00:09,217 --> 00:00:11,299
that. 
the first, it's just a uniquely great 

4
00:00:11,299 --> 00:00:13,288
problem for the study of greedy 
algorithms. 

5
00:00:13,288 --> 00:00:16,757
You can propose a bunch of different 
greedy algorithms, and quite unusually, 

6
00:00:16,757 --> 00:00:20,041
it's all of them seem to work. 
So you get correct greedy algorithms, but 

7
00:00:20,041 --> 00:00:22,401
it's quite subtle what's driving their 
correctness. 

8
00:00:22,401 --> 00:00:26,055
You also get a lot of practice, arguing 
about graphs, and arguing about exchange 

9
00:00:26,055 --> 00:00:28,784
arguments, and so on. 
The second reason, it's been worthwhile 

10
00:00:28,784 --> 00:00:32,439
spending time on these algorithms. It has 
given us some more practice with data 

11
00:00:32,439 --> 00:00:34,937
structures and how to use them to speed 
up algorithms, 

12
00:00:34,937 --> 00:00:36,880
namely heaps, to speed up Prim's 
algorithm, 

13
00:00:36,880 --> 00:00:40,145
to union-find data structure to speed up 
Kruskal's algorithm. 

14
00:00:40,145 --> 00:00:44,464
The third reason that we're talking about 
them is they have applications in their 

15
00:00:44,464 --> 00:00:47,203
own rights. 
That's the subject of this video and the 

16
00:00:47,203 --> 00:00:51,140
next and we're going to focus on 
applications to clustering problems. 

17
00:00:51,140 --> 00:00:55,510
So let me begin by just talking about the 
goal of clustering informally. 

18
00:00:55,510 --> 00:01:00,244
And then, I'll let you pin me down to a 
precise objective function on the next 

19
00:01:00,244 --> 00:01:03,037
slide. 
So in a clustering problem, the input is 

20
00:01:03,037 --> 00:01:06,254
n points that we think of as being 
embedded in space. 

21
00:01:06,254 --> 00:01:11,231
And it's actually quite rare that in the 
underlying problem that we care about is 

22
00:01:11,231 --> 00:01:15,784
it actually intrinsically geometric? 
Is it actually intrinsically points in 

23
00:01:15,784 --> 00:01:18,576
space? 
Usually, we're representing something we 

24
00:01:18,576 --> 00:01:20,518
care about. 
Maybe it's web pages. 

25
00:01:20,518 --> 00:01:24,100
Maybe it's images. 
Maybe it's a database as points in space. 

26
00:01:24,100 --> 00:01:28,932
And given a bunch of objects, we want to 
cluster them into, in some sense coherent 

27
00:01:28,932 --> 00:01:31,677
groups. 
For those of you coming from a machine 

28
00:01:31,677 --> 00:01:36,391
learning background, you'll often hear 
this problem referred to as unsupervised 

29
00:01:36,391 --> 00:01:41,164
learning, meaning the data is unlabeled. 
We're looking for just patterns and data 

30
00:01:41,164 --> 00:01:45,341
where the data is not annotated. 
This obviously is a fairly wishy-washy 

31
00:01:45,341 --> 00:01:49,380
description of a problem, so let's be a 
little bit more precise. 

32
00:01:49,380 --> 00:01:53,136
We're going to assume that part of the 
input is what we're going to call a 

33
00:01:53,136 --> 00:01:56,021
similarity measure. 
Meaning for any two objects, we're 

34
00:01:56,021 --> 00:02:00,050
going to have a function giving us a 
number indicating how similar or really 

35
00:02:00,050 --> 00:02:03,240
rather how dissimilar they are to each 
other. 

36
00:02:03,240 --> 00:02:07,764
In keeping with the geometric metaphor, 
we're going to refer to this function as 

37
00:02:07,764 --> 00:02:11,953
a distance function. 
One thing that's cool is we don't need to 

38
00:02:11,953 --> 00:02:14,565
impose many assumptions on this distance 
function. 

39
00:02:14,565 --> 00:02:17,647
The one thing we're going to assume is 
that it's symmetric, 

40
00:02:17,647 --> 00:02:22,000
meaning the distance from p to q is the 
same as the distance from q to p. 

41
00:02:22,000 --> 00:02:25,823
So what are some examples? 
Well, if you want to really go with the 

42
00:02:25,823 --> 00:02:30,193
geometric metaphor, if you're 
representing these points as in a space 

43
00:02:30,193 --> 00:02:35,109
RM for some dimension M, you can just use 
the Euclidean distance or if you prefer 

44
00:02:35,109 --> 00:02:37,658
some other norm, like say, l1 or 
l-infinity. 

45
00:02:37,658 --> 00:02:42,331
In many application domains, there are 
widely accepted similarity or distance 

46
00:02:42,331 --> 00:02:45,548
measures. 
one example would be for sequences, as we 

47
00:02:45,548 --> 00:02:50,768
discussed in introductory lecture the 
penalty of the best alignment between two 

48
00:02:50,768 --> 00:02:53,944
genome fragments. 
So now that we have this distance 

49
00:02:53,944 --> 00:02:56,734
function, what would it mean to have 
coherent groups? 

50
00:02:56,734 --> 00:03:00,866
Well, things which have small distance 
from each other which are similar, they 

51
00:03:00,866 --> 00:03:05,106
should generally be in the same group, 
and things that are very dissimilar that 

52
00:03:05,106 --> 00:03:09,184
have large distance between them, you 
would expect to mostly be in different 

53
00:03:09,184 --> 00:03:11,521
groups. 
So how can we evaluate how good a 

54
00:03:11,521 --> 00:03:15,335
clustering is, how well it's doing with 
the job of putting nearby points together 

55
00:03:15,335 --> 00:03:17,313
and dissimilar points in different 
groups? 

56
00:03:17,313 --> 00:03:21,080
Well, to be honest there's many ways of 
approaching and formalizing that problem. 

57
00:03:21,080 --> 00:03:24,094
The approach we're going to take is an 
optimization-based approach. 

58
00:03:24,094 --> 00:03:27,673
We're going to positive and objective 
function on clusterings and then seek out 

59
00:03:27,673 --> 00:03:30,216
the clustering that optimizes that 
objective function. 

60
00:03:30,216 --> 00:03:33,277
I want to warn you that's not the only 
way to approach the problem. 

61
00:03:33,277 --> 00:03:36,715
There are other interesting approaches, 
but optimization is a natural one. 

62
00:03:36,715 --> 00:03:40,199
Furthermore, just like in our scheduling 
application, there's more than one 

63
00:03:40,199 --> 00:03:43,213
objective function that people study and 
that is well-motivated. 

64
00:03:43,213 --> 00:03:47,340
So one very popular objective function 
would be, say the k-means objective, 

65
00:03:47,340 --> 00:03:49,968
which I encourage you to look up and read 
more about. 

66
00:03:49,968 --> 00:03:53,587
For this lecture, I'm just going to adopt 
one specific objective function. 

67
00:03:53,587 --> 00:03:57,454
It's natural enough, but it's by no means 
the only one or even the primary one. 

68
00:03:57,454 --> 00:04:01,420
But it'll serve the purpose of studying a 
natural greedy algorithm related to 

69
00:04:01,420 --> 00:04:04,940
minimum spanning tree algorithms which is 
optimal in a precise sense. 

70
00:04:09,120 --> 00:04:13,581
So let me develop the terminology needed 
to state the objective function and the 

71
00:04:13,581 --> 00:04:17,162
optimization problem precisely. 
One issue that always comes up in 

72
00:04:17,162 --> 00:04:20,577
clustering problems is how many clusters 
are you going to use? 

73
00:04:20,577 --> 00:04:24,874
So to keep things simple in this video, 
we're going to assume that part of the 

74
00:04:24,874 --> 00:04:28,124
input k indicates how many clusters 
you're supposed to use. 

75
00:04:28,124 --> 00:04:32,700
So we're going to assume you know a 
priori how many clusters you want. 

76
00:04:32,700 --> 00:04:36,012
So in some application domains, this is a 
totally reasonably assumption. 

77
00:04:36,012 --> 00:04:40,177
You know, you might know for example that 
you want exactly two clusters, the k = 2. 

78
00:04:40,177 --> 00:04:43,632
in some domains you may have you know, 
good domain knowledge from past 

79
00:04:43,632 --> 00:04:47,087
experience that you know how many 
clusters you expect to need that's, all 

80
00:04:47,087 --> 00:04:49,595
fine and good. 
Also you know, in practice, if you don't 

81
00:04:49,595 --> 00:04:53,476
really know what k is supposed to be, you 
can go ahead and run the algorithm we're 

82
00:04:53,476 --> 00:04:56,647
going to talk about for a bunch of 
different values of k and then you 

83
00:04:56,647 --> 00:04:59,960
symmetric or just eyeball it to figure 
out which is the best solution. 

84
00:05:00,980 --> 00:05:06,094
So the objective function we're going to 
look at is defined in terms of separated 

85
00:05:06,094 --> 00:05:09,337
pairs of points. 
That is points that are assigned to 

86
00:05:09,337 --> 00:05:13,541
distinct clusters. 
Now you know, if you have more than one 

87
00:05:13,541 --> 00:05:16,437
cluster, inevitably there's going to be 
some pairs of points. 

88
00:05:16,437 --> 00:05:20,247
Some points are going to be one groups, 
other points are going to be in the 

89
00:05:20,247 --> 00:05:23,042
different group. 
So separated points are inevitable and 

90
00:05:23,042 --> 00:05:27,055
the most alarming separated points are 
the ones that are the most similar, the 

91
00:05:27,055 --> 00:05:31,018
ones that have the smallest distance. 
If points are separated, we want them to 

92
00:05:31,018 --> 00:05:33,610
be for apart, 
so we're particularly concerned with 

93
00:05:33,610 --> 00:05:36,760
nearby points that are separated. 
So that's going to be our objective 

94
00:05:36,760 --> 00:05:39,401
function value, called the spacing of a 
clustering. 

95
00:05:39,401 --> 00:05:43,380
It's the distance between the closest 
together pair of separated points. 

96
00:05:43,380 --> 00:05:46,617
Now what do we want from the spacing of a 
clustering? 

97
00:05:46,617 --> 00:05:51,015
Well, we want all of the separated points 
to be as far apart as possible. 

98
00:05:51,015 --> 00:05:55,200
That is, we want the spacing to be big. 
The bigger the better. 

99
00:05:55,200 --> 00:05:58,683
So that naturally motivates the formal 
problem statement. 

100
00:05:58,683 --> 00:06:01,372
You're given as inputs, the, distance 
measure. 

101
00:06:01,372 --> 00:06:04,611
You're told the distance between each 
pair of points. 

102
00:06:04,611 --> 00:06:07,545
You're also told the desired number of 
clusters. 

103
00:06:07,545 --> 00:06:12,312
Amongst all ways of clustering the points 
into k clusters, find the clustering 

104
00:06:12,312 --> 00:06:14,940
which makes the spacing as big as 
possible. 

105
00:06:16,240 --> 00:06:20,729
So let's develop a greedy algorithm that 
seeks to make the spacing as big as 

106
00:06:20,729 --> 00:06:23,644
possible. 
And to facilitate the discussion, I'll 

107
00:06:23,644 --> 00:06:28,250
use an example point set with just six 
black points up here in the upper right 

108
00:06:28,250 --> 00:06:32,160
part of the slide. 
Now, the good idea behind this greedy 

109
00:06:32,160 --> 00:06:35,930
algorithm is to not worry about the 
constraints that, at the end of the day, 

110
00:06:35,930 --> 00:06:39,651
we can only output k different clusters. 
We're actually going to be infeasible, 

111
00:06:39,651 --> 00:06:42,365
we'll have too many clusters throughout 
the algorithm. 

112
00:06:42,365 --> 00:06:46,438
Only at the conclusion of the algorithm 
will we be down to k clusters, which will 

113
00:06:46,438 --> 00:06:50,108
be our final infeasible solution. 
So that frees us up to initialize the 

114
00:06:50,108 --> 00:06:54,080
procedure where the degenerate solution, 
where each point is in its own cluster. 

115
00:06:56,280 --> 00:07:00,102
So in our example point set, we have 
these six pink isolated clusters. 

116
00:07:00,102 --> 00:07:04,035
In general, you're going to have n 
clusters and we've got to get down to k. 

117
00:07:04,035 --> 00:07:06,750
Now, let's remember what the spacing 
objective is. 

118
00:07:06,750 --> 00:07:10,666
In the spacing objective, you go over all 
of the separated pairs of points. 

119
00:07:10,666 --> 00:07:13,735
So for this degenerate solution, it's all 
pairs of points. 

120
00:07:13,735 --> 00:07:17,545
And you look at the most alarming 
separated pair, that is those, that are 

121
00:07:17,545 --> 00:07:21,196
the, the closest to each other. 
So the spacing is the distance between 

122
00:07:21,196 --> 00:07:25,059
the closest pair of separated points. 
Now, in a greedy algorithm, you want to 

123
00:07:25,059 --> 00:07:27,864
increase your objective function as much 
as possible. 

124
00:07:27,864 --> 00:07:31,145
But actually, things are pretty cut and 
dried in this scenario. 

125
00:07:31,145 --> 00:07:34,850
Suppose I give you a clustering and you 
want to make the spacing bigger, 

126
00:07:34,850 --> 00:07:39,134
the only way you can do that is by taking 
the currently closest pair of separated 

127
00:07:39,134 --> 00:07:41,537
points and making them not separated any 
more. 

128
00:07:41,537 --> 00:07:45,717
That is putting them in the same cluster. 
So, it's in some sense obvious what you 

129
00:07:45,717 --> 00:07:48,486
want to do to make the objective function 
go up at all. 

130
00:07:48,486 --> 00:07:52,195
You gotta look at the pair of points that 
is defining the objective, 

131
00:07:52,195 --> 00:07:55,487
the closest pair of separated points, 
and you have to fuse them. 

132
00:07:55,487 --> 00:07:59,040
You have to fuse their clusters so that 
they're no longer separated. 

133
00:08:01,140 --> 00:08:06,570
In this example, it looks to me like the 
closest pair of points, which of course 

134
00:08:06,570 --> 00:08:09,732
are separated, is this pair in the upper 
right. 

135
00:08:09,732 --> 00:08:14,750
So, if we want to make the spacing 
bigger, then we fuse them into a common 

136
00:08:14,750 --> 00:08:25,827
cluster. 
[SOUND] So we started with six clusters. 

137
00:08:25,827 --> 00:08:29,864
Now, we're down to five. 
So now we reevaluate the spacing again of 

138
00:08:29,864 --> 00:08:33,280
this new clustering. 
We ask, what is the closest pair of 

139
00:08:33,280 --> 00:08:36,883
separated points? 
So that would seem to me to be this pair 

140
00:08:36,883 --> 00:08:40,485
in the bottom right. 
[SOUND] And again, the only way we can 

141
00:08:40,485 --> 00:08:45,082
increase the spacing by merging 
clustering is to merge these two isolated 

142
00:08:45,082 --> 00:08:48,308
clusters into one. 
Now, we do it again. We say, which pair 

143
00:08:48,308 --> 00:08:52,211
of points determines the current spacing, 
which the currently separated pair of 

144
00:08:52,211 --> 00:08:55,916
points that are nearest to each other? 
That to me would look like this pair 

145
00:08:55,916 --> 00:08:58,660
that's on the rightmost part of the 
picture. 

146
00:08:58,660 --> 00:09:02,874
The only way to merge two clusters and 
make the spacing actually go up is to 

147
00:09:02,874 --> 00:09:06,923
merge the clusters containing the pairs 
of points determining the current 

148
00:09:06,923 --> 00:09:09,496
spacing. 
So in this case, two different clusters 

149
00:09:09,496 --> 00:09:13,600
with two points, each would collapse into 
a single cluster with four points. 

150
00:09:13,600 --> 00:09:17,322
Let's assume that we wanted three 
clusters, clusters anyways, which is 

151
00:09:17,322 --> 00:09:20,222
where we are. 
So at this point, the greedy algorithm is 

152
00:09:20,222 --> 00:09:23,342
going to halt. 
So let me now spell out the pseudocode of 

153
00:09:23,342 --> 00:09:27,556
the greedy algorithm more generally, 
but it's exactly what you'd expect given 

154
00:09:27,556 --> 00:09:31,608
the discussion so far. 
All right. 

155
00:09:31,608 --> 00:09:35,785
So, I'd like you to stare at this 
pseudocode for ten seconds or however 

156
00:09:35,785 --> 00:09:40,256
long you need and try to relate this to 
an algorithm that we've seen in the 

157
00:09:40,256 --> 00:09:42,903
course. 
In particular, an algorithm that we've 

158
00:09:42,903 --> 00:09:46,257
seen quite recently. 
I hope it reminds you strongly of an 

159
00:09:46,257 --> 00:09:55,024
algorithm we've already covered. 
Specifically, I hope you see a strong 

160
00:09:55,024 --> 00:09:59,689
resemblance between this greedy algorithm 
and Kruskal's algorithm for computing a 

161
00:09:59,689 --> 00:10:03,216
minimum cost spanning tree. 
Indeed, we can think of this greedy 

162
00:10:03,216 --> 00:10:08,251
clustering algorithm as being exactly the 
same as Kruskal's minimum cost spanning 

163
00:10:08,251 --> 00:10:12,974
tree algorithm except aborted early. 
Stopped when there's k components 

164
00:10:12,974 --> 00:10:16,482
remaining, 
that is before the final k - 1 edge 

165
00:10:16,482 --> 00:10:19,855
additions. 
So, just to make sure the correspondence 

166
00:10:19,855 --> 00:10:21,677
is clear. 
What is the graph? 

167
00:10:21,677 --> 00:10:24,578
What are the edges? 
What are the edge costs? 

168
00:10:24,578 --> 00:10:28,761
Well, the objects in the clustering 
problem, that is the points. 

169
00:10:28,761 --> 00:10:33,401
Those correspond to vertices in a graph. 
The other part of the input of the 

170
00:10:33,401 --> 00:10:37,506
clustering problem are distances, which 
are given for every pair of points. 

171
00:10:37,506 --> 00:10:41,999
So those play the role that edge costs 
were playing in the minimum spanning tree 

172
00:10:41,999 --> 00:10:44,717
problem. 
Since we have a distance or an edge cost 

173
00:10:44,717 --> 00:10:48,489
for every single pair of points, 
we can think of the edge set in the 

174
00:10:48,489 --> 00:10:52,760
clustering problem as being the complete 
graph because we have an edge cost or a 

175
00:10:52,760 --> 00:10:57,602
distance for every single pair. 
So this type of agglomerative clustering 

176
00:10:57,602 --> 00:11:01,039
has a name. 
This idea of fusing components one at a 

177
00:11:01,039 --> 00:11:05,760
time using MST-lik criterion. 
It's called single-link clustering. 

178
00:11:05,760 --> 00:11:09,404
So a single n clustering is a good idea. 
If you work at all with clustering 

179
00:11:09,404 --> 00:11:12,999
problems or unsupervised learning 
problems, it definitely should be a tool 

180
00:11:12,999 --> 00:11:15,866
in your toolbox. 
we're going to justify its existence in 

181
00:11:15,866 --> 00:11:19,704
one particular way in the next video, 
when we show that it does indeed maximize 

182
00:11:19,704 --> 00:11:21,842
the spacing over all possible k 
clusterings. 

183
00:11:21,842 --> 00:11:25,487
But even if you don't care about the 
spacing objective function per se, you 

184
00:11:25,487 --> 00:11:27,819
want to be familiar with single-link 
clustering. 

185
00:11:27,819 --> 00:11:29,860
It has many other nice properties as 
well.