1
00:00:03,300 --> 00:00:09,312
So we're going to complete our study of
MST algorithms by considering some context

2
00:00:09,530 --> 00:00:15,470
both as an unsolved problem in theoretical
computer science and as a practical

3
00:00:15,470 --> 00:00:18,975
problem.
So the unsolved problem in theoretical

4
00:00:18,975 --> 00:00:24,300
computer science that has defied
researchers for many decades is, it is

5
00:00:24,300 --> 00:00:27,675
possible to find a linear-time MST
algorithm?

6
00:00:27,675 --> 00:00:33,600
Is there an algorithm that only examines
each edge at most once on the average?

7
00:00:33,600 --> 00:00:36,980
Now, this doesn't have.
That much practical comp.

8
00:00:36,980 --> 00:00:43,148
Consequence since the versions of Primm's
algorithm that we've talked about can get

9
00:00:43,148 --> 00:00:49,170
the running time down quite low for the
sparse graphs, the huge sparse graphs that

10
00:00:49,170 --> 00:00:52,548
we encounter in practice.
But it's still, its...

11
00:00:52,548 --> 00:00:55,706
Like union find, it's a tantalizing
problem.

12
00:00:55,706 --> 00:00:59,084
Unlike union find, it hasn't been resolved
yet.

13
00:00:59,084 --> 00:01:04,959
Union find remember, we couldn't get a
linear algorithm but at least Tarjan

14
00:01:04,959 --> 00:01:10,100
proved that no such algorithm exists.
For MST, we're not even there yet.

15
00:01:10,392 --> 00:01:18,489
And a lot of people have worked on it.
So let's go with the simple model where

16
00:01:18,489 --> 00:01:24,584
what you get to do is compare edges.
Compare weights on edges.

17
00:01:24,870 --> 00:01:32,870
In 1975 Yao proved that there exists an
algorithm that's worst case running time

18
00:01:32,870 --> 00:01:37,727
is e log, log v.
In 1976, Cheriton and Tarjan came up with

19
00:01:37,727 --> 00:01:44,680
another such algorithm.
And then Fredman and Tarjan found the e

20
00:01:44,680 --> 00:01:52,333
plus e log v algorithm or e log star v For
MST's in'84.

21
00:01:53,152 --> 00:01:56,067
Here's another one.
E log, log star v.

22
00:01:56,310 --> 00:02:02,300
Now remember, log star v is the number of
times you take log to get to one.

23
00:02:02,300 --> 00:02:06,995
So it's, less than, six, in the natural
universe.

24
00:02:07,238 --> 00:02:11,286
So these are very, very close to linear
algorithms.

25
00:02:11,529 --> 00:02:15,819
The names in orange are people who work at
Princeton.

26
00:02:15,819 --> 00:02:20,271
So, we're now trumpeting Princeton a
little bit here.

27
00:02:20,271 --> 00:02:26,910
In'97, Chazelle showed that, its close to
the, inverse Ackerman function.

28
00:02:26,910 --> 00:02:32,000
That, even more slowly growing than log
star v.

29
00:02:32,623 --> 00:02:39,168
In 2000, got rid of the log factor, so
very, very close to linear.

30
00:02:40,610 --> 00:02:46,304
That now.
In 2002, Optimal, well, let's, talk about

31
00:02:46,304 --> 00:02:49,719
that.
They, they, they showed an, an algorithm

32
00:02:49,719 --> 00:02:53,579
that it, better not to talk about the
theory of that.

33
00:02:53,579 --> 00:02:57,365
[laugh].
And that the, still, the open question is,

34
00:02:57,588 --> 00:03:03,156
is there, an algorithm whose worst case
running time is guaranteed to be

35
00:03:03,156 --> 00:03:07,387
proportional to e?
Or could, someone prove that no such

36
00:03:07,387 --> 00:03:11,915
algorithm exists?
It's one of the most, tantalizing open

37
00:03:11,915 --> 00:03:18,281
questions in computer science.
As we get into, graph algorithms in more

38
00:03:18,281 --> 00:03:22,183
detail.
We'll see some other examples of problems

39
00:03:22,183 --> 00:03:28,713
for which we know pretty good algorithms
but would like to know whether there are

40
00:03:28,713 --> 00:03:33,411
better algorithms or not.
And MST is a fine example of that.

41
00:03:33,411 --> 00:03:38,827
That's the orange means Princeton.
There is a randomized linear time

42
00:03:38,827 --> 00:03:45,357
algorithm that was discovered in 1995, but
that's not the same as solving it worst

43
00:03:45,357 --> 00:03:48,658
case in linear time.
So that's one context.

44
00:03:48,891 --> 00:03:54,950
Mxt is an important problem that's still
been studied by theoretical computer

45
00:03:54,950 --> 00:03:59,068
scientist and we still don't know the best
algorithm.

46
00:03:59,301 --> 00:04:04,140
Here's another one, So-called Euclidean
MST.

47
00:04:04,140 --> 00:04:09,742
And this one is what's appropriate in some
practical situations.

48
00:04:09,742 --> 00:04:16,482
So now you have points in the plane and
the graph is an implicit dense graph.

49
00:04:16,482 --> 00:04:23,659
That is, we take as an edge in the graph,
the distance between this point and every

50
00:04:23,659 --> 00:04:27,469
other point.
So if there's n points there's n squared

51
00:04:27,469 --> 00:04:31,439
edges because every point's connected to
every other point.

52
00:04:31,439 --> 00:04:36,487
And what we want is, in that graph, we
want to find the subset of edges that

53
00:04:36,487 --> 00:04:41,669
connects all the points, that's minimal.
That's actually, in a lot of practical

54
00:04:41,669 --> 00:04:46,515
problems, that's what you want.
So as it stands, the algorithms that we've

55
00:04:46,515 --> 00:04:51,764
talked about are not useful for this
beacause they're going to take quadratic

56
00:04:51,764 --> 00:04:56,206
time, because e's quadratic.
That's how many edges there are in the

57
00:04:56,206 --> 00:04:59,727
graph.
So you know, you could just go ahead and

58
00:04:59,727 --> 00:05:05,082
build the graph with the N squared over
two distances and run Prim's algorithm.

59
00:05:05,285 --> 00:05:09,284
But that's not very satisfying for a huge
number of points.

60
00:05:09,487 --> 00:05:15,169
It's actually not
Too difficult to exploit the geometry of

61
00:05:15,169 --> 00:05:19,976
the situation.
And get it done in time proportional to N

62
00:05:19,976 --> 00:05:23,953
log N.
What is typically done is to build a sub

63
00:05:23,667 --> 00:05:31,221
graph, where each point is connected to a
certain number of points that are close to

64
00:05:31,221 --> 00:05:34,740
it.
There's a particular graph called the

65
00:05:34,740 --> 00:05:40,234
Voronoi diagram, or the Delaunay
triangulation, that does that.

66
00:05:40,234 --> 00:05:47,837
And it's been proved, number one that,
that graph has linear number of edges not,

67
00:05:48,097 --> 00:05:55,303
quadratic, and it's also the MST is a sub
graph of the d-linear triangulation.

68
00:05:55,303 --> 00:06:01,380
So you can get it done in linear
arrhythmic time for Euclidean MST.

69
00:06:01,678 --> 00:06:10,346
Separate problem related but still a very
interesting in many practical

70
00:06:10,346 --> 00:06:17,718
applications.
Here's another application in se, several

71
00:06:17,718 --> 00:06:22,700
scientific studies, there's the idea of
clustering.

72
00:06:22,928 --> 00:06:29,241
And so what we wanna do is, we have a set
of objects and they're related by a

73
00:06:29,241 --> 00:06:35,782
distance function that specifies how close
they are and what we wanna do is divide

74
00:06:35,782 --> 00:06:41,487
the objects into a given number k of
coherent groups so that objects in

75
00:06:41,487 --> 00:06:47,116
different clusters are far apart.
So wanna see how things cluster together.

76
00:06:47,116 --> 00:06:53,581
And here's a really old example of a
application of this where there was an

77
00:06:53,581 --> 00:07:00,132
outbook, outbreak of cholera deaths in.
London in the 1850s and, if you plot where

78
00:07:00,132 --> 00:07:06,671
all of the deaths happened, scientific
study could find that they were clustered

79
00:07:06,671 --> 00:07:11,738
around certain places.
And, actually they were able to identify

80
00:07:11,738 --> 00:07:18,195
well pumps that were leading to the, the
cholera just by doing clustering study.

81
00:07:18,195 --> 00:07:23,998
And, that is a very special case.
There are many, many other applications

82
00:07:23,998 --> 00:07:30,700
where clustering is an important process,
an important thing to be able to compute.

83
00:07:30,926 --> 00:07:37,356
So like mobile networks for web search
there's an idea of the distance between

84
00:07:37,356 --> 00:07:41,591
doc, documents and you wanna categorize
them in clusters.

85
00:07:41,818 --> 00:07:47,718
There's the, all the objects that have
been discovered in the sky, you wanna

86
00:07:47,718 --> 00:07:55,300
cluster them together in a reasonable way.
And all kinds of. Of processing having to

87
00:07:55,300 --> 00:08:01,557
do with huge data bases, trying to get
information that seems close together to

88
00:08:01,557 --> 00:08:06,212
be close together into a relatively small
number of clusters.

89
00:08:06,441 --> 00:08:11,020
So there's, a, a approach called single
link clustering.

90
00:08:11,020 --> 00:08:18,237
Where you talk about the single length,
the distance between two clusters equaling

91
00:08:18,237 --> 00:08:23,756
the distance between the two closest
objects, one in each cluster.

92
00:08:23,756 --> 00:08:29,191
And so, so-called single length clustering
is given at integer K.

93
00:08:29,191 --> 00:08:36,239
Find the K clustering that maximizes the
distance between the two closest clusters.

94
00:08:36,239 --> 00:08:40,230
So that's a well defined computational
problem.

95
00:08:40,439 --> 00:08:45,896
And there's a very well-known algorithm,
in the science literature for this

96
00:08:45,896 --> 00:08:51,843
problem, signal, signal-link clustering.
Form of e-clusters, find the closest pair

97
00:08:51,843 --> 00:08:57,160
of objects such that each object's in a
different cluster, and merge the two

98
00:08:57,160 --> 00:09:00,098
clusters.
And repeat until they're exactly

99
00:09:00,098 --> 00:09:03,456
k-clusters.
You'll find this algorithm in the

100
00:09:03,456 --> 00:09:07,583
scientific literature.
What's amazing is that, this is

101
00:09:07,583 --> 00:09:13,530
Crussical's algorithm, just stop when you
found the k-connected components, so that,

102
00:09:13,740 --> 00:09:21,578
the, Or another thing you could do is just
run Prim's algorithm and then after you've

103
00:09:21,578 --> 00:09:28,201
run Prim's algorithm get rid of the
largest edges until you're left with

104
00:09:28,201 --> 00:09:32,643
k-clusters.
So out of all the efficient Algorithms

105
00:09:32,643 --> 00:09:38,638
that we've talked about are gonna apply
for single-link clustering.

106
00:09:38,638 --> 00:09:45,947
And actually scientists who also know some
computer science now are able to handle

107
00:09:46,194 --> 00:09:52,435
huge problems that would not be possible
without efficient algorithms.

108
00:09:52,682 --> 00:09:59,744
This is just one, one example where a, a
cancer study where experiments are

109
00:09:59,991 --> 00:10:07,227
connecting genes with the way they're
expressed in different parts of the body,

110
00:10:07,227 --> 00:10:11,264
and trying to cluster together tumors in
similar tissues.

111
00:10:11,264 --> 00:10:16,788
And again, such experimental results can
amount, result in huge amounts of data,

112
00:10:16,788 --> 00:10:22,030
and MST algorithms are playing a role in
scientific research of this type.

113
00:10:22,030 --> 00:10:26,260
That's our context for minimal spanning
trees.