1
00:00:02,560 --> 00:00:06,091
So now we understand why Kruskal's 
algorithm is correct, why it always 

2
00:00:06,091 --> 00:00:09,976
computes a Minimum cost-spanning Tree. 
In this video, we'll turn our attention 

3
00:00:09,976 --> 00:00:13,003
to implementation issues. 
We'll begin with a straightforward 

4
00:00:13,003 --> 00:00:16,838
implementation of Kruskal's algorithm. 
That'll give us a polynomial run time 

5
00:00:16,838 --> 00:00:19,209
bound which is good but we'd like to do 
better. 

6
00:00:19,209 --> 00:00:22,892
So then we'll show how deploying a 
suitable data structure, something you 

7
00:00:22,892 --> 00:00:26,525
haven't seen before, the Union-Find data 
structure, allows us to speed up 

8
00:00:26,525 --> 00:00:29,552
Kruskal's Algorithm to be competitive 
with Prim's Algorithm. 

9
00:00:29,552 --> 00:00:32,940
That is we'll get a near linear running 
time bound of MlogN. 

10
00:00:32,940 --> 00:00:36,480
So let's just briefly review the very 
elegant pseudocode for Kruskal's 

11
00:00:36,480 --> 00:00:38,275
Algorithm, 
so it's a greedy algorithm. 

12
00:00:38,275 --> 00:00:41,912
It considers the cheapest edges first, 
all the way up to the most expensive. 

13
00:00:41,912 --> 00:00:45,987
So we begin with a sorting pre-processing 
step to put the edges in sorted order for 

14
00:00:45,987 --> 00:00:49,818
notational convenience let's just rename 
the edges. So, that one is the cheapest 

15
00:00:49,818 --> 00:00:52,680
edge, and, all the way up to M being the 
most expensive edge. 

16
00:00:52,680 --> 00:00:56,366
We then have our single linear scan in 
this for loop, and we just grab edges 

17
00:00:56,366 --> 00:00:57,385
whenever we can, 
okay? 

18
00:00:57,385 --> 00:01:01,362
So we maintain this evolving set capital 
T, which at the end of the algorithm will 

19
00:01:01,362 --> 00:01:04,223
be our spanning tree. 
Now, what forces us to exclude an edge 

20
00:01:04,223 --> 00:01:07,223
from this set capital T? 
Well if it creates a cycle with the edges 

21
00:01:07,223 --> 00:01:09,115
was already chosen, obviously that's a no 
go. 

22
00:01:09,115 --> 00:01:13,040
We can't have cycles in our final output, 
but as long as we don't have a cycle from 

23
00:01:13,040 --> 00:01:15,877
including an edge, we go ahead and 
optimistically include it. 

24
00:01:15,877 --> 00:01:19,471
And as we've seen, this is a correct 
algorithm, it always outputs the minimum 

25
00:01:19,471 --> 00:01:22,549
cost spanning tree. 
So, what would be the running time of 

26
00:01:22,549 --> 00:01:25,731
this algorithm? 
If we just straightforwardly implement 

27
00:01:25,731 --> 00:01:31,529
the pseudocode on this slide? 
Well, let's just considered the algorithm 

28
00:01:31,529 --> 00:01:35,102
step by step. 
In the first step, we sort the edges, 

29
00:01:35,102 --> 00:01:40,720
so that's going to take M log N time. 
Now don't forget whenever we're speaking 

30
00:01:40,720 --> 00:01:45,154
about graphs, we have the convention that 
M denotes the number of edges and N 

31
00:01:45,154 --> 00:01:49,283
denotes the number of vertices. 
So, you might justifiably wonder why I 

32
00:01:49,283 --> 00:01:54,218
wrote M log N for the running time of the 
sorting step instead of M log M, since 

33
00:01:54,218 --> 00:01:58,597
after all what it is we're sorting are 
the edges and there's M of them. 

34
00:01:58,597 --> 00:02:03,346
Well, what I'm using here is that, in 
this context we can switch log N and log 

35
00:02:03,346 --> 00:02:06,180
M interchangeably with each other inside 
a big-O notation. 

36
00:02:06,180 --> 00:02:08,744
Why is that true? 
Well recall when we first discussed 

37
00:02:08,744 --> 00:02:11,890
graphs in part one, we noticed that there 
can't be too many edges. 

38
00:02:11,890 --> 00:02:15,423
So the number of edges M, is the most 
quadratic of the number of vertices, 

39
00:02:15,423 --> 00:02:18,955
it's the most big-O of N squared. 
So if M is at most M squared, then log M 

40
00:02:18,955 --> 00:02:22,295
is at most two log N And the two is 
suppressed in the big-O notation. 

41
00:02:22,295 --> 00:02:24,956
So log M and log M are interchangeable in 
this context. 

42
00:02:24,956 --> 00:02:28,973
Notice that for the minimum cost spanning 
tree problem you may as well assume that 

43
00:02:28,973 --> 00:02:32,118
there's no parallel edges. 
You may as well assume that the graphs 

44
00:02:32,118 --> 00:02:34,490
are simple. 
If you have a bunch of parallel edges 

45
00:02:34,490 --> 00:02:37,347
between a given vertices, 
you can just throw out all but the 

46
00:02:37,347 --> 00:02:40,000
cheapest one. 
That's the only one you'll ever need. 

47
00:02:40,000 --> 00:02:44,114
So, moving on to the main loop, pretty 
obvious how many iterations we have 

48
00:02:44,114 --> 00:02:45,635
there. 
We have M iterations. 

49
00:02:45,635 --> 00:02:50,032
So all we need to figure out is how much 
work we have to do in each iteration. 

50
00:02:50,032 --> 00:02:52,793
So what is it each iteration needs to 
accomplish? 

51
00:02:52,793 --> 00:02:57,076
It needs to check, whether or not adding 
the current edge to the edges we've 

52
00:02:57,076 --> 00:03:01,247
already chosen creates a cycle or not. 
So I claim that can be done in time 

53
00:03:01,247 --> 00:03:05,418
linear in the number of vertices. 
That is it can be done in the big-O of N 

54
00:03:05,418 --> 00:03:08,234
time. 
So how do we accomplish this? 

55
00:03:08,234 --> 00:03:11,830
Well, we need two quick observations. 
First of all, and this is something we've 

56
00:03:11,830 --> 00:03:15,660
seen in arguments in the previous videos, 
checking whether or not this new edge is 

57
00:03:15,660 --> 00:03:18,462
going to create a cycle. 
Say the edge has end points U and V. 

58
00:03:18,462 --> 00:03:22,058
Checking for a cycle boils down to 
checking whether or not there's already a 

59
00:03:22,058 --> 00:03:25,748
path between the end points U and V, and 
the edges capital T that was chosen so 

60
00:03:25,748 --> 00:03:27,662
far. 
If there is a UV path already, adding 

61
00:03:27,662 --> 00:03:29,951
this edge will close the loop and create 
a cycle. 

62
00:03:29,951 --> 00:03:33,640
If there currently is no UV path, then 
adding this edge will not create a cycle. 

63
00:03:33,640 --> 00:03:37,377
So the second observation is well, how do 
we check if there's a path from U to V, 

64
00:03:37,377 --> 00:03:40,646
in the edges we've already chosen? 
Well, we already know how to do that 

65
00:03:40,646 --> 00:03:41,658
just. 
Using graph search. 

66
00:03:41,658 --> 00:03:44,401
So you can use breadth for a search, you 
can use depth for a search. 

67
00:03:44,401 --> 00:03:46,735
It doesn't matter. 
You just start at the vertex U and you 

68
00:03:46,735 --> 00:03:49,479
see if you have a reach of V or not. 
If you reach it, there's a path. 

69
00:03:49,479 --> 00:03:51,340
If you don't reach it, 
there's not a path. 

70
00:03:51,340 --> 00:03:53,816
Breadth-first-search, depth-first-search, 
whatever. 

71
00:03:53,816 --> 00:03:56,645
It takes time linear, in the graph that 
you're searching. 

72
00:03:56,645 --> 00:04:00,132
And since we only need to search for the 
edges that are in capital T. 

73
00:04:00,132 --> 00:04:02,760
And there's going to be, at most, N minus 
one of them. 

74
00:04:02,760 --> 00:04:06,246
Linear time in this context means O of N. 
O of the number of vertices, 

75
00:04:06,246 --> 00:04:09,885
because that also bounds the number of 
edges that might be in capital T. 

76
00:04:09,885 --> 00:04:12,260
The edges that we're searching for are 
pathing. 

77
00:04:13,330 --> 00:04:15,716
So, adding up all of this work, what do 
we have? 

78
00:04:15,716 --> 00:04:20,177
We have this sorting pre-processing step. 
That takes time big-O of M log N, 

79
00:04:20,177 --> 00:04:24,276
then we have these M iterations of the 
four loop like this is takes O of N times 

80
00:04:24,276 --> 00:04:28,478
factor. 
the latter term dominates, so the overall 

81
00:04:28,478 --> 00:04:31,850
running time is big-O of M times N. 
So this, coincidentally, is the same 

82
00:04:31,850 --> 00:04:35,481
running time we got from the 
straightforward implementation of Prim's 

83
00:04:35,481 --> 00:04:37,920
algorithm, 
and I'll make the same comments here. 

84
00:04:37,920 --> 00:04:40,984
This is a reasonable running time, it's 
polynomial on the input size. 

85
00:04:40,984 --> 00:04:44,273
It's way better than checking all 
exponentially many spanning trees that 

86
00:04:44,273 --> 00:04:47,066
the graph might have. 
But we certainly would like to do better. 

87
00:04:47,066 --> 00:04:50,040
We'd certainly love to have 
implementation of Kruskal's algorithm 

88
00:04:50,040 --> 00:04:53,149
that gets us to a near linear running 
time bound, and that's the plan. 

89
00:04:53,149 --> 00:04:55,898
How are we going to do it? 
Well, really the work that we're doing 

90
00:04:55,898 --> 00:04:59,187
here over and over again, which is kind 
of a bummer, is these cycle checks. 

91
00:04:59,187 --> 00:05:02,387
And every single iteration, we're 
spending time linear in the number of 

92
00:05:02,387 --> 00:05:04,640
vertices to check for a cycle. 
And the question is, 

93
00:05:04,640 --> 00:05:08,829
can we speed that up? 
And the Union-Find data structure will 

94
00:05:08,829 --> 00:05:14,560
actually, believe it or not, allow us to 
check for a cycle in constant time. 

95
00:05:16,040 --> 00:05:20,356
So if we actually had a data structure 
that could implement constant time cycle 

96
00:05:20,356 --> 00:05:22,784
checks. 
Then we'd have to spend only constant 

97
00:05:22,784 --> 00:05:24,835
time for each iteration of this while 
loop. 

98
00:05:24,835 --> 00:05:28,935
So the loop overall would take only 
linear time in the number of edges, O of 

99
00:05:28,935 --> 00:05:31,417
M edges. 
If we got that, then believe it or not, 

100
00:05:31,417 --> 00:05:35,518
the sorting pre-processing step would 
become the bottle neck in the running 

101
00:05:35,518 --> 00:05:39,241
time of Kruskal's algorithm. 
Our running time would drop from N times 

102
00:05:39,241 --> 00:05:41,400
N down to near linear, 
down O of N log N. 

103
00:05:44,460 --> 00:05:48,247
So let me now tell you a little bit about 
this magical data structure that's going 

104
00:05:48,247 --> 00:05:51,762
to give us constant time cycle checks. 
I'm just going to give you the high level 

105
00:05:51,762 --> 00:05:54,728
picture, and how it connects to Kruskal's 
algorithm on this slide. 

106
00:05:54,728 --> 00:05:57,877
We'll look at the details of the data 
structure, in the next video. 

107
00:05:57,877 --> 00:06:01,163
I also want to warn you that I'm not 
going to discuss, in this pair of videos, 

108
00:06:01,163 --> 00:06:03,536
the state of the art for Union-Find Data 
Structures. 

109
00:06:03,536 --> 00:06:05,635
I'm going to give you a fairly primitive 
version, 

110
00:06:05,635 --> 00:06:09,286
but that is nevertheless, sufficient to 
give us our desired M Log N running time 

111
00:06:09,286 --> 00:06:12,070
of Kruskal's algorithm. 
So if you're interested, there is some 

112
00:06:12,070 --> 00:06:15,402
optional material about different 
implementations of Union-Find that use 

113
00:06:15,402 --> 00:06:18,231
some super cool ideas, like union by rank 
and path compression. 

114
00:06:18,231 --> 00:06:21,458
That give you different, and in some 
senses, better operation times. 

115
00:06:21,458 --> 00:06:25,144
But the quick and dirty version of 
Union-Find that I'm going to discuss 

116
00:06:25,144 --> 00:06:28,100
here, is sufficient for our present 
needs. 

117
00:06:28,100 --> 00:06:33,417
And so the Raison d'être of a Union-Find 
Data Structure is to maintain a partition 

118
00:06:33,417 --> 00:06:37,233
of a set of objects. 
So in this picture, the overall rectangle 

119
00:06:37,233 --> 00:06:41,988
is meant to denote a set of objects and 
then C1, C2, C3, and C4 are disjointed 

120
00:06:41,988 --> 00:06:45,428
subsets that together form the union of 
the entire set. 

121
00:06:45,428 --> 00:06:49,480
So that's what I mean by a partition of a 
group of objects. 

122
00:06:49,480 --> 00:06:52,142
We're not going to ask too much of this 
data structure. 

123
00:06:52,142 --> 00:06:54,708
We're only going to ask it to support two 
operations. 

124
00:06:54,708 --> 00:06:58,080
No prizes to guess what those two 
operations are called. 

125
00:06:58,080 --> 00:07:03,169
So the find operation, we give it an 
object from this universe and we ask the 

126
00:07:03,169 --> 00:07:07,663
data structure to return to us the name 
of the group to which that object 

127
00:07:07,663 --> 00:07:10,902
belongs. 
So for example, if we handed it something 

128
00:07:10,902 --> 00:07:15,726
in the middle of this rectangle, an 
object, we'd expect it to return to us 

129
00:07:15,726 --> 00:07:19,586
the name c3. 
The union operation by contrast, takes as 

130
00:07:19,586 --> 00:07:24,024
input, the names of two groups. 
And what we want the data structure to do 

131
00:07:24,024 --> 00:07:28,708
is to fuse those two groups together. 
That is, the objects in the first group, 

132
00:07:28,708 --> 00:07:33,269
and the objects in the second group 
should all coalesce, and be, now, in one 

133
00:07:33,269 --> 00:07:36,289
sole group. 
So why might such a data structure be 

134
00:07:36,289 --> 00:07:39,580
useful for speeding up Kruskal's 
Algorithm? 

135
00:07:39,580 --> 00:07:44,329
To see the connection, think of Kruskal's 
algorithm as working conceptually in the 

136
00:07:44,329 --> 00:07:47,325
following way. 
So initially, when the algorithm starts 

137
00:07:47,325 --> 00:07:51,056
in the Set capital T is empty, 
each of the vertices is by it's own, 

138
00:07:51,056 --> 00:07:55,127
it's on its own isolated component. 
And then each time Kruskal's adds a new 

139
00:07:55,127 --> 00:07:58,802
edge to the set capital T. 
What that does is it takes two current 

140
00:07:58,802 --> 00:08:02,760
connected components and fuses them into 
a single connected component. 

141
00:08:04,370 --> 00:08:08,339
So for example, toward the end of 
Kruskal's algorithm that maybe it's 

142
00:08:08,339 --> 00:08:12,998
included enough edges, that now the Tree 
capital T that is constructed so far has 

143
00:08:12,998 --> 00:08:17,716
only four different connected components. 
And maybe it's about to add a new edge U 

144
00:08:17,716 --> 00:08:22,375
comma V, where of course U and V should 
be in different connected components with 

145
00:08:22,375 --> 00:08:26,402
respect to the edges chosen for far. 
So this new edge addition at this 

146
00:08:26,402 --> 00:08:31,119
iteration of Kruskal is going to fuse the 
connecting components of U and V into a 

147
00:08:31,119 --> 00:08:34,226
single one. 
So that corresponds to taking the union 

148
00:08:34,226 --> 00:08:37,160
of the groups to which U and V 
respectively belong. 

149
00:08:38,180 --> 00:08:41,905
So to be a little more precise about it. 
So what are going to be the objects 

150
00:08:41,905 --> 00:08:45,324
contained by the Union-Find Data 
Structure in Kruskal's algorithm? 

151
00:08:45,324 --> 00:08:49,356
We're going to correspond to vertices. 
It's the vertices coalescing that we want 

152
00:08:49,356 --> 00:08:52,367
to keep track of. 
So what are going to be the groups in the 

153
00:08:52,367 --> 00:08:55,429
partition that we maintain? 
They're just going to correspond to 

154
00:08:55,429 --> 00:08:59,308
connected components with respect to the 
edges that Kruskal's algorithm has 

155
00:08:59,308 --> 00:09:03,249
already committed to. 
And with these semantics, it's clear that 

156
00:09:03,249 --> 00:09:07,946
every time Kruskal adds a new edge to its 
set capital T, we have to invoke the 

157
00:09:07,946 --> 00:09:11,440
union operation to fuse two connected 
components into one.