1
00:00:00,000 --> 00:00:03,294
So let me now introduce you to the union 
find data structure. 

2
00:00:03,294 --> 00:00:07,021
Let's not lose sight of our goal. 
Our goal is to be able to check for 

3
00:00:07,021 --> 00:00:09,880
cycles in Kruskal's algorithm in constant 
time. 

4
00:00:09,880 --> 00:00:13,902
So the first, and most basic idea behind 
this union find data structure 

5
00:00:13,902 --> 00:00:17,868
implementation, is, we're going to 
maintain a linked structure for each 

6
00:00:17,868 --> 00:00:20,304
group. 
That is, for each connected component, 

7
00:00:20,304 --> 00:00:23,364
with respect to the edges Kruskal has 
chosen thus far. 

8
00:00:23,364 --> 00:00:27,726
By a link structure, I just mean each 
vertex of the graph is going to have an 

9
00:00:27,726 --> 00:00:31,835
extra pointer field. 
In addition with each group, with each 

10
00:00:31,835 --> 00:00:35,012
connected component, we're going to 
designate one vertex. 

11
00:00:35,012 --> 00:00:39,437
And we don't care which one, just some 
vertex from this connected component, as 

12
00:00:39,437 --> 00:00:42,268
the leader vertex of that component. 
Alright, 

13
00:00:42,268 --> 00:00:46,086
so what is the point of these extra 
pointers at each vertex? 

14
00:00:46,086 --> 00:00:49,012
And what is the point of having these 
leaders? 

15
00:00:49,012 --> 00:00:53,975
Well, a key invariant that we're going to 
maintain is that, a given vertex, with 

16
00:00:53,975 --> 00:00:58,620
its extra pointer, points to the leader 
vertex of its connected component. 

17
00:00:58,620 --> 00:01:03,348
So, for example maybe we have two 
different connected components with three 

18
00:01:03,348 --> 00:01:07,652
vertices each, one containing the 
vertices U, V and W. another containing 

19
00:01:07,652 --> 00:01:11,713
the vertices X, Y and Z. 
any of these three vertices could be the 

20
00:01:11,713 --> 00:01:16,563
leader of each of these two components. 
so perhaps U happens to be the leader 

21
00:01:16,563 --> 00:01:21,109
vertex of the first component and X 
happens to be the leader vertex of the 

22
00:01:21,109 --> 00:01:24,966
second component. 
And then the invariant just says, every 

23
00:01:24,966 --> 00:01:28,249
vertex should be coining to the leader of 
it's component. 

24
00:01:28,249 --> 00:01:32,512
So, V and W should be pointing to U. 
U has it's own ver, it's own pointer, it 

25
00:01:32,512 --> 00:01:36,429
should be pointing to itself. 
Similarly, in the second component, X is 

26
00:01:36,429 --> 00:01:39,770
pointing into itself. 
Y points to X and Z also points to X. 

27
00:01:39,770 --> 00:01:42,650
So, in blue here, are the actual edges of 
the graph. 

28
00:01:42,650 --> 00:01:47,143
And in green, there are these, sort of, 
extra edged pointers that we've invented 

29
00:01:47,143 --> 00:01:51,060
where everybody's pointing back to the 
leader of their component. 

30
00:01:52,620 --> 00:01:57,097
And so, one very simple thing in this 
setup which turns out to be a good idea, 

31
00:01:57,097 --> 00:02:01,284
is each component is in effect, 
inheriting the name of its leader vertex. 

32
00:02:01,284 --> 00:02:05,762
So we refer to a group, we refer to a 
component via the object, via the vertex 

33
00:02:05,762 --> 00:02:10,480
who happens to be the representative. 
Who happens to be the leader. 

34
00:02:10,480 --> 00:02:15,458
And what's kind of amazing is, even just 
this very simple scaffolding on the 

35
00:02:15,458 --> 00:02:20,633
connected components is enough to have 
constant time cycle checks provided the 

36
00:02:20,633 --> 00:02:23,865
invariant is satisfied. 
Well, how do we do that? 

37
00:02:23,865 --> 00:02:27,292
So remember that, [COUGH], checking if 
adding the edge uv is going to create a 

38
00:02:27,292 --> 00:02:29,376
cycle, 
boils down to checking whether there's 

39
00:02:29,376 --> 00:02:32,572
already a path between U and V. 
Well, there's already a path between U 

40
00:02:32,572 --> 00:02:35,582
and V if and only if they're same, in the 
same connected component. 

41
00:02:35,582 --> 00:02:39,056
Given two vertices, U and V. How do we 
know if they're in the same connected 

42
00:02:39,056 --> 00:02:41,232
component? 
We just follow the respective leader 

43
00:02:41,232 --> 00:02:43,455
pointers, and we see if we get to the 
same place. 

44
00:02:43,455 --> 00:02:46,049
If they're in the same component, we get 
the same leader. 

45
00:02:46,049 --> 00:02:48,828
If they're in different components, we 
get different leaders. 

46
00:02:48,828 --> 00:02:52,440
So, checking for a cycle just involves, 
for each of u and v, comparing a quality 

47
00:02:52,440 --> 00:02:54,710
of leader pointers. 
That is clearly constant time. 

48
00:02:54,710 --> 00:02:58,960
More generally, the way you implement the 
Find operation for this flavor of 

49
00:02:58,960 --> 00:03:03,156
disjoint union data structure, is you 
simply return the leader pointer of the 

50
00:03:03,156 --> 00:03:06,644
object that you are handed. 
So if you're given a vertex, you just 

51
00:03:06,644 --> 00:03:11,120
follow the leader pointer of that vertex, 
and you return wherever you wind up. 

52
00:03:11,120 --> 00:03:14,850
So that's pretty excellent, as long as, 
in this simple data structure, the 

53
00:03:14,850 --> 00:03:18,736
invariant is satisfied, we have our 
desired implementation of constant time 

54
00:03:18,736 --> 00:03:21,326
cycle checks. 
But, and you know, certainly, this is a 

55
00:03:21,326 --> 00:03:23,917
recurring theme in our data structure 
discussions. 

56
00:03:23,917 --> 00:03:27,699
Whenever you have a data structure and it 
needs to maintain an invariant. 

57
00:03:27,699 --> 00:03:30,859
Whenever you do an operation that changes 
the data structure. 

58
00:03:30,859 --> 00:03:34,383
So, like in this case, when you do a 
union fusing two groups together. 

59
00:03:34,383 --> 00:03:38,579
You have to worry, well, does the variant 
get destroyed when you do that operation? 

60
00:03:38,579 --> 00:03:42,620
And if it does, how are you going to 
restore the invariant without doing undue 

61
00:03:42,620 --> 00:03:47,300
work? 
So in the present context of Kruskal's 

62
00:03:47,300 --> 00:03:51,171
algorithm, here's how this plays out. 
So we're happily doing our constant time 

63
00:03:51,171 --> 00:03:53,634
cycle checks. 
Whenever an edge creates a cycle, we 

64
00:03:53,634 --> 00:03:57,504
don't do anything, we skip the edge, we 
don't change our data structure, and we 

65
00:03:57,504 --> 00:03:59,968
move on. 
The issue is when we have a new edge and 

66
00:03:59,968 --> 00:04:02,330
it doesn't create a cycle or cycle check 
fails. 

67
00:04:02,330 --> 00:04:06,502
Now, Kruskal's algorithm dictates that we 
add this edge into the set capital T that 

68
00:04:06,502 --> 00:04:09,267
we're building. 
And that fuses two connected components 

69
00:04:09,267 --> 00:04:11,544
together. 
But remember, we have this invariant. 

70
00:04:11,544 --> 00:04:14,432
Every vertex should point to the leader 
of its component. 

71
00:04:14,432 --> 00:04:18,436
Well, if we had component A and we had 
component B, they're both pointing to the 

72
00:04:18,436 --> 00:04:20,716
leader vertex of their respective 
components. 

73
00:04:20,716 --> 00:04:24,720
Now, when these components fuse into one, 
we've got to do some updating of leader 

74
00:04:24,720 --> 00:04:26,949
pointers. 
In particular, there used to be two 

75
00:04:26,949 --> 00:04:30,953
leaders, now there has to be just one, 
and we have to rewire the leader pointers 

76
00:04:30,953 --> 00:04:36,354
to restore the invariant. 
So, to make sure you're clear on this, 

77
00:04:36,354 --> 00:04:39,482
important, problem. 
Let me ask you the following question. 

78
00:04:39,482 --> 00:04:43,616
So, consider the case, when, at some 
point, in Kruskal's algorithm, a new edge 

79
00:04:43,616 --> 00:04:46,298
is added. 
And two connected components fuse into 

80
00:04:46,298 --> 00:04:48,644
one. 
Now, to restore the invariant, you have 

81
00:04:48,644 --> 00:04:52,610
to do some leader pointer updates. 
In the worst case, asymptotically, how 

82
00:04:52,610 --> 00:04:56,521
many leader pointer updates might be 
required in order to restore the 

83
00:04:56,521 --> 00:05:01,387
invariant? 
So, the answer to this question, somewhat 

84
00:05:01,387 --> 00:05:05,810
alarmingly, is the third answer. 
So it might require a linear number, in 

85
00:05:05,810 --> 00:05:09,437
the number of vertices N, 
pointer updates to restore the invariant. 

86
00:05:09,437 --> 00:05:13,431
And one easy way to see that is imagine 
the very last edge that Kruskal's going 

87
00:05:13,431 --> 00:05:17,425
to add in the set T, the one which fuses 
the final two connected components down 

88
00:05:17,425 --> 00:05:19,672
into one. 
For all you know, you know, those two 

89
00:05:19,672 --> 00:05:22,967
components have exactly the same size, 
they have N / 2 vertices each. 

90
00:05:22,967 --> 00:05:25,464
You're going down from two leader 
pointers to one. 

91
00:05:25,464 --> 00:05:29,308
And one of those set of N / 2 vertices 
are going to have to inherit the leader 

92
00:05:29,308 --> 00:05:32,703
pointer from the other side. 
So one of the two sets is going to have 

93
00:05:32,703 --> 00:05:35,400
to have N / 2 vertices, get their leader 
pointer updated. 

94
00:05:36,800 --> 00:05:39,798
So that's a bummer. 
We're sort of hoping for a near linear 

95
00:05:39,798 --> 00:05:43,780
time bound but if every each one of our 
linear number of edge editions might 

96
00:05:43,780 --> 00:05:46,313
trigger a linear number of liter pointer 
updates. 

97
00:05:46,313 --> 00:05:49,157
That seems to be giving rise to a 
quadratic time bound. 

98
00:05:49,157 --> 00:05:53,190
But remember when I introduced the union 
find data structure I only said that 

99
00:05:53,190 --> 00:05:56,757
first idea was idea number one presumably 
there's an idea number two. 

100
00:05:56,757 --> 00:06:00,842
And here it is and it's very natural if 
you are coding up an implementation of 

101
00:06:00,842 --> 00:06:04,720
union find data structure you would 
probably do this optimization yourself. 

102
00:06:04,720 --> 00:06:07,351
Alright. 
So, consider the moment in time in which 

103
00:06:07,351 --> 00:06:09,874
some component A merges with some 
component B. 

104
00:06:09,874 --> 00:06:14,151
Each of these two components currently 
has their own respective leader vertex. 

105
00:06:14,151 --> 00:06:17,879
And all vertices from that group are 
pointing to that leader vertex. 

106
00:06:17,879 --> 00:06:20,292
Now, when they fuse, what are you going 
to do? 

107
00:06:20,292 --> 00:06:24,514
The first obvious thing to do is to say, 
well, let's not bother computing some 

108
00:06:24,514 --> 00:06:27,804
totally new leader. 
Let's either reuse the leader from group 

109
00:06:27,804 --> 00:06:31,477
A, or the leader from group B. 
That way, if, for example, we retain the 

110
00:06:31,477 --> 00:06:34,822
leader from group A. 
The only leader pointers that need to be 

111
00:06:34,822 --> 00:06:37,290
rewired are the ones who come from 
component. 

112
00:06:37,290 --> 00:06:39,280
B. 
The vertices in component A can keep 

113
00:06:39,280 --> 00:06:42,650
their same leader vertex and their same 
leader pointers as before. 

114
00:06:42,650 --> 00:06:45,764
So that's the first point. 
Let's just have a new union of two 

115
00:06:45,764 --> 00:06:48,929
components inherit the leader of one of 
the constituent parts. 

116
00:06:48,929 --> 00:06:51,413
Now if you're going to retain one of the 
two leaders, 

117
00:06:51,413 --> 00:06:54,926
which one are you going to retain. 
Maybe one of the components has 1,000 

118
00:06:54,926 --> 00:06:56,976
vertices. 
The other component only has 100 

119
00:06:56,976 --> 00:06:59,074
vertices. 
Well, given the choice, you're certainly 

120
00:06:59,074 --> 00:07:02,294
going to keep the leader from the bigger, 
from the first component. 

121
00:07:02,294 --> 00:07:06,002
That way, you only have to rewire the 100 
liter pointers of the second group. 

122
00:07:06,002 --> 00:07:09,515
If you catch the leader of the second 
group, you have to rewire the 1000 

123
00:07:09,515 --> 00:07:12,589
pointer from the first group. 
And that seems silly and wasteful. 

124
00:07:12,589 --> 00:07:16,590
So the obvious way to implement a merge 
is you just keep the leader of the bigger 

125
00:07:16,590 --> 00:07:20,180
group, and rewire everybody from the 
second group, the smaller group. 

126
00:07:20,180 --> 00:07:23,609
So you should notice that, in order to 
actually implement this optimization, 

127
00:07:23,609 --> 00:07:26,535
where you always retain the lead of the 
larger group, 

128
00:07:26,535 --> 00:07:30,330
you have to be able to quickly determine 
which of the two groups is larger but you 

129
00:07:30,330 --> 00:07:33,623
can augment the data structure we've 
discussed in, to make, to facilitate 

130
00:07:33,623 --> 00:07:35,589
that. 
So, just, with each group, you just keep 

131
00:07:35,589 --> 00:07:37,738
account of how many vertices are in that 
group. 

132
00:07:37,738 --> 00:07:39,750
So you maintain a size field for each 
group. 

133
00:07:39,750 --> 00:07:42,996
That allows you to check, in constant 
time, what's the population of two 

134
00:07:42,996 --> 00:07:46,380
different groups and figure out, again, 
in constant time, which one's bigger. 

135
00:07:46,380 --> 00:07:50,084
And notice that when you fuse two groups 
together, it's easy to maintain the size 

136
00:07:50,084 --> 00:07:53,325
field. 
It's just the sum of the sizes of the two 

137
00:07:53,325 --> 00:07:57,119
constituent parts. 
So let me now revisit the question from 

138
00:07:57,119 --> 00:08:00,323
the previous slide. 
In the worst case, given this 

139
00:08:00,323 --> 00:08:05,596
optimization, how many leader pointers 
asymptotically might you have to re-wire 

140
00:08:05,596 --> 00:08:12,485
when you merge two components together? 
Well, unfortunately, the song remains the 

141
00:08:12,485 --> 00:08:14,595
same. 
The answer is still the third one. 

142
00:08:14,595 --> 00:08:18,601
And it's because for exactly the same 
reason as on the previous slide. 

143
00:08:18,601 --> 00:08:22,768
It still might be the case that, say, in 
the final iteration of Kruskal, you're 

144
00:08:22,768 --> 00:08:25,691
merging two components that both have 
size N / 2. 

145
00:08:25,691 --> 00:08:28,668
So, it doesn't matter. 
I mean, no matter which leader you 

146
00:08:28,668 --> 00:08:33,215
choose, you're going to be stuck updating 
the leader pointers of N / 2 or theta of 

147
00:08:33,215 --> 00:08:35,813
n vertices. 
So, this is clearly a smart practical 

148
00:08:35,813 --> 00:08:38,736
optimization. 
It doesn't seem to be buying us anything 

149
00:08:38,736 --> 00:08:41,280
in our asymptotic analysis of the running 
time. 

150
00:08:44,180 --> 00:08:51,050
However, however, what if we think about 
the work done in all of these leader 

151
00:08:51,050 --> 00:08:55,300
pointer updates, in the following 
different way? 

152
00:08:55,300 --> 00:08:58,708
Rather than asking how many updates, 
might a merging of two components 

153
00:08:58,708 --> 00:09:00,754
trigger. 
Let's adopt a vertex centric view. 

154
00:09:00,754 --> 00:09:03,432
Let's suppose you're one of the vertices 
of this graph. 

155
00:09:03,432 --> 00:09:06,938
So initially, at the beginning of 
Kruskal's algorithm, you're in your own 

156
00:09:06,938 --> 00:09:09,666
isolated connected component. 
You just point to yourself. 

157
00:09:09,666 --> 00:09:12,539
You're your own leader. 
And then, as, Kruskal's algorithm run 

158
00:09:12,539 --> 00:09:15,948
its, runs its course, your leader pointer 
will periodically get updated. 

159
00:09:15,948 --> 00:09:18,480
At some point, you're no longer pointing 
to yourself. 

160
00:09:18,480 --> 00:09:22,230
You're pointing to some other vertex. 
Then at some point, your pointer gets 

161
00:09:22,230 --> 00:09:24,908
updated again. 
You're pointing to yet some other vertex, 

162
00:09:24,908 --> 00:09:28,605
and so on. 
How many times over the entire trajectory 

163
00:09:28,605 --> 00:09:32,786
of Kruskal's algorithm. 
In light of our new optimization, might 

164
00:09:32,786 --> 00:09:38,140
you, as some vertex of this graph, have, 
your leader pointer updated. 

165
00:09:41,557 --> 00:09:45,790
Well, here's the very cool answer, the 
answer is the second one. 

166
00:09:45,790 --> 00:09:51,035
So while it remains true, that if you 
always have the union of two groups, and 

167
00:09:51,035 --> 00:09:56,347
here at the leader pointer of the larger 
one, it's still true that a given fusion 

168
00:09:56,347 --> 00:10:00,937
might trigger, a, a linear number of 
leader pointer updates, each, vertex, 

169
00:10:00,937 --> 00:10:06,182
will only see, its leader pointer updated 
a logarithmic number of times, over the 

170
00:10:06,182 --> 00:10:09,778
course of Kruskal's algorithm. 
What is the reason for this? 

171
00:10:09,778 --> 00:10:14,314
Well suppose you're a vertex and you're 
in some group and it has maybe twenty 

172
00:10:14,314 --> 00:10:18,443
vertices, so you're one of twenty. 
Now suppose at some point, your leader 

173
00:10:18,443 --> 00:10:20,827
pointer gets updated. 
Why did that happen? 

174
00:10:20,827 --> 00:10:25,363
Well it meant that your group of twenty 
vertices merged with some other group 

175
00:10:25,363 --> 00:10:28,910
that has to be bigger. 
Remember your leader pointer only gets 

176
00:10:28,910 --> 00:10:31,876
rewired in a fusion if you were in a 
smaller group. 

177
00:10:31,876 --> 00:10:36,528
So you're joining a group at least as big 
as yours, so the size of the union, the 

178
00:10:36,528 --> 00:10:40,890
size of your new community, your new 
connected component, is at least double. 

179
00:10:40,890 --> 00:10:44,994
The size of your previous one. 
So the bottom line is, every time you, as 

180
00:10:44,994 --> 00:10:47,398
a vertex, has your leader pointer 
updated. 

181
00:10:47,398 --> 00:10:51,736
The population, and the component to 
which you belong is at least twice as 

182
00:10:51,736 --> 00:10:54,961
large as before. 
Now, you start in a connected component 

183
00:10:54,961 --> 00:10:57,834
of size one. 
The connected component is not going to 

184
00:10:57,834 --> 00:11:01,821
have more than N vertices. 
So the number of doublings you might have 

185
00:11:01,821 --> 00:11:04,108
to endure, is, at most, log base two of 
n. 

186
00:11:04,108 --> 00:11:08,740
So that bounds how many leader pointers 
you will see as a vertex in this graph. 

187
00:11:12,880 --> 00:11:17,038
So, in light of that very cool operation, 
we can now give a good running time 

188
00:11:17,038 --> 00:11:20,704
analysis of Kruskal's algorithm using the 
union find data structure. 

189
00:11:20,704 --> 00:11:24,206
Using scaffolding on the component 
structure to do cycle checks. 

190
00:11:24,206 --> 00:11:26,942
We, of course, have not changed the 
processing step. 

191
00:11:26,942 --> 00:11:31,319
We're sorting the edges from cheapest to 
most expensive at the beginning of the 

192
00:11:31,319 --> 00:11:34,580
algorithm and that still takes big O of M 
log N time. 

193
00:11:34,580 --> 00:11:37,993
So what work do we have to do, beyond 
this sorting pre-processing step? 

194
00:11:37,993 --> 00:11:41,505
Well, fundamentally, Kruskal's algorithm 
is just about these cycle checks. 

195
00:11:41,505 --> 00:11:45,211
And each iteration of the four loop, we 
have to check if adding a given edge 

196
00:11:45,211 --> 00:11:48,039
would create a cycle with those edges 
we've already added. 

197
00:11:48,039 --> 00:11:51,697
And the whole point of the union find 
scaffolding, these link structures, is 

198
00:11:51,697 --> 00:11:53,745
that we can check cycles in constant 
time. 

199
00:11:53,745 --> 00:11:57,305
Just given an edge, we look at the leader 
pointers of a ten points. 

200
00:11:57,305 --> 00:12:01,304
And, a cycle will be created if and only 
if their leader pointers are identical. 

201
00:12:01,304 --> 00:12:04,913
So, for the cycle checking, we only do 
constant time for each of the O of M 

202
00:12:04,913 --> 00:12:07,121
iterations. 
But the final source of work is 

203
00:12:07,121 --> 00:12:09,367
maintaining this union fine data 
structure. 

204
00:12:09,367 --> 00:12:13,179
Restoring the invariant each time we add 
a new edge to the set capital T. 

205
00:12:13,179 --> 00:12:17,044
And here, the good idea is, we're not 
going to just bound to the worst case 

206
00:12:17,044 --> 00:12:20,021
running time of these leader pointers for 
each iteration. 

207
00:12:20,021 --> 00:12:23,625
because that could be too expensive. 
That could be up to linear in just a 

208
00:12:23,625 --> 00:12:26,288
single iteration. 
Rather, we're going to do a global 

209
00:12:26,288 --> 00:12:30,205
analyses, thinking about the total number 
of liter pointers that ever occur. 

210
00:12:30,205 --> 00:12:34,227
On the previous find, we observed that 
for a single vertex, it's only going to 

211
00:12:34,227 --> 00:12:36,943
endur a logarithmic number of liter 
pointer updates. 

212
00:12:36,943 --> 00:12:41,374
So summing over all the end vertasese, 
the total work done for leader pointer 

213
00:12:41,374 --> 00:12:45,356
updates is only end login. 
So even though we might do a linear 

214
00:12:45,356 --> 00:12:49,517
amount of pointer updates on just one 
iteration of this for loop, we still have 

215
00:12:49,517 --> 00:12:53,783
this global upper bound of N log N on the 
total number of leader pointer updates, 

216
00:12:53,783 --> 00:12:56,301
very cool. 
So looking over this tally, we observe 

217
00:12:56,301 --> 00:13:00,075
the stunning fact that the bo, the 
bottleneck for this implementation of 

218
00:13:00,075 --> 00:13:04,215
Kruscos algorithm is actually sorting. 
So we do more work in the pre-processing 

219
00:13:04,215 --> 00:13:06,731
step, N log N, than we do in the entire 
fore-loop. 

220
00:13:06,731 --> 00:13:10,400
Which is only M plus N log N. 
So that gives us an overall running time 

221
00:13:10,400 --> 00:13:14,278
bound of Big O of N log N, matching the 
theoretical performance that we've 

222
00:13:14,278 --> 00:13:16,846
achieved. 
Prim's algorithm implemented with heaps. 

223
00:13:16,846 --> 00:13:20,672
So just like with our heat based 
implementation, with Prim's algorithm, we 

224
00:13:20,672 --> 00:13:24,813
get a running time which is almost linear 
only logarithmic factors slower than 

225
00:13:24,813 --> 00:13:27,853
reading the input. 
And just like with Prim's algorithm, you 

226
00:13:27,853 --> 00:13:30,620
should. 
Find Kruskal's algorithm to be very 

227
00:13:30,620 --> 00:13:32,220
competitive in practice.