1
00:00:00,012 --> 00:00:03,662
So welcome to this optional sequence of 
videos on State of the Art 

2
00:00:03,662 --> 00:00:06,562
Implementations of the Union-Find Data 
Structure. 

3
00:00:06,562 --> 00:00:10,802
Now, as advanced in optional material, 
let me make a few comments before we get 

4
00:00:10,802 --> 00:00:13,212
started. 
So the first comment is, I'm going to 

5
00:00:13,212 --> 00:00:16,137
assume that you're in a particularly 
motivated mood. 

6
00:00:16,137 --> 00:00:20,257
Now, no ones forcing you to take to these 
algorithms courses, so I'm always 

7
00:00:20,257 --> 00:00:23,282
assuming that you're a very motiviated 
individual. 

8
00:00:23,282 --> 00:00:25,811
Individual. 
But that's going to hold even more true 

9
00:00:25,811 --> 00:00:29,600
than usual, in these advanced videos. 
And the way that plays out is, well I'm 

10
00:00:29,600 --> 00:00:33,498
going to hold myself to the same exacting 
standards of clarity, that I usually do. 

11
00:00:33,498 --> 00:00:35,996
I'm going to ask a little bit more of 
you, the viewer. 

12
00:00:35,996 --> 00:00:39,860
I'm going to perhaps, dot a few less I's, 
cross a few less T's, than I usually do. 

13
00:00:39,860 --> 00:00:43,312
And so you may find yourself 
periodicially needing to pause, and think 

14
00:00:43,312 --> 00:00:46,267
through some of the details. 
That I'm glossing over. 

15
00:00:46,267 --> 00:00:50,062
The second comment is that I'm going to 
give sort of short shrift to applications 

16
00:00:50,062 --> 00:00:53,751
of the union-find data structure. 
That's not cause there aren't any, there 

17
00:00:53,751 --> 00:00:56,255
are plenty of applications of this data 
structure. 

18
00:00:56,255 --> 00:01:00,259
But for these videos, we're really just 
going to immerse ourselves, in the beauty 

19
00:01:00,259 --> 00:01:04,132
and the depth of ideas behind the design 
of the data structures, and especially 

20
00:01:04,132 --> 00:01:08,330
the analysis of their performance. 
The final comment is to keep in mind that 

21
00:01:08,330 --> 00:01:10,787
this is some seriously next level 
material. 

22
00:01:10,787 --> 00:01:14,725
It is totally normal that the first time 
you see this stuff, you find it 

23
00:01:14,725 --> 00:01:17,451
confusing, you find it difficult to 
understand. 

24
00:01:17,451 --> 00:01:21,863
So confusion should not discourage you. 
It does not represent any intellectual 

25
00:01:21,863 --> 00:01:25,436
failing on your part. 
Rather, keep in mind, it represents an 

26
00:01:25,436 --> 00:01:29,893
opportunity to get even smarter. 
So, with those comments in mind, let's go 

27
00:01:29,893 --> 00:01:34,990
ahead and proceed to a different approach 
to the union-find data structure via lazy 

28
00:01:34,990 --> 00:01:37,544
unions. 
So let's have one slide with a quick 

29
00:01:37,544 --> 00:01:41,492
review of what we already know about the 
union-find data structure. 

30
00:01:41,492 --> 00:01:45,465
If you recall, we discussed this in the 
context of a fast implementation of 

31
00:01:45,465 --> 00:01:48,260
Kruskal's minimum spanning tree 
algorithm. 

32
00:01:48,260 --> 00:01:52,859
The raison d'être of a union-find data 
structure, is to maintain a partition of 

33
00:01:52,859 --> 00:01:56,640
a universe of objects. 
So in the context of Kruskal's algorithm, 

34
00:01:56,640 --> 00:02:00,638
the objects were the vertices. 
And the groups we wanted to maintain, 

35
00:02:00,638 --> 00:02:04,778
were the connecting components, with 
respect to the edges we committed 

36
00:02:04,778 --> 00:02:08,532
ourselves to so far. 
The data structure should support two 

37
00:02:08,532 --> 00:02:12,337
operations. 
No prizes for guessing the names of those 

38
00:02:12,337 --> 00:02:15,406
2 operations. 
First is the find operation. 

39
00:02:15,406 --> 00:02:20,022
This is given an object from the 
universe, return the name of that 

40
00:02:20,022 --> 00:02:23,927
object's group. 
So for example, if x was in the middle of 

41
00:02:23,927 --> 00:02:28,812
this square that I've put on the right, 
representing the universe X, 

42
00:02:28,812 --> 00:02:32,261
we will return to c three the the name of 
the group that contains x . 

43
00:02:32,261 --> 00:02:35,747
So in the context of Kruskal's algorithm, 
we use the find operation to check the 

44
00:02:35,747 --> 00:02:39,507
cycles, how do we know whether adding a 
new edge would create a cycle with 

45
00:02:39,507 --> 00:02:43,348
respect to edge we have already chosen, 
or will be if and only if the n points of 

46
00:02:43,348 --> 00:02:47,170
that edge were already in the same 
connecting component that is if and only 

47
00:02:47,170 --> 00:02:49,812
if two fine operations return the same 
answer. 

48
00:02:49,812 --> 00:02:52,891
Answer. 
In the union operation, you're given not 

49
00:02:52,891 --> 00:02:57,851
1 object, but 2, call them x and y, and 
then the responsibility of the operation 

50
00:02:57,851 --> 00:03:02,792
is to merge the groups that contain x and 
y, so for example, in the picture, if y 

51
00:03:02,792 --> 00:03:07,670
was in C4, x was, as before, in C3, then 
your job is to fuse the groups C3 and C4 

52
00:03:07,670 --> 00:03:11,807
into a single group. 
In the context of Kruskal's algorithm, we 

53
00:03:11,807 --> 00:03:14,748
needed this operation when we added a new 
edge. 

54
00:03:14,748 --> 00:03:19,514
This, this fused together 2 of the 
existing connecting componenents into 1. 

55
00:03:19,514 --> 00:03:23,960
So, we already went through 1 
implementation of union find, that was in 

56
00:03:23,960 --> 00:03:28,400
order to get a blazingly fast 
implemenation of Kruskal's algorithm, so 

57
00:03:28,400 --> 00:03:33,287
lets just review how that worked. 
With each group, we associated a linked 

58
00:03:33,287 --> 00:03:38,362
structure, so each object had one 
pointer, associated with it, and the 

59
00:03:38,362 --> 00:03:43,837
invariant was in a given group, there was 
some representative leader object, and 

60
00:03:43,837 --> 00:03:47,850
everybody in that group pointed to the 
leader of that Group. 

61
00:03:47,850 --> 00:03:52,445
So for example, on the right I'm showing 
you a group with three objects x, y and z 

62
00:03:52,445 --> 00:03:56,965
and x would be the leader of this group. 
All three of the objects point directly 

63
00:03:56,965 --> 00:04:01,384
to x and that would just be the output of 
the Find operation, the leader of the 

64
00:04:01,384 --> 00:04:03,631
group. 
We use that as the The group name. 

65
00:04:03,631 --> 00:04:07,002
Now the part which is cool and obvious 
about this approach is our find 

66
00:04:07,002 --> 00:04:10,426
operation's take constant time. 
All you do is return the leader of the 

67
00:04:10,426 --> 00:04:12,955
given object. 
Now the tricky part was analyzing the 

68
00:04:12,955 --> 00:04:16,142
cost of union operations. 
So the problem here is that to maintain 

69
00:04:16,142 --> 00:04:19,354
the inveriant, that every object of a 
group points to its leader. 

70
00:04:19,354 --> 00:04:23,012
When you fuse 2 groups, you have to 
update a bunch of the object's leaders. 

71
00:04:23,012 --> 00:04:25,562
You would only have one leader for the 
one new. 

72
00:04:25,562 --> 00:04:28,235
New group. 
The simple, but totally crucial 

73
00:04:28,235 --> 00:04:33,259
optimization, that we discussed was, when 
2 groups merged, you update the leader 

74
00:04:33,259 --> 00:04:37,807
pointers of all objects in the smaller 
group, to point to the leader of the 

75
00:04:37,807 --> 00:04:41,235
bigger group. 
That is, the new fused group inherits the 

76
00:04:41,235 --> 00:04:44,252
leader from the bigger of it's 
constituents. 

77
00:04:44,252 --> 00:04:47,048
Parts. 
If we do that optimization, it's still 

78
00:04:47,048 --> 00:04:51,518
the case that a single union might take 
linear time, theta of n time, but a 

79
00:04:51,518 --> 00:04:54,724
sequence of n unions takes only big o of 
n login time. 

80
00:04:54,724 --> 00:04:59,404
And that's because each object endures at 
most a logarithmic number of leader 

81
00:04:59,404 --> 00:05:02,175
updates. 
because every time its leader point gets 

82
00:05:02,175 --> 00:05:06,016
updated, the population of the group it 
inhabits Doubles. 

83
00:05:06,016 --> 00:05:11,728
So in this sequence of videos we are 
going to discuss a different approach, to 

84
00:05:11,728 --> 00:05:15,057
implementing the union find data 
structure. 

85
00:05:15,057 --> 00:05:20,379
And lets just go all-in, and try to get 
away with updating only, a single 

86
00:05:20,379 --> 00:05:24,411
pointer, in each, union. 
All right, well how are we going to do 

87
00:05:24,411 --> 00:05:28,792
that? Well, I think if we look at a 
picture, it'll be clear what the approach 

88
00:05:28,792 --> 00:05:31,725
is going to be. 
So, let's look at a simple example, 

89
00:05:31,725 --> 00:05:34,797
there's just six objects in the world, in 
two groups. 

90
00:05:34,797 --> 00:05:39,417
One, two and three are in one group, with 
leader one, four five and six are in the 

91
00:05:39,417 --> 00:05:43,334
second group with leader 4. 
So, with our previous implementation of 

92
00:05:43,334 --> 00:05:47,243
union find and the 2 groups have equal 
size so we pick one arbitrary to 

93
00:05:47,243 --> 00:05:50,817
represent the new leader. 
Let's say 4 is going to be the new leader 

94
00:05:50,817 --> 00:05:55,285
and now we update objects 1, 2 and 3 so 
that they point directly to their new 

95
00:05:55,285 --> 00:05:56,272
leader. 
Four. 

96
00:05:56,272 --> 00:06:00,873
And so the new idea is very simple. 
Let's just sort of as a shorthand for 

97
00:06:00,873 --> 00:06:05,762
rewiring all of one two and three, to 
point to four, let's just rewire one's 

98
00:06:05,762 --> 00:06:09,831
pointer to point to four. 
And then it's understood that two and 

99
00:06:09,831 --> 00:06:14,412
three as descendants of one, also now 
have the new leader four as well. 

100
00:06:14,412 --> 00:06:18,955
So as a result, we do again get a 
directed tree afterwards, but we don't 

101
00:06:18,955 --> 00:06:23,786
get one with just depth of one. 
We don't' get this shallow bushy a tree. 

102
00:06:23,786 --> 00:06:27,552
We get a deeper one that has two levels 
below the root. 

103
00:06:27,552 --> 00:06:30,891
So let me give you another way of 
thinking about this in terms of array 

104
00:06:30,891 --> 00:06:33,635
representation. 
Here I also want to make the point that, 

105
00:06:33,635 --> 00:06:37,105
you know, while conceptually it's 
going to be very useful to think of these 

106
00:06:37,105 --> 00:06:40,629
unified data structure in terms of these 
directed trees, you're actually 

107
00:06:40,629 --> 00:06:43,605
implementing. 
This data structure, this is not how you 

108
00:06:43,605 --> 00:06:46,288
do it. 
You wouldn't bother with actual pointers 

109
00:06:46,288 --> 00:06:49,998
between the different objects. 
You just have a simple array, indexed by 

110
00:06:49,998 --> 00:06:53,951
the nodes and in the entry corresponding 
to a node i, you would just store the 

111
00:06:53,951 --> 00:06:57,682
name of i's Parent. 
For instance, if we reframe this exact 

112
00:06:57,682 --> 00:07:02,637
same example, in a ray representation, 
before we do the union, what do we got? 

113
00:07:02,637 --> 00:07:05,437
Well, objects 1, 2, and 3 all have parent 
1. 

114
00:07:05,437 --> 00:07:10,167
Objects 4, 5, and 6 all have parent 4. 
So in the array, we'd have a 1, in the 

115
00:07:10,167 --> 00:07:13,115
first 3 entries, and a 4 in the second 3 
entries. 

116
00:07:14,254 --> 00:07:18,928
So in the old solution we have to update 
the parent pointers of objects 1, 2 and 

117
00:07:18,928 --> 00:07:23,486
3.They are all going to point directly to 
4 so that gives us an array entirely of 

118
00:07:23,486 --> 00:07:28,506
4's.Whereas in the new solution, the only 
node whose parent pointer gets updated is 

119
00:07:28,506 --> 00:07:32,177
that of object 1, it gets updated from 
itself to 4. 

120
00:07:32,177 --> 00:07:37,222
And in particular, objects 2 and 3 still 
point to object 1, not directly to object 

121
00:07:37,222 --> 00:07:39,712
4. 
So that's how union works in a simple 

122
00:07:39,712 --> 00:07:44,712
example, how does it work in general. 
Well, in general you're given 2 objects, 

123
00:07:44,712 --> 00:07:48,327
each belongs to some group. 
You can link up these 2 groups 

124
00:07:48,327 --> 00:07:52,852
conceptually as directed trees. 
If you follow all the parent pointers, 

125
00:07:52,852 --> 00:07:57,237
all parent pointers lead to the root 
vertex and we identify those root 

126
00:07:57,237 --> 00:08:00,341
vertices with the leaders. 
Of these two groups. 

127
00:08:00,341 --> 00:08:04,772
Now, when you have to merge the two trees 
together, how do you do it? Well, you 

128
00:08:04,772 --> 00:08:07,490
pick one of the groups and you look at 
it's root. 

129
00:08:07,490 --> 00:08:11,647
Currently, it points to itself. 
You change its parent pointer to point to 

130
00:08:11,647 --> 00:08:15,879
the other group's leader, that is, you 
install one of the two roots as a new 

131
00:08:15,879 --> 00:08:19,840
child of the other tree's root. 
So, I hope the pros and cons of this 

132
00:08:19,840 --> 00:08:24,374
alternative approach to the union fine 
data structure are intuitively clear, the 

133
00:08:24,374 --> 00:08:28,251
when, the pro comes in the union 
operation, which is now very simple. 

134
00:08:28,251 --> 00:08:32,164
Now you might be tempted to just say 
union takes constant time, because all 

135
00:08:32,164 --> 00:08:35,632
you do is update one pointer, it's 
actually not so simple. 

136
00:08:35,632 --> 00:08:40,186
Right because remember you are given two 
objects x and y and in general you aren't 

137
00:08:40,186 --> 00:08:43,760
linking x and y together they might be 
somewhere deep in a tree. 

138
00:08:43,760 --> 00:08:48,339
You are linking roots together, so what 
is that the union operation reduces to, 2 

139
00:08:48,339 --> 00:08:52,633
indicatations of find you have to find 
x's root r r's of 1 you have to find y's 

140
00:08:52,633 --> 00:08:56,547
root R sub 2, and then you just do 
constant time linking, either R1 to R2, 

141
00:08:56,547 --> 00:08:59,102
or vice versa. 
Now the issue, with this lazy union 

142
00:08:59,102 --> 00:09:02,932
approach, is that, it's not at all 
obvious, and in fact it's not going to be 

143
00:09:02,932 --> 00:09:05,547
true, that the find operation takes 
constant time. 

144
00:09:05,547 --> 00:09:09,497
Remember, previously, when we actually 
did all this hard work of updating all 

145
00:09:09,497 --> 00:09:12,052
these leader pointers every time we had a 
union. 

146
00:09:12,052 --> 00:09:16,295
It guaranteed that whenever there was a 
find boom we just look in the field we 

147
00:09:16,295 --> 00:09:20,452
return the leader pointer, we're done. 
Here, parent pointers do not point 

148
00:09:20,452 --> 00:09:24,792
directly to roots rather, you have to 
traverse in general a sequence of parent 

149
00:09:24,792 --> 00:09:28,511
pointers from a given object x to go all 
the way up to the root of the 

150
00:09:28,511 --> 00:09:31,545
corresponding tree. 
So because of these trade offs, these 

151
00:09:31,545 --> 00:09:35,496
pros and cons, it's really not obvious at 
all whether this lazy union approach to 

152
00:09:35,496 --> 00:09:39,104
the unifying data structure is a good 
idea, whether it's going to have any 

153
00:09:39,104 --> 00:09:41,580
payoffs. 
So that's going to take some really quite 

154
00:09:41,580 --> 00:09:44,714
subtle analysis, which is the main point 
of the lectures to come. 

155
00:09:44,714 --> 00:09:48,728
It's also going to take a couple of 
optimizations, and in the next video, 

156
00:09:48,728 --> 00:09:51,973
we'll start with the first one. 
You might be wondering, okay, so when you 

157
00:09:51,973 --> 00:09:55,293
do a union, and you have to install 1 
root, under the other, how do you make 

158
00:09:55,293 --> 00:09:58,275
that choice? So the right answer is 
something called Union By Rank. 

159
00:09:58,275 --> 00:09:59,209
That's coming up next.