1
00:00:01,520 --> 00:00:05,280
So, welcome to this optional sequence of
videos on state of the art implementations

2
00:00:05,280 --> 00:00:07,600
of the union to find data structure.

3
00:00:07,600 --> 00:00:09,710
Now as advanced in optional material,

4
00:00:09,710 --> 00:00:12,260
let me make a few comments
before we get started.

5
00:00:14,860 --> 00:00:17,790
So the first comment is I'm going to
assume that you're in a particularly

6
00:00:17,790 --> 00:00:18,910
motivated mood.

7
00:00:18,910 --> 00:00:21,550
Now, no one's forcing you to
take these algorithm courses, so

8
00:00:21,550 --> 00:00:23,800
I'm always assuming that you're
very motivated individual,

9
00:00:23,800 --> 00:00:27,640
but that's going to hold even more true
than usual in these advanced videos.

10
00:00:27,640 --> 00:00:31,590
And the way that plays out is while I'm
going to hold myself to the same exacting

11
00:00:31,590 --> 00:00:35,362
standards of clarity that I usually do,
I'm going to ask a little bit more of you,

12
00:00:35,362 --> 00:00:35,890
the viewer.

13
00:00:35,890 --> 00:00:41,180
I'm going to perhaps dot a few less Is,
cross a few less Ts than I usually do.

14
00:00:41,180 --> 00:00:43,820
So you may find yourself
periodically needing to pause and

15
00:00:43,820 --> 00:00:46,250
think through some of the details
that I'm glossing over.

16
00:00:46,250 --> 00:00:50,311
The second comment is I'm going to give
is sort of short shrift to applications

17
00:00:50,311 --> 00:00:52,103
of the union find data structure.

18
00:00:52,103 --> 00:00:55,922
That's not because there aren't any,
there's plenty of applications of this

19
00:00:55,922 --> 00:00:59,910
data structure but for these videos we're
really just going to immerse ourselves in

20
00:00:59,910 --> 00:01:03,502
the beauty and the depth of ideas behind
the design of the data structures and

21
00:01:03,502 --> 00:01:05,830
especially the analysis
of their performance.

22
00:01:06,990 --> 00:01:10,190
The final comment is to keep in
mind that this is some seriously

23
00:01:10,190 --> 00:01:12,040
next level material.

24
00:01:12,040 --> 00:01:15,910
It is totally normal that the first time
you see this stuff, you find it confusing.

25
00:01:15,910 --> 00:01:18,070
You find it difficult to understand.

26
00:01:18,070 --> 00:01:20,220
So confusion should not discourage you.

27
00:01:20,220 --> 00:01:23,550
It does not represent any
intellectual failing on your part.

28
00:01:23,550 --> 00:01:28,410
Rather, keep in mind, it represents
an opportunity to get even smarter.

29
00:01:28,410 --> 00:01:31,098
So, with those comments in mind,
let's go ahead and

30
00:01:31,098 --> 00:01:35,357
proceed to a different approach to the
Union-find data structure via lazy unions.

31
00:01:38,011 --> 00:01:40,962
So let's have one slide with a quick
review of what we already know about

32
00:01:40,962 --> 00:01:42,400
the Union-find data structure.

33
00:01:42,400 --> 00:01:45,990
Recall we discussed this in the context
of a fast implementation of

34
00:01:45,990 --> 00:01:50,760
Kruskal's minimum spanning tree algorithm
The raison d'etre of a union-find

35
00:01:50,760 --> 00:01:55,980
data structure is to maintain
a partition of a universe of objects.

36
00:01:55,980 --> 00:02:00,005
So in the context of Kruskal's algorithm,
the objects were the vertices, and

37
00:02:00,005 --> 00:02:03,969
the groups we wanted to maintain were
the connected components with respect to

38
00:02:03,969 --> 00:02:06,205
the edges we committed ourselves to so
far.

39
00:02:08,482 --> 00:02:10,970
The data structure should
support two operations.

40
00:02:10,970 --> 00:02:14,450
No prizes for guessing the names
of those two operations.

41
00:02:14,450 --> 00:02:15,900
First is the find operation.

42
00:02:15,900 --> 00:02:20,720
This is given an object from the universe
return the name of that object's group.

43
00:02:23,370 --> 00:02:26,670
So for example if X was in
the middle of this square that I put

44
00:02:26,670 --> 00:02:30,160
on the right representing the universe
capital X we would return a C3.

45
00:02:30,160 --> 00:02:32,040
The name of the group that contains X.

46
00:02:34,530 --> 00:02:38,510
So in the context of Kruskal's algorithm,
we use find operation to check for cycles.

47
00:02:38,510 --> 00:02:41,290
How did we know whether adding a new edge
would create a cycle with respect to

48
00:02:41,290 --> 00:02:42,350
the edges we've already chosen?

49
00:02:42,350 --> 00:02:43,330
Well, it would be if and

50
00:02:43,330 --> 00:02:47,040
only if the end points of that edge were
already in the same connected component.

51
00:02:47,040 --> 00:02:50,309
That is, if and only if the two find
operations returned the same answer.

52
00:02:52,270 --> 00:02:55,810
In the Union Operation you're given out
one object but you call them x and y,

53
00:02:55,810 --> 00:02:59,630
and then the responsibility of
the operation is to merge the groups

54
00:02:59,630 --> 00:03:00,790
that contain x and y.

55
00:03:00,790 --> 00:03:04,690
So for example in the picture y was in C4,
x was as before in C3.

56
00:03:04,690 --> 00:03:08,592
Then your job is to fuse the groups C3 and
C4 into a single group.

57
00:03:11,024 --> 00:03:12,954
In the context of Kruskal's algorithm,

58
00:03:12,954 --> 00:03:15,280
we needed this operation
when we added a new edge.

59
00:03:15,280 --> 00:03:19,580
This fused together two of the existing
connecting components into one.

60
00:03:19,580 --> 00:03:21,410
So that was exactly the union operation.

61
00:03:23,130 --> 00:03:25,960
So we all ready went through one
implementation of union and find.

62
00:03:25,960 --> 00:03:29,070
That was order to get a blazingly fast
implementation of Kruskal's algorithm.

63
00:03:29,070 --> 00:03:30,876
So let's just review how that works.

64
00:03:33,546 --> 00:03:36,682
With each group,
we associated a linked structure, so

65
00:03:36,682 --> 00:03:39,770
each object had one pointer
associated with it.

66
00:03:39,770 --> 00:03:41,720
And the invariant was in a given group,

67
00:03:41,720 --> 00:03:44,440
there was some representative
leader object.

68
00:03:44,440 --> 00:03:47,620
And everybody in that group pointed
to the leader of that group.

69
00:03:50,070 --> 00:03:53,620
So for example, on the right I'm showing
you a group with three objects, x, y, and

70
00:03:53,620 --> 00:03:56,080
z, and
x would be the leader of this group.

71
00:03:56,080 --> 00:03:58,370
All three of the objects
point directly to x, and

72
00:03:58,370 --> 00:04:01,950
that would just be the output of the find
operation, the leader of the group.

73
00:04:01,950 --> 00:04:03,540
We use that as the group name.

74
00:04:05,030 --> 00:04:06,130
Now the part which is cool and

75
00:04:06,130 --> 00:04:09,860
obvious about this approach is our
find operations take constant time.

76
00:04:09,860 --> 00:04:12,510
All you do is return the leader
of the given object.

77
00:04:12,510 --> 00:04:15,910
Now the tricky part was analyzing
the cost of Union operations.

78
00:04:15,910 --> 00:04:18,710
So the problem here is that to
maintain the invariant that

79
00:04:18,710 --> 00:04:20,540
every object of a group
points to its leader.

80
00:04:20,540 --> 00:04:24,130
When you fuse two groups, you have to
update a bunch of the objects' leaders.

81
00:04:24,130 --> 00:04:25,950
We'd only have one leader for
the one new group.

82
00:04:27,630 --> 00:04:31,830
The simple, but totally crucial,
optimization that we discussed was

83
00:04:31,830 --> 00:04:36,160
when two groups merge you update
the leader pointers of all objects in

84
00:04:36,160 --> 00:04:39,450
the smaller group to point to
the leader of the bigger group.

85
00:04:39,450 --> 00:04:42,400
That is the new fused
group inherits the leader

86
00:04:42,400 --> 00:04:45,140
from the bigger of its constituent parts.

87
00:04:45,140 --> 00:04:49,290
If we do that optimization, it's still the
case that a single union might take linear

88
00:04:49,290 --> 00:04:55,820
time fade event time but a sequence of n
unions takes only big o of n log in time.

89
00:04:55,820 --> 00:04:59,934
And that's because each object endures
at most a logarithmic number of leader

90
00:04:59,934 --> 00:05:03,988
updates, cause every time it's leader
pointer gets updated the population

91
00:05:03,988 --> 00:05:06,059
of the group that it inhabits doubles.

92
00:05:07,831 --> 00:05:11,751
So in this sequence of videos we are going
to discuss a different approach to

93
00:05:11,751 --> 00:05:14,191
implementing the union
find data structure.

94
00:05:15,710 --> 00:05:17,260
And let's just go all in and

95
00:05:17,260 --> 00:05:22,300
try to get away with updating only
a single pointer in each union.

96
00:05:23,960 --> 00:05:25,380
All right,
well how are we going to do that?

97
00:05:25,380 --> 00:05:27,061
Well, I think if we look at a picture,

98
00:05:27,061 --> 00:05:29,230
it'll be clear what
the approach is going to be.

99
00:05:31,173 --> 00:05:32,670
So let's look at a simple example.

100
00:05:32,670 --> 00:05:35,680
There's just six objects in the world and
currently they're in two groups.

101
00:05:35,680 --> 00:05:38,190
One, two, and
three are in one group with leader one.

102
00:05:38,190 --> 00:05:41,640
Four, five, and six are in
the second group with leader four.

103
00:05:41,640 --> 00:05:45,940
So, with our previous implementation of
union find and the two groups have equal

104
00:05:45,940 --> 00:05:49,230
size, so we pick one arbitrarily
to represent the new leader.

105
00:05:49,230 --> 00:05:51,230
Let's say four is going to
be the new leader.

106
00:05:51,230 --> 00:05:53,652
And now we update objects one,
two, and three so

107
00:05:53,652 --> 00:05:56,324
that they point directly
to their new leader, four.

108
00:05:57,650 --> 00:06:01,635
And so the new idea is very simple
let's just sort of as a short hand for

109
00:06:01,635 --> 00:06:05,110
re-wiring all of one, two and
three to point to four,

110
00:06:05,110 --> 00:06:09,890
let's just re-wire one pointer to point to
four and then it's understood that two and

111
00:06:09,890 --> 00:06:14,340
three as descendants of one also now
have the new leader four as well.

112
00:06:16,060 --> 00:06:19,880
So as a result, we do again get
a directed tree afterwards,

113
00:06:19,880 --> 00:06:21,900
but we don't get one
with just depth of one.

114
00:06:21,900 --> 00:06:24,110
We don't get as shallow, pushy a tree.

115
00:06:24,110 --> 00:06:26,980
We get a deeper one that has
two levels below the root.

116
00:06:29,295 --> 00:06:31,635
So, let me give you another way of
thinking about this in terms of an array

117
00:06:31,635 --> 00:06:32,725
representation.

118
00:06:32,725 --> 00:06:35,965
And here I also want to make a point
that why conceptually it's going

119
00:06:35,965 --> 00:06:38,925
to be very useful to think of these union
find data structure in terms of these

120
00:06:38,925 --> 00:06:40,270
directed trees.

121
00:06:40,270 --> 00:06:44,015
You're actually implementing this data
structure this is not how you'd do it.

122
00:06:44,015 --> 00:06:47,450
You wouldn't bother with actually
pointers between the different objects.

123
00:06:47,450 --> 00:06:50,500
You'd just have a simply
array indexed by the nodes.

124
00:06:50,500 --> 00:06:54,420
And in the entry corresponding to
a node i you would just store the name

125
00:06:54,420 --> 00:06:55,930
of i's parent.

126
00:06:55,930 --> 00:07:00,710
For instance if we reframe this exact
same example in an array representation

127
00:07:00,710 --> 00:07:02,370
before we do the union what do we got?

128
00:07:02,370 --> 00:07:05,480
Well objects one, two, and
three all have parent one.

129
00:07:05,480 --> 00:07:08,420
Objects four, five, and
six all have parent four.

130
00:07:08,420 --> 00:07:10,890
So in the array we'd have a one.

131
00:07:10,890 --> 00:07:13,820
In the first three entries and
a four in the second three entries.

132
00:07:14,830 --> 00:07:18,730
So in the old solution we have to update
the parent pointers of objects one two and

133
00:07:18,730 --> 00:07:20,530
three, they're all going to
point directly to four, so

134
00:07:20,530 --> 00:07:22,100
that gives us an array entirely of fours.

135
00:07:22,100 --> 00:07:27,330
Whereas in the new solution
the only node whose parent

136
00:07:27,330 --> 00:07:32,050
pointer gets updated is that of object
one, it gets updated from itself to four.

137
00:07:34,500 --> 00:07:36,388
And in a particular objects two and

138
00:07:36,388 --> 00:07:39,720
three still point to object one
not directly to object four.

139
00:07:42,090 --> 00:07:43,960
So that's how union works so
in simple example.

140
00:07:43,960 --> 00:07:45,125
How does it work in general?

141
00:07:45,125 --> 00:07:48,980
Well, in general, you're given two
objects, each belongs to some group.

142
00:07:48,980 --> 00:07:52,300
You can think of these two groups
conceptually as directed trees.

143
00:07:52,300 --> 00:07:56,660
If you follow all the parent corners, all
parent corners lead to the root vertex.

144
00:07:56,660 --> 00:08:01,388
We identify those root vertices with
the leaders of these two groups.

145
00:08:01,388 --> 00:08:05,110
Now when you have to merge the two
trees together, how do you do it?

146
00:08:05,110 --> 00:08:06,580
Well, you pick one of the groups and

147
00:08:06,580 --> 00:08:08,960
you look at its roots
currently it points to itself.

148
00:08:08,960 --> 00:08:12,490
You can change its parent pointer to
point to the other groups leader.

149
00:08:12,490 --> 00:08:17,834
That is you install one of the two roots
as a new child of the other trees root.

150
00:08:19,742 --> 00:08:23,013
So I hope the pros and cons of this
alternative approach to the union find

151
00:08:23,013 --> 00:08:25,290
data structure are intuitively clear.

152
00:08:25,290 --> 00:08:30,050
The win, the pro, comes in the union
operation, which is now very simple.

153
00:08:30,050 --> 00:08:32,630
Now you might be tempted to just
say union takes constant time,

154
00:08:32,630 --> 00:08:35,900
because all you do is update one pointer,
It's actually not so simple, right?

155
00:08:35,900 --> 00:08:38,920
Because remember you're given two objects,
x and y, and

156
00:08:38,920 --> 00:08:40,750
in general you're not linking x and
y together.

157
00:08:40,750 --> 00:08:42,550
They might be somewhere deep in a tree.

158
00:08:42,550 --> 00:08:44,140
You're linking roots together.

159
00:08:44,140 --> 00:08:47,880
So what is true is that
the union operation reduces two

160
00:08:47,880 --> 00:08:49,310
two indications of find.

161
00:08:49,310 --> 00:08:51,110
You have to find x's root.

162
00:08:51,110 --> 00:08:53,950
R sub 1, you have to y's root r sub 2,

163
00:08:53,950 --> 00:08:59,110
and then you just do constant time
linking either r1 to r 2 or vice versa.

164
00:08:59,110 --> 00:09:03,080
Now the issue with this lazy union
approach is that it's not at all obvious,

165
00:09:03,080 --> 00:09:07,010
and in fact it's not going to be true,
that find operation takes constant time.

166
00:09:07,010 --> 00:09:10,520
Remember, previously when we actually
did this hard work of updating all these

167
00:09:10,520 --> 00:09:13,860
leader pointers every time we had a union
it guaranteed that whenever there is

168
00:09:13,860 --> 00:09:17,680
a find boom, we just look in the field,
we return the leader pointer, we're done.

169
00:09:17,680 --> 00:09:21,690
Here, parent pointers do not
point directly to roots rather

170
00:09:21,690 --> 00:09:26,430
you have to traverse in general a sequence
of parent pointers from a given object x

171
00:09:26,430 --> 00:09:29,170
to go all the way up to the root
of the correspondence tree.

172
00:09:32,070 --> 00:09:33,350
So because of these trade offs.

173
00:09:33,350 --> 00:09:36,260
These pros and cons, it's really
not obvious at all whether this

174
00:09:36,260 --> 00:09:39,570
lazy union approach to the union
find data structure is a good idea.

175
00:09:39,570 --> 00:09:41,038
Whether it's going to have any pay offs.

176
00:09:41,038 --> 00:09:44,090
So that's going to take really some
quite subtle analysis which is the main

177
00:09:44,090 --> 00:09:45,550
point of the lectures to come.

178
00:09:45,550 --> 00:09:47,790
It's also going to take
a couple of optimizations and

179
00:09:47,790 --> 00:09:49,800
the next video will start
with the first one.

180
00:09:49,800 --> 00:09:51,920
You might be wondering, okay,
so when you do a union and

181
00:09:51,920 --> 00:09:55,430
you have to install one root under
the other, how do you make that choice?

182
00:09:55,430 --> 00:09:58,270
So the right answer is
something called union by rank.

183
00:09:58,270 --> 00:09:59,514
That's coming up next.