1
00:00:00,012 --> 00:00:04,197
In this video we'll provide for the first 
time concrete evidence that the lazy 

2
00:00:04,197 --> 00:00:07,397
union approach to the Union-Find data 
structure is viable. 

3
00:00:07,397 --> 00:00:11,666
Specifically, we'll prove that the worst 
case running time of both the find and 

4
00:00:11,666 --> 00:00:16,019
the union operation is logarithmic in n, 
the number of objects stored in the data 

5
00:00:16,019 --> 00:00:18,981
structure. 
We are going to do even better later 

6
00:00:18,981 --> 00:00:22,302
once, we introduce a second optimization 
known as path impression. 

7
00:00:22,302 --> 00:00:25,963
But an important stepping stone is to 
understand just is why, just union by 

8
00:00:25,963 --> 00:00:29,102
rank, already gets us to a reasonable 
algorithmic run-time. 

9
00:00:29,102 --> 00:00:33,420
So a quick review of the lazy union 
approach to implementing the union fine 

10
00:00:33,420 --> 00:00:36,166
data structure. 
So, with each note we're going to 

11
00:00:36,166 --> 00:00:39,755
maintain a parent pointer. 
And it's no longer the case that we 

12
00:00:39,755 --> 00:00:43,600
insist the parent pointer point directly 
to the leader of a group. 

13
00:00:43,600 --> 00:00:48,142
Rather we just insist that the collection 
Collection of parent pointers, induces a 

14
00:00:48,142 --> 00:00:51,616
collection of directed trees. 
The root of each tree, that is, the node 

15
00:00:51,616 --> 00:00:55,196
which is it's own parent, we're going to 
define as the leader of that group. 

16
00:00:55,196 --> 00:00:59,060
So, given any old object, x, how do you 
implement find? How do you, figure out 

17
00:00:59,060 --> 00:01:03,036
what the leader vertex is? Well, you just 
traverse parent pointers, up until you 

18
00:01:03,036 --> 00:01:07,311
get to the root of that particular group. 
So for this implementation of the Find 

19
00:01:07,311 --> 00:01:11,220
operation, the worst case running time is 
just going to be the longest path of 

20
00:01:11,220 --> 00:01:15,211
parent pointers that you ever have to 
traverse, to get from an object to some 

21
00:01:15,211 --> 00:01:17,431
root. 
So the way we're going to quantify that 

22
00:01:17,431 --> 00:01:20,289
is using these ranks. 
So this is again, a field that we 

23
00:01:20,289 --> 00:01:23,799
maintain for each object. 
And for now, this will break down later, 

24
00:01:23,799 --> 00:01:28,658
but for now before we have path 
-pression, we're going to maintain the 

25
00:01:28,658 --> 00:01:34,049
invariant that the rank of an object x is 
exactly the largest number of pointers 

26
00:01:34,049 --> 00:01:37,439
you have to traverse, from some leaf, to 
get to x. 

27
00:01:37,439 --> 00:01:42,956
As a consequence, the biggest rank of any 
object is the longest path from any leaf 

28
00:01:42,956 --> 00:01:46,296
to any root. 
And that's going to be an upper bound on 

29
00:01:46,296 --> 00:01:49,942
the worst case running time of the find 
op- Operation. 

30
00:01:49,942 --> 00:01:54,751
So let's move on to the union operation. 
So here, given 2 objects, x and y, you 

31
00:01:54,751 --> 00:01:59,170
need to fuse their 2 trees, their 2 
groups, so you find the roots of the 2 

32
00:01:59,170 --> 00:02:02,315
trees, so you call a find on x, you call 
a find on y. 

33
00:02:02,315 --> 00:02:07,232
That gives you their 2 respective roots, 
and now you install 1 as a new child. 

34
00:02:07,232 --> 00:02:09,698
Of the other. 
Now we saw in a quiz in the last video, 

35
00:02:09,698 --> 00:02:13,362
if you're not careful about which root 
you install as a child under the other, 

36
00:02:13,362 --> 00:02:16,949
you can wind up with these long chains. 
And be stuck with a linear worse case 

37
00:02:16,949 --> 00:02:20,078
time for both find and union. 
So instead we have this union by rank 

38
00:02:20,078 --> 00:02:22,571
optimization. 
Which says, well, we want to keep our 

39
00:02:22,571 --> 00:02:25,752
trees from getting scraggly. 
And the way we're going to do that is. 

40
00:02:25,752 --> 00:02:29,889
When we have a shallow tree and a deep 
tree we make the shallow tree shall under 

41
00:02:29,889 --> 00:02:33,827
the root of the deep one, that prevents 
the tree from getting even deeper. 

42
00:02:33,827 --> 00:02:37,914
Now there is a situation where the two 
trees have exactly the same depths that 

43
00:02:37,914 --> 00:02:42,051
is where the two roots have exactly the 
same rank, in that case we just proceed 

44
00:02:42,051 --> 00:02:44,795
arbitrarily. 
Then when we merge two trees that both 

45
00:02:44,795 --> 00:02:48,891
had a common rank r, its important that 
in the new tree, the rank is gone up to 

46
00:02:48,891 --> 00:02:50,964
r+1. 
So we need the update, we need the 

47
00:02:50,964 --> 00:02:54,182
incremental rank of the new root to 
reflect that increase. 

48
00:02:54,182 --> 00:02:59,017
So that's where we've already been. 
Where are we going to next? Well the plan 

49
00:02:59,017 --> 00:03:03,407
for this video is to show that, with the 
union by rank, optimization. 

50
00:03:03,407 --> 00:03:08,177
The maximum rank of any node, is always, 
bounded above, by log, base 2 (n). 

51
00:03:08,177 --> 00:03:11,627
Where n is the number of objects in the 
data structure. 

52
00:03:11,627 --> 00:03:16,792
Now we just said, the worst case running 
time of find, is governed, by the maximum 

53
00:03:16,792 --> 00:03:19,862
rank. 
So the logarithmic maximum rank means 

54
00:03:19,862 --> 00:03:23,027
logarithm run time of find. 
that also carries over to the union 

55
00:03:23,027 --> 00:03:25,592
operation. 
Remember union is just 2 finds plus 

56
00:03:25,592 --> 00:03:29,602
constant work to rewire 1 pointer, so 
that's going to give us algorithm time 

57
00:03:29,602 --> 00:03:32,672
value on both operations. 
So let's see why that's true. 

58
00:03:32,672 --> 00:03:37,938
So let's begin the analyses with a few 
simple, but useful properties that follow 

59
00:03:37,938 --> 00:03:42,668
immediately from our invariant. 
From the way that we change the ranks of 

60
00:03:42,668 --> 00:03:45,425
objects as we do finds and as we do 
unions. 

61
00:03:45,425 --> 00:03:49,632
So the first simple property is focus on 
your favorite object. 

62
00:03:49,632 --> 00:03:52,053
x. 
And, watch this objects rank change, over 

63
00:03:52,053 --> 00:03:55,245
the course of the data structure, as we 
do finds and unions. 

64
00:03:55,245 --> 00:03:59,258
How can it change? Well, when we do a 
find we don't change anything, all the 

65
00:03:59,258 --> 00:04:02,359
ranks stay the same. 
When we do a union, all the ranks stay 

66
00:04:02,359 --> 00:04:04,747
the same. 
Well, except there is 1 case in the 

67
00:04:04,747 --> 00:04:08,901
union, where the rank, of a single node, 
gets bumped up by 1, gets increased. 

68
00:04:08,901 --> 00:04:13,212
So ranks only go up, over time, for all 
of the Objects, that's property one. 

69
00:04:13,212 --> 00:04:17,632
So the second property is again pretty 
much trivial, but really, really useful. 

70
00:04:17,632 --> 00:04:21,547
So what is the situation in which 
somebody's rank gets bumped up by 1? 

71
00:04:21,547 --> 00:04:25,197
We're going to take the union of two 
trees that have a common rank. 

72
00:04:25,197 --> 00:04:29,487
And then whichever of the two roots that 
we pick to be the root of the new bigger 

73
00:04:29,487 --> 00:04:32,997
tree That's the object whose rank gets 
bumped up by 1. 

74
00:04:32,997 --> 00:04:37,231
So new roots of this fused tree. 
So in particular, the only type of 

75
00:04:37,231 --> 00:04:40,513
objects that can ever get a rank increase 
is a root. 

76
00:04:40,513 --> 00:04:43,592
If you're not a root, your rank will not 
go up. 

77
00:04:43,592 --> 00:04:48,827
Furthermore, once you're not a root in 
this data structure, you will never be a 

78
00:04:48,827 --> 00:04:53,050
root again in the future. 
There is no process by which, you shed 

79
00:04:53,050 --> 00:04:56,112
your parent. 
Once you have a parent other than 

80
00:04:56,112 --> 00:04:59,618
yourself, you will always have exactly 
that parent. 

81
00:04:59,618 --> 00:05:04,782
Putting those two observations together 
we find that, as soon as an object X. 

82
00:05:04,782 --> 00:05:09,667
Becomes a non root but as soon as it has 
a in parent other than itself it rank is 

83
00:05:09,667 --> 00:05:14,522
frozen for the rest of time forever more. 
The third and final simple property 

84
00:05:14,522 --> 00:05:19,352
follows from a formula we mentioned in 
the last video about computing ranks. 

85
00:05:19,352 --> 00:05:24,352
So remember the rank of a node in general 
is going to be one more than the maximum 

86
00:05:24,352 --> 00:05:28,557
rank of any of its children. 
So if you have a child and there is some 

87
00:05:28,557 --> 00:05:31,742
path from a leaf to that child, it takes 
5 hops. 

88
00:05:31,742 --> 00:05:34,953
The path to you from that child is going 
to take 6 hops. 

89
00:05:34,953 --> 00:05:39,586
As a consequence as you go from the leaf 
up to the root you will see a strictly 

90
00:05:39,586 --> 00:05:43,797
increasing sequence of ranks. 
The rank of a parent is always strictly 

91
00:05:43,797 --> 00:05:46,552
more than the rank of all of those 
children. 

92
00:05:46,552 --> 00:05:48,840
So that's it for the immediate 
properties. 

93
00:05:48,840 --> 00:05:51,808
Let's go to a property which is a little 
less immediate. 

94
00:05:51,808 --> 00:05:55,771
But still this next lemma, which I'm 
going to call the rank lemma, it's the 

95
00:05:55,771 --> 00:05:58,814
best kind of lemma. 
So on the one hand, it's just not that 

96
00:05:58,814 --> 00:06:01,318
hard to prove. 
I'll give you a full proof in the 

97
00:06:01,318 --> 00:06:04,732
following 2 slides. 
On the other hand, it's really powerful. 

98
00:06:04,732 --> 00:06:08,787
It's going to play a crucial role in the 
analysis were doing right now. 

99
00:06:08,787 --> 00:06:12,696
A logarithmic run time bound, with a 
union by rank optimization, and we'll 

100
00:06:12,696 --> 00:06:16,972
keep using it again as a workforce, once 
we introduce path compression, and prove 

101
00:06:16,972 --> 00:06:20,653
better bounds on the operations. 
So what's the rank limit say? Well it 

102
00:06:20,653 --> 00:06:24,912
controls the population size of objects, 
that have a given (no period) Rank, so we 

103
00:06:24,912 --> 00:06:28,387
want it to apply at all intermediate 
stages of our data structure, so we're 

104
00:06:28,387 --> 00:06:30,637
going to consider an arbitrary sequence 
of unions. 

105
00:06:30,637 --> 00:06:32,962
You can throw in some finds as well. 
I don't care. 

106
00:06:32,962 --> 00:06:36,662
Finds don't change the data structure, so 
they're totally irrelevant, so think 

107
00:06:36,662 --> 00:06:39,037
about a sequence of unions, a sequence of 
mergers. 

108
00:06:39,037 --> 00:06:41,412
The claim is, for every non-negative 
integer, r. 

109
00:06:41,412 --> 00:06:45,491
The number of objects that have rank 
exactly r at this time is at must n. 

110
00:06:45,491 --> 00:06:48,311
the total number of objects divided by 2 
to the r. 

111
00:06:48,311 --> 00:06:52,440
So for example, if our rank is 0. 
It says that at must n objects have rank 

112
00:06:52,440 --> 00:06:55,600
0, so it is a trivial statement because 
only n objects. 

113
00:06:55,600 --> 00:06:59,616
But at any given time the number of 
objects that have rank 1 is at most n 

114
00:06:59,616 --> 00:07:04,002
over 2, the number of objects that have 
rank 2 is at most no over 4 and so on. 

115
00:07:04,002 --> 00:07:09,327
And if you think about it, if we succeed 
in proving the rank Lemma, we're pretty 

116
00:07:09,327 --> 00:07:13,852
much done, showing the efficacy of the 
union by rank optimization. 

117
00:07:13,852 --> 00:07:19,252
So in particular, once you take r, the, 
in this, key Lemma, to be log base 2 (n), 

118
00:07:19,252 --> 00:07:24,197
it says that there's at most 1 object. 
That has rank log2(n). 

119
00:07:24,197 --> 00:07:28,137
And there can't be any objects that have 
rank strictly larger. 

120
00:07:28,137 --> 00:07:32,712
That is, this limit implies that the 
maximum rank at all times is bounded 

121
00:07:32,712 --> 00:07:36,232
above by log2(n). 
And remember, the maximum rank is the 

122
00:07:36,232 --> 00:07:41,327
longest path of pointers, traversals, you 
ever need to get from a leaf to a root. 

123
00:07:41,327 --> 00:07:46,187
And that means the most amount of work 
we'll ever do in a find, and therefore, 

124
00:07:46,187 --> 00:07:51,857
in a union is O (log n) Okay, so I've now 
teased you with the consequences of the 

125
00:07:51,857 --> 00:07:57,967
rank lemma, assuming that it's true, but 
why is it true? Let's turn to the proof. 

126
00:07:57,967 --> 00:08:03,382
I'm going to break the proof down into 
two claims, claim one and claim two. 

127
00:08:03,382 --> 00:08:08,057
We'll see that the two claims easily 
imply the rank Rank Lemma. 

128
00:08:08,057 --> 00:08:12,625
So claim 1 asks you to consider 2 
objects, X and Y, that have exactly the 

129
00:08:12,625 --> 00:08:16,024
same rank R. 
And the claim asserts that the sub-trees 

130
00:08:16,024 --> 00:08:20,542
of these 2 objects have to be disjoint. 
They have no objects in common. 

131
00:08:20,542 --> 00:08:25,369
And here by the sub-tree of an object, I 
just mean the other objects that can 

132
00:08:25,369 --> 00:08:29,313
reach this one by following a sequence. 
Of, parent pointers. 

133
00:08:29,313 --> 00:08:33,235
So that is the subtree at x, is the 
objects from which you can reach x. 

134
00:08:33,235 --> 00:08:36,499
The subtree at y is the objects from 
which you can reach y. 

135
00:08:36,499 --> 00:08:41,054
The second claim, is that, if you look at 
any object that has rank r, and you look 

136
00:08:41,054 --> 00:08:45,435
at it's subtree, that is, if you look at 
the number of objects that can reach, 

137
00:08:45,435 --> 00:08:49,245
this object x by following pointers, 
there have to be a lot of them. 

138
00:08:49,245 --> 00:08:54,289
There have to be at least 2 raised to 
that objects rank Are, objects in it's 

139
00:08:54,289 --> 00:09:01,317
subtree? Notice that, if we prove claim 1 
and claim 2, then the Rank Lemma, follows 

140
00:09:01,317 --> 00:09:04,618
easily. 
Why? Well, fix a value for, r. 

141
00:09:04,618 --> 00:09:10,212
2, 10, I don't care what. 
Look at all the nodes that have this rank 

142
00:09:10,212 --> 00:09:12,553
R. 
By claim 2, each of them has at least 2 

143
00:09:12,553 --> 00:09:17,123
to the R objects that could reach them. 
And by claim 1, these have to be disjoint 

144
00:09:17,123 --> 00:09:20,069
sets of objects. 
Well, there's only N objects to go 

145
00:09:20,069 --> 00:09:24,261
around, and if each of these disjoint 
sets has at least 2 to the R of them, 

146
00:09:24,261 --> 00:09:27,559
there are going to be at most N over 2 to 
the R such groups. 

147
00:09:27,559 --> 00:09:31,232
That is at most N over 2 to the R nodes, 
objects with this rank R. 

148
00:09:31,232 --> 00:09:36,580
So we've reduced the proof of the rank 
Lemma to proving claims 1 and 2, I will 

149
00:09:36,580 --> 00:09:40,352
do them in turn. 
So for claim 1 let me go via the contra 

150
00:09:40,352 --> 00:09:46,199
positive, that is, I will assume that the 
conclusion is false, and I will show that 

151
00:09:46,199 --> 00:09:51,389
the hypothesis must then also be false. 
So, lets assume that we have 2 no, 

152
00:09:51,389 --> 00:09:55,269
objects x and y, and their subtrees are 
not, disjoint. 

153
00:09:55,269 --> 00:10:00,322
That is, there exists an object z, from 
which You can reach X and from the same 

154
00:10:00,322 --> 00:10:03,672
object Z, you can also reach Y by a 
sequence parent pointers. 

155
00:10:03,672 --> 00:10:08,057
Well now let's use the fact that we're 
dealing with the directed tree, right, so 

156
00:10:08,057 --> 00:10:12,306
if you start with an object Z. 
There's only a unique parent point, or 2, 

157
00:10:12,306 --> 00:10:15,799
follow each time. 
So that is, all of the objects reachable 

158
00:10:15,799 --> 00:10:20,011
from z, they form a directed path, 
leading up to the root of z's group. 

159
00:10:20,011 --> 00:10:24,562
So the only way for both x and y to be 
reachable from z, they have to both be on 

160
00:10:24,562 --> 00:10:27,593
this path. 
If they're both on this path, then 1 has 

161
00:10:27,593 --> 00:10:31,906
to be an ancestor of the other. 
So now we're going to use the third of 

162
00:10:31,906 --> 00:10:37,247
our simple properties that we observed. 
That is, on every path to the root, ranks 

163
00:10:37,247 --> 00:10:41,639
strictly go up, each time. 
So, whichever of x or y is an ancestor of 

164
00:10:41,639 --> 00:10:46,834
the other, that has strictly higher rank. 
Therefore x and y do not have the same 

165
00:10:46,834 --> 00:10:49,952
rank. 
That completes the proof, of claim 1. 

166
00:10:49,952 --> 00:10:55,681
So lets move on to claim 2. 
Remember, claim 2 is search that, an 

167
00:10:55,681 --> 00:11:02,292
object of rank r, necessarily has 2 ^ r 
objects, or more, in its subtree. 

168
00:11:02,292 --> 00:11:09,521
That's how many objects can actually 
reach this object x, by following Parent 

169
00:11:09,521 --> 00:11:12,106
pointers. 
So for this proof we're going to proceed 

170
00:11:12,106 --> 00:11:16,527
by induction on the number of operations, 
and again remember fine operations have 

171
00:11:16,527 --> 00:11:19,507
no effect on the data structure, so we 
can ignore them. 

172
00:11:19,507 --> 00:11:23,397
So it's just by induction on the number 
of union operations that happen. 

173
00:11:23,397 --> 00:11:27,775
So for the base case, when, before we've 
done any unions whatsoever, we're doing 

174
00:11:27,775 --> 00:11:30,258
just fine. 
Every object has a rank of 0 and the 

175
00:11:30,258 --> 00:11:32,632
sub-tree size of every object is equal to 
1. 

176
00:11:32,632 --> 00:11:35,212
That object itself, also known as 2 to 
the 0. 

177
00:11:35,212 --> 00:11:38,052
Zero. 
Now for the inductive step, there's an 

178
00:11:38,052 --> 00:11:41,972
easy case and a hard case. 
The easy case is where nobody's rank 

179
00:11:41,972 --> 00:11:46,712
changes, where we do a union, and 
everybody's rank stays exactly the same. 

180
00:11:46,712 --> 00:11:51,082
In this case, we're golden. 
Why? Well, when you do a union, sub-tree 

181
00:11:51,082 --> 00:11:53,412
sizes only go up. 
There's only more. 

182
00:11:53,412 --> 00:11:57,272
Pointers so there's only more objects 
that can reach any given other objects. 

183
00:11:57,272 --> 00:11:59,545
So sub-tree sizes go up, ranks stay the 
same. 

184
00:11:59,545 --> 00:12:03,289
If we had this inequality of sub-tree 
size as being at least 2 ^ r before, we 

185
00:12:03,289 --> 00:12:05,026
have it equally well now. 
Now. 

186
00:12:05,026 --> 00:12:09,709
So the interesting case is when 
somebody's ranked actually changes. 

187
00:12:09,709 --> 00:12:14,492
How can that happen? Well it happens in 
only one particular way that we 

188
00:12:14,492 --> 00:12:18,213
understand well. 
Looking at a union operation between 

189
00:12:18,213 --> 00:12:22,110
objects X and Y. 
Suppose the roots of these objects are S1 

190
00:12:22,110 --> 00:12:25,312
and S2 respectively. 
It's only when these. 

191
00:12:25,312 --> 00:12:29,824
Two roots have the same rank, let's call 
that common rank R, that somebodies rank 

192
00:12:29,824 --> 00:12:32,905
gets changed. 
In particular, we're going to break ties 

193
00:12:32,905 --> 00:12:36,918
as we did in the previous video. 
S2 will be the root of the fused tree, S1 

194
00:12:36,918 --> 00:12:40,576
will become a child of it. 
And, in that case, S2s rank gets bumped 

195
00:12:40,576 --> 00:12:42,381
up by 1. 
It goes from R To r + 1. 

196
00:12:42,381 --> 00:12:45,675
Now notice, in this case, we do have 
something to prove. 

197
00:12:45,675 --> 00:12:50,185
What are we trying to establish? We're 
trying to establish that every subtree 

198
00:12:50,185 --> 00:12:54,806
size is big, as a function of the rank. 
So, s2's rank has gone up, and therefore 

199
00:12:54,806 --> 00:12:59,767
the lowerbound, the bar that we have to 
meet, for the subtree size, has also Gone 

200
00:12:59,767 --> 00:13:03,532
up, it's doubled. 
So in this case we actually have to 

201
00:13:03,532 --> 00:13:08,547
scrutinize s2's new sub-tree. 
So what is its new sub-tree? Well, it's 

202
00:13:08,547 --> 00:13:14,012
really just composed from its old 
sub-tree, and it inherits s1 and all of 

203
00:13:14,012 --> 00:13:18,348
its sub-trees. 
Well, in that case, we know that s2's new 

204
00:13:18,348 --> 00:13:24,105
subtree size, the nubmer of no objects 
that can reach it, is just, it's old 

205
00:13:24,105 --> 00:13:27,832
subtree size, plus the old subtree size, 
of s1. 

206
00:13:27,832 --> 00:13:34,032
But then w're in good shape because we 
have the inductive hypothesis to rescue 

207
00:13:34,032 --> 00:13:37,457
us. 
So remember, before this union, by the 

208
00:13:37,457 --> 00:13:43,847
inductive hypothesis, for every object 
with a given rank, say r, it had at least 

209
00:13:43,847 --> 00:13:48,977
2^r objects in its sub tree. 
So S1, and S2, both had rank r before 

210
00:13:48,977 --> 00:13:52,287
this. 
Unions, before this union, both of their 

211
00:13:52,287 --> 00:13:57,287
subtree were at least two to the r. 
So as two subtree sizes bounded below by 

212
00:13:57,287 --> 00:14:02,037
two to the r plus two to the r, a 
quantity also known as two raised to the 

213
00:14:02,037 --> 00:14:05,587
r plus one. 
Quite conveniently, r plus one is S two's 

214
00:14:05,587 --> 00:14:10,987
new rank, so S two's new bigger rank, its 
subtree size is still meeting the lower 

215
00:14:10,987 --> 00:14:15,307
bound, meeting the target of two raised 
to, New rank, 2 ^ r + 1. 

216
00:14:15,307 --> 00:14:21,017
So that completes the inductive step, 
therefore it completes the proof of claim 

217
00:14:21,017 --> 00:14:25,467
2, that objects of rank r have subtree 
sizes at least 2 ^ r. 

218
00:14:25,467 --> 00:14:31,332
Therefore completes the proof of the rank 
Lemma, that for every rank r, there's at 

219
00:14:31,332 --> 00:14:35,379
most n / 2 ^ r nodes of rank r. 
And remember the rank Lemma, implies that 

220
00:14:35,379 --> 00:14:39,217
the maximum rank, at all times, is 
bounded by log base 2 (n), as long as 

221
00:14:39,217 --> 00:14:42,812
you're using union by rank. 
And that implies, that with this first 

222
00:14:42,812 --> 00:14:47,215
optimization, the worst case running time 
of union, and find, are both O (log n), 

223
00:14:47,215 --> 00:14:50,158
where n is the number of objects, in the 
data structure.