1
00:00:00,012 --> 00:00:02,887
So I don't blame you if you're feeling a 
little bit restless. 

2
00:00:02,887 --> 00:00:06,222
You've now watched three videos about 
this alternative approach to the 

3
00:00:06,222 --> 00:00:08,487
Union-Find Data Structure based on Lazy 
Unions. 

4
00:00:08,487 --> 00:00:11,768
And, frankly, we don't have many tangible 
things to show for it. 

5
00:00:11,768 --> 00:00:15,052
We already had a perfectly fine 
implementation of this data structure 

6
00:00:15,052 --> 00:00:17,749
based on eager unions. 
That gave us constant time finds. 

7
00:00:17,749 --> 00:00:20,140
And yeah, union could be linear in the 
worst case. 

8
00:00:20,140 --> 00:00:23,603
But it was logarithmic on average of. 
For a sequence of unions. 

9
00:00:23,603 --> 00:00:26,146
And so now we have this other 
implementation. 

10
00:00:26,146 --> 00:00:29,238
Both of our operations are requiring 
logarithmic time. 

11
00:00:29,238 --> 00:00:33,617
It's true, it's a worst case bound, and 
it's true union by rank is pretty cool, 

12
00:00:33,617 --> 00:00:37,261
but still the bottom line up to this 
point is not that compelling. 

13
00:00:37,261 --> 00:00:39,829
But I've still got another trick up my 
sleeve. 

14
00:00:39,829 --> 00:00:43,875
Now just watch what happens once we 
employ a second optimization known as 

15
00:00:43,875 --> 00:00:47,784
path compression. 
So the optimization is very natural. 

16
00:00:47,784 --> 00:00:52,856
I suspect many as serious programmers 
would come up with it on their own if 

17
00:00:52,856 --> 00:00:56,912
they were tasked with implementing union 
find with union. 

18
00:00:56,912 --> 00:00:59,747
By rank. 
So to motivate this optimization let's 

19
00:00:59,747 --> 00:01:03,537
think about what would be our worst 
nightmare as someone maintaining such a 

20
00:01:03,537 --> 00:01:05,992
data structure and hoping for good 
performance. 

21
00:01:05,992 --> 00:01:09,997
Well remember the running time of a find 
is proportional to the number or parent 

22
00:01:09,997 --> 00:01:14,002
pointers you have to traverse that is as 
proportional to the depth of the object 

23
00:01:14,002 --> 00:01:15,557
that which. 
it's invoked. 

24
00:01:15,557 --> 00:01:19,227
So what's the worst case find look like? 
Well it's going to be the leaf, and 

25
00:01:19,227 --> 00:01:23,127
moreover it's going to be the leaf which 
is furthest away from its corresponding 

26
00:01:23,127 --> 00:01:25,517
root. 
Now, on the one hand we're using union by 

27
00:01:25,517 --> 00:01:27,717
rank so we know this depth can't be too 
bad. 

28
00:01:27,717 --> 00:01:29,912
We know it's going to be at most big o of 
log n. 

29
00:01:29,912 --> 00:01:33,952
However there will be examples, there 
will in general be leafs that are Theta 

30
00:01:33,952 --> 00:01:37,512
of log n hops away from their root. 
And for all we know we're just going to 

31
00:01:37,512 --> 00:01:41,112
get those endless sequence of Find 
operations where every single one is 

32
00:01:41,112 --> 00:01:44,427
invoked on a leaf that's a log n number 
of hops away from its root. 

33
00:01:44,427 --> 00:01:48,711
So for example in the pink tree that I've 
shown in the upper right of the slide You 

34
00:01:48,711 --> 00:01:53,088
know, maybe someone just keeps searching 
for the object, keeps invoking find from 

35
00:01:53,088 --> 00:01:55,310
the object 1 over and over and over 
again. 

36
00:01:55,310 --> 00:01:59,107
And then we're going to be suffering a 
log number of steps every single 

37
00:01:59,107 --> 00:02:01,818
operation. 
But if you think about it, it's totally 

38
00:02:01,818 --> 00:02:06,054
pointless to keep redoing the work of 
previous finds, to keep re-traversing the 

39
00:02:06,054 --> 00:02:10,332
same paretn pointers over and over again. 
So, for example, in the pink tree that 

40
00:02:10,332 --> 00:02:14,522
I've shown in the upper right, imagine 
that find is invoked from object. 

41
00:02:14,522 --> 00:02:17,487
The one. 
So then we traverse the 3 parent corners, 

42
00:02:17,487 --> 00:02:22,202
we discover the ancestors 4 and 6, before 
terminating at 7 and we discover that 7 

43
00:02:22,202 --> 00:02:25,582
is the leader, or the root corresponding 
to the object 1. 

44
00:02:25,582 --> 00:02:29,152
Now remember, we do not care that 4 and 6 
are ancestors of 1. 

45
00:02:29,152 --> 00:02:32,402
That is uninteresting. 
We only visited them, to discover we 

46
00:02:32,402 --> 00:02:36,342
actually cared about that 7 is the 
leader, the root, corresponding to 1. 

47
00:02:36,342 --> 00:02:40,400
Well, but let's just, cache that 
information, now that we've computed it. 

48
00:02:40,400 --> 00:02:44,796
So this is now, basically reconstructing 
the eager, union-find implementation. 

49
00:02:44,796 --> 00:02:48,046
Let's just rewire 1's parent pointer to 
point straight to 7. 

50
00:02:48,046 --> 00:02:51,342
We don't need to recompute 4 and 6 and 
some later find them. 

51
00:02:51,342 --> 00:02:58,387
In operation,and more generally after 
we'll worth find from object 1 we may as 

52
00:02:58,387 --> 00:03:05,052
well rewire 4's parent pointer as well to 
point directly to it's leader 7. 

53
00:03:05,052 --> 00:03:10,748
So that then, is, path compression. 
When find is invoked from some node x, 

54
00:03:10,748 --> 00:03:15,817
and you traverse parent pointers from x, 
up to its root, call it r. 

55
00:03:15,817 --> 00:03:21,735
For every, object that you visit, along 
this path, from x to r, you rewire the 

56
00:03:21,735 --> 00:03:26,339
parent pointer to Point, directly to r. 
So r doesn't have its parent pointer 

57
00:03:26,339 --> 00:03:30,229
changed, it still points to itself. 
The penultimate object on this path 

58
00:03:30,229 --> 00:03:34,582
doesn't have it's parent pointer changed, 
it already was pointing to the root r. 

59
00:03:34,582 --> 00:03:38,851
But anything below the root, and it's 
immediate descendants, on this path from 

60
00:03:38,851 --> 00:03:43,195
x, will have its parent point updated, 
and it will be updated to point directly, 

61
00:03:43,195 --> 00:03:45,560
to r. 
Now we couldn't get away with this, if 

62
00:03:45,560 --> 00:03:48,692
the tree had to be binary But it doesn't 
have to be binary. 

63
00:03:48,692 --> 00:03:52,492
We couldn't get away with this, if we had 
to satisfy some other constraint, like a 

64
00:03:52,492 --> 00:03:54,192
search tree property. 
But we don't. 

65
00:03:54,192 --> 00:03:57,592
So nothing's stopping us from just 
casting this information, about whose 

66
00:03:57,592 --> 00:04:00,242
everyone's leaders are. 
Because that's really, the only 

67
00:04:00,242 --> 00:04:03,617
information we're responsible from 
exporting, from this union find data 

68
00:04:03,617 --> 00:04:07,097
structure. 
So pictorially if you like thinking about 

69
00:04:07,097 --> 00:04:12,368
the trees in effect what past compression 
does is make the tree shallower and more 

70
00:04:12,368 --> 00:04:15,404
bushy. 
So in our example in the upper right, it 

71
00:04:15,404 --> 00:04:20,750
rips out 1 and 4 and makes them immediate 
decendent of the root 7 Prefer to think 

72
00:04:20,750 --> 00:04:26,714
in terms of the array representation, and 
remember, in the array representation, 

73
00:04:26,714 --> 00:04:31,220
the array index for i, is recording the 
parent of the object i. 

74
00:04:31,220 --> 00:04:34,667
So, in our pink tree, 5, 6 and 7, all 
point to 7. 

75
00:04:34,667 --> 00:04:39,615
They all ave parent 7. 
Where as 2 and 3 have parent 5. 

76
00:04:39,615 --> 00:04:43,966
4 has the parent 6, and 1 has the parent 
4. 

77
00:04:43,966 --> 00:04:52,809
And after applying path in, compression, 
following a find invoked at the object 1, 

78
00:04:52,809 --> 00:04:59,504
1 and 4 are redirected to have parent 7. 
So what are the pros and cons of the path 

79
00:04:59,504 --> 00:05:03,400
compression optimization. 
Well the con side is quite minor. 

80
00:05:03,400 --> 00:05:08,294
Really, we're just doing a little 
multitasking on find, and that introduces 

81
00:05:08,294 --> 00:05:13,209
a small constant factor overhead. 
You're already doing work proportional to 

82
00:05:13,209 --> 00:05:16,814
the number of hops. 
On the path from x towards root and we're 

83
00:05:16,814 --> 00:05:20,089
still just doing constant work for each 
node on that path. 

84
00:05:20,089 --> 00:05:24,476
We're doing an extra constant work preach 
node after the fact to rewire the parent 

85
00:05:24,476 --> 00:05:27,919
pointer out of the point to the root. 
The Pro should be obvious. 

86
00:05:27,919 --> 00:05:32,163
This is going to speed up all subsequent 
find operations, you're really making 

87
00:05:32,163 --> 00:05:36,609
sure you don't redo redundant work, you 
don't traverse parent pointers over and 

88
00:05:36,609 --> 00:05:39,909
over and over again. 
So what's clear is that you know finds 

89
00:05:39,909 --> 00:05:42,462
will speed up, what's really not clear 
is. 

90
00:05:42,462 --> 00:05:46,663
Whether they'll speed up by a lot. 
So is this only going to have an affect 

91
00:05:46,663 --> 00:05:50,953
the performance of the Find operation by 
say a factor of 2? Or, will we get 

92
00:05:50,953 --> 00:05:55,555
something fundamentally better than the 
logarithmic performance we were stuck 

93
00:05:55,555 --> 00:06:00,293
with without path compression? So, let me 
spend a slide on the subtle point of how 

94
00:06:00,293 --> 00:06:04,742
ranks interact with the path compression. 
In optimization. 

95
00:06:04,742 --> 00:06:11,219
So the plan is, we're going to 
manipulate, rank fields, with path 

96
00:06:11,219 --> 00:06:18,424
compression, in exactly the same way, as 
we did, when we didn't have path 

97
00:06:18,424 --> 00:06:22,412
compression. 
So how did we manipulate them previously? 

98
00:06:22,412 --> 00:06:26,863
Well, at the birth of the Union Find data 
structure, everybody has rank 0. 

99
00:06:26,863 --> 00:06:30,174
Ranks never change except possibly when 
you do a union. 

100
00:06:30,174 --> 00:06:34,716
When you do a union, you look at the 
roots of the two objects whose groups you 

101
00:06:34,716 --> 00:06:39,481
have to union, you look at their ranks. 
If one of them has strictly bigger rank, 

102
00:06:39,481 --> 00:06:42,239
that becomes the root of your new merged 
tree. 

103
00:06:42,239 --> 00:06:46,395
The root with the smaller rank is 
installed as a child Underneath. 

104
00:06:46,395 --> 00:06:51,759
If you union two roots that have exactly 
the same rank, you pick arbitrarily which 

105
00:06:51,759 --> 00:06:55,542
of them is the new root, and you bump up 
it's rank, by one. 

106
00:06:55,542 --> 00:07:00,272
That is how we manipulated ranks before. 
That is exactly how we're going to 

107
00:07:00,272 --> 00:07:04,076
manipulate them now. 
So in particular, when we apply path 

108
00:07:04,076 --> 00:07:08,806
compression, when we rewire parent 
pointers following a find, we do not 

109
00:07:08,806 --> 00:07:12,857
touch Anybody's ranks, we leave them all 
exactly the same. 

110
00:07:12,857 --> 00:07:17,782
So again, ranks only change in one 
specific case, in a very specific way. 

111
00:07:17,782 --> 00:07:23,007
Namely, during the union, when we merge 
two trees, that have exactly the same 

112
00:07:23,007 --> 00:07:27,092
rank and when we do that kind of union, 
that kind of merge. 

113
00:07:27,092 --> 00:07:31,502
Which ever of the old roots winds up 
being the new root, we bump up its rank 

114
00:07:31,502 --> 00:07:35,177
by 1, that is the only modification we 
ever make to any ranks. 

115
00:07:35,177 --> 00:07:39,785
Now, it might bother you, that we don't 
recompute the ranks, so when we apply 

116
00:07:39,785 --> 00:07:43,303
path compression in a way I sort of hope 
it does bother you. 

117
00:07:43,303 --> 00:07:48,182
Because by now recomputing ranks by just 
manipulating them exactly as before were 

118
00:07:48,182 --> 00:07:52,599
actually loosing the semantics of what 
ranks meant in the previous video. 

119
00:07:52,599 --> 00:07:56,242
So the key envering we have that ranks 
exactly represented. 

120
00:07:56,242 --> 00:07:59,818
Worse case, search time to that object is 
now lost. 

121
00:07:59,818 --> 00:08:05,366
The easiest way to see what I'm talking 
about is probably through an example, so 

122
00:08:05,366 --> 00:08:10,964
let me redraw the same tree we had on the 
previous slide, both before and after we 

123
00:08:10,964 --> 00:08:15,522
apply path compression following a find 
invoked at the object 1. 

124
00:08:15,522 --> 00:08:20,130
So what I've done here is in the top 
tree, before the path compression's been 

125
00:08:20,130 --> 00:08:24,222
applied I've labeled ranks just as they 
were in the previous video. 

126
00:08:24,222 --> 00:08:28,540
So for each node in the top tree, its 
rank is equal to the longest path from a 

127
00:08:28,540 --> 00:08:31,927
leaf to that node. 
Then I applied path compression from the 

128
00:08:31,927 --> 00:08:36,530
object one, resulting in the Bushier more 
shallow tree on the bottom and as I said, 

129
00:08:36,530 --> 00:08:40,104
I did not touch anybody's rank when I 
applied this past compression. 

130
00:08:40,104 --> 00:08:43,703
I didn't do any recomputation. 
I know see we've now lost the semantic 

131
00:08:43,703 --> 00:08:46,454
for the rank. 
Just to give a simple example, they're 

132
00:08:46,454 --> 00:08:49,492
now leaves. 
That don't have rank 0 they have rank 

133
00:08:49,492 --> 00:08:53,403
strictly bigger than 0. 
So what we can say is that for every node 

134
00:08:53,403 --> 00:08:58,180
the rank is a upper band not necessarily 
exactly the same as but at least as big 

135
00:08:58,180 --> 00:09:00,903
as the longest path from a leaf to that 
node. 

136
00:09:00,903 --> 00:09:05,809
But because of the shortcuts that we've 
installed we no longer have the equality 

137
00:09:05,809 --> 00:09:09,008
that we had before. 
There is good news however, we can 

138
00:09:09,008 --> 00:09:13,053
definitely salvage a lot of the 
technology that we developed in the last 

139
00:09:13,053 --> 00:09:17,508
video, for the union by rank analysis, 
and apply it fruitfully here to the case 

140
00:09:17,508 --> 00:09:20,945
with path compression. 
In particular, any statement that we 

141
00:09:20,945 --> 00:09:25,301
made, last video, that talks only about 
the ranks, of objects, and not about 

142
00:09:25,301 --> 00:09:29,697
their parent pointers per se, that will 
be as true with path compression As 

143
00:09:29,697 --> 00:09:32,912
without. 
Why? Well, think about making two copies 

144
00:09:32,912 --> 00:09:37,487
of a union-find data structure. 
Exactly the same set of objects, and now 

145
00:09:37,487 --> 00:09:42,577
run them in parallel, on exactly the same 
sequence of union and find operations. 

146
00:09:42,577 --> 00:09:47,317
So, by definition, we manipulate the 
ranks of objects, in exactly the same 

147
00:09:47,317 --> 00:09:49,778
way. 
In the 2 copies it doesn't matter if this 

148
00:09:49,778 --> 00:09:53,206
is path impression or not. 
So all moment in time every object will 

149
00:09:53,206 --> 00:09:56,659
have exactly the same ranks in one copy 
as it does in another copy. 

150
00:09:56,659 --> 00:09:59,948
Path impression or no, doesn't matter, 
exactly the same rank. 

151
00:09:59,948 --> 00:10:03,351
What's going to be different in the true 
copy is the parent pointer. 

152
00:10:03,351 --> 00:10:07,349
They're in some sense going to be further 
along, they're going to point further up 

153
00:10:07,349 --> 00:10:10,271
into the tree and then copied with path 
impression. 

154
00:10:10,271 --> 00:10:14,212
But no matter, any statements that's 
purely about ranks is going to be equally 

155
00:10:14,212 --> 00:10:17,223
well true. 
With or without path compression. 

156
00:10:17,223 --> 00:10:21,807
So in particular, remember the Rank Lemma 
that we proved in the last video. 

157
00:10:21,807 --> 00:10:25,011
That says there's the most n/r^r objects 
of rank r. 

158
00:10:25,011 --> 00:10:29,269
That was true without path compression. 
It's going to be true with path 

159
00:10:29,269 --> 00:10:32,352
compression. 
And we're going to use it in the fourth 

160
00:10:32,352 --> 00:10:35,452
The coming analysis. 
Another fact, which is still true with 

161
00:10:35,452 --> 00:10:39,075
path compression, in fact in some sense, 
it's only more true with path 

162
00:10:39,075 --> 00:10:42,897
compression, is that whenever you 
traverse a parent pointer, you will get 

163
00:10:42,897 --> 00:10:46,900
to a node with strictly, larger, rank. 
Remember in the last video we said, as 

164
00:10:46,900 --> 00:10:50,402
you go up toward the root, you're 
going to see a strictly increasing 

165
00:10:50,402 --> 00:10:54,737
[UNKNOWN] sequence of ranks. 
Here each pointer with path impression 

166
00:10:54,737 --> 00:10:57,782
represents a sequence without path 
impression. 

167
00:10:57,782 --> 00:11:02,342
So certainly, you will still only be 
increasing in ranks each time you. 

168
00:11:02,342 --> 00:11:05,272
A pointer. 
Lets now quantify the benefit of path 

169
00:11:05,272 --> 00:11:10,061
compression, by proving new performance 
gaurantees on the operatioins in this 

170
00:11:10,061 --> 00:11:13,937
union find data structure. 
In fact, the running time bounds are 

171
00:11:13,937 --> 00:11:18,826
going to blow away the logarithmic bound 
that we had when we were only doing union 

172
00:11:18,826 --> 00:11:22,643
by rank, without Path compression. 
We're going to do the analysis in two 

173
00:11:22,643 --> 00:11:24,450
steps. 
And it's going to be historically 

174
00:11:24,450 --> 00:11:27,936
accurate, in the sense that I'm first 
going to show you a theorem by Hopcroft 

175
00:11:27,936 --> 00:11:31,447
and Ullman, which gives an excellent, but 
not quite optimal bound, on the 

176
00:11:31,447 --> 00:11:33,713
performance of this union-find data 
structure. 

177
00:11:33,713 --> 00:11:37,327
And then, we're going to build on the 
Hopcroft-Ullman ideas, to prove Tarjan's 

178
00:11:37,327 --> 00:11:40,472
even better bound on the performance of 
this data structure. 

179
00:11:40,472 --> 00:11:46,091
So here is the performance guarantee that 
Hopcroft and Ullman proved back in 1973, 

180
00:11:46,091 --> 00:11:49,887
about 40 years ago. 
They said consider a union find data 

181
00:11:49,887 --> 00:11:53,254
structure implemented as we've been 
discussing. 

182
00:11:53,254 --> 00:11:56,493
Lazy unions, union by rank, path the 
pressure. 

183
00:11:56,493 --> 00:12:01,935
Consider and arbitrary sequence of M find 
and union operations, and suppose the 

184
00:12:01,935 --> 00:12:06,592
data structure contains N objects. 
Then the total work that the data 

185
00:12:06,592 --> 00:12:10,777
structure does. 
To process this entire sequence of m 

186
00:12:10,777 --> 00:12:17,214
operations, is m, times, log *, of n. 
What is, log * of n, you ask? Well, it's 

187
00:12:17,214 --> 00:12:22,810
the iterated logarithm operator. 
So, by analogy, remember when we first 

188
00:12:22,810 --> 00:12:27,163
demystified the logarithm. 
That way back at the beginning of part 1 

189
00:12:27,163 --> 00:12:31,383
we noticed that log base 2 of a number is 
just well you type that number into your 

190
00:12:31,383 --> 00:12:35,636
calculator, you count the number of times 
you / by 2 until the result drops below 

191
00:12:35,636 --> 00:12:38,107
1. 
So log * you type the number n into your 

192
00:12:38,107 --> 00:12:42,194
calculator you count the number of times 
you need to hit log before the result 

193
00:12:42,194 --> 00:12:43,947
drops below. 
Work one. 

194
00:12:43,947 --> 00:12:50,847
So it totally burgles the mind, just have 
slowly this log * function grows. 

195
00:12:50,847 --> 00:12:57,672
In fact that may give you a quick quiz 
just to make sure that you appreciate the 

196
00:12:57,672 --> 00:13:04,547
glacial growth of the log * function. 
So the correct answer is B, it is 5 and 

197
00:13:04,547 --> 00:13:10,532
you should probably spend a couple of 
seconds just reflecting on. 

198
00:13:10,532 --> 00:13:15,190
How absurd this is, right? So, you take 
this ridiculous number, 2 to the 65,000, 

199
00:13:15,190 --> 00:13:19,572
remember the number of atoms, in the 
known universe estimated is way smaller 

200
00:13:19,572 --> 00:13:23,739
than this, it's just 10 to the 80. 
So you dig this ridiculous number, you 

201
00:13:23,739 --> 00:13:26,408
apply the log * function, you get back, 
5. 

202
00:13:26,408 --> 00:13:30,474
Why do you get back 5? Well, take 
logarithm once, what do you get back? 

203
00:13:30,474 --> 00:13:34,762
65,536, Also know as 2 to the 16. 
So, you take log again, you get back 16. 

204
00:13:34,762 --> 00:13:37,850
Also known as 2^4. 
You take the value again, you get back 4, 

205
00:13:37,850 --> 00:13:40,913
also known as 2^2. 
You take the value again, you get back at 

206
00:13:40,913 --> 00:13:43,335
2, and then one more time it gets you 
down to 1. 

207
00:13:43,335 --> 00:13:47,512
So, you think of log * as the inverse of 
tower function with the tower function 

208
00:13:47,512 --> 00:13:49,828
takes, think of it as a positive integer, 
t. 

209
00:13:49,828 --> 00:13:53,702
And it just raises to 2, to the 2, to the 
2, to the 2, to the 2 t times. 

210
00:13:53,702 --> 00:13:59,117
And this in particular shows that log*, 
despite it's glacial growth, is not 

211
00:13:59,117 --> 00:14:02,722
bounded by a constant. 
This function is not L of 1. 

212
00:14:02,722 --> 00:14:08,397
For any constant c you can think of, I 
can, in principle, exhibit a large enough 

213
00:14:08,397 --> 00:14:11,325
n so that log* of my n is bigger than 
your c. 

214
00:14:12,434 --> 00:14:16,747
Anyway, I can just set c to be 2 to the 2 
to the 2 to the so on c times. 

215
00:14:16,747 --> 00:14:21,691
log * of that will be at least c. 
So that's the HopCroft Ullman performance 

216
00:14:21,691 --> 00:14:26,923
guarantee, that, on average, over a 
sequence of N union and fine operations, 

217
00:14:26,923 --> 00:14:30,122
you spend only this ridiculously small 
log *. 

218
00:14:30,122 --> 00:14:33,759
In amount per operation. 
We're going to prove it in the next 

219
00:14:33,759 --> 00:14:34,127
video.