1
00:00:00,012 --> 00:00:04,016
So, in this video, we're going to prove 
Tarjan's inverse Ackermann bound in the 

2
00:00:04,016 --> 00:00:07,821
performance of the union find data 
structure with union by rank and path 

3
00:00:07,821 --> 00:00:10,706
compression. 
Frankly, I hope you're kind of excited. 

4
00:00:10,706 --> 00:00:14,677
This is one of the crown jewels in the 
entire history of algorithms and data 

5
00:00:14,677 --> 00:00:17,612
structures. 
Let me remind you what the bound says. 

6
00:00:17,612 --> 00:00:21,862
You've got a union-find data structure, 
we're doing lazy unions, we're doing 

7
00:00:21,862 --> 00:00:24,367
union by rank, we're doing path 
compression. 

8
00:00:24,367 --> 00:00:28,992
And for an arbitrary sequence of m find a 
union operations. The total amount of 

9
00:00:28,992 --> 00:00:32,767
work you do over the sequence of 
operations is bounded above by m times 

10
00:00:32,767 --> 00:00:37,247
alpha of n, where n is the number of 
objects in the data structure and alpha 

11
00:00:37,247 --> 00:00:42,357
is the inverse Ackermann function that we 
defined in the previous video. 

12
00:00:42,357 --> 00:00:46,827
I want to give a shout out here to Dexter 
Kozen's beautiful book the Design and 

13
00:00:46,827 --> 00:00:50,182
Analysis of Algorithms. 
It is his analysis that I will be 

14
00:00:50,182 --> 00:00:53,603
following closely here. 
So, it's true that merely stating, 

15
00:00:53,603 --> 00:00:58,082
Tarjan's bound is non-trivial. We have 
this entire video defining the Ackermann 

16
00:00:58,082 --> 00:01:01,904
function and it's inverse so we could 
make sense of the statement. 

17
00:01:01,904 --> 00:01:05,428
That said, we're well positioned to 
approve Tarjan's bound. 

18
00:01:05,428 --> 00:01:09,821
In fact, the template of the proof is 
going to very much mirror what we already 

19
00:01:09,821 --> 00:01:12,829
did for the log star analysis by Hopcroft 
and Ullman. 

20
00:01:12,829 --> 00:01:16,592
So in that spirit, let's review what were 
the two basic workhorses, the two 

21
00:01:16,592 --> 00:01:20,348
building blocks that drove forward the 
Hopcroft-Ullman analysis. 

22
00:01:20,348 --> 00:01:24,490
So the first building block is the rank 
Lemma, which dates back all the way to 

23
00:01:24,490 --> 00:01:28,271
the videos pre-path compression. 
And remember the rank Lemma gives us 

24
00:01:28,271 --> 00:01:31,513
control on how many objects can exist 
with a given rank. 

25
00:01:31,513 --> 00:01:36,062
And in particular, that upper bound is 
decreasing exponentially with the rank, 

26
00:01:36,062 --> 00:01:39,589
it's n/2^r objects at most with the given 
rank r. 

27
00:01:39,589 --> 00:01:44,122
But, of course, we need a second building 
block to exploit the facts that we're 

28
00:01:44,122 --> 00:01:48,960
using the path compression optimization. 
So, how do we quantify the progress made 

29
00:01:48,960 --> 00:01:52,585
by path compression? 
Well, we argue that every time a node has 

30
00:01:52,585 --> 00:01:56,845
its parent pointer updated in path 
compression, it acquires a new parent 

31
00:01:56,845 --> 00:02:00,941
with strictly larger rank. 
So, you can think of the Hopcroft-Ullman 

32
00:02:00,941 --> 00:02:04,812
analysis as a clever and essentially 
optimal exploitation of these two 

33
00:02:04,812 --> 00:02:08,076
building blocks. 
How did we use them? Well, we define the 

34
00:02:08,076 --> 00:02:12,345
notion of rank blocks and each rank block 
has size to raise to the size of the 

35
00:02:12,345 --> 00:02:15,832
previous rank block. 
Why were rank blocks useful? Well, they 

36
00:02:15,832 --> 00:02:19,642
allowed us to measure progress as we 
followed parent pointers. 

37
00:02:19,642 --> 00:02:23,738
We defined an object to be good, if 
following it's parent, catapulted you 

38
00:02:23,738 --> 00:02:27,809
into a subsequent ranked block. 
Because there's only a log star number of 

39
00:02:27,809 --> 00:02:31,722
ranked blocks, you can only visit log 
star good nodes on any given find 

40
00:02:31,722 --> 00:02:34,559
operation. 
So, that gave us a per operation bound on 

41
00:02:34,559 --> 00:02:38,605
the log star for visits to good nodes. 
And then, for visits to bad nodes, we use 

42
00:02:38,605 --> 00:02:42,625
the second building block to argue that 
every time you visit a bad node, the rank 

43
00:02:42,625 --> 00:02:45,602
of its parent increases. 
And that can only happen so many times 

44
00:02:45,602 --> 00:02:49,394
before the rank of this object's parent 
is so much bigger than this object's rank 

45
00:02:49,394 --> 00:02:52,878
itself that the node has to be good. 
You have to make a lot of progress by 

46
00:02:52,878 --> 00:02:57,247
following its parent pointer. 
And the point is, if we want to do better 

47
00:02:57,247 --> 00:03:01,642
than the Hopcroft-Ullman analysis, we're 
not going to do better by taking these 

48
00:03:01,642 --> 00:03:06,052
same two building blocks and exploiting 
them in a better way, that can't be done. 

49
00:03:06,052 --> 00:03:09,742
So, what we need to do is have a better 
version of one of the two building 

50
00:03:09,742 --> 00:03:12,562
blocks. 
So, the key idea in Tarjan's analysis is 

51
00:03:12,562 --> 00:03:16,246
to have a stronger version of the second 
building block. 

52
00:03:16,246 --> 00:03:20,498
We're going to argue that path 
compression doesn't merely increase the 

53
00:03:20,498 --> 00:03:25,552
rank of nodes parents by one, but in 
fact, it increases typically the rank of 

54
00:03:25,552 --> 00:03:30,142
an object's parents by a lot. 
And what's kind of amazing is the link of 

55
00:03:30,142 --> 00:03:34,450
the proof we're about to do is really 
basically the same as that in the 

56
00:03:34,450 --> 00:03:38,727
Hopcroft-Ullman analysis. 
And the steps match up almost perfectly. 

57
00:03:38,727 --> 00:03:42,803
So, the bound is even better, the 
argument is even more ingenious. 

58
00:03:42,803 --> 00:03:48,072
It's even harder to imagine how one would 
come up with this idea oneself. But, as 

59
00:03:48,072 --> 00:03:52,105
far as checking it, as far as 
understanding it, the complexity level is 

60
00:03:52,105 --> 00:03:56,436
roughly the same as what we've already 
done in the log star analysis, so here we 

61
00:03:56,436 --> 00:03:58,608
go. 
So, in this slide, I'm going to give you 

62
00:03:58,608 --> 00:04:01,574
a definition. 
And the point of the definition is to 

63
00:04:01,574 --> 00:04:06,112
measure how much progress we make in 
terms of ranks when we follow a node's 

64
00:04:06,112 --> 00:04:10,616
parent pointer. So, the definition here 
is going to play exactly the same role 

65
00:04:10,616 --> 00:04:14,802
that the definition of rank blocks played 
in the Hopcroft-Ullman analysis. 

66
00:04:14,802 --> 00:04:19,624
So, what I'm going to define is a number, 
a statistic measuring the difference, the 

67
00:04:19,624 --> 00:04:23,709
gap, between the rank of an object and 
the rank of its parent. So, this 

68
00:04:23,709 --> 00:04:27,334
definition is only going to make sense 
for non-root objects x. 

69
00:04:27,334 --> 00:04:32,022
One thing I want you to remember from 
previous videos is that once an object is 

70
00:04:32,022 --> 00:04:35,522
not a root, 
it will never again be a root, and it's 

71
00:04:35,522 --> 00:04:40,272
rank will never again change. 
So, non-root objects have rank fixed 

72
00:04:40,272 --> 00:04:43,937
forevermore. 
So, we're going to use the notation delta 

73
00:04:43,937 --> 00:04:49,057
of x for the statistic. And the 
definition of delta of x is the largest 

74
00:04:49,057 --> 00:04:54,492
value of k such that a sub k, this is 
Ackermann function remember, such that a 

75
00:04:54,492 --> 00:05:00,577
sub k applied to the rank of this object 
x, is bounded above by the rank of x's 

76
00:05:00,577 --> 00:05:03,677
parent. 
So, the bigger the gap between the rank 

77
00:05:03,677 --> 00:05:07,742
of x's parent and x's rank, the bigger 
the value of delta of x. 

78
00:05:07,742 --> 00:05:12,622
So, let me mention, talk through some 
simple examples to make sure this makes 

79
00:05:12,622 --> 00:05:16,527
sense, and also review the Ackermann 
function a little bit. 

80
00:05:16,527 --> 00:05:21,265
So, first of all, let me just point out 
the delta of x is always non-negative for 

81
00:05:21,265 --> 00:05:25,057
any non-root object x. 
So, why is this true? Well remember, and 

82
00:05:25,057 --> 00:05:29,435
this goes all the way back to our union 
by rank discussion, the rank of an 

83
00:05:29,435 --> 00:05:33,830
object's parent is always at least one 
more than the rank of that object. 

84
00:05:33,830 --> 00:05:38,225
And the function a sub 0, recall we 
defined as the successor function. 

85
00:05:38,225 --> 00:05:42,779
So, for every object that's not a root, 
it's always true that it's parent has 

86
00:05:42,779 --> 00:05:47,342
rank at least one more than it. 
And that means we can always at least 

87
00:05:47,342 --> 00:05:51,002
take k to be zero and have the inequality 
be satisfied. 

88
00:05:51,002 --> 00:05:56,517
Now, when is it going to be true that an 
object x has a delta value at least equal 

89
00:05:56,517 --> 00:06:01,287
to one? Well, that is equivalent to 
stating we can take k to be 1 in this 

90
00:06:01,287 --> 00:06:06,997
inequality, and the inequality holds. 
So, let's remember what is the function a 

91
00:06:06,997 --> 00:06:10,445
sub one. 
In the previous video, we discovered that 

92
00:06:10,445 --> 00:06:15,653
was just the doubling function. 
So, we can take k=1 and have this 

93
00:06:15,653 --> 00:06:20,967
inequality satisfied if and only if the 
rank of the node's parent is at least 

94
00:06:20,967 --> 00:06:25,991
double the rank of that object. 
Similarly, an object x has a delta value 

95
00:06:25,991 --> 00:06:31,870
equal to at least 2 if and only if we can 
take k=2 on this right-hand side of the 

96
00:06:31,870 --> 00:06:34,853
definition and have this inequality be 
satisfied. 

97
00:06:34,853 --> 00:06:38,822
So, let's recall what the function a sub 
2 is. 

98
00:06:38,822 --> 00:06:43,881
a sub 2 of r is just r*2^r. 
So, an object x has delta value equal to 

99
00:06:43,881 --> 00:06:49,069
2 if and only if its parent's rank is 
substantially bigger than its own rank, 

100
00:06:49,069 --> 00:06:53,763
in the sense that if it has rank r, its 
parent's ranks has to be at least r*2^r. 

101
00:06:54,982 --> 00:07:00,360
And in general, the bigger the value of 
delta, the bigger the gap between the 

102
00:07:00,360 --> 00:07:03,688
rank of the object x and the rank of its 
parent. 

103
00:07:03,688 --> 00:07:09,018
And, of course, because the Ackermann 
function grows so ridiculously quickly, 

104
00:07:09,018 --> 00:07:14,654
the gap between this object's rank and 
its parent's rank is growing massively as 

105
00:07:14,654 --> 00:07:19,959
you take delta to be larger and larger. 
So, now that we understand this better, 

106
00:07:19,959 --> 00:07:24,428
one thing that should be clear is that 
the delta value of an object x can only 

107
00:07:24,428 --> 00:07:27,739
go up over time, 
right? So remember, we're talking only 

108
00:07:27,739 --> 00:07:31,957
about non-root objects x, so the x's rank 
itself, is frozen forevermore. 

109
00:07:31,957 --> 00:07:36,314
The only thing that's changing is it's 
acquiring potentially new parents, 

110
00:07:36,314 --> 00:07:40,885
through path compression, and each new 
parents rank is bigger than the previous 

111
00:07:40,885 --> 00:07:43,682
one. 
So, the gap in the ranks is only going 

112
00:07:43,682 --> 00:07:48,317
up, and therefore the delta value of this 
object x can only go up, over time. 

113
00:07:48,317 --> 00:07:52,852
So one final comment, you recall on the 
Hopcroft-Ullman analysis, the way we 

114
00:07:52,852 --> 00:07:57,582
defined rank blocks ensured that there 
weren't too many of them, there were only 

115
00:07:57,582 --> 00:08:00,277
log star. 
Similarly here, the way we're defining 

116
00:08:00,277 --> 00:08:04,787
the statistic delta, guarantees that 
there are not too many distinct delta 

117
00:08:04,787 --> 00:08:09,433
values that objects can take on. 
So, we've already argued that delta is a, 

118
00:08:09,433 --> 00:08:13,480
integer that's at least zero, and we can 
also bound it from above. 

119
00:08:13,480 --> 00:08:18,131
At least, for any object x that has a 
rank of at least 2, which will be good 

120
00:08:18,131 --> 00:08:22,275
enough for our purposes. 
If an object x has rank at least 2, its 

121
00:08:22,275 --> 00:08:27,236
value of delta can not be any larger than 
the inverse acromyn of n, than alpha of 

122
00:08:27,236 --> 00:08:29,934
n. 
And this is really just by the way we 

123
00:08:29,934 --> 00:08:34,760
define the inverse Ackermann function 
alpha, we defined it as the smallest 

124
00:08:34,760 --> 00:08:38,908
value of k such that if you apply a sub k 
to 2, that catapult to beyond end. 

125
00:08:38,908 --> 00:08:43,882
And since ranks of objects are bounded 
above by n, actually they're even bounded 

126
00:08:43,882 --> 00:08:48,614
above by log n but in particular by n 
that means you will never see a delta 

127
00:08:48,614 --> 00:08:53,188
value bigger that alpha of n for any 
object x has the rank at least two. 

128
00:08:54,932 --> 00:09:00,147
So, that completes the definition of the 
statistic data of x. 

129
00:09:00,147 --> 00:09:06,362
And, once again, the point of this 
statistic is to quantify progress through 

130
00:09:06,362 --> 00:09:10,207
rank space as we traverse a parent 
pointer, from an object. 

131
00:09:10,207 --> 00:09:14,401
So, in the Hopcroft-Ullman analysis, we 
used rank blocks for that purpose. 

132
00:09:14,401 --> 00:09:16,771
Here, we're using the statistic delta of 
x. 

133
00:09:16,771 --> 00:09:21,164
And now, we have to redefine good and bad 
objects to reflect this different 

134
00:09:21,164 --> 00:09:25,445
measurement for progress, for gaps 
between ranks of objects, and ranks of 

135
00:09:25,445 --> 00:09:28,571
their parents. 
So, that's what I'm going to provide for 

136
00:09:28,571 --> 00:09:31,408
you on this slide, the definition of good 
and bad nodes. 

137
00:09:31,408 --> 00:09:35,171
And, this definition will play exactly 
the same role as it did in the 

138
00:09:35,171 --> 00:09:38,775
Hopcroft-Ullman analysis. 
For the good nodes, we'll be able to just 

139
00:09:38,775 --> 00:09:42,942
get a per operation bound for how many 
visits we make to them. And then, we'll 

140
00:09:42,942 --> 00:09:46,287
be stuck with, and this will be the main 
part of the proof, 

141
00:09:46,287 --> 00:09:50,872
a global analysis arguing that the total 
number of visits to bad nodes over a 

142
00:09:50,872 --> 00:09:55,546
sequence of operations cannot be too big. 
So, the bad objects will be those that 

143
00:09:55,546 --> 00:09:59,939
meet the following set of four criteria. 
If an object fails to meet any of these 

144
00:09:59,939 --> 00:10:04,128
four criteria, it will be called good. 
So, the first two 2 criteria will be 

145
00:10:04,128 --> 00:10:07,047
familiar from the Hopcroft-Ullman 
analysis. 

146
00:10:07,047 --> 00:10:12,091
So, first of all, roots get a pass, and 
direct descendants of roots also get a 

147
00:10:12,091 --> 00:10:15,068
pass. 
So, to be bad, it must be the case you're 

148
00:10:15,068 --> 00:10:18,034
neither a root nor are you the child of a 
root. 

149
00:10:18,034 --> 00:10:21,919
So, what's the motivation for these two 
criteria? Well, it's exactly the same as 

150
00:10:21,919 --> 00:10:26,095
it was in the Hopcroft-Ullman analysis. 
It's just to ensure that bad nodes, after 

151
00:10:26,095 --> 00:10:29,149
they're visited on a find, get their 
parent pointer updated. 

152
00:10:29,149 --> 00:10:32,444
So remember, when you do path 
compression, after you've done a find 

153
00:10:32,444 --> 00:10:36,354
from an object x all the way up to its 
corresponding root, everybody's parent 

154
00:10:36,354 --> 00:10:40,387
pointer gets updated except for the root 
and except for the child of the root. 

155
00:10:40,387 --> 00:10:42,626
Those two nodes keep the same parent 
pointer. 

156
00:10:42,626 --> 00:10:46,234
So, to make sure we have progress, we 
want to exclude the root and the directed 

157
00:10:46,234 --> 00:10:49,842
sentence of the root from being bad, 
we'll acount for them separately. 

158
00:10:49,842 --> 00:10:53,481
The third criterion is not conceptually 
important, its just for technical 

159
00:10:53,481 --> 00:10:57,382
convenience sort of an out of factor the 
way I've defined the inverse arithmetic 

160
00:10:57,382 --> 00:10:59,992
function. 
We're going to give nodes that have rank 

161
00:10:59,992 --> 00:11:03,360
zero or one also with free paths. 
So, in order to bad, we're going to worry 

162
00:11:03,360 --> 00:11:05,944
about the case where the rank is at least 
2. 

163
00:11:05,944 --> 00:11:10,686
So, the final criterion is really the 
ingenious one and it's the one that 

164
00:11:10,686 --> 00:11:16,148
ensures that when you do path compression 
typically objects parents ranks will 

165
00:11:16,148 --> 00:11:19,102
increase extremely rapidly, not just by 
1. 

166
00:11:19,102 --> 00:11:24,708
And this criterion says, for an object x 
to be bad, it has to be the case that is 

167
00:11:24,708 --> 00:11:28,746
has some ancestor y. 
So, an object you get to by traversing 

168
00:11:28,746 --> 00:11:33,362
parent pointers from x, that has exactly 
the same value of delta. 

169
00:11:33,362 --> 00:11:38,099
So, if an object x has delta of x equal 
to 2, for it to be bad, there has to be 

170
00:11:38,099 --> 00:11:42,672
some ancestor object y that also has a 
delta value equal to 2. 

171
00:11:42,672 --> 00:11:47,472
And again, if x fails to meet any of 
these four conditions, then x gets a 

172
00:11:47,472 --> 00:11:49,647
pass, 
x is called a good object. 

173
00:11:49,647 --> 00:11:54,347
Now, I said that the role of this 
definition is going to be to split the 

174
00:11:54,347 --> 00:11:59,847
work done into two groups and that we're 
going to be able to easily bound, even on 

175
00:11:59,847 --> 00:12:03,797
a per operation basis, the number of 
visits to a good node. 

176
00:12:03,797 --> 00:12:08,524
So, the next quiz asks you to think about 
that carefully. 

177
00:12:08,524 --> 00:12:14,459
Alright. So, the correct answer is B. 
It's theta of the inverse Ackermann 

178
00:12:14,459 --> 00:12:18,505
function of n. 
And this is really one of the points, of 

179
00:12:18,505 --> 00:12:24,079
how we define good and bad nodes. 
We guarantee that there can not be too 

180
00:12:24,079 --> 00:12:27,722
many good nodes whenever we do a fine 
operation. 

181
00:12:27,722 --> 00:12:32,289
So, to see why this is true, let's just 
count up the number of nodes along a path 

182
00:12:32,289 --> 00:12:34,940
that can fail to meet each of the four 
criteria. 

183
00:12:34,940 --> 00:12:38,549
Well, so the first criterion was we give 
roots a free pass. 

184
00:12:38,549 --> 00:12:42,067
So there's at most one node, that's good 
because it's a root. 

185
00:12:42,067 --> 00:12:46,700
Similarly, on a given path there's at 
most one object on that path, it's a 

186
00:12:46,700 --> 00:12:50,935
direct descendant of the root. 
Because ranks always strictly go up when 

187
00:12:50,935 --> 00:12:55,452
you follow parent pointers that can be at 
most one object of rank zero and at most 

188
00:12:55,452 --> 00:12:58,741
one object of rank one. 
So, we, and most two objects get a free 

189
00:12:58,741 --> 00:13:02,699
pascals of the third criterion. 
Well, such thing about the interesting 

190
00:13:02,699 --> 00:13:06,769
case nodes that good be because they 
failed to meet the fourth criterion. 

191
00:13:06,769 --> 00:13:10,641
How many can that be? Well, let's focus 
on a particular value of delta. 

192
00:13:10,641 --> 00:13:15,809
So, let's say, think about all of the 
objects along a path that have delta of x 

193
00:13:15,809 --> 00:13:18,687
equal to 3. 
There may well be lots of objects on this 

194
00:13:18,687 --> 00:13:21,834
path with delta of x equal to 3, may be 
there's like 17 of them. 

195
00:13:21,834 --> 00:13:27,248
Call the highest to the closest to the 
root, such object z. Of these 17 objects 

196
00:13:27,248 --> 00:13:31,191
with delta of x equal to 3, 
z is going be the only one that is good. 

197
00:13:31,191 --> 00:13:35,143
The other 16 objects with delta of x 
equal to 3 are going to be bad. 

198
00:13:35,143 --> 00:13:40,873
Why? Well, any object other than z has an 
ancestor, namely z, with exactly the same 

199
00:13:40,873 --> 00:13:44,207
delta value. 
So that, those 16 objects do meet the 

200
00:13:44,207 --> 00:13:47,565
fourth criterion they will be classified 
as bad. 

201
00:13:47,565 --> 00:13:52,442
Summarizing for each of the alpha of n 
possible values, of delta, only the 

202
00:13:52,442 --> 00:13:57,016
upper-most object, the object closest to 
the root with that particular delta 

203
00:13:57,016 --> 00:13:59,545
value, can possibly be classified as 
good. 

204
00:13:59,545 --> 00:14:04,155
So, that bounds at alpha of n, the number 
of objects that fail to meet this fourth 

205
00:14:04,155 --> 00:14:06,926
criterion. 
That gives us the total bound of the 

206
00:14:06,926 --> 00:14:08,356
number of good nodes on any path.