In this video, we're going to prove our 
first performance guarantee on the 
Union-Find Data Structure with path 
compressions is the path first 
established by Hopcroft and Ullman. 
Let me briefly remind you about the 
guarantee. 
Consider a union-find data structure 
where you're using lazy unions. 
You're doing union by rank. 
You're using the path compression 
optimization. 
Consider an arbitrary sequence, say of m, 
union and fine. 
And operations, the guarantee is that 
over the course of this sequence of 
operations the total work that you do is 
a most m, the number of operations times 
the gracially growing function large 
star. 
Of n. 
Remember a log star of n is defined as 
the number of times you have to apply the 
logarithm function to n, before you get a 
result which is below one. 
Remember two raised to the 65536, a 
totally absurd number, log star of that 
number is merely five. 
So the theorem is true no matter what m 
is, no matter how many operations you do, 
whether you do very few or whether you do 
a whole lot of them. 
I'm going to focus on the interesting 
case where m is at least asymptotically 
as big as n, or m is omega of n. 
I'll let you think through, in the 
privacy of your own home, why the 
arguments we're about to see, imply the 
theorem, no matter what m is. 
So, 1 final comment about this guarantee 
before we turn to its proof. 
Notice what the theorem is not saying. 
The theorem is not saying that every 
single 1 of these find and union 
operations runs in big 0 of log star n 
time. 
And that's 'cuz, that statement, in 
general, which would be stronger. 
That statement is false. 
There are going to be operations, in 
general, that take more than log star. 
n*. 
Now on the one hand we know no operation 
is going to be slower than a logarithmic 
*. 
Even when we didn't have path 
compression, no operation was worse than 
the log n time. 
And we're not going to do any worse with 
path compression. 
So the worst case timed out is log n but 
some operations may indeed run that 
slowly. 
But over a sequence of m operations, the 
average amount of work we do on a per 
operation basis is only Log*(n). 
This is exactly the sort of so-called 
"amortized analysis" that we did in our 
very first union-find implementation, 
with eager unions. 
We agreed that particular unions might 
take linear time, but over a sequence of 
unions, we spend only logarithmic time. 
So the same thing is here, with the 
exception that we're getting a radically 
better, on average, running time bound, 
of log*n, over a sequence. 
So, before I launch into the proof 
details, lets just spend 1 slide talking 
a little bit about the proof plan, and in 
particular, what is the intuition behind 
the performance, that we're trying to 
make precise, that we're trying to encode 
in a mathematical way, in the proof 
that's About to follow. 
Well if we're hoping to prove abound, 
better than log n, what we were stuck 
with, without path compression, it better 
be the case that installing all these 
shortcuts really qualitatively speeds up, 
finds and And on some level it's sort of 
clear that things have to be sped up. 
We're replacing an old chain of pointers 
with a single pointer so you can only go 
faster. 
But how do we keep track of this 
progress? How do we compile intuition and 
make it into a rigorous gurantee? So 
here's the idea. 
Let's focus on an object x, which is no 
longer a root. 
At this point its parent is somebody 
other than itself. 
And one thing I want you to remember from 
our union by rank analysis, is that once 
an object is not a root, it's rank is 
frozen forever. 
Now we proved that in the context when 
there was no path compression. 
But again we manipulate ranks in exactly 
the same way when there is path 
compressions. 
If you're an object and you're no longer 
a root. 
There is no way your rank will ever 
change again. 
Now what we're certainly hoping is true, 
is that finds that originate at, say, 
this object X are running fast. 
Not only that, but they should even be 
getting faster and faster as time goes 
on, because we're doing more and more 
path compressions. 
Impression. 
So here's the big idea, the way we're 
going to reason about the worst case 
running time of the find operation or 
equivalently the longest sequence of 
parent point is we might have to 
transverse to get to a root is we're 
going to think about the sequence of 
ranks that we observe as we traverse 
these parent pointers from an object x. 
Up to the root. 
Let me give you an example. 
So, what would be the worst cast 
situation? Well, the worst case would be 
if we have a data structure and let's 
say, the maximum rank is something like 
100. 
The worst case sequence of ranks, the 
longest sequence we would ever see would 
be if we start a fine operation at a, at 
a object with rank 0. 
We traverse it parent point or we get to 
it's parent, it has rank 1. 
It traverses it's parent, it has rank 2, 
then 3, then 4, and so on, all the way up 
to 100. 
Now remember, ranks have to strictly 
increase, whenever we traverse a parent 
pointer. 
As we discussed, that was true with or 
without path compression. 
So, in the worst case situation to go 
from 0 to 100, you have to traverse 100 
pointers. 
So, that would be a bummer. 
Wouldn't it be nice if, every time we 
traversed a parent pointer, the rank 
increased not just by 1, but by a much 
bigger number. 
So for example, if we went from 0, to 10, 
to 20, to 30, to 40, and so on, that 
would guarantee that we'd get to the max 
rank node, 100, in only 10 steps. 
So again the bottom line is, if we can 
have a better, larger, lower bound on the 
gap in ranks between object and its 
parent. 
That implies more rapid progress through 
the possible ranks of the nodes that we 
can see. 
And it translates to faster finds, to 
fewer parent pointer traversals. 
So with that idea in mind, the big gaps 
between ranks imply rapid progress. 
I want to propose, as a progress measure, 
for a given non-root object x, the gap 
between x's rank, which again remember is 
frozen forevermore, and, the rank of it's 
current parent. 
So this progress measure is a good one 
for two reasons. 
First, as we just discussed, if you have 
a handle on this gap, if you can lower 
bound it, then that gives you an upper 
bound on the search time. 
Secondly, this gap allows us to quantify. 
The benefit of path compression. 
Of installing these shortcuts. 
Specifically, whenever you install a new 
shortcut, you rewire an objects parent 
pointer to point higher up in the tree. 
It's new parent is going to have rank, 
strictly bigger than it's old one. 
That means this gap, is only going to get 
bigger. 
Summarizing, path compression, improve, 
improves this progress Measure. 
That is, if an object x previously had a 
parent p and then has its parent pointer 
rewired to a different node, p prime, the 
rank of p prime is bigger than the rank 
of p. 
And just to make sure this is crystal 
clear, let's just draw a couple of 
cartoon examples. 
Examples. 
So first just in the abstract, if you 
think about an object x. 
Suppose it has some parent p, and suppose 
the root of the tree is some p prime, so 
some ancestor strictly further up in the 
tree. 
Remember, ranks always increase as you go 
up the tree, as you follow the The parent 
pointers. 
So that means p's rank is strictly bigger 
than x. 
p-prime's rank, this is the important 
thing, is strictly bigger than p. 
So, when you rewire x's parent pointer, 
to go, to point from p no longer, but 
rather to p-prime, it acquires a new 
parent, and it's new parent is an 
ancestor of it's old one, ergo, it must 
have strictly bigger rank. 
For that reason the gap Between x's rank, 
which is fixed forevermore, and the rank 
of its new parent, that gap is bigger 
than the gap between its rank and the 
rank of its old parent. 
You can also see this effect in action 
with the running seven object example we 
were using in the last video. 
So I've shown that example, that tree in 
pink, both before and after path 
compression. 
And again before path compression I just 
defined the ranks as with union by ranks 
so that the rank of each node is equal to 
the longest path from a leaf to that 
node. 
And of course we don't change the ranks 
when we apply path compression. 
And what do we observe where exactly two 
of the objects had their parent pointer 
rewired. 
Namely, objects one and four had their 
parent Pointer rewired to piont directly 
to 7, and as a consequence, the gap in 
rank, between object 1, and it's parent, 
has leaped, from merely 1, the difference 
between each rank, and the rank of 4, up 
to 3. 
The difference between it's rank, and 
that of it's new parent 7, similarly 
object 4, its rank gap has jumped from 1 
to 2. 
Its rank is only 1 less than its old 
parent 6, but it's 2 less than that of 
its new parent 7. 
So, believe it or not, we actually the 2 
crucial building blocks for the 
Hopcraft-Ullman analysis. 
Building block number 1, is the rank 
Lemma, that we discussed a couple videos 
ago. 
The fact that with or without path 
compression, you can't have too many 
objects where the given rank r. 
You're getting the most n/2^r. 
The second building block Is the one we 
just covered. 
That every time you update a parent 
pointer under path compression, the gap 
has to grow between the rank of that 
object and the rank of it's new parent. 
The rest of the proof is just an optimal 
exploitation of those two building 
blocks. 
Let me now show you the details.