1
00:00:00,012 --> 00:00:04,058
In this video, we're going to prove our 
first performance guarantee on the 

2
00:00:04,058 --> 00:00:07,749
Union-Find Data Structure with path 
compressions is the path first 

3
00:00:07,749 --> 00:00:11,734
established by Hopcroft and Ullman. 
Let me briefly remind you about the 

4
00:00:11,734 --> 00:00:14,496
guarantee. 
Consider a union-find data structure 

5
00:00:14,496 --> 00:00:17,846
where you're using lazy unions. 
You're doing union by rank. 

6
00:00:17,846 --> 00:00:20,583
You're using the path compression 
optimization. 

7
00:00:20,583 --> 00:00:24,042
Consider an arbitrary sequence, say of m, 
union and fine. 

8
00:00:24,042 --> 00:00:28,723
And operations, the guarantee is that 
over the course of this sequence of 

9
00:00:28,723 --> 00:00:33,930
operations the total work that you do is 
a most m, the number of operations times 

10
00:00:33,930 --> 00:00:36,772
the gracially growing function large 
star. 

11
00:00:36,772 --> 00:00:39,792
Of n. 
Remember a log star of n is defined as 

12
00:00:39,792 --> 00:00:45,602
the number of times you have to apply the 
logarithm function to n, before you get a 

13
00:00:45,602 --> 00:00:50,032
result which is below one. 
Remember two raised to the 65536, a 

14
00:00:50,032 --> 00:00:54,572
totally absurd number, log star of that 
number is merely five. 

15
00:00:54,572 --> 00:00:58,817
So the theorem is true no matter what m 
is, no matter how many operations you do, 

16
00:00:58,817 --> 00:01:02,034
whether you do very few or whether you do 
a whole lot of them. 

17
00:01:02,034 --> 00:01:05,955
I'm going to focus on the interesting 
case where m is at least asymptotically 

18
00:01:05,955 --> 00:01:09,413
as big as n, or m is omega of n. 
I'll let you think through, in the 

19
00:01:09,413 --> 00:01:13,293
privacy of your own home, why the 
arguments we're about to see, imply the 

20
00:01:13,293 --> 00:01:17,558
theorem, no matter what m is. 
So, 1 final comment about this guarantee 

21
00:01:17,558 --> 00:01:21,728
before we turn to its proof. 
Notice what the theorem is not saying. 

22
00:01:21,728 --> 00:01:25,942
The theorem is not saying that every 
single 1 of these find and union 

23
00:01:25,942 --> 00:01:28,632
operations runs in big 0 of log star n 
time. 

24
00:01:28,632 --> 00:01:33,078
And that's 'cuz, that statement, in 
general, which would be stronger. 

25
00:01:33,078 --> 00:01:36,759
That statement is false. 
There are going to be operations, in 

26
00:01:36,759 --> 00:01:39,699
general, that take more than log star. 
n*. 

27
00:01:39,699 --> 00:01:44,609
Now on the one hand we know no operation 
is going to be slower than a logarithmic 

28
00:01:44,609 --> 00:01:46,603
*. 
Even when we didn't have path 

29
00:01:46,603 --> 00:01:49,914
compression, no operation was worse than 
the log n time. 

30
00:01:49,914 --> 00:01:53,526
And we're not going to do any worse with 
path compression. 

31
00:01:53,526 --> 00:01:58,135
So the worst case timed out is log n but 
some operations may indeed run that 

32
00:01:58,135 --> 00:02:01,170
slowly. 
But over a sequence of m operations, the 

33
00:02:01,170 --> 00:02:05,738
average amount of work we do on a per 
operation basis is only Log*(n). 

34
00:02:05,738 --> 00:02:10,081
This is exactly the sort of so-called 
"amortized analysis" that we did in our 

35
00:02:10,081 --> 00:02:13,406
very first union-find implementation, 
with eager unions. 

36
00:02:13,406 --> 00:02:17,873
We agreed that particular unions might 
take linear time, but over a sequence of 

37
00:02:17,873 --> 00:02:22,055
unions, we spend only logarithmic time. 
So the same thing is here, with the 

38
00:02:22,055 --> 00:02:27,244
exception that we're getting a radically 
better, on average, running time bound, 

39
00:02:27,244 --> 00:02:30,969
of log*n, over a sequence. 
So, before I launch into the proof 

40
00:02:30,969 --> 00:02:35,908
details, lets just spend 1 slide talking 
a little bit about the proof plan, and in 

41
00:02:35,908 --> 00:02:40,747
particular, what is the intuition behind 
the performance, that we're trying to 

42
00:02:40,747 --> 00:02:45,435
make precise, that we're trying to encode 
in a mathematical way, in the proof 

43
00:02:45,435 --> 00:02:49,017
that's About to follow. 
Well if we're hoping to prove abound, 

44
00:02:49,017 --> 00:02:53,468
better than log n, what we were stuck 
with, without path compression, it better 

45
00:02:53,468 --> 00:02:57,941
be the case that installing all these 
shortcuts really qualitatively speeds up, 

46
00:02:57,941 --> 00:03:02,432
finds and And on some level it's sort of 
clear that things have to be sped up. 

47
00:03:02,432 --> 00:03:07,122
We're replacing an old chain of pointers 
with a single pointer so you can only go 

48
00:03:07,122 --> 00:03:09,452
faster. 
But how do we keep track of this 

49
00:03:09,452 --> 00:03:14,145
progress? How do we compile intuition and 
make it into a rigorous gurantee? So 

50
00:03:14,145 --> 00:03:18,275
here's the idea. 
Let's focus on an object x, which is no 

51
00:03:18,275 --> 00:03:22,078
longer a root. 
At this point its parent is somebody 

52
00:03:22,078 --> 00:03:26,519
other than itself. 
And one thing I want you to remember from 

53
00:03:26,519 --> 00:03:32,202
our union by rank analysis, is that once 
an object is not a root, it's rank is 

54
00:03:32,202 --> 00:03:35,510
frozen forever. 
Now we proved that in the context when 

55
00:03:35,510 --> 00:03:39,315
there was no path compression. 
But again we manipulate ranks in exactly 

56
00:03:39,315 --> 00:03:41,744
the same way when there is path 
compressions. 

57
00:03:41,744 --> 00:03:44,281
If you're an object and you're no longer 
a root. 

58
00:03:44,281 --> 00:03:46,876
There is no way your rank will ever 
change again. 

59
00:03:46,876 --> 00:03:51,071
Now what we're certainly hoping is true, 
is that finds that originate at, say, 

60
00:03:51,071 --> 00:03:54,777
this object X are running fast. 
Not only that, but they should even be 

61
00:03:54,777 --> 00:03:58,806
getting faster and faster as time goes 
on, because we're doing more and more 

62
00:03:58,806 --> 00:04:00,646
path compressions. 
Impression. 

63
00:04:00,646 --> 00:04:04,886
So here's the big idea, the way we're 
going to reason about the worst case 

64
00:04:04,886 --> 00:04:09,218
running time of the find operation or 
equivalently the longest sequence of 

65
00:04:09,218 --> 00:04:13,113
parent point is we might have to 
transverse to get to a root is we're 

66
00:04:13,113 --> 00:04:17,303
going to think about the sequence of 
ranks that we observe as we traverse 

67
00:04:17,303 --> 00:04:20,643
these parent pointers from an object x. 
Up to the root. 

68
00:04:20,643 --> 00:04:23,562
Let me give you an example. 
So, what would be the worst cast 

69
00:04:23,562 --> 00:04:27,379
situation? Well, the worst case would be 
if we have a data structure and let's 

70
00:04:27,379 --> 00:04:29,590
say, the maximum rank is something like 
100. 

71
00:04:29,590 --> 00:04:33,410
The worst case sequence of ranks, the 
longest sequence we would ever see would 

72
00:04:33,410 --> 00:04:36,379
be if we start a fine operation at a, at 
a object with rank 0. 

73
00:04:36,379 --> 00:04:39,752
We traverse it parent point or we get to 
it's parent, it has rank 1. 

74
00:04:39,752 --> 00:04:43,552
It traverses it's parent, it has rank 2, 
then 3, then 4, and so on, all the way up 

75
00:04:43,552 --> 00:04:45,652
to 100. 
Now remember, ranks have to strictly 

76
00:04:45,652 --> 00:04:47,977
increase, whenever we traverse a parent 
pointer. 

77
00:04:47,977 --> 00:04:51,002
As we discussed, that was true with or 
without path compression. 

78
00:04:51,002 --> 00:04:54,527
So, in the worst case situation to go 
from 0 to 100, you have to traverse 100 

79
00:04:54,527 --> 00:04:56,612
pointers. 
So, that would be a bummer. 

80
00:04:56,612 --> 00:05:01,221
Wouldn't it be nice if, every time we 
traversed a parent pointer, the rank 

81
00:05:01,221 --> 00:05:04,513
increased not just by 1, but by a much 
bigger number. 

82
00:05:04,513 --> 00:05:09,259
So for example, if we went from 0, to 10, 
to 20, to 30, to 40, and so on, that 

83
00:05:09,259 --> 00:05:13,872
would guarantee that we'd get to the max 
rank node, 100, in only 10 steps. 

84
00:05:13,872 --> 00:05:19,684
So again the bottom line is, if we can 
have a better, larger, lower bound on the 

85
00:05:19,684 --> 00:05:22,897
gap in ranks between object and its 
parent. 

86
00:05:22,897 --> 00:05:28,808
That implies more rapid progress through 
the possible ranks of the nodes that we 

87
00:05:28,808 --> 00:05:32,237
can see. 
And it translates to faster finds, to 

88
00:05:32,237 --> 00:05:37,884
fewer parent pointer traversals. 
So with that idea in mind, the big gaps 

89
00:05:37,884 --> 00:05:44,264
between ranks imply rapid progress. 
I want to propose, as a progress measure, 

90
00:05:44,264 --> 00:05:50,761
for a given non-root object x, the gap 
between x's rank, which again remember is 

91
00:05:50,761 --> 00:05:55,682
frozen forevermore, and, the rank of it's 
current parent. 

92
00:05:55,682 --> 00:05:59,537
So this progress measure is a good one 
for two reasons. 

93
00:05:59,537 --> 00:06:05,014
First, as we just discussed, if you have 
a handle on this gap, if you can lower 

94
00:06:05,014 --> 00:06:09,497
bound it, then that gives you an upper 
bound on the search time. 

95
00:06:09,497 --> 00:06:14,586
Secondly, this gap allows us to quantify. 
The benefit of path compression. 

96
00:06:14,586 --> 00:06:19,102
Of installing these shortcuts. 
Specifically, whenever you install a new 

97
00:06:19,102 --> 00:06:23,922
shortcut, you rewire an objects parent 
pointer to point higher up in the tree. 

98
00:06:23,922 --> 00:06:28,241
It's new parent is going to have rank, 
strictly bigger than it's old one. 

99
00:06:28,241 --> 00:06:31,100
That means this gap, is only going to get 
bigger. 

100
00:06:31,100 --> 00:06:35,964
Summarizing, path compression, improve, 
improves this progress Measure. 

101
00:06:35,964 --> 00:06:42,199
That is, if an object x previously had a 
parent p and then has its parent pointer 

102
00:06:42,199 --> 00:06:48,459
rewired to a different node, p prime, the 
rank of p prime is bigger than the rank 

103
00:06:48,459 --> 00:06:51,760
of p. 
And just to make sure this is crystal 

104
00:06:51,760 --> 00:06:56,022
clear, let's just draw a couple of 
cartoon examples. 

105
00:06:56,022 --> 00:06:59,027
Examples. 
So first just in the abstract, if you 

106
00:06:59,027 --> 00:07:03,182
think about an object x. 
Suppose it has some parent p, and suppose 

107
00:07:03,182 --> 00:07:08,182
the root of the tree is some p prime, so 
some ancestor strictly further up in the 

108
00:07:08,182 --> 00:07:11,232
tree. 
Remember, ranks always increase as you go 

109
00:07:11,232 --> 00:07:14,638
up the tree, as you follow the The parent 
pointers. 

110
00:07:14,638 --> 00:07:17,593
So that means p's rank is strictly bigger 
than x. 

111
00:07:17,593 --> 00:07:21,988
p-prime's rank, this is the important 
thing, is strictly bigger than p. 

112
00:07:21,988 --> 00:07:26,642
So, when you rewire x's parent pointer, 
to go, to point from p no longer, but 

113
00:07:26,642 --> 00:07:30,861
rather to p-prime, it acquires a new 
parent, and it's new parent is an 

114
00:07:30,861 --> 00:07:34,921
ancestor of it's old one, ergo, it must 
have strictly bigger rank. 

115
00:07:34,921 --> 00:07:40,389
For that reason the gap Between x's rank, 
which is fixed forevermore, and the rank 

116
00:07:40,389 --> 00:07:45,194
of its new parent, that gap is bigger 
than the gap between its rank and the 

117
00:07:45,194 --> 00:07:49,194
rank of its old parent. 
You can also see this effect in action 

118
00:07:49,194 --> 00:07:53,802
with the running seven object example we 
were using in the last video. 

119
00:07:53,802 --> 00:07:57,862
So I've shown that example, that tree in 
pink, both before and after path 

120
00:07:57,862 --> 00:08:00,893
compression. 
And again before path compression I just 

121
00:08:00,893 --> 00:08:05,308
defined the ranks as with union by ranks 
so that the rank of each node is equal to 

122
00:08:05,308 --> 00:08:07,632
the longest path from a leaf to that 
node. 

123
00:08:07,632 --> 00:08:11,581
And of course we don't change the ranks 
when we apply path compression. 

124
00:08:11,581 --> 00:08:15,982
And what do we observe where exactly two 
of the objects had their parent pointer 

125
00:08:15,982 --> 00:08:18,672
rewired. 
Namely, objects one and four had their 

126
00:08:18,672 --> 00:08:23,615
parent Pointer rewired to piont directly 
to 7, and as a consequence, the gap in 

127
00:08:23,615 --> 00:08:28,713
rank, between object 1, and it's parent, 
has leaped, from merely 1, the difference 

128
00:08:28,713 --> 00:08:31,475
between each rank, and the rank of 4, up 
to 3. 

129
00:08:31,475 --> 00:08:36,295
The difference between it's rank, and 
that of it's new parent 7, similarly 

130
00:08:36,295 --> 00:08:39,788
object 4, its rank gap has jumped from 1 
to 2. 

131
00:08:39,788 --> 00:08:45,367
Its rank is only 1 less than its old 
parent 6, but it's 2 less than that of 

132
00:08:45,367 --> 00:08:49,157
its new parent 7. 
So, believe it or not, we actually the 2 

133
00:08:49,157 --> 00:08:52,672
crucial building blocks for the 
Hopcraft-Ullman analysis. 

134
00:08:52,672 --> 00:08:57,232
Building block number 1, is the rank 
Lemma, that we discussed a couple videos 

135
00:08:57,232 --> 00:08:59,537
ago. 
The fact that with or without path 

136
00:08:59,537 --> 00:09:03,722
compression, you can't have too many 
objects where the given rank r. 

137
00:09:03,722 --> 00:09:07,502
You're getting the most n/2^r. 
The second building block Is the one we 

138
00:09:07,502 --> 00:09:10,376
just covered. 
That every time you update a parent 

139
00:09:10,376 --> 00:09:14,783
pointer under path compression, the gap 
has to grow between the rank of that 

140
00:09:14,783 --> 00:09:19,329
object and the rank of it's new parent. 
The rest of the proof is just an optimal 

141
00:09:19,329 --> 00:09:21,872
exploitation of those two building 
blocks. 

142
00:09:21,872 --> 00:09:23,635
Let me now show you the details.