1
00:00:00,770 --> 00:00:03,350
In this video, we're going to prove
our first performance guarantee on

2
00:00:03,350 --> 00:00:06,110
the union-find data structure
with path compression.

3
00:00:06,110 --> 00:00:09,090
This is the bound first established
by Hopcroft and Ullman.

4
00:00:09,090 --> 00:00:12,740
Let me briefly remind
you about the guarantee.

5
00:00:12,740 --> 00:00:16,350
Consider a union-find data structure
where you're using lazy unions.

6
00:00:16,350 --> 00:00:20,310
You're doing union by rank and you're
using the path compression optimization.

7
00:00:20,310 --> 00:00:25,390
Consider an arbitrary sequence,
say of m union and find operations.

8
00:00:25,390 --> 00:00:28,980
The guarantee is that over the course
of this sequence of operations,

9
00:00:28,980 --> 00:00:32,840
the total work that you do is at most m,
the number of operations,

10
00:00:32,840 --> 00:00:37,918
times the glacially growing
function log* of n.

11
00:00:37,918 --> 00:00:42,360
Remember, log* n is defined as the number
of times you have to apply the logarithm

12
00:00:42,360 --> 00:00:46,360
function to n before you get
a result which is below 1.

13
00:00:46,360 --> 00:00:49,627
Remember, 2 raised to the 65,536,

14
00:00:49,627 --> 00:00:54,320
a totally absurd number,
log* of that number is merely five.

15
00:00:55,870 --> 00:00:59,110
So the theorem is true no matter what m
is, no matter how many operations you do,

16
00:00:59,110 --> 00:01:01,710
whether you do very few,
whether you do a whole lot of them.

17
00:01:01,710 --> 00:01:05,500
I'm going to focus on the interesting
case where m is at least asymptotically

18
00:01:05,500 --> 00:01:06,800
as big as n, where m is omega of n.

19
00:01:07,800 --> 00:01:10,300
I'll let you think through in
the privacy of your own home

20
00:01:10,300 --> 00:01:14,969
why the arguments that we're about to see
imply the theorem no matter what m is.

21
00:01:17,000 --> 00:01:20,740
So one final comment about this
guarantee before we turn to its proof.

22
00:01:20,740 --> 00:01:22,720
Notice what the theorem is not saying.

23
00:01:22,720 --> 00:01:26,783
The theorem is not saying that
every single one of these find and

24
00:01:26,783 --> 00:01:30,167
union operations runs in big
O of log start and time.

25
00:01:30,167 --> 00:01:33,928
And that's because that statement in
general, which would be stronger,

26
00:01:33,928 --> 00:01:35,580
that statement is false.

27
00:01:35,580 --> 00:01:40,325
There are going to be operations in
general that take more than log* n time.

28
00:01:40,325 --> 00:01:45,280
Now on the one hand we know no operations
going to be slower than logarithmic time.

29
00:01:45,280 --> 00:01:47,431
Even when we didn't have path compression,

30
00:01:47,431 --> 00:01:49,478
no operation was worse
than the log n time.

31
00:01:49,478 --> 00:01:52,690
And we're not going to do any
worse with path compression.

32
00:01:52,690 --> 00:01:54,980
So the worst case time bound is log n but

33
00:01:54,980 --> 00:01:58,040
some operations may
indeed run that slowly.

34
00:01:58,040 --> 00:02:01,280
But over a sequence of m operations,

35
00:02:01,280 --> 00:02:06,580
the average amount of work we do on
a per operation basis is only log* of n.

36
00:02:06,580 --> 00:02:10,700
This is exactly the sort of so called
amortized analysis that we did in our

37
00:02:10,700 --> 00:02:14,220
very first union-find
implementation with eager unions.

38
00:02:14,220 --> 00:02:18,060
We agreed that particular unions
might take linear time, but

39
00:02:18,060 --> 00:02:21,490
over a sequence of unions we
spent only logarithmic time.

40
00:02:21,490 --> 00:02:25,159
So the same thing is here,
with the exception that we're getting

41
00:02:25,159 --> 00:02:29,441
a radically better on average running
time bound of log* n over a sequence.

42
00:02:32,123 --> 00:02:34,114
So before I launch into the proof details,

43
00:02:34,114 --> 00:02:37,390
let's just spend one slide talking
a little bit about the proof plan.

44
00:02:37,390 --> 00:02:41,830
And in particular, what is the intuition
behind the performance that we're trying

45
00:02:41,830 --> 00:02:42,560
to make precise?

46
00:02:42,560 --> 00:02:45,290
That we're trying to encode
in a mathematical way

47
00:02:45,290 --> 00:02:46,610
in the proof that's about to follow?

48
00:02:48,540 --> 00:02:50,660
Well if we're hoping to prove
a bound better than log n,

49
00:02:50,660 --> 00:02:54,270
what we were stuck with without path
compression it better be the case that

50
00:02:54,270 --> 00:02:58,770
installing all these short cuts really
qualitatively speeds up finds in unions.

51
00:02:59,790 --> 00:03:02,870
And on some level it's sort of clear
that things have to be sped up.

52
00:03:02,870 --> 00:03:06,060
We're replacing an old chain of
pointers with a single pointer so

53
00:03:06,060 --> 00:03:07,410
you can only go faster.

54
00:03:07,410 --> 00:03:09,640
But how do we keep track of this progress?

55
00:03:09,640 --> 00:03:13,145
How do we compile this intuition and
make it into a rigorous guarantee.

56
00:03:15,909 --> 00:03:17,160
So here's the idea.

57
00:03:17,160 --> 00:03:20,710
Let's focus on an object x
which is no longer a root.

58
00:03:20,710 --> 00:03:23,650
At this point it's parent is
somebody other than itself.

59
00:03:25,610 --> 00:03:29,240
And one thing I want you to remember
from our union by rank analysis is that

60
00:03:29,240 --> 00:03:34,070
once an object is not a root,
its rank is frozen forevermore.

61
00:03:34,070 --> 00:03:37,010
Now we proved that in the context
when there was no path compression.

62
00:03:37,010 --> 00:03:40,300
But again remember we manipulate ranks
in exactly the same way when there

63
00:03:40,300 --> 00:03:41,250
is path compression.

64
00:03:41,250 --> 00:03:42,390
So that's still true.

65
00:03:42,390 --> 00:03:43,220
If you're an object and

66
00:03:43,220 --> 00:03:46,650
you're no longer a root there is no way
you're rank will ever change again.

67
00:03:48,960 --> 00:03:52,810
Now what we're certainly hoping is true
is that finds that originate, at say,

68
00:03:52,810 --> 00:03:54,810
at this object x are running fast.

69
00:03:54,810 --> 00:03:58,450
Not only that, they should be getting
faster and faster as time goes on.

70
00:03:58,450 --> 00:04:00,240
Because we're doing more and
more path compression.

71
00:04:02,000 --> 00:04:03,720
So here's the big idea.

72
00:04:03,720 --> 00:04:07,280
The way we're going to reason
about the worst case running time

73
00:04:07,280 --> 00:04:09,460
of the find operation, or equivalently,

74
00:04:09,460 --> 00:04:12,720
the longest sequence of parent pointers we
might have to traverse to get to a root,

75
00:04:12,720 --> 00:04:17,320
is we're going to think about
the sequence of ranks that we observe

76
00:04:17,320 --> 00:04:21,330
as we traverse these parent pointers
from an object x up to the root.

77
00:04:21,330 --> 00:04:24,980
Let me give you an example, so
what would be the worst case situation?

78
00:04:24,980 --> 00:04:26,966
Well the worst case would be if
we have a data structure, and

79
00:04:26,966 --> 00:04:29,630
let's say that the maximum
rank is something like 100,

80
00:04:29,630 --> 00:04:32,980
the worst case sequence of ranks,
the longest sequence we would ever see,

81
00:04:32,980 --> 00:04:36,940
would be if we start a find operation
at an object with rank zero,

82
00:04:36,940 --> 00:04:40,050
we traverse a parent pointer,
we get to its parent, it has rank one.

83
00:04:40,050 --> 00:04:40,870
We traverse its parent,

84
00:04:40,870 --> 00:04:44,460
it has rank two, then three, then four,
and so on all the way up to 100.

85
00:04:44,460 --> 00:04:47,240
Now remember, ranks have to strictly
increase whenever we traverse

86
00:04:47,240 --> 00:04:48,260
a parent pointer.

87
00:04:48,260 --> 00:04:51,010
As we discussed, that was true with or
without path compression.

88
00:04:51,010 --> 00:04:52,520
So in the worst case situation,

89
00:04:52,520 --> 00:04:54,580
to go from 0 to 100,
you have to traverse 100 pointers.

90
00:04:55,600 --> 00:04:57,180
So that'd be a bummer.

91
00:04:57,180 --> 00:05:01,150
Wouldn't it be nice if every time
we traversed a parent pointer,

92
00:05:01,150 --> 00:05:04,740
the rank increased, not just by one,
but by a much bigger number.

93
00:05:04,740 --> 00:05:08,870
So for example, if we went from 0 to 10,
to 20, to 30, to 40, and so

94
00:05:08,870 --> 00:05:14,380
on, that would guarantee that we'd get to
the max ranked node 100 in only ten steps.

95
00:05:14,380 --> 00:05:19,510
So again, the bottom line is, if we
can have a better, larger lower bound

96
00:05:19,510 --> 00:05:24,690
on the gap in ranks between objects and
its parent, that implies

97
00:05:24,690 --> 00:05:29,140
more rapid progress through the possible
ranks of the nodes that we can see.

98
00:05:29,140 --> 00:05:34,960
And it translates to faster finds,
to fewer parent pointer traversals.

99
00:05:37,250 --> 00:05:42,591
So with that idea in mind, the big gaps
between ranks imply rapid progress,

100
00:05:42,591 --> 00:05:47,810
I want to propose as a progress
measure for a given non-root object x.

101
00:05:47,810 --> 00:05:52,932
The gap between x's rank, which again
remember is is frozen forevermore,

102
00:05:52,932 --> 00:05:55,342
and the rank of its current parent.

103
00:05:58,919 --> 00:06:01,820
So this progress measure is a good one for
two reasons.

104
00:06:01,820 --> 00:06:04,954
First as we just discussed,
if you have a handle on this gap,

105
00:06:04,954 --> 00:06:09,520
if you can lower bound it then that gives
you an upper bound on the search time.

106
00:06:09,520 --> 00:06:12,810
Secondly, this gap allow us to quantify

107
00:06:12,810 --> 00:06:16,920
the benefit of path compression
of installing these shortcuts.

108
00:06:16,920 --> 00:06:19,940
Specifically, whenever you
install a new short cut you

109
00:06:19,940 --> 00:06:23,420
rewire an object's parent pointer
to point higher up in the tree.

110
00:06:23,420 --> 00:06:27,610
Its new parent is going to have rank
strictly bigger than its old one.

111
00:06:27,610 --> 00:06:30,570
That means this gap is
only going to get bigger.

112
00:06:30,570 --> 00:06:35,540
Summarizing path compression
improves this progress measure.

113
00:06:37,530 --> 00:06:42,990
That is if an object x
previously had a parent p and

114
00:06:42,990 --> 00:06:47,710
then has its parent pointer rewired to
a different node p prime the rank of p

115
00:06:47,710 --> 00:06:52,670
prime is bigger than the rank of p.

116
00:06:52,670 --> 00:06:55,800
And just to make sure this is crystal
clear let's just draw a couple of cartoon

117
00:06:55,800 --> 00:06:56,450
examples.

118
00:06:58,370 --> 00:07:01,940
So first just in the abstract,
if you think about an object x.

119
00:07:01,940 --> 00:07:06,370
Suppose it has some parent p and suppose
the root of the tree is some p prime.

120
00:07:06,370 --> 00:07:09,920
So some ancestor strictly
further up in the tree.

121
00:07:09,920 --> 00:07:12,810
Remember, ranks always increase
as you go up the tree,

122
00:07:12,810 --> 00:07:14,030
as you follow the parent pointers.

123
00:07:14,030 --> 00:07:17,692
So that means p's rank is
strictly bigger than x.

124
00:07:17,692 --> 00:07:21,170
p prime's rank is the important thing,
is strictly bigger than p.

125
00:07:21,170 --> 00:07:26,410
So when you rewire x's parent pointer
to point from p no longer but

126
00:07:26,410 --> 00:07:30,170
rather to p prime,
it acquires a new parent, and its

127
00:07:30,170 --> 00:07:34,980
new parent is an ancestor of its old one,
ergo it must have strictly bigger rank.

128
00:07:34,980 --> 00:07:40,050
For that reason, the gap between x's rank,
which is fixed forevermore, and

129
00:07:40,050 --> 00:07:41,220
the rank of its new parent.

130
00:07:41,220 --> 00:07:45,625
That gap is bigger than the gap between
its rank and the rank of its old parent.

131
00:07:45,625 --> 00:07:48,355
You can also see this
effect in action with

132
00:07:48,355 --> 00:07:52,646
the running seven object example
we were using in the last video.

133
00:07:55,201 --> 00:07:58,981
So I've shown that example,
that tree in pink, both before and

134
00:07:58,981 --> 00:08:00,530
after path compression.

135
00:08:00,530 --> 00:08:04,790
And, again, before path compression,
I just defined the ranks as with union by

136
00:08:04,790 --> 00:08:09,670
rank so the rank of each node is equal to
the longest path from a leaf to that node,

137
00:08:09,670 --> 00:08:12,610
and then of course we don't change
the ranks when we apply path compression.

138
00:08:12,610 --> 00:08:13,790
And what do we observe?

139
00:08:13,790 --> 00:08:17,950
Where exactly two of the objects had their
parent pointer rewired namely objects

140
00:08:17,950 --> 00:08:21,500
one and four had their parent pointer
rewired to point directly to seven.

141
00:08:21,500 --> 00:08:26,920
And as a consequence the gap in rank
between object one and its parent

142
00:08:26,920 --> 00:08:31,920
has leaped from merely one, the difference
between each rank and the rank of four

143
00:08:31,920 --> 00:08:36,470
up to three, the difference between its
rank and that of its new parent seven.

144
00:08:36,470 --> 00:08:41,565
Similarly object four,
its rank gap has jumped from one to two.

145
00:08:41,565 --> 00:08:43,930
Its rank is only less than
its old parent six but

146
00:08:43,930 --> 00:08:46,520
it's two less than that
of its new parent seven.

147
00:08:48,960 --> 00:08:52,745
So believe it or not we actually have
the two crucial building blocks for

148
00:08:52,745 --> 00:08:54,649
the Hopcraft and Ullman analysis.

149
00:08:54,649 --> 00:08:58,441
Building block number one is the rank
lemma that we discussed a couple of videos

150
00:08:58,441 --> 00:09:02,119
ago, the fact that with or without path
compression you can't have too many

151
00:09:02,119 --> 00:09:05,710
objects with a given rank r,
you can only have at most n over 2 to vr.

152
00:09:05,710 --> 00:09:08,390
The second building block
is the one we just covered,

153
00:09:08,390 --> 00:09:11,870
that every time you update a parent
pointer under path compression, the gap

154
00:09:11,870 --> 00:09:16,640
has to grow between the rank of that
object and the rank of its new parent.

155
00:09:16,640 --> 00:09:21,760
The rest of the proof is just an optimal
exploitation of those two building blocks,

156
00:09:21,760 --> 00:09:23,150
let me now show you the details.