1
00:00:00,012 --> 00:00:04,312
I need to begin by setting up some 
notation and definitions. 

2
00:00:04,312 --> 00:00:09,747
The first one is that of rank blocks. 
So if we're given value of N, the number 

3
00:00:09,747 --> 00:00:14,737
of objects in the data structure we 
define the rank blocks as follows. 

4
00:00:14,737 --> 00:00:18,312
0 gets its own block as does 1. 
The next block is 2. 

5
00:00:18,312 --> 00:00:18,732
3. 
4. 

6
00:00:18,732 --> 00:00:24,487
The next block starts at 5 obviously and 
the last entry is going to be 2 raised to 

7
00:00:24,487 --> 00:00:27,192
the fourth. 
So 2 raised to the final. 

8
00:00:27,192 --> 00:00:31,278
The rank of the previous block as known 
as 16. 

9
00:00:31,278 --> 00:00:38,448
The nest block starts at 17 and goes up 
to 2 raised to the last rank of previous 

10
00:00:38,448 --> 00:00:42,225
blocks. 
So 2 raised to the 16 also known as 

11
00:00:42,225 --> 00:00:48,032
65536, and so until we have enough rank 
blocks that we capture. 

12
00:00:48,032 --> 00:00:50,215
N. 
Actually, if you think about it, because 

13
00:00:50,215 --> 00:00:54,132
ranks can only be log n, we only have to 
go up to log n, but that actually doesn't 

14
00:00:54,132 --> 00:00:58,182
affect anything by more than a constant. 
So let's just go ahead all the way up to 

15
00:00:58,182 --> 00:01:03,481
So, how many rank blocks are there? Well, 
focus just on the largest member of each 

16
00:01:03,481 --> 00:01:07,334
of the rank blocks. 
The largest member of a given rank block 

17
00:01:07,334 --> 00:01:12,173
is the largest member is 2 raised to the 
largest member of the previous rank 

18
00:01:12,173 --> 00:01:15,353
block. 
So, for the t-th rank block, roughly what 

19
00:01:15,353 --> 00:01:18,712
you have is this largest entry is 2 to 
the 2 to the 2. 

20
00:01:18,712 --> 00:01:21,981
Raised t times. 
How many times you need to do that before 

21
00:01:21,981 --> 00:01:24,624
you get in? By definition, you need 
log*(n). 

22
00:01:24,624 --> 00:01:29,120
So, that's how many rank block star, 
that's where log*(n) enters the analysis. 

23
00:01:29,120 --> 00:01:33,527
So, I realized this must be a very 
inscrutable definition, these weird rank 

24
00:01:33,527 --> 00:01:36,323
blocks. 
So, let me tell you how you should think 

25
00:01:36,323 --> 00:01:39,262
of about them. 
They're meant to encode an intuition that 

26
00:01:39,262 --> 00:01:43,065
we had on the previous slide where we 
said well, you know we should be happy if 

27
00:01:43,065 --> 00:01:46,585
the rank of an object's parent is a lot 
bigger than the rank of that object 

28
00:01:46,585 --> 00:01:49,044
itself. 
Why? Well, when we traverse that object's 

29
00:01:49,044 --> 00:01:52,978
parent pointer, we make a ton of progress 
through the rank space and you can only 

30
00:01:52,978 --> 00:01:56,421
make a ton of progress through the rank 
space a limited number of times. 

31
00:01:56,421 --> 00:02:00,242
So that means a small number of parent 
traversals, that means fast fines. 

32
00:02:00,242 --> 00:02:04,221
So these rank blocks are just a very 
clever way of defining sufficient 

33
00:02:04,221 --> 00:02:06,865
progress. 
When we're going to be happy with a gap 

34
00:02:06,865 --> 00:02:10,088
between the rank of an object and the 
rank of its parent. 

35
00:02:10,088 --> 00:02:14,082
Specifically we're happy whenever these 2 
ranks lie in different. 

36
00:02:14,082 --> 00:02:17,028
Rank blocks. 
We're not happy if they lie on the same 

37
00:02:17,028 --> 00:02:19,999
rank block. 
So, for example, if you're at an object 

38
00:02:19,999 --> 00:02:24,594
and it has rank 8, and together with his 
parents, it has rank 18, then we're happy 

39
00:02:24,594 --> 00:02:27,842
because 18 is in the rank block after the 
one that contains 8. 

40
00:02:27,842 --> 00:02:32,361
On the other hand, if you go from rank 8 
to rank 15, then we're not happy, okay? 

41
00:02:32,361 --> 00:02:37,070
We're going to call that not a big gap 
Between the rank of the object and the 

42
00:02:37,070 --> 00:02:40,269
object's parent. 
So let's build on that idea to make 

43
00:02:40,269 --> 00:02:43,876
another definition. 
So consider a given snapshot in time. 

44
00:02:43,876 --> 00:02:48,263
That is, we've done some sequence of 
finds, we've done some sequence of 

45
00:02:48,263 --> 00:02:51,190
unions. 
So, we're going to call some objects at 

46
00:02:51,190 --> 00:02:53,913
this point in time good and some of them 
bad. 

47
00:02:53,913 --> 00:02:56,732
Here are the criteria by which you are 
good. 

48
00:02:56,732 --> 00:03:00,747
So first of all, if you're the root of 
your tree, you're going to be good. 

49
00:03:00,747 --> 00:03:04,892
If you're a direct descendant of the 
root, that is if your parent is the root 

50
00:03:04,892 --> 00:03:09,302
of your tree, then you're also good, If 
not though,if you're deeper in the tree, 

51
00:03:09,302 --> 00:03:13,711
then you're going to be good only if your 
parents ranked Rank is in a strictly 

52
00:03:13,711 --> 00:03:18,129
larger ranked block, than your own rank. 
So, that is in the previous example. 

53
00:03:18,129 --> 00:03:21,505
If your rank is 8, and your parents is 
18, than you're good. 

54
00:03:21,505 --> 00:03:26,097
If your rank is 8, and your parents is 15 
and your parent's not a root, then you're 

55
00:03:26,097 --> 00:03:29,478
going to be bad. 
The role of this definition is to split 

56
00:03:29,478 --> 00:03:33,940
the work that we do into two parts. 
First of all, we do some work visiting 

57
00:03:33,940 --> 00:03:37,189
good nodes. 
Second of all, we do some work vising bad 

58
00:03:37,189 --> 00:03:40,091
nodes. 
The work we do visiting good nodes would 

59
00:03:40,091 --> 00:03:44,231
be very easy to bout, no problem. 
We'll have to do a separate global 

60
00:03:44,231 --> 00:03:47,761
analysis to bound the total work that we 
do visiting bad nodes. 

61
00:03:47,761 --> 00:03:51,421
This dichotomy is exactly the same as 
something we faced when analyzing 

62
00:03:51,421 --> 00:03:55,554
Kruskal's algorithm when we implemented 
it using the union 5 data structure with 

63
00:03:55,554 --> 00:03:58,430
eager unions. 
Here, you recall there were some parts of 

64
00:03:58,430 --> 00:04:02,364
the work in Kruskal's algorithm that were 
easy to balance, just iteration by 

65
00:04:02,364 --> 00:04:05,104
iteration. 
For example every cycle check cost only 

66
00:04:05,104 --> 00:04:09,457
at constant time but then there is this 
more complicated type of work, namely all 

67
00:04:09,457 --> 00:04:13,238
of leader pointer updates that we had 
that bound globally via separate 

68
00:04:13,238 --> 00:04:15,764
argument. 
The exact same thing is going to happen 

69
00:04:15,764 --> 00:04:17,818
here. 
Good nodes will be able to bound 

70
00:04:17,818 --> 00:04:22,155
operation by operation whereas the bad 
nodes will have a global analysis control 

71
00:04:22,155 --> 00:04:24,982
the total work done over all of the 
operations. 

72
00:04:24,982 --> 00:04:30,260
So more precisely, I've set up the 
definitions so that the work done 

73
00:04:30,260 --> 00:04:35,682
visiting good nodes is bounded by O(log* 
n) every single operation. 

74
00:04:35,682 --> 00:04:39,567
Indeed, how many good nodes could you 
possibly visit during a find operation? 

75
00:04:39,567 --> 00:04:42,854
Say, from some object x. 
So you start at x, and you traverse these 

76
00:04:42,854 --> 00:04:45,343
parent pointers going all the way up to 
the roots. 

77
00:04:45,343 --> 00:04:48,609
Well, there's the roots. 
There's the descendant of the roots, so 

78
00:04:48,609 --> 00:04:50,400
that's 2. 
So let's set those aside. 

79
00:04:50,400 --> 00:04:54,040
What about the other good nodes that you 
encounter on your path? Well, by 

80
00:04:54,040 --> 00:04:58,196
definition, when you visit a good node. 
The rank of its parent is on the bigger 

81
00:04:58,196 --> 00:05:00,629
rank block than the rank of that node 
itself. 

82
00:05:00,629 --> 00:05:04,712
That is, every time you traverse a parent 
pointer from a good node, you will 

83
00:05:04,712 --> 00:05:08,627
progress to a subsequent rank block. 
Well, assuming that log star n rank 

84
00:05:08,627 --> 00:05:12,399
blocks, so you can only progress through 
subsequent log star n times. 

85
00:05:12,399 --> 00:05:16,402
So, the total number of good nodes you're 
going to see is O(log*n). 

86
00:05:16,402 --> 00:05:20,485
So now, let's go ahead and express the 
total amount of the work that we do 

87
00:05:20,485 --> 00:05:23,885
overall to find a union operations, and 
use two quantities. 

88
00:05:23,885 --> 00:05:28,179
Work on the good nodes, which we now know 
as just log star in each operation. 

89
00:05:28,179 --> 00:05:32,734
Plus, the visits to the bad nodes, which 
at the moment, we have no idea how big it 

90
00:05:32,734 --> 00:05:35,787
is. 
So now, let's proceed to the global bound 

91
00:05:35,787 --> 00:05:40,837
on the total amount of work that we 
performed on bad nodes and this is really 

92
00:05:40,837 --> 00:05:45,162
the crux of the whole theorem. 
[SOUND] So, let's just review the 

93
00:05:45,162 --> 00:05:49,562
definitions real quick. 
What is that mean that you're a bad node? 

94
00:05:49,562 --> 00:05:54,017
So, first of all, you're not the root 
Second of all you're not a direct 

95
00:05:54,017 --> 00:05:57,077
descendant of the root. 
That is you have a grandparent. 

96
00:05:57,077 --> 00:06:01,357
And third, it is not the case that your 
parents' rank is in a later rank block. 

97
00:06:01,357 --> 00:06:04,117
Is in exactly the same rank block as your 
own rank. 

98
00:06:04,117 --> 00:06:08,776
That's what it means that your. 
So how much work do we spend on bad 

99
00:06:08,776 --> 00:06:13,369
nodes? Well let's analyze it one rank 
block at a time. 

100
00:06:13,369 --> 00:06:19,905
So, fix an arbitrary rank block, let's 
say for some integer K, it's smallest 

101
00:06:19,905 --> 00:06:24,602
rank is K + 1 and it's biggest rank is 2 
to the K. 

102
00:06:24,602 --> 00:06:27,338
Now I told you what our two main building 
blocks were. 

103
00:06:27,338 --> 00:06:31,376
First 1 is the rank lemma, I am going to 
ask you to remember that in a second. 

104
00:06:31,376 --> 00:06:35,281
But first I want to put to use our other 
building block, which is that path 

105
00:06:35,281 --> 00:06:39,036
compression increases the rank gap 
between objects and their parents. 

106
00:06:39,036 --> 00:06:45,155
That's what we're going to use right now. 
Specifically, consider a fine operation 

107
00:06:45,155 --> 00:06:50,707
and a bad object x that it visits. 
By virtue of being bad x is not the root 

108
00:06:50,707 --> 00:06:53,827
and is not a direct descendant of the 
root. 

109
00:06:53,827 --> 00:06:57,407
So, root is a higher up ancestor than its 
parent. 

110
00:06:57,407 --> 00:07:02,484
Therefore, x's parent will be changed In 
the subsequent path compression. 

111
00:07:02,484 --> 00:07:06,604
It will be rewired to point directly to 
the root, a strict ancestor of its 

112
00:07:06,604 --> 00:07:09,731
previous parent. 
Therefore, the rank of its new parent 

113
00:07:09,731 --> 00:07:13,193
will be strictly bigger than the rank of 
its previous parent. 

114
00:07:13,193 --> 00:07:17,313
That's going to keep happening every time 
that x is visited, y looks bad. 

115
00:07:17,313 --> 00:07:21,755
It keeps getting new parents, and those 
new parents keep having ranks strictly 

116
00:07:21,755 --> 00:07:26,172
bigger than the previous one. 
Well, how many times can this happen 

117
00:07:26,172 --> 00:07:31,845
before the rank of x's parent has grown 
so big that it lies in a subsequent rank 

118
00:07:31,845 --> 00:07:37,165
block? The biggest value in x's rank 
block and remember x is a nonroot its 

119
00:07:37,165 --> 00:07:41,861
rank is frozen forever. 
So it's always stuck in this rank block. 

120
00:07:41,861 --> 00:07:47,673
Once its parents rank gets updated let's 
say at least 2^k times then the rank has 

121
00:07:47,673 --> 00:07:51,125
to be so big that it lies. 
In the next rank block. 

122
00:07:51,125 --> 00:07:55,143
At that point, x is no longer bad. 
Its parent pointer makes so much 

123
00:07:55,143 --> 00:07:59,680
progress, it goes to another rank block. 
Now, we're going to call it good. 

124
00:07:59,680 --> 00:08:03,857
And, of course, once x becomes good in 
this fashion, it would be good 

125
00:08:03,857 --> 00:08:07,088
forevermore. 
It is not a root, it will never be a root 

126
00:08:07,088 --> 00:08:09,591
again. 
Its rank is frozen forever and its 

127
00:08:09,591 --> 00:08:13,877
parent's rank can only go up. 
So, once you're good, once your parent's 

128
00:08:13,877 --> 00:08:18,318
rank is sufficiently large. 
It will be sufficiently large for the 

129
00:08:18,318 --> 00:08:21,671
rest of time. 
Alright, so we're almost there. 

130
00:08:21,671 --> 00:08:26,582
Let's just make sure we don't forget 
anything that we have done. 

131
00:08:26,582 --> 00:08:30,284
So, the total work we're bounding in this 
two ways. 

132
00:08:30,284 --> 00:08:35,457
So, first of all, we visit log star n 
good nodes for each operation. 

133
00:08:35,457 --> 00:08:40,346
So, overall m operations at O(m). 
log*n for the good nodes, plus there's a 

134
00:08:40,346 --> 00:08:43,544
number of visits over the m operations to 
the bad nodes. 

135
00:08:43,544 --> 00:08:48,077
We're going to bound that wrok globally 
but we're going to proceed rank block by 

136
00:08:48,077 --> 00:08:50,960
rank blocks. 
We're fixing one ranks k+1 up to 2^k. 

137
00:08:50,960 --> 00:08:55,321
What we shown on the last slide is that 
for every object x whose final frozen 

138
00:08:55,321 --> 00:08:59,674
rank lies somewhere in this rank block. 
The number of times it gets visited while 

139
00:08:59,674 --> 00:09:03,968
it's bad, the number of times it could be 
visited before it becomes good forever 

140
00:09:03,968 --> 00:09:07,343
more is bounded above by 2^k. 
So, if now you used one of our key 

141
00:09:07,343 --> 00:09:11,764
building blocks, that path compression 
increases the ranks of parents, now let's 

142
00:09:11,764 --> 00:09:14,452
use the other building block, the rank 
lemma. 

143
00:09:14,452 --> 00:09:19,957
The rank lemma, if you recall, says, that 
in any given moment in time and for any 

144
00:09:19,957 --> 00:09:25,362
possible rank r, the number of objects 
that currently have rank r cannot be any 

145
00:09:25,362 --> 00:09:29,485
bigger than n/2^r. 
So, it's used the rank lemma and apply it 

146
00:09:29,485 --> 00:09:35,077
to the upper bound, how many nodes Could 
possibly have their final frozen ranks 

147
00:09:35,077 --> 00:09:40,282
lengths somewhere in this rank block. 
They have their final frozen ranks 

148
00:09:40,282 --> 00:09:45,322
somewhere between k+1 and 2^5. 
So, let's just sum over all of the ranks 

149
00:09:45,322 --> 00:09:49,617
in the rank blocks, so it's starting at 
2k+1 going up to 2^k. 

150
00:09:49,617 --> 00:09:55,954
By the rank lemma, for a given value of 
i, we noticed the most n/2^i objects that 

151
00:09:55,954 --> 00:10:03,383
eventually wind up with rank i and by the 
usual geometrical sum, this whole thing 

152
00:10:03,383 --> 00:10:08,657
can be upper bounded by n / 2^k. 
So, this is now just trying to look like 

153
00:10:08,657 --> 00:10:12,062
a magical coincidence. 
Of course, we made a number of 

154
00:10:12,062 --> 00:10:16,042
definitions in the analysis. 
specifically, we structured the rank 

155
00:10:16,042 --> 00:10:20,812
blocks so that this magic would happen. 
Specifically, the number of inhabitants 

156
00:10:20,812 --> 00:10:25,747
of a rank block, n/2^k, times the maximum 
number of visits to an inhabitant in the 

157
00:10:25,747 --> 00:10:30,343
rank block wile they're bad, times 2^k, 
that's actually independent of the rank 

158
00:10:30,343 --> 00:10:33,025
block. 
We multiply these two things together, 

159
00:10:33,025 --> 00:10:37,570
the number of visits per object to the k, 
the number of objects n/2^k and what do 

160
00:10:37,570 --> 00:10:40,931
we get? We get n. 
Now, this was only counting the visits to 

161
00:10:40,931 --> 00:10:45,363
bad objects in a given rank block but 
there aren't many rank block, remember, 

162
00:10:45,363 --> 00:10:50,423
there's only log*n of them. 
So, that means the total amount of work 

163
00:10:50,423 --> 00:10:55,478
spent visiting the bad nodes, some that 
over the rank blocks is O(m*log*n). 

164
00:10:57,381 --> 00:11:03,652
Combining the balance of the good nodes 
and the bad nodes, we get (m+n) log*n. 

165
00:11:03,652 --> 00:11:08,020
At the beginning, I mentioned that the 
interesting case is when m is big Omega 

166
00:11:08,020 --> 00:11:10,441
of n, in that case, this bound is just 
O(m*log*n). 

167
00:11:11,446 --> 00:11:15,988
Essentially, if you have a very sparse 
set of union operations, you can just 

168
00:11:15,988 --> 00:11:19,487
apply this analysis to each of the 
directed trees separately. 

169
00:11:19,487 --> 00:11:23,952
So, that's it, that's the full story of 
complete analysis of the brillaint 

170
00:11:23,952 --> 00:11:28,518
Hopcroft only analysis. 
Log star and on average operation time, 

171
00:11:28,518 --> 00:11:32,714
under calf compression. 
Brilliant as it is, you can do even 

172
00:11:32,714 --> 00:11:35,862
there. 
That's the subject of the next couple 

173
00:11:35,862 --> 00:11:36,349
videos.