1
00:00:00,840 --> 00:00:03,600
So I don't blame you if you're
feeling a little bit restless.

2
00:00:03,600 --> 00:00:06,460
You've now watched three videos
about this alternative approach

3
00:00:06,460 --> 00:00:10,020
to the union find data structure
based on lazy unions, and

4
00:00:10,020 --> 00:00:12,380
frankly we don't have many
changeable things to show for it.

5
00:00:12,380 --> 00:00:15,330
We already had a perfectly fine
implementation of this data structure

6
00:00:15,330 --> 00:00:19,190
based on eager unions that gave
us constant time fines, and yeah,

7
00:00:19,190 --> 00:00:20,750
union could be linear in the worst case,

8
00:00:20,750 --> 00:00:23,750
but it was logarithmic on
average over sequence of unions.

9
00:00:23,750 --> 00:00:25,850
So now we have this other implementation.

10
00:00:25,850 --> 00:00:28,910
Both of our operations
are requiring logarithmic time.

11
00:00:28,910 --> 00:00:32,700
It's true it's a worst case bound, and
it's true union by rank is pretty cool.

12
00:00:32,700 --> 00:00:36,390
But still, the bottom line up to
this point is not that compelling.

13
00:00:36,390 --> 00:00:39,600
But I've still got another
trick up my sleeve.

14
00:00:39,600 --> 00:00:44,161
Now just watch what happens once we
employ a second optimization known as

15
00:00:44,161 --> 00:00:45,439
path compression.

16
00:00:48,107 --> 00:00:49,990
So the optimization is very natural.

17
00:00:49,990 --> 00:00:54,388
I suspect many a serious programmer
would come up with this on their

18
00:00:54,388 --> 00:00:59,191
own if they were tasked with implementing
union find with union by rank.

19
00:00:59,191 --> 00:01:02,542
So to motivate this optimization let's
think about what would be our worst

20
00:01:02,542 --> 00:01:05,681
nightmare as someone maintaining
such a data structure and hoping for

21
00:01:05,681 --> 00:01:07,160
good performance.

22
00:01:07,160 --> 00:01:10,660
Well, remember the running time of a find
is proportional to the number of parent

23
00:01:10,660 --> 00:01:11,740
pointers you have to traverse.

24
00:01:11,740 --> 00:01:15,750
That is disproportional to the depth
of the object at which it's invoked.

25
00:01:15,750 --> 00:01:17,350
So, what's the worst case find look like?

26
00:01:17,350 --> 00:01:18,720
Well it's going to be the leaf.

27
00:01:18,720 --> 00:01:19,790
And moreover, it's going to be a leaf,

28
00:01:19,790 --> 00:01:22,830
which is furthest away from
its corresponding roots.

29
00:01:22,830 --> 00:01:25,190
Now, on the one hand we're
using union by rank, so

30
00:01:25,190 --> 00:01:27,000
we know that this depth can't be too bad.

31
00:01:27,000 --> 00:01:30,210
It's going to be at most big 0 of log n.

32
00:01:30,210 --> 00:01:32,145
However, there will be example there will,

33
00:01:32,145 --> 00:01:36,585
in general, be leafs that are theta
of log n hops away from their root.

34
00:01:36,585 --> 00:01:39,745
And, for all we know, we're just going to
get this endless sequence of find

35
00:01:39,745 --> 00:01:42,475
operations, where every
single one is invoked on

36
00:01:42,475 --> 00:01:45,262
a leaf that's a log n number of
hops away from from its root.

37
00:01:45,262 --> 00:01:49,123
So for example, in the pink tree that is
shown in the upper right of the slide

38
00:01:49,123 --> 00:01:51,578
maybe someone keeps searching for
the object,

39
00:01:51,578 --> 00:01:55,155
keeps invoking find from the object
one over and over and over again.

40
00:01:55,155 --> 00:01:59,106
And then we're going to be suffering
a log number of steps with every single

41
00:01:59,106 --> 00:01:59,832
operation.

42
00:02:02,040 --> 00:02:02,820
But if you think about it,

43
00:02:02,820 --> 00:02:06,080
it's totally pointless to keep
re-doing the work of previous finds.

44
00:02:06,080 --> 00:02:09,220
To keep retraversing the same parent
pointers over and over again.

45
00:02:09,220 --> 00:02:12,750
So for example, in the pink tree
that I've shown in the upper right,

46
00:02:12,750 --> 00:02:15,010
imagine that find is
invoked from object one.

47
00:02:15,010 --> 00:02:16,980
So then we traverse
the three parent pointers.

48
00:02:16,980 --> 00:02:20,250
We discover the ancestors four and
six before terminating at seven.

49
00:02:20,250 --> 00:02:21,950
And we discover that
seven is the leader or

50
00:02:21,950 --> 00:02:24,180
the root corresponding to the object one.

51
00:02:24,180 --> 00:02:26,905
Now remember, we do not care that four and

52
00:02:26,905 --> 00:02:30,360
six are ancestors of one,
that is uninteresting.

53
00:02:30,360 --> 00:02:33,650
We only visited them to discover
what we actually cared about,

54
00:02:33,650 --> 00:02:36,960
that seven is the leader,
the root corresponding to one.

55
00:02:36,960 --> 00:02:40,590
Well, but let's just cache that
information now that we've computed it.

56
00:02:40,590 --> 00:02:44,890
So, this is now basically reconstructing
the eager union find implementation.

57
00:02:44,890 --> 00:02:48,700
Let's just rewire one's parent
corner to point straight to seven.

58
00:02:48,700 --> 00:02:52,079
We don't need to recompute foreign
six in some later find operation.

59
00:02:55,004 --> 00:02:59,541
And more generally, after we've invoked
FIND from object one, we may as well

60
00:02:59,541 --> 00:03:04,302
rewire four's parent pointer as well,
to point directly to its leader, seven.

61
00:03:07,147 --> 00:03:09,730
So that then is path compression.

62
00:03:09,730 --> 00:03:14,720
When FIND is invoked from some node, x,
and you traverse parent pointers from x

63
00:03:14,720 --> 00:03:19,710
up to its root, call it r for
every object that you visit along

64
00:03:19,710 --> 00:03:24,930
this path from x to r, you rewire
the parent corner to point directly to r.

65
00:03:24,930 --> 00:03:28,340
So r doesnt have it's parent pointer
changed, it still points to itself.

66
00:03:28,340 --> 00:03:32,760
The penultimate object on this path
doesn't have its parent pointer changed.

67
00:03:32,760 --> 00:03:35,250
It already was pointing to the root r, but

68
00:03:35,250 --> 00:03:39,910
anything below the root and its immediate
descendents on this path from x

69
00:03:39,910 --> 00:03:44,230
will have its parent point updated and
it'll be updated to point directly to r.

70
00:03:44,230 --> 00:03:47,430
Now we couldn't get away with this
if the tree had to be binary.

71
00:03:47,430 --> 00:03:48,960
But it doesn't have to be binary.

72
00:03:48,960 --> 00:03:52,120
We couldn't get away with this if we had
to satisfy some other constraint like

73
00:03:52,120 --> 00:03:54,340
a search tree property, but we don't.

74
00:03:54,340 --> 00:03:57,230
So nothing's stopping us from
just caching this information

75
00:03:57,230 --> 00:03:59,290
about who everyone's leaders are.

76
00:03:59,290 --> 00:04:02,100
because that's really the only
information we're responsible for

77
00:04:02,100 --> 00:04:04,239
exporting from this union
find data structure.

78
00:04:06,340 --> 00:04:09,280
So pictorially, if you like thinking
about the trees, in effect,

79
00:04:09,280 --> 00:04:13,330
what path compression does is make
the trees shallower and more bushy.

80
00:04:13,330 --> 00:04:15,916
So in our example in the upper right,
it rips out one and

81
00:04:15,916 --> 00:04:20,019
four and makes them immediate
descendants of the root seven.

82
00:04:21,450 --> 00:04:24,860
Prefer to think in terms of the array
representation and remember in the array

83
00:04:24,860 --> 00:04:31,430
representation, the array index for
i is recording the parent of the object i.

84
00:04:31,430 --> 00:04:35,460
So, in our pink tree, five, six,
and seven all point to seven.

85
00:04:35,460 --> 00:04:36,590
They all have parent seven.

86
00:04:39,920 --> 00:04:41,920
Whereas, two and
three all have parent five.

87
00:04:45,740 --> 00:04:48,930
Four has the parent six and
one has the parent four.

88
00:04:48,930 --> 00:04:53,120
And after applying path compression
following a find invoked at the object

89
00:04:53,120 --> 00:04:57,110
one, one and
four are redirected to have parent seven.

90
00:04:59,630 --> 00:05:03,180
So what are the pros and
cons of the path compression optimization?

91
00:05:03,180 --> 00:05:05,370
Well, the cons side is quite minor.

92
00:05:05,370 --> 00:05:08,380
Really we're just doing a little
multi-tasking on find and

93
00:05:08,380 --> 00:05:11,430
that introduces a small
constant factor overhead.

94
00:05:11,430 --> 00:05:14,790
We're already doing work
proportional to the number of hops.

95
00:05:14,790 --> 00:05:18,040
On the path from x to it's root and
we are still just doing constant work for

96
00:05:18,040 --> 00:05:19,850
each node on that path.

97
00:05:19,850 --> 00:05:21,220
We're doing an extra constant work for

98
00:05:21,220 --> 00:05:27,080
each node after the fact to rewire
the parent pointer to point to the root.

99
00:05:27,080 --> 00:05:31,540
The pro should be obvious this is going to
speed up all subsequent finds operations.

100
00:05:31,540 --> 00:05:34,660
You are really making sure you
don't redo redundant work.

101
00:05:34,660 --> 00:05:38,250
You don't traverse parent pointers
over and over and over again.

102
00:05:38,250 --> 00:05:41,300
So what's clear is that
FINDs will speed up.

103
00:05:41,300 --> 00:05:43,940
What's really not clear is whether
they'll speed up by a lot.

104
00:05:43,940 --> 00:05:47,140
So this one going to affect the
performance of the FIND operation by say

105
00:05:47,140 --> 00:05:50,600
a factor of two, where we get
something fundamentally better

106
00:05:50,600 --> 00:05:54,070
than the logger than performance we were
stuck with without path compression.

107
00:05:57,750 --> 00:06:02,345
So, let me spend a slide on the subtle
point of how ranks interact with the path

108
00:06:02,345 --> 00:06:04,122
compression optimization.

109
00:06:08,234 --> 00:06:13,271
So, the plan is we're going to
manipulate rank fields with path

110
00:06:13,271 --> 00:06:19,920
compression in exactly the same way as we
did when we didn't have path compression.

111
00:06:21,590 --> 00:06:23,660
So how did we manipulate them previously?

112
00:06:23,660 --> 00:06:27,780
Well at the birth of the union find
data structure everybody has rank zero.

113
00:06:27,780 --> 00:06:32,131
Ranks never change except
possibly when you do a union.

114
00:06:32,131 --> 00:06:35,444
When you do a union you look at the roots
of the two objects whose groups you

115
00:06:35,444 --> 00:06:36,160
have to union.

116
00:06:36,160 --> 00:06:37,440
You look at their ranks.

117
00:06:37,440 --> 00:06:40,650
If one of them has strictly
bigger rank that becomes

118
00:06:40,650 --> 00:06:42,770
the root of your new merged tree.

119
00:06:42,770 --> 00:06:46,620
The root with the smaller rank is
installed as a child underneath.

120
00:06:46,620 --> 00:06:50,340
If you union two roots that
have exactly the same rank,

121
00:06:50,340 --> 00:06:55,760
you pick arbitrarily which of them is the
new root, and you bump up its rank by one.

122
00:06:55,760 --> 00:06:57,530
That is how we manipulated ranks before,

123
00:06:57,530 --> 00:06:59,810
that is exactly how we're
going to manipulate them now.

124
00:07:01,910 --> 00:07:04,580
So in particular,
when we apply path compression,

125
00:07:04,580 --> 00:07:10,560
when we rewire parent pointers following
a find, we do not touch anybody's ranks.

126
00:07:10,560 --> 00:07:13,090
We leave them all exactly the same.

127
00:07:13,090 --> 00:07:18,920
So again, ranks only change in one
specific case in a very specific way.

128
00:07:18,920 --> 00:07:24,890
Namely, during a union when we merge two
trees that have exactly the same rank.

129
00:07:24,890 --> 00:07:27,630
And when we do that kind of union,
that kind of merge,

130
00:07:27,630 --> 00:07:32,060
whichever of the old roots winds up being
the new root, we bump up its rank by one.

131
00:07:32,060 --> 00:07:35,580
That is the only modification
we ever make to any ranks.

132
00:07:37,770 --> 00:07:40,760
Now, it might bother you that we don't
recompute the ranks when we apply

133
00:07:40,760 --> 00:07:41,590
path compression.

134
00:07:41,590 --> 00:07:46,282
And in a way I sort of hope it does bother
you because by not recomputing ranks, by

135
00:07:46,282 --> 00:07:51,189
just manipulating them exactly as before,
we're actually losing the semantics

136
00:07:51,189 --> 00:07:55,312
of what ranks meant in the previous video,
so the key invariant we had

137
00:07:55,312 --> 00:08:00,248
that ranks exactly represented worst case
search time to that object is now lost.

138
00:08:02,571 --> 00:08:06,355
The easiest way to see what I'm talking
about probably through an example, so

139
00:08:06,355 --> 00:08:09,792
let me redraw the same tree we had on
the previous slide, both before and

140
00:08:09,792 --> 00:08:11,900
after we apply path compression.

141
00:08:11,900 --> 00:08:15,103
Following a find invoked
at the object one.

142
00:08:16,499 --> 00:08:20,120
So what I've done here is, in the top
tree, before the path compression's been

143
00:08:20,120 --> 00:08:23,460
applied, I've labeled ranks just as
they were in the previous video.

144
00:08:23,460 --> 00:08:27,760
So for each node in the top tree,
its rank is equal to the longest path.

145
00:08:27,760 --> 00:08:29,620
From a leaf to that note.

146
00:08:29,620 --> 00:08:32,680
Then I applied path
compression from he object one

147
00:08:32,680 --> 00:08:36,360
resulting in the bushier more
shallow tree on the bottom.

148
00:08:36,360 --> 00:08:39,900
And as I said I did not touch
anybody's ranks when I applied

149
00:08:39,900 --> 00:08:40,810
this path compression.

150
00:08:40,810 --> 00:08:42,415
I didn't do any recomplications.

151
00:08:42,415 --> 00:08:45,260
And you'll see now we've lost
the semantics for the ranks.

152
00:08:45,260 --> 00:08:48,560
Just to give a simple example there
are now leaves that don't have rank zero,

153
00:08:48,560 --> 00:08:51,030
they have ranks strictly bigger than zero.

154
00:08:51,030 --> 00:08:55,480
So what we can say is that for
every node the rank is an upper bound.

155
00:08:55,480 --> 00:08:58,017
Not necessarily exactly the same as, but

156
00:08:58,017 --> 00:09:01,940
at least as big as the longest
path from a leaf to that node.

157
00:09:01,940 --> 00:09:05,990
But because of the shortcuts that we've
installed we no longer have the equality

158
00:09:05,990 --> 00:09:06,780
that we had before.

159
00:09:09,190 --> 00:09:10,360
There is good news however.

160
00:09:10,360 --> 00:09:12,910
We can definitely salvage a lot
of the technology that we

161
00:09:12,910 --> 00:09:15,880
developed in last video for
the union by rank analysis and

162
00:09:15,880 --> 00:09:18,960
apply it fruitfully here to
the case with path compression.

163
00:09:18,960 --> 00:09:21,920
In particular,
any statement that we made last video

164
00:09:21,920 --> 00:09:26,980
that talks only about the ranks of objects
and not about their parent pointers per se

165
00:09:26,980 --> 00:09:30,840
that will be as true with
path compression as without.

166
00:09:30,840 --> 00:09:31,720
Why?

167
00:09:31,720 --> 00:09:35,320
Well, think about making two copies
of a union find data structure.

168
00:09:35,320 --> 00:09:39,450
Exactly the same set of objects and
they'll run them in parallel on exactly

169
00:09:39,450 --> 00:09:42,650
the same sequence of union and
find operations.

170
00:09:42,650 --> 00:09:44,020
So, by definition,

171
00:09:44,020 --> 00:09:48,860
we manipulate the ranks of objects in
exactly the same way in the two copies.

172
00:09:48,860 --> 00:09:51,090
It doesn't matter if there's
path compression or not.

173
00:09:51,090 --> 00:09:55,130
So at all moments in time, every object
will have exactly the same rank in

174
00:09:55,130 --> 00:09:57,050
one copy as it does in the other copy.

175
00:09:57,050 --> 00:09:59,200
Path compression or no, it doesn't matter.

176
00:09:59,200 --> 00:10:00,980
Exactly the same ranks.

177
00:10:00,980 --> 00:10:04,240
What's going to be different in the two
copies are the parent pointers.

178
00:10:04,240 --> 00:10:06,050
They're, in some sense,
going to be further along.

179
00:10:06,050 --> 00:10:10,480
They're going to point further up into the
tree in the copy with path compression.

180
00:10:10,480 --> 00:10:11,780
But no matter.

181
00:10:11,780 --> 00:10:15,860
Any statement that's purely about ranks,
it's going to be equally well true with or

182
00:10:15,860 --> 00:10:17,270
without path compression.

183
00:10:19,050 --> 00:10:22,460
So in particular, remember the ranking
lemma that we proved in the last video.

184
00:10:22,460 --> 00:10:26,970
That says there's at most n/2
to the r objects of rank r.

185
00:10:26,970 --> 00:10:28,490
That was true without path compression.

186
00:10:28,490 --> 00:10:30,820
It's going to be true
with path compression.

187
00:10:30,820 --> 00:10:33,330
And we're going to use it in
the forthcoming analysis.

188
00:10:35,790 --> 00:10:38,260
Another fact which is still
true with path compression,

189
00:10:38,260 --> 00:10:42,220
in fact in some sense it's only more true
with path compression, is that whenever

190
00:10:42,220 --> 00:10:46,320
you traverse a parent corner you will
get to a node with strictly larger rank.

191
00:10:46,320 --> 00:10:48,950
Remember in the last video we said,
as you go up toward the root,

192
00:10:48,950 --> 00:10:52,570
you're going to see a strictly
increasing sequence of ranks.

193
00:10:52,570 --> 00:10:56,240
Here each pointer with path
compression represents a sequence

194
00:10:56,240 --> 00:10:58,570
of pointers without path compression.

195
00:10:58,570 --> 00:11:01,777
So certainly you will still only be
increasing in a rank every time you

196
00:11:01,777 --> 00:11:02,782
traverse a pointer.

197
00:11:06,230 --> 00:11:10,238
Let's now quantify the benefit of path
compression by proving new performance

198
00:11:10,238 --> 00:11:13,540
guarantees on the operations and
the union find data structure.

199
00:11:13,540 --> 00:11:17,301
And frankly, the running time bounds are
going to blow away the logarithmic bound

200
00:11:17,301 --> 00:11:20,750
that we had when we were only doing
union by rank without path compression.

201
00:11:22,150 --> 00:11:24,100
We're going to do the analysis
in two steps, and

202
00:11:24,100 --> 00:11:26,940
it's going to be historically accurate in
the sense that I'm first going to show you

203
00:11:26,940 --> 00:11:30,410
what theorem by Hopcroft-Ullman,
which gives an excellent but

204
00:11:30,410 --> 00:11:34,550
not quite optimal bound on the performance
of this union find data structure.

205
00:11:34,550 --> 00:11:37,230
And then we're going to build
on the Hopcroft-Ullman ideas

206
00:11:37,230 --> 00:11:40,430
to prove even better bound on
the performance on this data structure.

207
00:11:42,450 --> 00:11:45,633
So here is the performance
guarantee that Hopcroft and

208
00:11:45,633 --> 00:11:48,894
Ullman proved back in 1973,
about 40 years ago.

209
00:11:48,894 --> 00:11:49,421
They said,

210
00:11:49,421 --> 00:11:53,390
consider a union find data structure
implemented as we've been discussing.

211
00:11:53,390 --> 00:11:57,570
Lazy unions, union by rank,
path compression.

212
00:11:57,570 --> 00:12:02,560
Consider an arbitrary sequence of m,
find, and union operations, and

213
00:12:02,560 --> 00:12:05,540
suppose the data structure
contains n objects.

214
00:12:05,540 --> 00:12:09,710
Then the total work that the data
structure does to process this entire

215
00:12:09,710 --> 00:12:15,518
sequence of m operations
is m times log star of n.

216
00:12:15,518 --> 00:12:18,770
What is log* of n, you ask?

217
00:12:18,770 --> 00:12:21,290
Well, it's the iterated
logarithm operator.

218
00:12:21,290 --> 00:12:25,139
So, my analogy, remember when we first
demystified the logarithm way back at

219
00:12:25,139 --> 00:12:28,693
the beginning of part one, we notice
the log base two of a number is well,

220
00:12:28,693 --> 00:12:30,966
you just type that number
in your calculator,

221
00:12:30,966 --> 00:12:34,910
you count the number of times you divide
by two until the result drops below one.

222
00:12:34,910 --> 00:12:38,401
So, log star, you type the number
n into your calculator and

223
00:12:38,401 --> 00:12:43,140
you count the number of times you need to
hit log before the result drops below one.

224
00:12:44,190 --> 00:12:48,660
So it totally boggles the mind just how
slowly this log star function grows.

225
00:12:48,660 --> 00:12:52,836
In fact, let me give you a quick quiz
just to make sure that you appreciate

226
00:12:52,836 --> 00:12:55,451
the glacial growth of
the log star function.

227
00:13:03,814 --> 00:13:06,280
So the correct answer is B.

228
00:13:06,280 --> 00:13:07,584
It is five.

229
00:13:07,584 --> 00:13:11,667
And you should probably spend a couple
seconds just reflecting on how absurd

230
00:13:11,667 --> 00:13:12,190
this is.

231
00:13:12,190 --> 00:13:14,522
Right?
So you take this ridiculous number 2

232
00:13:14,522 --> 00:13:15,700
to the 65500.

233
00:13:15,700 --> 00:13:19,334
Remember the number of atoms in the known
universe estimated is way smaller

234
00:13:19,334 --> 00:13:20,010
than this.

235
00:13:20,010 --> 00:13:20,880
Which is 10 to the 80.

236
00:13:20,880 --> 00:13:23,850
So, you take this ridiculous number,
and you apply the log star function.

237
00:13:23,850 --> 00:13:25,291
You get back 5.

238
00:13:25,291 --> 00:13:26,351
Why do you get back 5?

239
00:13:26,351 --> 00:13:28,150
Well, take the logarithm once.

240
00:13:28,150 --> 00:13:29,635
What do you get back?

241
00:13:29,635 --> 00:13:33,770
65,536, also known as 2 to the 16,
so you take log again.

242
00:13:33,770 --> 00:13:35,888
You get back 16,
also known as two to the four.

243
00:13:35,888 --> 00:13:36,750
You take log again.

244
00:13:36,750 --> 00:13:39,420
You get back four,
also known as two to the two.

245
00:13:39,420 --> 00:13:40,210
You take log again.

246
00:13:40,210 --> 00:13:42,542
You get back a 2, and
then one more time gets you down to 1.

247
00:13:43,959 --> 00:13:46,907
So, you can think of log star as
the inverse of the tower function,

248
00:13:46,907 --> 00:13:50,222
where the tower function takes,
think of it as a positive integer, t, and

249
00:13:50,222 --> 00:13:53,144
it just raises two to the two to
the two to the two to the two t times.

250
00:13:54,581 --> 00:13:58,677
And this, in particular, shows that
log star, despite its glacial growth,

251
00:13:58,677 --> 00:14:00,240
is not bounded by a constant.

252
00:14:00,240 --> 00:14:01,700
This function is not.

253
00:14:01,700 --> 00:14:02,830
Big o of one.

254
00:14:02,830 --> 00:14:08,270
For any constant c you can think of I can
in principle exhibit a large enough n so

255
00:14:08,270 --> 00:14:13,530
that log star of my n
is bigger than your c.

256
00:14:13,530 --> 00:14:17,920
Namely I can just set n to be two to
the two to the two to the so on c times.

257
00:14:17,920 --> 00:14:19,650
Log star of that will be at least c.

258
00:14:20,880 --> 00:14:24,220
So that's the Hobcroft-Owen
performance guarantee that

259
00:14:24,220 --> 00:14:27,980
on average over a sequence of m
union of five operations you spend

260
00:14:27,980 --> 00:14:32,560
this only ridiculously small log
star n amount per operation.

261
00:14:32,560 --> 00:14:34,240
We're going to prove
it in this next video.