1 00:00:00,770 --> 00:00:03,350 In this video, we're going to prove our first performance guarantee on 2 00:00:03,350 --> 00:00:06,110 the union-find data structure with path compression. 3 00:00:06,110 --> 00:00:09,090 This is the bound first established by Hopcroft and Ullman. 4 00:00:09,090 --> 00:00:12,740 Let me briefly remind you about the guarantee. 5 00:00:12,740 --> 00:00:16,350 Consider a union-find data structure where you're using lazy unions. 6 00:00:16,350 --> 00:00:20,310 You're doing union by rank and you're using the path compression optimization. 7 00:00:20,310 --> 00:00:25,390 Consider an arbitrary sequence, say of m union and find operations. 8 00:00:25,390 --> 00:00:28,980 The guarantee is that over the course of this sequence of operations, 9 00:00:28,980 --> 00:00:32,840 the total work that you do is at most m, the number of operations, 10 00:00:32,840 --> 00:00:37,918 times the glacially growing function log* of n. 11 00:00:37,918 --> 00:00:42,360 Remember, log* n is defined as the number of times you have to apply the logarithm 12 00:00:42,360 --> 00:00:46,360 function to n before you get a result which is below 1. 13 00:00:46,360 --> 00:00:49,627 Remember, 2 raised to the 65,536, 14 00:00:49,627 --> 00:00:54,320 a totally absurd number, log* of that number is merely five. 15 00:00:55,870 --> 00:00:59,110 So the theorem is true no matter what m is, no matter how many operations you do, 16 00:00:59,110 --> 00:01:01,710 whether you do very few, whether you do a whole lot of them. 17 00:01:01,710 --> 00:01:05,500 I'm going to focus on the interesting case where m is at least asymptotically 18 00:01:05,500 --> 00:01:06,800 as big as n, where m is omega of n. 19 00:01:07,800 --> 00:01:10,300 I'll let you think through in the privacy of your own home 20 00:01:10,300 --> 00:01:14,969 why the arguments that we're about to see imply the theorem no matter what m is. 21 00:01:17,000 --> 00:01:20,740 So one final comment about this guarantee before we turn to its proof. 22 00:01:20,740 --> 00:01:22,720 Notice what the theorem is not saying. 23 00:01:22,720 --> 00:01:26,783 The theorem is not saying that every single one of these find and 24 00:01:26,783 --> 00:01:30,167 union operations runs in big O of log start and time. 25 00:01:30,167 --> 00:01:33,928 And that's because that statement in general, which would be stronger, 26 00:01:33,928 --> 00:01:35,580 that statement is false. 27 00:01:35,580 --> 00:01:40,325 There are going to be operations in general that take more than log* n time. 28 00:01:40,325 --> 00:01:45,280 Now on the one hand we know no operations going to be slower than logarithmic time. 29 00:01:45,280 --> 00:01:47,431 Even when we didn't have path compression, 30 00:01:47,431 --> 00:01:49,478 no operation was worse than the log n time. 31 00:01:49,478 --> 00:01:52,690 And we're not going to do any worse with path compression. 32 00:01:52,690 --> 00:01:54,980 So the worst case time bound is log n but 33 00:01:54,980 --> 00:01:58,040 some operations may indeed run that slowly. 34 00:01:58,040 --> 00:02:01,280 But over a sequence of m operations, 35 00:02:01,280 --> 00:02:06,580 the average amount of work we do on a per operation basis is only log* of n. 36 00:02:06,580 --> 00:02:10,700 This is exactly the sort of so called amortized analysis that we did in our 37 00:02:10,700 --> 00:02:14,220 very first union-find implementation with eager unions. 38 00:02:14,220 --> 00:02:18,060 We agreed that particular unions might take linear time, but 39 00:02:18,060 --> 00:02:21,490 over a sequence of unions we spent only logarithmic time. 40 00:02:21,490 --> 00:02:25,159 So the same thing is here, with the exception that we're getting 41 00:02:25,159 --> 00:02:29,441 a radically better on average running time bound of log* n over a sequence. 42 00:02:32,123 --> 00:02:34,114 So before I launch into the proof details, 43 00:02:34,114 --> 00:02:37,390 let's just spend one slide talking a little bit about the proof plan. 44 00:02:37,390 --> 00:02:41,830 And in particular, what is the intuition behind the performance that we're trying 45 00:02:41,830 --> 00:02:42,560 to make precise? 46 00:02:42,560 --> 00:02:45,290 That we're trying to encode in a mathematical way 47 00:02:45,290 --> 00:02:46,610 in the proof that's about to follow? 48 00:02:48,540 --> 00:02:50,660 Well if we're hoping to prove a bound better than log n, 49 00:02:50,660 --> 00:02:54,270 what we were stuck with without path compression it better be the case that 50 00:02:54,270 --> 00:02:58,770 installing all these short cuts really qualitatively speeds up finds in unions. 51 00:02:59,790 --> 00:03:02,870 And on some level it's sort of clear that things have to be sped up. 52 00:03:02,870 --> 00:03:06,060 We're replacing an old chain of pointers with a single pointer so 53 00:03:06,060 --> 00:03:07,410 you can only go faster. 54 00:03:07,410 --> 00:03:09,640 But how do we keep track of this progress? 55 00:03:09,640 --> 00:03:13,145 How do we compile this intuition and make it into a rigorous guarantee. 56 00:03:15,909 --> 00:03:17,160 So here's the idea. 57 00:03:17,160 --> 00:03:20,710 Let's focus on an object x which is no longer a root. 58 00:03:20,710 --> 00:03:23,650 At this point it's parent is somebody other than itself. 59 00:03:25,610 --> 00:03:29,240 And one thing I want you to remember from our union by rank analysis is that 60 00:03:29,240 --> 00:03:34,070 once an object is not a root, its rank is frozen forevermore. 61 00:03:34,070 --> 00:03:37,010 Now we proved that in the context when there was no path compression. 62 00:03:37,010 --> 00:03:40,300 But again remember we manipulate ranks in exactly the same way when there 63 00:03:40,300 --> 00:03:41,250 is path compression. 64 00:03:41,250 --> 00:03:42,390 So that's still true. 65 00:03:42,390 --> 00:03:43,220 If you're an object and 66 00:03:43,220 --> 00:03:46,650 you're no longer a root there is no way you're rank will ever change again. 67 00:03:48,960 --> 00:03:52,810 Now what we're certainly hoping is true is that finds that originate, at say, 68 00:03:52,810 --> 00:03:54,810 at this object x are running fast. 69 00:03:54,810 --> 00:03:58,450 Not only that, they should be getting faster and faster as time goes on. 70 00:03:58,450 --> 00:04:00,240 Because we're doing more and more path compression. 71 00:04:02,000 --> 00:04:03,720 So here's the big idea. 72 00:04:03,720 --> 00:04:07,280 The way we're going to reason about the worst case running time 73 00:04:07,280 --> 00:04:09,460 of the find operation, or equivalently, 74 00:04:09,460 --> 00:04:12,720 the longest sequence of parent pointers we might have to traverse to get to a root, 75 00:04:12,720 --> 00:04:17,320 is we're going to think about the sequence of ranks that we observe 76 00:04:17,320 --> 00:04:21,330 as we traverse these parent pointers from an object x up to the root. 77 00:04:21,330 --> 00:04:24,980 Let me give you an example, so what would be the worst case situation? 78 00:04:24,980 --> 00:04:26,966 Well the worst case would be if we have a data structure, and 79 00:04:26,966 --> 00:04:29,630 let's say that the maximum rank is something like 100, 80 00:04:29,630 --> 00:04:32,980 the worst case sequence of ranks, the longest sequence we would ever see, 81 00:04:32,980 --> 00:04:36,940 would be if we start a find operation at an object with rank zero, 82 00:04:36,940 --> 00:04:40,050 we traverse a parent pointer, we get to its parent, it has rank one. 83 00:04:40,050 --> 00:04:40,870 We traverse its parent, 84 00:04:40,870 --> 00:04:44,460 it has rank two, then three, then four, and so on all the way up to 100. 85 00:04:44,460 --> 00:04:47,240 Now remember, ranks have to strictly increase whenever we traverse 86 00:04:47,240 --> 00:04:48,260 a parent pointer. 87 00:04:48,260 --> 00:04:51,010 As we discussed, that was true with or without path compression. 88 00:04:51,010 --> 00:04:52,520 So in the worst case situation, 89 00:04:52,520 --> 00:04:54,580 to go from 0 to 100, you have to traverse 100 pointers. 90 00:04:55,600 --> 00:04:57,180 So that'd be a bummer. 91 00:04:57,180 --> 00:05:01,150 Wouldn't it be nice if every time we traversed a parent pointer, 92 00:05:01,150 --> 00:05:04,740 the rank increased, not just by one, but by a much bigger number. 93 00:05:04,740 --> 00:05:08,870 So for example, if we went from 0 to 10, to 20, to 30, to 40, and so 94 00:05:08,870 --> 00:05:14,380 on, that would guarantee that we'd get to the max ranked node 100 in only ten steps. 95 00:05:14,380 --> 00:05:19,510 So again, the bottom line is, if we can have a better, larger lower bound 96 00:05:19,510 --> 00:05:24,690 on the gap in ranks between objects and its parent, that implies 97 00:05:24,690 --> 00:05:29,140 more rapid progress through the possible ranks of the nodes that we can see. 98 00:05:29,140 --> 00:05:34,960 And it translates to faster finds, to fewer parent pointer traversals. 99 00:05:37,250 --> 00:05:42,591 So with that idea in mind, the big gaps between ranks imply rapid progress, 100 00:05:42,591 --> 00:05:47,810 I want to propose as a progress measure for a given non-root object x. 101 00:05:47,810 --> 00:05:52,932 The gap between x's rank, which again remember is is frozen forevermore, 102 00:05:52,932 --> 00:05:55,342 and the rank of its current parent. 103 00:05:58,919 --> 00:06:01,820 So this progress measure is a good one for two reasons. 104 00:06:01,820 --> 00:06:04,954 First as we just discussed, if you have a handle on this gap, 105 00:06:04,954 --> 00:06:09,520 if you can lower bound it then that gives you an upper bound on the search time. 106 00:06:09,520 --> 00:06:12,810 Secondly, this gap allow us to quantify 107 00:06:12,810 --> 00:06:16,920 the benefit of path compression of installing these shortcuts. 108 00:06:16,920 --> 00:06:19,940 Specifically, whenever you install a new short cut you 109 00:06:19,940 --> 00:06:23,420 rewire an object's parent pointer to point higher up in the tree. 110 00:06:23,420 --> 00:06:27,610 Its new parent is going to have rank strictly bigger than its old one. 111 00:06:27,610 --> 00:06:30,570 That means this gap is only going to get bigger. 112 00:06:30,570 --> 00:06:35,540 Summarizing path compression improves this progress measure. 113 00:06:37,530 --> 00:06:42,990 That is if an object x previously had a parent p and 114 00:06:42,990 --> 00:06:47,710 then has its parent pointer rewired to a different node p prime the rank of p 115 00:06:47,710 --> 00:06:52,670 prime is bigger than the rank of p. 116 00:06:52,670 --> 00:06:55,800 And just to make sure this is crystal clear let's just draw a couple of cartoon 117 00:06:55,800 --> 00:06:56,450 examples. 118 00:06:58,370 --> 00:07:01,940 So first just in the abstract, if you think about an object x. 119 00:07:01,940 --> 00:07:06,370 Suppose it has some parent p and suppose the root of the tree is some p prime. 120 00:07:06,370 --> 00:07:09,920 So some ancestor strictly further up in the tree. 121 00:07:09,920 --> 00:07:12,810 Remember, ranks always increase as you go up the tree, 122 00:07:12,810 --> 00:07:14,030 as you follow the parent pointers. 123 00:07:14,030 --> 00:07:17,692 So that means p's rank is strictly bigger than x. 124 00:07:17,692 --> 00:07:21,170 p prime's rank is the important thing, is strictly bigger than p. 125 00:07:21,170 --> 00:07:26,410 So when you rewire x's parent pointer to point from p no longer but 126 00:07:26,410 --> 00:07:30,170 rather to p prime, it acquires a new parent, and its 127 00:07:30,170 --> 00:07:34,980 new parent is an ancestor of its old one, ergo it must have strictly bigger rank. 128 00:07:34,980 --> 00:07:40,050 For that reason, the gap between x's rank, which is fixed forevermore, and 129 00:07:40,050 --> 00:07:41,220 the rank of its new parent. 130 00:07:41,220 --> 00:07:45,625 That gap is bigger than the gap between its rank and the rank of its old parent. 131 00:07:45,625 --> 00:07:48,355 You can also see this effect in action with 132 00:07:48,355 --> 00:07:52,646 the running seven object example we were using in the last video. 133 00:07:55,201 --> 00:07:58,981 So I've shown that example, that tree in pink, both before and 134 00:07:58,981 --> 00:08:00,530 after path compression. 135 00:08:00,530 --> 00:08:04,790 And, again, before path compression, I just defined the ranks as with union by 136 00:08:04,790 --> 00:08:09,670 rank so the rank of each node is equal to the longest path from a leaf to that node, 137 00:08:09,670 --> 00:08:12,610 and then of course we don't change the ranks when we apply path compression. 138 00:08:12,610 --> 00:08:13,790 And what do we observe? 139 00:08:13,790 --> 00:08:17,950 Where exactly two of the objects had their parent pointer rewired namely objects 140 00:08:17,950 --> 00:08:21,500 one and four had their parent pointer rewired to point directly to seven. 141 00:08:21,500 --> 00:08:26,920 And as a consequence the gap in rank between object one and its parent 142 00:08:26,920 --> 00:08:31,920 has leaped from merely one, the difference between each rank and the rank of four 143 00:08:31,920 --> 00:08:36,470 up to three, the difference between its rank and that of its new parent seven. 144 00:08:36,470 --> 00:08:41,565 Similarly object four, its rank gap has jumped from one to two. 145 00:08:41,565 --> 00:08:43,930 Its rank is only less than its old parent six but 146 00:08:43,930 --> 00:08:46,520 it's two less than that of its new parent seven. 147 00:08:48,960 --> 00:08:52,745 So believe it or not we actually have the two crucial building blocks for 148 00:08:52,745 --> 00:08:54,649 the Hopcraft and Ullman analysis. 149 00:08:54,649 --> 00:08:58,441 Building block number one is the rank lemma that we discussed a couple of videos 150 00:08:58,441 --> 00:09:02,119 ago, the fact that with or without path compression you can't have too many 151 00:09:02,119 --> 00:09:05,710 objects with a given rank r, you can only have at most n over 2 to vr. 152 00:09:05,710 --> 00:09:08,390 The second building block is the one we just covered, 153 00:09:08,390 --> 00:09:11,870 that every time you update a parent pointer under path compression, the gap 154 00:09:11,870 --> 00:09:16,640 has to grow between the rank of that object and the rank of its new parent. 155 00:09:16,640 --> 00:09:21,760 The rest of the proof is just an optimal exploitation of those two building blocks, 156 00:09:21,760 --> 00:09:23,150 let me now show you the details.