1 00:00:00,012 --> 00:00:04,058 In this video, we're going to prove our first performance guarantee on the 2 00:00:04,058 --> 00:00:07,749 Union-Find Data Structure with path compressions is the path first 3 00:00:07,749 --> 00:00:11,734 established by Hopcroft and Ullman. Let me briefly remind you about the 4 00:00:11,734 --> 00:00:14,496 guarantee. Consider a union-find data structure 5 00:00:14,496 --> 00:00:17,846 where you're using lazy unions. You're doing union by rank. 6 00:00:17,846 --> 00:00:20,583 You're using the path compression optimization. 7 00:00:20,583 --> 00:00:24,042 Consider an arbitrary sequence, say of m, union and fine. 8 00:00:24,042 --> 00:00:28,723 And operations, the guarantee is that over the course of this sequence of 9 00:00:28,723 --> 00:00:33,930 operations the total work that you do is a most m, the number of operations times 10 00:00:33,930 --> 00:00:36,772 the gracially growing function large star. 11 00:00:36,772 --> 00:00:39,792 Of n. Remember a log star of n is defined as 12 00:00:39,792 --> 00:00:45,602 the number of times you have to apply the logarithm function to n, before you get a 13 00:00:45,602 --> 00:00:50,032 result which is below one. Remember two raised to the 65536, a 14 00:00:50,032 --> 00:00:54,572 totally absurd number, log star of that number is merely five. 15 00:00:54,572 --> 00:00:58,817 So the theorem is true no matter what m is, no matter how many operations you do, 16 00:00:58,817 --> 00:01:02,034 whether you do very few or whether you do a whole lot of them. 17 00:01:02,034 --> 00:01:05,955 I'm going to focus on the interesting case where m is at least asymptotically 18 00:01:05,955 --> 00:01:09,413 as big as n, or m is omega of n. I'll let you think through, in the 19 00:01:09,413 --> 00:01:13,293 privacy of your own home, why the arguments we're about to see, imply the 20 00:01:13,293 --> 00:01:17,558 theorem, no matter what m is. So, 1 final comment about this guarantee 21 00:01:17,558 --> 00:01:21,728 before we turn to its proof. Notice what the theorem is not saying. 22 00:01:21,728 --> 00:01:25,942 The theorem is not saying that every single 1 of these find and union 23 00:01:25,942 --> 00:01:28,632 operations runs in big 0 of log star n time. 24 00:01:28,632 --> 00:01:33,078 And that's 'cuz, that statement, in general, which would be stronger. 25 00:01:33,078 --> 00:01:36,759 That statement is false. There are going to be operations, in 26 00:01:36,759 --> 00:01:39,699 general, that take more than log star. n*. 27 00:01:39,699 --> 00:01:44,609 Now on the one hand we know no operation is going to be slower than a logarithmic 28 00:01:44,609 --> 00:01:46,603 *. Even when we didn't have path 29 00:01:46,603 --> 00:01:49,914 compression, no operation was worse than the log n time. 30 00:01:49,914 --> 00:01:53,526 And we're not going to do any worse with path compression. 31 00:01:53,526 --> 00:01:58,135 So the worst case timed out is log n but some operations may indeed run that 32 00:01:58,135 --> 00:02:01,170 slowly. But over a sequence of m operations, the 33 00:02:01,170 --> 00:02:05,738 average amount of work we do on a per operation basis is only Log*(n). 34 00:02:05,738 --> 00:02:10,081 This is exactly the sort of so-called "amortized analysis" that we did in our 35 00:02:10,081 --> 00:02:13,406 very first union-find implementation, with eager unions. 36 00:02:13,406 --> 00:02:17,873 We agreed that particular unions might take linear time, but over a sequence of 37 00:02:17,873 --> 00:02:22,055 unions, we spend only logarithmic time. So the same thing is here, with the 38 00:02:22,055 --> 00:02:27,244 exception that we're getting a radically better, on average, running time bound, 39 00:02:27,244 --> 00:02:30,969 of log*n, over a sequence. So, before I launch into the proof 40 00:02:30,969 --> 00:02:35,908 details, lets just spend 1 slide talking a little bit about the proof plan, and in 41 00:02:35,908 --> 00:02:40,747 particular, what is the intuition behind the performance, that we're trying to 42 00:02:40,747 --> 00:02:45,435 make precise, that we're trying to encode in a mathematical way, in the proof 43 00:02:45,435 --> 00:02:49,017 that's About to follow. Well if we're hoping to prove abound, 44 00:02:49,017 --> 00:02:53,468 better than log n, what we were stuck with, without path compression, it better 45 00:02:53,468 --> 00:02:57,941 be the case that installing all these shortcuts really qualitatively speeds up, 46 00:02:57,941 --> 00:03:02,432 finds and And on some level it's sort of clear that things have to be sped up. 47 00:03:02,432 --> 00:03:07,122 We're replacing an old chain of pointers with a single pointer so you can only go 48 00:03:07,122 --> 00:03:09,452 faster. But how do we keep track of this 49 00:03:09,452 --> 00:03:14,145 progress? How do we compile intuition and make it into a rigorous gurantee? So 50 00:03:14,145 --> 00:03:18,275 here's the idea. Let's focus on an object x, which is no 51 00:03:18,275 --> 00:03:22,078 longer a root. At this point its parent is somebody 52 00:03:22,078 --> 00:03:26,519 other than itself. And one thing I want you to remember from 53 00:03:26,519 --> 00:03:32,202 our union by rank analysis, is that once an object is not a root, it's rank is 54 00:03:32,202 --> 00:03:35,510 frozen forever. Now we proved that in the context when 55 00:03:35,510 --> 00:03:39,315 there was no path compression. But again we manipulate ranks in exactly 56 00:03:39,315 --> 00:03:41,744 the same way when there is path compressions. 57 00:03:41,744 --> 00:03:44,281 If you're an object and you're no longer a root. 58 00:03:44,281 --> 00:03:46,876 There is no way your rank will ever change again. 59 00:03:46,876 --> 00:03:51,071 Now what we're certainly hoping is true, is that finds that originate at, say, 60 00:03:51,071 --> 00:03:54,777 this object X are running fast. Not only that, but they should even be 61 00:03:54,777 --> 00:03:58,806 getting faster and faster as time goes on, because we're doing more and more 62 00:03:58,806 --> 00:04:00,646 path compressions. Impression. 63 00:04:00,646 --> 00:04:04,886 So here's the big idea, the way we're going to reason about the worst case 64 00:04:04,886 --> 00:04:09,218 running time of the find operation or equivalently the longest sequence of 65 00:04:09,218 --> 00:04:13,113 parent point is we might have to transverse to get to a root is we're 66 00:04:13,113 --> 00:04:17,303 going to think about the sequence of ranks that we observe as we traverse 67 00:04:17,303 --> 00:04:20,643 these parent pointers from an object x. Up to the root. 68 00:04:20,643 --> 00:04:23,562 Let me give you an example. So, what would be the worst cast 69 00:04:23,562 --> 00:04:27,379 situation? Well, the worst case would be if we have a data structure and let's 70 00:04:27,379 --> 00:04:29,590 say, the maximum rank is something like 100. 71 00:04:29,590 --> 00:04:33,410 The worst case sequence of ranks, the longest sequence we would ever see would 72 00:04:33,410 --> 00:04:36,379 be if we start a fine operation at a, at a object with rank 0. 73 00:04:36,379 --> 00:04:39,752 We traverse it parent point or we get to it's parent, it has rank 1. 74 00:04:39,752 --> 00:04:43,552 It traverses it's parent, it has rank 2, then 3, then 4, and so on, all the way up 75 00:04:43,552 --> 00:04:45,652 to 100. Now remember, ranks have to strictly 76 00:04:45,652 --> 00:04:47,977 increase, whenever we traverse a parent pointer. 77 00:04:47,977 --> 00:04:51,002 As we discussed, that was true with or without path compression. 78 00:04:51,002 --> 00:04:54,527 So, in the worst case situation to go from 0 to 100, you have to traverse 100 79 00:04:54,527 --> 00:04:56,612 pointers. So, that would be a bummer. 80 00:04:56,612 --> 00:05:01,221 Wouldn't it be nice if, every time we traversed a parent pointer, the rank 81 00:05:01,221 --> 00:05:04,513 increased not just by 1, but by a much bigger number. 82 00:05:04,513 --> 00:05:09,259 So for example, if we went from 0, to 10, to 20, to 30, to 40, and so on, that 83 00:05:09,259 --> 00:05:13,872 would guarantee that we'd get to the max rank node, 100, in only 10 steps. 84 00:05:13,872 --> 00:05:19,684 So again the bottom line is, if we can have a better, larger, lower bound on the 85 00:05:19,684 --> 00:05:22,897 gap in ranks between object and its parent. 86 00:05:22,897 --> 00:05:28,808 That implies more rapid progress through the possible ranks of the nodes that we 87 00:05:28,808 --> 00:05:32,237 can see. And it translates to faster finds, to 88 00:05:32,237 --> 00:05:37,884 fewer parent pointer traversals. So with that idea in mind, the big gaps 89 00:05:37,884 --> 00:05:44,264 between ranks imply rapid progress. I want to propose, as a progress measure, 90 00:05:44,264 --> 00:05:50,761 for a given non-root object x, the gap between x's rank, which again remember is 91 00:05:50,761 --> 00:05:55,682 frozen forevermore, and, the rank of it's current parent. 92 00:05:55,682 --> 00:05:59,537 So this progress measure is a good one for two reasons. 93 00:05:59,537 --> 00:06:05,014 First, as we just discussed, if you have a handle on this gap, if you can lower 94 00:06:05,014 --> 00:06:09,497 bound it, then that gives you an upper bound on the search time. 95 00:06:09,497 --> 00:06:14,586 Secondly, this gap allows us to quantify. The benefit of path compression. 96 00:06:14,586 --> 00:06:19,102 Of installing these shortcuts. Specifically, whenever you install a new 97 00:06:19,102 --> 00:06:23,922 shortcut, you rewire an objects parent pointer to point higher up in the tree. 98 00:06:23,922 --> 00:06:28,241 It's new parent is going to have rank, strictly bigger than it's old one. 99 00:06:28,241 --> 00:06:31,100 That means this gap, is only going to get bigger. 100 00:06:31,100 --> 00:06:35,964 Summarizing, path compression, improve, improves this progress Measure. 101 00:06:35,964 --> 00:06:42,199 That is, if an object x previously had a parent p and then has its parent pointer 102 00:06:42,199 --> 00:06:48,459 rewired to a different node, p prime, the rank of p prime is bigger than the rank 103 00:06:48,459 --> 00:06:51,760 of p. And just to make sure this is crystal 104 00:06:51,760 --> 00:06:56,022 clear, let's just draw a couple of cartoon examples. 105 00:06:56,022 --> 00:06:59,027 Examples. So first just in the abstract, if you 106 00:06:59,027 --> 00:07:03,182 think about an object x. Suppose it has some parent p, and suppose 107 00:07:03,182 --> 00:07:08,182 the root of the tree is some p prime, so some ancestor strictly further up in the 108 00:07:08,182 --> 00:07:11,232 tree. Remember, ranks always increase as you go 109 00:07:11,232 --> 00:07:14,638 up the tree, as you follow the The parent pointers. 110 00:07:14,638 --> 00:07:17,593 So that means p's rank is strictly bigger than x. 111 00:07:17,593 --> 00:07:21,988 p-prime's rank, this is the important thing, is strictly bigger than p. 112 00:07:21,988 --> 00:07:26,642 So, when you rewire x's parent pointer, to go, to point from p no longer, but 113 00:07:26,642 --> 00:07:30,861 rather to p-prime, it acquires a new parent, and it's new parent is an 114 00:07:30,861 --> 00:07:34,921 ancestor of it's old one, ergo, it must have strictly bigger rank. 115 00:07:34,921 --> 00:07:40,389 For that reason the gap Between x's rank, which is fixed forevermore, and the rank 116 00:07:40,389 --> 00:07:45,194 of its new parent, that gap is bigger than the gap between its rank and the 117 00:07:45,194 --> 00:07:49,194 rank of its old parent. You can also see this effect in action 118 00:07:49,194 --> 00:07:53,802 with the running seven object example we were using in the last video. 119 00:07:53,802 --> 00:07:57,862 So I've shown that example, that tree in pink, both before and after path 120 00:07:57,862 --> 00:08:00,893 compression. And again before path compression I just 121 00:08:00,893 --> 00:08:05,308 defined the ranks as with union by ranks so that the rank of each node is equal to 122 00:08:05,308 --> 00:08:07,632 the longest path from a leaf to that node. 123 00:08:07,632 --> 00:08:11,581 And of course we don't change the ranks when we apply path compression. 124 00:08:11,581 --> 00:08:15,982 And what do we observe where exactly two of the objects had their parent pointer 125 00:08:15,982 --> 00:08:18,672 rewired. Namely, objects one and four had their 126 00:08:18,672 --> 00:08:23,615 parent Pointer rewired to piont directly to 7, and as a consequence, the gap in 127 00:08:23,615 --> 00:08:28,713 rank, between object 1, and it's parent, has leaped, from merely 1, the difference 128 00:08:28,713 --> 00:08:31,475 between each rank, and the rank of 4, up to 3. 129 00:08:31,475 --> 00:08:36,295 The difference between it's rank, and that of it's new parent 7, similarly 130 00:08:36,295 --> 00:08:39,788 object 4, its rank gap has jumped from 1 to 2. 131 00:08:39,788 --> 00:08:45,367 Its rank is only 1 less than its old parent 6, but it's 2 less than that of 132 00:08:45,367 --> 00:08:49,157 its new parent 7. So, believe it or not, we actually the 2 133 00:08:49,157 --> 00:08:52,672 crucial building blocks for the Hopcraft-Ullman analysis. 134 00:08:52,672 --> 00:08:57,232 Building block number 1, is the rank Lemma, that we discussed a couple videos 135 00:08:57,232 --> 00:08:59,537 ago. The fact that with or without path 136 00:08:59,537 --> 00:09:03,722 compression, you can't have too many objects where the given rank r. 137 00:09:03,722 --> 00:09:07,502 You're getting the most n/2^r. The second building block Is the one we 138 00:09:07,502 --> 00:09:10,376 just covered. That every time you update a parent 139 00:09:10,376 --> 00:09:14,783 pointer under path compression, the gap has to grow between the rank of that 140 00:09:14,783 --> 00:09:19,329 object and the rank of it's new parent. The rest of the proof is just an optimal 141 00:09:19,329 --> 00:09:21,872 exploitation of those two building blocks. 142 00:09:21,872 --> 00:09:23,635 Let me now show you the details.