1 00:00:00,012 --> 00:00:04,016 So, in this video, we're going to prove Tarjan's inverse Ackermann bound in the 2 00:00:04,016 --> 00:00:07,821 performance of the union find data structure with union by rank and path 3 00:00:07,821 --> 00:00:10,706 compression. Frankly, I hope you're kind of excited. 4 00:00:10,706 --> 00:00:14,677 This is one of the crown jewels in the entire history of algorithms and data 5 00:00:14,677 --> 00:00:17,612 structures. Let me remind you what the bound says. 6 00:00:17,612 --> 00:00:21,862 You've got a union-find data structure, we're doing lazy unions, we're doing 7 00:00:21,862 --> 00:00:24,367 union by rank, we're doing path compression. 8 00:00:24,367 --> 00:00:28,992 And for an arbitrary sequence of m find a union operations. The total amount of 9 00:00:28,992 --> 00:00:32,767 work you do over the sequence of operations is bounded above by m times 10 00:00:32,767 --> 00:00:37,247 alpha of n, where n is the number of objects in the data structure and alpha 11 00:00:37,247 --> 00:00:42,357 is the inverse Ackermann function that we defined in the previous video. 12 00:00:42,357 --> 00:00:46,827 I want to give a shout out here to Dexter Kozen's beautiful book the Design and 13 00:00:46,827 --> 00:00:50,182 Analysis of Algorithms. It is his analysis that I will be 14 00:00:50,182 --> 00:00:53,603 following closely here. So, it's true that merely stating, 15 00:00:53,603 --> 00:00:58,082 Tarjan's bound is non-trivial. We have this entire video defining the Ackermann 16 00:00:58,082 --> 00:01:01,904 function and it's inverse so we could make sense of the statement. 17 00:01:01,904 --> 00:01:05,428 That said, we're well positioned to approve Tarjan's bound. 18 00:01:05,428 --> 00:01:09,821 In fact, the template of the proof is going to very much mirror what we already 19 00:01:09,821 --> 00:01:12,829 did for the log star analysis by Hopcroft and Ullman. 20 00:01:12,829 --> 00:01:16,592 So in that spirit, let's review what were the two basic workhorses, the two 21 00:01:16,592 --> 00:01:20,348 building blocks that drove forward the Hopcroft-Ullman analysis. 22 00:01:20,348 --> 00:01:24,490 So the first building block is the rank Lemma, which dates back all the way to 23 00:01:24,490 --> 00:01:28,271 the videos pre-path compression. And remember the rank Lemma gives us 24 00:01:28,271 --> 00:01:31,513 control on how many objects can exist with a given rank. 25 00:01:31,513 --> 00:01:36,062 And in particular, that upper bound is decreasing exponentially with the rank, 26 00:01:36,062 --> 00:01:39,589 it's n/2^r objects at most with the given rank r. 27 00:01:39,589 --> 00:01:44,122 But, of course, we need a second building block to exploit the facts that we're 28 00:01:44,122 --> 00:01:48,960 using the path compression optimization. So, how do we quantify the progress made 29 00:01:48,960 --> 00:01:52,585 by path compression? Well, we argue that every time a node has 30 00:01:52,585 --> 00:01:56,845 its parent pointer updated in path compression, it acquires a new parent 31 00:01:56,845 --> 00:02:00,941 with strictly larger rank. So, you can think of the Hopcroft-Ullman 32 00:02:00,941 --> 00:02:04,812 analysis as a clever and essentially optimal exploitation of these two 33 00:02:04,812 --> 00:02:08,076 building blocks. How did we use them? Well, we define the 34 00:02:08,076 --> 00:02:12,345 notion of rank blocks and each rank block has size to raise to the size of the 35 00:02:12,345 --> 00:02:15,832 previous rank block. Why were rank blocks useful? Well, they 36 00:02:15,832 --> 00:02:19,642 allowed us to measure progress as we followed parent pointers. 37 00:02:19,642 --> 00:02:23,738 We defined an object to be good, if following it's parent, catapulted you 38 00:02:23,738 --> 00:02:27,809 into a subsequent ranked block. Because there's only a log star number of 39 00:02:27,809 --> 00:02:31,722 ranked blocks, you can only visit log star good nodes on any given find 40 00:02:31,722 --> 00:02:34,559 operation. So, that gave us a per operation bound on 41 00:02:34,559 --> 00:02:38,605 the log star for visits to good nodes. And then, for visits to bad nodes, we use 42 00:02:38,605 --> 00:02:42,625 the second building block to argue that every time you visit a bad node, the rank 43 00:02:42,625 --> 00:02:45,602 of its parent increases. And that can only happen so many times 44 00:02:45,602 --> 00:02:49,394 before the rank of this object's parent is so much bigger than this object's rank 45 00:02:49,394 --> 00:02:52,878 itself that the node has to be good. You have to make a lot of progress by 46 00:02:52,878 --> 00:02:57,247 following its parent pointer. And the point is, if we want to do better 47 00:02:57,247 --> 00:03:01,642 than the Hopcroft-Ullman analysis, we're not going to do better by taking these 48 00:03:01,642 --> 00:03:06,052 same two building blocks and exploiting them in a better way, that can't be done. 49 00:03:06,052 --> 00:03:09,742 So, what we need to do is have a better version of one of the two building 50 00:03:09,742 --> 00:03:12,562 blocks. So, the key idea in Tarjan's analysis is 51 00:03:12,562 --> 00:03:16,246 to have a stronger version of the second building block. 52 00:03:16,246 --> 00:03:20,498 We're going to argue that path compression doesn't merely increase the 53 00:03:20,498 --> 00:03:25,552 rank of nodes parents by one, but in fact, it increases typically the rank of 54 00:03:25,552 --> 00:03:30,142 an object's parents by a lot. And what's kind of amazing is the link of 55 00:03:30,142 --> 00:03:34,450 the proof we're about to do is really basically the same as that in the 56 00:03:34,450 --> 00:03:38,727 Hopcroft-Ullman analysis. And the steps match up almost perfectly. 57 00:03:38,727 --> 00:03:42,803 So, the bound is even better, the argument is even more ingenious. 58 00:03:42,803 --> 00:03:48,072 It's even harder to imagine how one would come up with this idea oneself. But, as 59 00:03:48,072 --> 00:03:52,105 far as checking it, as far as understanding it, the complexity level is 60 00:03:52,105 --> 00:03:56,436 roughly the same as what we've already done in the log star analysis, so here we 61 00:03:56,436 --> 00:03:58,608 go. So, in this slide, I'm going to give you 62 00:03:58,608 --> 00:04:01,574 a definition. And the point of the definition is to 63 00:04:01,574 --> 00:04:06,112 measure how much progress we make in terms of ranks when we follow a node's 64 00:04:06,112 --> 00:04:10,616 parent pointer. So, the definition here is going to play exactly the same role 65 00:04:10,616 --> 00:04:14,802 that the definition of rank blocks played in the Hopcroft-Ullman analysis. 66 00:04:14,802 --> 00:04:19,624 So, what I'm going to define is a number, a statistic measuring the difference, the 67 00:04:19,624 --> 00:04:23,709 gap, between the rank of an object and the rank of its parent. So, this 68 00:04:23,709 --> 00:04:27,334 definition is only going to make sense for non-root objects x. 69 00:04:27,334 --> 00:04:32,022 One thing I want you to remember from previous videos is that once an object is 70 00:04:32,022 --> 00:04:35,522 not a root, it will never again be a root, and it's 71 00:04:35,522 --> 00:04:40,272 rank will never again change. So, non-root objects have rank fixed 72 00:04:40,272 --> 00:04:43,937 forevermore. So, we're going to use the notation delta 73 00:04:43,937 --> 00:04:49,057 of x for the statistic. And the definition of delta of x is the largest 74 00:04:49,057 --> 00:04:54,492 value of k such that a sub k, this is Ackermann function remember, such that a 75 00:04:54,492 --> 00:05:00,577 sub k applied to the rank of this object x, is bounded above by the rank of x's 76 00:05:00,577 --> 00:05:03,677 parent. So, the bigger the gap between the rank 77 00:05:03,677 --> 00:05:07,742 of x's parent and x's rank, the bigger the value of delta of x. 78 00:05:07,742 --> 00:05:12,622 So, let me mention, talk through some simple examples to make sure this makes 79 00:05:12,622 --> 00:05:16,527 sense, and also review the Ackermann function a little bit. 80 00:05:16,527 --> 00:05:21,265 So, first of all, let me just point out the delta of x is always non-negative for 81 00:05:21,265 --> 00:05:25,057 any non-root object x. So, why is this true? Well remember, and 82 00:05:25,057 --> 00:05:29,435 this goes all the way back to our union by rank discussion, the rank of an 83 00:05:29,435 --> 00:05:33,830 object's parent is always at least one more than the rank of that object. 84 00:05:33,830 --> 00:05:38,225 And the function a sub 0, recall we defined as the successor function. 85 00:05:38,225 --> 00:05:42,779 So, for every object that's not a root, it's always true that it's parent has 86 00:05:42,779 --> 00:05:47,342 rank at least one more than it. And that means we can always at least 87 00:05:47,342 --> 00:05:51,002 take k to be zero and have the inequality be satisfied. 88 00:05:51,002 --> 00:05:56,517 Now, when is it going to be true that an object x has a delta value at least equal 89 00:05:56,517 --> 00:06:01,287 to one? Well, that is equivalent to stating we can take k to be 1 in this 90 00:06:01,287 --> 00:06:06,997 inequality, and the inequality holds. So, let's remember what is the function a 91 00:06:06,997 --> 00:06:10,445 sub one. In the previous video, we discovered that 92 00:06:10,445 --> 00:06:15,653 was just the doubling function. So, we can take k=1 and have this 93 00:06:15,653 --> 00:06:20,967 inequality satisfied if and only if the rank of the node's parent is at least 94 00:06:20,967 --> 00:06:25,991 double the rank of that object. Similarly, an object x has a delta value 95 00:06:25,991 --> 00:06:31,870 equal to at least 2 if and only if we can take k=2 on this right-hand side of the 96 00:06:31,870 --> 00:06:34,853 definition and have this inequality be satisfied. 97 00:06:34,853 --> 00:06:38,822 So, let's recall what the function a sub 2 is. 98 00:06:38,822 --> 00:06:43,881 a sub 2 of r is just r*2^r. So, an object x has delta value equal to 99 00:06:43,881 --> 00:06:49,069 2 if and only if its parent's rank is substantially bigger than its own rank, 100 00:06:49,069 --> 00:06:53,763 in the sense that if it has rank r, its parent's ranks has to be at least r*2^r. 101 00:06:54,982 --> 00:07:00,360 And in general, the bigger the value of delta, the bigger the gap between the 102 00:07:00,360 --> 00:07:03,688 rank of the object x and the rank of its parent. 103 00:07:03,688 --> 00:07:09,018 And, of course, because the Ackermann function grows so ridiculously quickly, 104 00:07:09,018 --> 00:07:14,654 the gap between this object's rank and its parent's rank is growing massively as 105 00:07:14,654 --> 00:07:19,959 you take delta to be larger and larger. So, now that we understand this better, 106 00:07:19,959 --> 00:07:24,428 one thing that should be clear is that the delta value of an object x can only 107 00:07:24,428 --> 00:07:27,739 go up over time, right? So remember, we're talking only 108 00:07:27,739 --> 00:07:31,957 about non-root objects x, so the x's rank itself, is frozen forevermore. 109 00:07:31,957 --> 00:07:36,314 The only thing that's changing is it's acquiring potentially new parents, 110 00:07:36,314 --> 00:07:40,885 through path compression, and each new parents rank is bigger than the previous 111 00:07:40,885 --> 00:07:43,682 one. So, the gap in the ranks is only going 112 00:07:43,682 --> 00:07:48,317 up, and therefore the delta value of this object x can only go up, over time. 113 00:07:48,317 --> 00:07:52,852 So one final comment, you recall on the Hopcroft-Ullman analysis, the way we 114 00:07:52,852 --> 00:07:57,582 defined rank blocks ensured that there weren't too many of them, there were only 115 00:07:57,582 --> 00:08:00,277 log star. Similarly here, the way we're defining 116 00:08:00,277 --> 00:08:04,787 the statistic delta, guarantees that there are not too many distinct delta 117 00:08:04,787 --> 00:08:09,433 values that objects can take on. So, we've already argued that delta is a, 118 00:08:09,433 --> 00:08:13,480 integer that's at least zero, and we can also bound it from above. 119 00:08:13,480 --> 00:08:18,131 At least, for any object x that has a rank of at least 2, which will be good 120 00:08:18,131 --> 00:08:22,275 enough for our purposes. If an object x has rank at least 2, its 121 00:08:22,275 --> 00:08:27,236 value of delta can not be any larger than the inverse acromyn of n, than alpha of 122 00:08:27,236 --> 00:08:29,934 n. And this is really just by the way we 123 00:08:29,934 --> 00:08:34,760 define the inverse Ackermann function alpha, we defined it as the smallest 124 00:08:34,760 --> 00:08:38,908 value of k such that if you apply a sub k to 2, that catapult to beyond end. 125 00:08:38,908 --> 00:08:43,882 And since ranks of objects are bounded above by n, actually they're even bounded 126 00:08:43,882 --> 00:08:48,614 above by log n but in particular by n that means you will never see a delta 127 00:08:48,614 --> 00:08:53,188 value bigger that alpha of n for any object x has the rank at least two. 128 00:08:54,932 --> 00:09:00,147 So, that completes the definition of the statistic data of x. 129 00:09:00,147 --> 00:09:06,362 And, once again, the point of this statistic is to quantify progress through 130 00:09:06,362 --> 00:09:10,207 rank space as we traverse a parent pointer, from an object. 131 00:09:10,207 --> 00:09:14,401 So, in the Hopcroft-Ullman analysis, we used rank blocks for that purpose. 132 00:09:14,401 --> 00:09:16,771 Here, we're using the statistic delta of x. 133 00:09:16,771 --> 00:09:21,164 And now, we have to redefine good and bad objects to reflect this different 134 00:09:21,164 --> 00:09:25,445 measurement for progress, for gaps between ranks of objects, and ranks of 135 00:09:25,445 --> 00:09:28,571 their parents. So, that's what I'm going to provide for 136 00:09:28,571 --> 00:09:31,408 you on this slide, the definition of good and bad nodes. 137 00:09:31,408 --> 00:09:35,171 And, this definition will play exactly the same role as it did in the 138 00:09:35,171 --> 00:09:38,775 Hopcroft-Ullman analysis. For the good nodes, we'll be able to just 139 00:09:38,775 --> 00:09:42,942 get a per operation bound for how many visits we make to them. And then, we'll 140 00:09:42,942 --> 00:09:46,287 be stuck with, and this will be the main part of the proof, 141 00:09:46,287 --> 00:09:50,872 a global analysis arguing that the total number of visits to bad nodes over a 142 00:09:50,872 --> 00:09:55,546 sequence of operations cannot be too big. So, the bad objects will be those that 143 00:09:55,546 --> 00:09:59,939 meet the following set of four criteria. If an object fails to meet any of these 144 00:09:59,939 --> 00:10:04,128 four criteria, it will be called good. So, the first two 2 criteria will be 145 00:10:04,128 --> 00:10:07,047 familiar from the Hopcroft-Ullman analysis. 146 00:10:07,047 --> 00:10:12,091 So, first of all, roots get a pass, and direct descendants of roots also get a 147 00:10:12,091 --> 00:10:15,068 pass. So, to be bad, it must be the case you're 148 00:10:15,068 --> 00:10:18,034 neither a root nor are you the child of a root. 149 00:10:18,034 --> 00:10:21,919 So, what's the motivation for these two criteria? Well, it's exactly the same as 150 00:10:21,919 --> 00:10:26,095 it was in the Hopcroft-Ullman analysis. It's just to ensure that bad nodes, after 151 00:10:26,095 --> 00:10:29,149 they're visited on a find, get their parent pointer updated. 152 00:10:29,149 --> 00:10:32,444 So remember, when you do path compression, after you've done a find 153 00:10:32,444 --> 00:10:36,354 from an object x all the way up to its corresponding root, everybody's parent 154 00:10:36,354 --> 00:10:40,387 pointer gets updated except for the root and except for the child of the root. 155 00:10:40,387 --> 00:10:42,626 Those two nodes keep the same parent pointer. 156 00:10:42,626 --> 00:10:46,234 So, to make sure we have progress, we want to exclude the root and the directed 157 00:10:46,234 --> 00:10:49,842 sentence of the root from being bad, we'll acount for them separately. 158 00:10:49,842 --> 00:10:53,481 The third criterion is not conceptually important, its just for technical 159 00:10:53,481 --> 00:10:57,382 convenience sort of an out of factor the way I've defined the inverse arithmetic 160 00:10:57,382 --> 00:10:59,992 function. We're going to give nodes that have rank 161 00:10:59,992 --> 00:11:03,360 zero or one also with free paths. So, in order to bad, we're going to worry 162 00:11:03,360 --> 00:11:05,944 about the case where the rank is at least 2. 163 00:11:05,944 --> 00:11:10,686 So, the final criterion is really the ingenious one and it's the one that 164 00:11:10,686 --> 00:11:16,148 ensures that when you do path compression typically objects parents ranks will 165 00:11:16,148 --> 00:11:19,102 increase extremely rapidly, not just by 1. 166 00:11:19,102 --> 00:11:24,708 And this criterion says, for an object x to be bad, it has to be the case that is 167 00:11:24,708 --> 00:11:28,746 has some ancestor y. So, an object you get to by traversing 168 00:11:28,746 --> 00:11:33,362 parent pointers from x, that has exactly the same value of delta. 169 00:11:33,362 --> 00:11:38,099 So, if an object x has delta of x equal to 2, for it to be bad, there has to be 170 00:11:38,099 --> 00:11:42,672 some ancestor object y that also has a delta value equal to 2. 171 00:11:42,672 --> 00:11:47,472 And again, if x fails to meet any of these four conditions, then x gets a 172 00:11:47,472 --> 00:11:49,647 pass, x is called a good object. 173 00:11:49,647 --> 00:11:54,347 Now, I said that the role of this definition is going to be to split the 174 00:11:54,347 --> 00:11:59,847 work done into two groups and that we're going to be able to easily bound, even on 175 00:11:59,847 --> 00:12:03,797 a per operation basis, the number of visits to a good node. 176 00:12:03,797 --> 00:12:08,524 So, the next quiz asks you to think about that carefully. 177 00:12:08,524 --> 00:12:14,459 Alright. So, the correct answer is B. It's theta of the inverse Ackermann 178 00:12:14,459 --> 00:12:18,505 function of n. And this is really one of the points, of 179 00:12:18,505 --> 00:12:24,079 how we define good and bad nodes. We guarantee that there can not be too 180 00:12:24,079 --> 00:12:27,722 many good nodes whenever we do a fine operation. 181 00:12:27,722 --> 00:12:32,289 So, to see why this is true, let's just count up the number of nodes along a path 182 00:12:32,289 --> 00:12:34,940 that can fail to meet each of the four criteria. 183 00:12:34,940 --> 00:12:38,549 Well, so the first criterion was we give roots a free pass. 184 00:12:38,549 --> 00:12:42,067 So there's at most one node, that's good because it's a root. 185 00:12:42,067 --> 00:12:46,700 Similarly, on a given path there's at most one object on that path, it's a 186 00:12:46,700 --> 00:12:50,935 direct descendant of the root. Because ranks always strictly go up when 187 00:12:50,935 --> 00:12:55,452 you follow parent pointers that can be at most one object of rank zero and at most 188 00:12:55,452 --> 00:12:58,741 one object of rank one. So, we, and most two objects get a free 189 00:12:58,741 --> 00:13:02,699 pascals of the third criterion. Well, such thing about the interesting 190 00:13:02,699 --> 00:13:06,769 case nodes that good be because they failed to meet the fourth criterion. 191 00:13:06,769 --> 00:13:10,641 How many can that be? Well, let's focus on a particular value of delta. 192 00:13:10,641 --> 00:13:15,809 So, let's say, think about all of the objects along a path that have delta of x 193 00:13:15,809 --> 00:13:18,687 equal to 3. There may well be lots of objects on this 194 00:13:18,687 --> 00:13:21,834 path with delta of x equal to 3, may be there's like 17 of them. 195 00:13:21,834 --> 00:13:27,248 Call the highest to the closest to the root, such object z. Of these 17 objects 196 00:13:27,248 --> 00:13:31,191 with delta of x equal to 3, z is going be the only one that is good. 197 00:13:31,191 --> 00:13:35,143 The other 16 objects with delta of x equal to 3 are going to be bad. 198 00:13:35,143 --> 00:13:40,873 Why? Well, any object other than z has an ancestor, namely z, with exactly the same 199 00:13:40,873 --> 00:13:44,207 delta value. So that, those 16 objects do meet the 200 00:13:44,207 --> 00:13:47,565 fourth criterion they will be classified as bad. 201 00:13:47,565 --> 00:13:52,442 Summarizing for each of the alpha of n possible values, of delta, only the 202 00:13:52,442 --> 00:13:57,016 upper-most object, the object closest to the root with that particular delta 203 00:13:57,016 --> 00:13:59,545 value, can possibly be classified as good. 204 00:13:59,545 --> 00:14:04,155 So, that bounds at alpha of n, the number of objects that fail to meet this fourth 205 00:14:04,155 --> 00:14:06,926 criterion. That gives us the total bound of the 206 00:14:06,926 --> 00:14:08,356 number of good nodes on any path.