So, in this video, we're going to prove Tarjan's inverse Ackermann bound in the performance of the union find data structure with union by rank and path compression. Frankly, I hope you're kind of excited. This is one of the crown jewels in the entire history of algorithms and data structures. Let me remind you what the bound says. You've got a union-find data structure, we're doing lazy unions, we're doing union by rank, we're doing path compression. And for an arbitrary sequence of m find a union operations. The total amount of work you do over the sequence of operations is bounded above by m times alpha of n, where n is the number of objects in the data structure and alpha is the inverse Ackermann function that we defined in the previous video. I want to give a shout out here to Dexter Kozen's beautiful book the Design and Analysis of Algorithms. It is his analysis that I will be following closely here. So, it's true that merely stating, Tarjan's bound is non-trivial. We have this entire video defining the Ackermann function and it's inverse so we could make sense of the statement. That said, we're well positioned to approve Tarjan's bound. In fact, the template of the proof is going to very much mirror what we already did for the log star analysis by Hopcroft and Ullman. So in that spirit, let's review what were the two basic workhorses, the two building blocks that drove forward the Hopcroft-Ullman analysis. So the first building block is the rank Lemma, which dates back all the way to the videos pre-path compression. And remember the rank Lemma gives us control on how many objects can exist with a given rank. And in particular, that upper bound is decreasing exponentially with the rank, it's n/2^r objects at most with the given rank r. But, of course, we need a second building block to exploit the facts that we're using the path compression optimization. So, how do we quantify the progress made by path compression? Well, we argue that every time a node has its parent pointer updated in path compression, it acquires a new parent with strictly larger rank. So, you can think of the Hopcroft-Ullman analysis as a clever and essentially optimal exploitation of these two building blocks. How did we use them? Well, we define the notion of rank blocks and each rank block has size to raise to the size of the previous rank block. Why were rank blocks useful? Well, they allowed us to measure progress as we followed parent pointers. We defined an object to be good, if following it's parent, catapulted you into a subsequent ranked block. Because there's only a log star number of ranked blocks, you can only visit log star good nodes on any given find operation. So, that gave us a per operation bound on the log star for visits to good nodes. And then, for visits to bad nodes, we use the second building block to argue that every time you visit a bad node, the rank of its parent increases. And that can only happen so many times before the rank of this object's parent is so much bigger than this object's rank itself that the node has to be good. You have to make a lot of progress by following its parent pointer. And the point is, if we want to do better than the Hopcroft-Ullman analysis, we're not going to do better by taking these same two building blocks and exploiting them in a better way, that can't be done. So, what we need to do is have a better version of one of the two building blocks. So, the key idea in Tarjan's analysis is to have a stronger version of the second building block. We're going to argue that path compression doesn't merely increase the rank of nodes parents by one, but in fact, it increases typically the rank of an object's parents by a lot. And what's kind of amazing is the link of the proof we're about to do is really basically the same as that in the Hopcroft-Ullman analysis. And the steps match up almost perfectly. So, the bound is even better, the argument is even more ingenious. It's even harder to imagine how one would come up with this idea oneself. But, as far as checking it, as far as understanding it, the complexity level is roughly the same as what we've already done in the log star analysis, so here we go. So, in this slide, I'm going to give you a definition. And the point of the definition is to measure how much progress we make in terms of ranks when we follow a node's parent pointer. So, the definition here is going to play exactly the same role that the definition of rank blocks played in the Hopcroft-Ullman analysis. So, what I'm going to define is a number, a statistic measuring the difference, the gap, between the rank of an object and the rank of its parent. So, this definition is only going to make sense for non-root objects x. One thing I want you to remember from previous videos is that once an object is not a root, it will never again be a root, and it's rank will never again change. So, non-root objects have rank fixed forevermore. So, we're going to use the notation delta of x for the statistic. And the definition of delta of x is the largest value of k such that a sub k, this is Ackermann function remember, such that a sub k applied to the rank of this object x, is bounded above by the rank of x's parent. So, the bigger the gap between the rank of x's parent and x's rank, the bigger the value of delta of x. So, let me mention, talk through some simple examples to make sure this makes sense, and also review the Ackermann function a little bit. So, first of all, let me just point out the delta of x is always non-negative for any non-root object x. So, why is this true? Well remember, and this goes all the way back to our union by rank discussion, the rank of an object's parent is always at least one more than the rank of that object. And the function a sub 0, recall we defined as the successor function. So, for every object that's not a root, it's always true that it's parent has rank at least one more than it. And that means we can always at least take k to be zero and have the inequality be satisfied. Now, when is it going to be true that an object x has a delta value at least equal to one? Well, that is equivalent to stating we can take k to be 1 in this inequality, and the inequality holds. So, let's remember what is the function a sub one. In the previous video, we discovered that was just the doubling function. So, we can take k=1 and have this inequality satisfied if and only if the rank of the node's parent is at least double the rank of that object. Similarly, an object x has a delta value equal to at least 2 if and only if we can take k=2 on this right-hand side of the definition and have this inequality be satisfied. So, let's recall what the function a sub 2 is. a sub 2 of r is just r*2^r. So, an object x has delta value equal to 2 if and only if its parent's rank is substantially bigger than its own rank, in the sense that if it has rank r, its parent's ranks has to be at least r*2^r. And in general, the bigger the value of delta, the bigger the gap between the rank of the object x and the rank of its parent. And, of course, because the Ackermann function grows so ridiculously quickly, the gap between this object's rank and its parent's rank is growing massively as you take delta to be larger and larger. So, now that we understand this better, one thing that should be clear is that the delta value of an object x can only go up over time, right? So remember, we're talking only about non-root objects x, so the x's rank itself, is frozen forevermore. The only thing that's changing is it's acquiring potentially new parents, through path compression, and each new parents rank is bigger than the previous one. So, the gap in the ranks is only going up, and therefore the delta value of this object x can only go up, over time. So one final comment, you recall on the Hopcroft-Ullman analysis, the way we defined rank blocks ensured that there weren't too many of them, there were only log star. Similarly here, the way we're defining the statistic delta, guarantees that there are not too many distinct delta values that objects can take on. So, we've already argued that delta is a, integer that's at least zero, and we can also bound it from above. At least, for any object x that has a rank of at least 2, which will be good enough for our purposes. If an object x has rank at least 2, its value of delta can not be any larger than the inverse acromyn of n, than alpha of n. And this is really just by the way we define the inverse Ackermann function alpha, we defined it as the smallest value of k such that if you apply a sub k to 2, that catapult to beyond end. And since ranks of objects are bounded above by n, actually they're even bounded above by log n but in particular by n that means you will never see a delta value bigger that alpha of n for any object x has the rank at least two. So, that completes the definition of the statistic data of x. And, once again, the point of this statistic is to quantify progress through rank space as we traverse a parent pointer, from an object. So, in the Hopcroft-Ullman analysis, we used rank blocks for that purpose. Here, we're using the statistic delta of x. And now, we have to redefine good and bad objects to reflect this different measurement for progress, for gaps between ranks of objects, and ranks of their parents. So, that's what I'm going to provide for you on this slide, the definition of good and bad nodes. And, this definition will play exactly the same role as it did in the Hopcroft-Ullman analysis. For the good nodes, we'll be able to just get a per operation bound for how many visits we make to them. And then, we'll be stuck with, and this will be the main part of the proof, a global analysis arguing that the total number of visits to bad nodes over a sequence of operations cannot be too big. So, the bad objects will be those that meet the following set of four criteria. If an object fails to meet any of these four criteria, it will be called good. So, the first two 2 criteria will be familiar from the Hopcroft-Ullman analysis. So, first of all, roots get a pass, and direct descendants of roots also get a pass. So, to be bad, it must be the case you're neither a root nor are you the child of a root. So, what's the motivation for these two criteria? Well, it's exactly the same as it was in the Hopcroft-Ullman analysis. It's just to ensure that bad nodes, after they're visited on a find, get their parent pointer updated. So remember, when you do path compression, after you've done a find from an object x all the way up to its corresponding root, everybody's parent pointer gets updated except for the root and except for the child of the root. Those two nodes keep the same parent pointer. So, to make sure we have progress, we want to exclude the root and the directed sentence of the root from being bad, we'll acount for them separately. The third criterion is not conceptually important, its just for technical convenience sort of an out of factor the way I've defined the inverse arithmetic function. We're going to give nodes that have rank zero or one also with free paths. So, in order to bad, we're going to worry about the case where the rank is at least 2. So, the final criterion is really the ingenious one and it's the one that ensures that when you do path compression typically objects parents ranks will increase extremely rapidly, not just by 1. And this criterion says, for an object x to be bad, it has to be the case that is has some ancestor y. So, an object you get to by traversing parent pointers from x, that has exactly the same value of delta. So, if an object x has delta of x equal to 2, for it to be bad, there has to be some ancestor object y that also has a delta value equal to 2. And again, if x fails to meet any of these four conditions, then x gets a pass, x is called a good object. Now, I said that the role of this definition is going to be to split the work done into two groups and that we're going to be able to easily bound, even on a per operation basis, the number of visits to a good node. So, the next quiz asks you to think about that carefully. Alright. So, the correct answer is B. It's theta of the inverse Ackermann function of n. And this is really one of the points, of how we define good and bad nodes. We guarantee that there can not be too many good nodes whenever we do a fine operation. So, to see why this is true, let's just count up the number of nodes along a path that can fail to meet each of the four criteria. Well, so the first criterion was we give roots a free pass. So there's at most one node, that's good because it's a root. Similarly, on a given path there's at most one object on that path, it's a direct descendant of the root. Because ranks always strictly go up when you follow parent pointers that can be at most one object of rank zero and at most one object of rank one. So, we, and most two objects get a free pascals of the third criterion. Well, such thing about the interesting case nodes that good be because they failed to meet the fourth criterion. How many can that be? Well, let's focus on a particular value of delta. So, let's say, think about all of the objects along a path that have delta of x equal to 3. There may well be lots of objects on this path with delta of x equal to 3, may be there's like 17 of them. Call the highest to the closest to the root, such object z. Of these 17 objects with delta of x equal to 3, z is going be the only one that is good. The other 16 objects with delta of x equal to 3 are going to be bad. Why? Well, any object other than z has an ancestor, namely z, with exactly the same delta value. So that, those 16 objects do meet the fourth criterion they will be classified as bad. Summarizing for each of the alpha of n possible values, of delta, only the upper-most object, the object closest to the root with that particular delta value, can possibly be classified as good. So, that bounds at alpha of n, the number of objects that fail to meet this fourth criterion. That gives us the total bound of the number of good nodes on any path.