In this video we'll provide for the first time concrete evidence that the lazy union approach to the Union-Find data structure is viable. Specifically, we'll prove that the worst case running time of both the find and the union operation is logarithmic in n, the number of objects stored in the data structure. We are going to do even better later once, we introduce a second optimization known as path impression. But an important stepping stone is to understand just is why, just union by rank, already gets us to a reasonable algorithmic run-time. So a quick review of the lazy union approach to implementing the union fine data structure. So, with each note we're going to maintain a parent pointer. And it's no longer the case that we insist the parent pointer point directly to the leader of a group. Rather we just insist that the collection Collection of parent pointers, induces a collection of directed trees. The root of each tree, that is, the node which is it's own parent, we're going to define as the leader of that group. So, given any old object, x, how do you implement find? How do you, figure out what the leader vertex is? Well, you just traverse parent pointers, up until you get to the root of that particular group. So for this implementation of the Find operation, the worst case running time is just going to be the longest path of parent pointers that you ever have to traverse, to get from an object to some root. So the way we're going to quantify that is using these ranks. So this is again, a field that we maintain for each object. And for now, this will break down later, but for now before we have path -pression, we're going to maintain the invariant that the rank of an object x is exactly the largest number of pointers you have to traverse, from some leaf, to get to x. As a consequence, the biggest rank of any object is the longest path from any leaf to any root. And that's going to be an upper bound on the worst case running time of the find op- Operation. So let's move on to the union operation. So here, given 2 objects, x and y, you need to fuse their 2 trees, their 2 groups, so you find the roots of the 2 trees, so you call a find on x, you call a find on y. That gives you their 2 respective roots, and now you install 1 as a new child. Of the other. Now we saw in a quiz in the last video, if you're not careful about which root you install as a child under the other, you can wind up with these long chains. And be stuck with a linear worse case time for both find and union. So instead we have this union by rank optimization. Which says, well, we want to keep our trees from getting scraggly. And the way we're going to do that is. When we have a shallow tree and a deep tree we make the shallow tree shall under the root of the deep one, that prevents the tree from getting even deeper. Now there is a situation where the two trees have exactly the same depths that is where the two roots have exactly the same rank, in that case we just proceed arbitrarily. Then when we merge two trees that both had a common rank r, its important that in the new tree, the rank is gone up to r+1. So we need the update, we need the incremental rank of the new root to reflect that increase. So that's where we've already been. Where are we going to next? Well the plan for this video is to show that, with the union by rank, optimization. The maximum rank of any node, is always, bounded above, by log, base 2 (n). Where n is the number of objects in the data structure. Now we just said, the worst case running time of find, is governed, by the maximum rank. So the logarithmic maximum rank means logarithm run time of find. that also carries over to the union operation. Remember union is just 2 finds plus constant work to rewire 1 pointer, so that's going to give us algorithm time value on both operations. So let's see why that's true. So let's begin the analyses with a few simple, but useful properties that follow immediately from our invariant. From the way that we change the ranks of objects as we do finds and as we do unions. So the first simple property is focus on your favorite object. x. And, watch this objects rank change, over the course of the data structure, as we do finds and unions. How can it change? Well, when we do a find we don't change anything, all the ranks stay the same. When we do a union, all the ranks stay the same. Well, except there is 1 case in the union, where the rank, of a single node, gets bumped up by 1, gets increased. So ranks only go up, over time, for all of the Objects, that's property one. So the second property is again pretty much trivial, but really, really useful. So what is the situation in which somebody's rank gets bumped up by 1? We're going to take the union of two trees that have a common rank. And then whichever of the two roots that we pick to be the root of the new bigger tree That's the object whose rank gets bumped up by 1. So new roots of this fused tree. So in particular, the only type of objects that can ever get a rank increase is a root. If you're not a root, your rank will not go up. Furthermore, once you're not a root in this data structure, you will never be a root again in the future. There is no process by which, you shed your parent. Once you have a parent other than yourself, you will always have exactly that parent. Putting those two observations together we find that, as soon as an object X. Becomes a non root but as soon as it has a in parent other than itself it rank is frozen for the rest of time forever more. The third and final simple property follows from a formula we mentioned in the last video about computing ranks. So remember the rank of a node in general is going to be one more than the maximum rank of any of its children. So if you have a child and there is some path from a leaf to that child, it takes 5 hops. The path to you from that child is going to take 6 hops. As a consequence as you go from the leaf up to the root you will see a strictly increasing sequence of ranks. The rank of a parent is always strictly more than the rank of all of those children. So that's it for the immediate properties. Let's go to a property which is a little less immediate. But still this next lemma, which I'm going to call the rank lemma, it's the best kind of lemma. So on the one hand, it's just not that hard to prove. I'll give you a full proof in the following 2 slides. On the other hand, it's really powerful. It's going to play a crucial role in the analysis were doing right now. A logarithmic run time bound, with a union by rank optimization, and we'll keep using it again as a workforce, once we introduce path compression, and prove better bounds on the operations. So what's the rank limit say? Well it controls the population size of objects, that have a given (no period) Rank, so we want it to apply at all intermediate stages of our data structure, so we're going to consider an arbitrary sequence of unions. You can throw in some finds as well. I don't care. Finds don't change the data structure, so they're totally irrelevant, so think about a sequence of unions, a sequence of mergers. The claim is, for every non-negative integer, r. The number of objects that have rank exactly r at this time is at must n. the total number of objects divided by 2 to the r. So for example, if our rank is 0. It says that at must n objects have rank 0, so it is a trivial statement because only n objects. But at any given time the number of objects that have rank 1 is at most n over 2, the number of objects that have rank 2 is at most no over 4 and so on. And if you think about it, if we succeed in proving the rank Lemma, we're pretty much done, showing the efficacy of the union by rank optimization. So in particular, once you take r, the, in this, key Lemma, to be log base 2 (n), it says that there's at most 1 object. That has rank log2(n). And there can't be any objects that have rank strictly larger. That is, this limit implies that the maximum rank at all times is bounded above by log2(n). And remember, the maximum rank is the longest path of pointers, traversals, you ever need to get from a leaf to a root. And that means the most amount of work we'll ever do in a find, and therefore, in a union is O (log n) Okay, so I've now teased you with the consequences of the rank lemma, assuming that it's true, but why is it true? Let's turn to the proof. I'm going to break the proof down into two claims, claim one and claim two. We'll see that the two claims easily imply the rank Rank Lemma. So claim 1 asks you to consider 2 objects, X and Y, that have exactly the same rank R. And the claim asserts that the sub-trees of these 2 objects have to be disjoint. They have no objects in common. And here by the sub-tree of an object, I just mean the other objects that can reach this one by following a sequence. Of, parent pointers. So that is the subtree at x, is the objects from which you can reach x. The subtree at y is the objects from which you can reach y. The second claim, is that, if you look at any object that has rank r, and you look at it's subtree, that is, if you look at the number of objects that can reach, this object x by following pointers, there have to be a lot of them. There have to be at least 2 raised to that objects rank Are, objects in it's subtree? Notice that, if we prove claim 1 and claim 2, then the Rank Lemma, follows easily. Why? Well, fix a value for, r. 2, 10, I don't care what. Look at all the nodes that have this rank R. By claim 2, each of them has at least 2 to the R objects that could reach them. And by claim 1, these have to be disjoint sets of objects. Well, there's only N objects to go around, and if each of these disjoint sets has at least 2 to the R of them, there are going to be at most N over 2 to the R such groups. That is at most N over 2 to the R nodes, objects with this rank R. So we've reduced the proof of the rank Lemma to proving claims 1 and 2, I will do them in turn. So for claim 1 let me go via the contra positive, that is, I will assume that the conclusion is false, and I will show that the hypothesis must then also be false. So, lets assume that we have 2 no, objects x and y, and their subtrees are not, disjoint. That is, there exists an object z, from which You can reach X and from the same object Z, you can also reach Y by a sequence parent pointers. Well now let's use the fact that we're dealing with the directed tree, right, so if you start with an object Z. There's only a unique parent point, or 2, follow each time. So that is, all of the objects reachable from z, they form a directed path, leading up to the root of z's group. So the only way for both x and y to be reachable from z, they have to both be on this path. If they're both on this path, then 1 has to be an ancestor of the other. So now we're going to use the third of our simple properties that we observed. That is, on every path to the root, ranks strictly go up, each time. So, whichever of x or y is an ancestor of the other, that has strictly higher rank. Therefore x and y do not have the same rank. That completes the proof, of claim 1. So lets move on to claim 2. Remember, claim 2 is search that, an object of rank r, necessarily has 2 ^ r objects, or more, in its subtree. That's how many objects can actually reach this object x, by following Parent pointers. So for this proof we're going to proceed by induction on the number of operations, and again remember fine operations have no effect on the data structure, so we can ignore them. So it's just by induction on the number of union operations that happen. So for the base case, when, before we've done any unions whatsoever, we're doing just fine. Every object has a rank of 0 and the sub-tree size of every object is equal to 1. That object itself, also known as 2 to the 0. Zero. Now for the inductive step, there's an easy case and a hard case. The easy case is where nobody's rank changes, where we do a union, and everybody's rank stays exactly the same. In this case, we're golden. Why? Well, when you do a union, sub-tree sizes only go up. There's only more. Pointers so there's only more objects that can reach any given other objects. So sub-tree sizes go up, ranks stay the same. If we had this inequality of sub-tree size as being at least 2 ^ r before, we have it equally well now. Now. So the interesting case is when somebody's ranked actually changes. How can that happen? Well it happens in only one particular way that we understand well. Looking at a union operation between objects X and Y. Suppose the roots of these objects are S1 and S2 respectively. It's only when these. Two roots have the same rank, let's call that common rank R, that somebodies rank gets changed. In particular, we're going to break ties as we did in the previous video. S2 will be the root of the fused tree, S1 will become a child of it. And, in that case, S2s rank gets bumped up by 1. It goes from R To r + 1. Now notice, in this case, we do have something to prove. What are we trying to establish? We're trying to establish that every subtree size is big, as a function of the rank. So, s2's rank has gone up, and therefore the lowerbound, the bar that we have to meet, for the subtree size, has also Gone up, it's doubled. So in this case we actually have to scrutinize s2's new sub-tree. So what is its new sub-tree? Well, it's really just composed from its old sub-tree, and it inherits s1 and all of its sub-trees. Well, in that case, we know that s2's new subtree size, the nubmer of no objects that can reach it, is just, it's old subtree size, plus the old subtree size, of s1. But then w're in good shape because we have the inductive hypothesis to rescue us. So remember, before this union, by the inductive hypothesis, for every object with a given rank, say r, it had at least 2^r objects in its sub tree. So S1, and S2, both had rank r before this. Unions, before this union, both of their subtree were at least two to the r. So as two subtree sizes bounded below by two to the r plus two to the r, a quantity also known as two raised to the r plus one. Quite conveniently, r plus one is S two's new rank, so S two's new bigger rank, its subtree size is still meeting the lower bound, meeting the target of two raised to, New rank, 2 ^ r + 1. So that completes the inductive step, therefore it completes the proof of claim 2, that objects of rank r have subtree sizes at least 2 ^ r. Therefore completes the proof of the rank Lemma, that for every rank r, there's at most n / 2 ^ r nodes of rank r. And remember the rank Lemma, implies that the maximum rank, at all times, is bounded by log base 2 (n), as long as you're using union by rank. And that implies, that with this first optimization, the worst case running time of union, and find, are both O (log n), where n is the number of objects, in the data structure.