In this video, we're going to prove our first performance guarantee on the Union-Find Data Structure with path compressions is the path first established by Hopcroft and Ullman. Let me briefly remind you about the guarantee. Consider a union-find data structure where you're using lazy unions. You're doing union by rank. You're using the path compression optimization. Consider an arbitrary sequence, say of m, union and fine. And operations, the guarantee is that over the course of this sequence of operations the total work that you do is a most m, the number of operations times the gracially growing function large star. Of n. Remember a log star of n is defined as the number of times you have to apply the logarithm function to n, before you get a result which is below one. Remember two raised to the 65536, a totally absurd number, log star of that number is merely five. So the theorem is true no matter what m is, no matter how many operations you do, whether you do very few or whether you do a whole lot of them. I'm going to focus on the interesting case where m is at least asymptotically as big as n, or m is omega of n. I'll let you think through, in the privacy of your own home, why the arguments we're about to see, imply the theorem, no matter what m is. So, 1 final comment about this guarantee before we turn to its proof. Notice what the theorem is not saying. The theorem is not saying that every single 1 of these find and union operations runs in big 0 of log star n time. And that's 'cuz, that statement, in general, which would be stronger. That statement is false. There are going to be operations, in general, that take more than log star. n*. Now on the one hand we know no operation is going to be slower than a logarithmic *. Even when we didn't have path compression, no operation was worse than the log n time. And we're not going to do any worse with path compression. So the worst case timed out is log n but some operations may indeed run that slowly. But over a sequence of m operations, the average amount of work we do on a per operation basis is only Log*(n). This is exactly the sort of so-called "amortized analysis" that we did in our very first union-find implementation, with eager unions. We agreed that particular unions might take linear time, but over a sequence of unions, we spend only logarithmic time. So the same thing is here, with the exception that we're getting a radically better, on average, running time bound, of log*n, over a sequence. So, before I launch into the proof details, lets just spend 1 slide talking a little bit about the proof plan, and in particular, what is the intuition behind the performance, that we're trying to make precise, that we're trying to encode in a mathematical way, in the proof that's About to follow. Well if we're hoping to prove abound, better than log n, what we were stuck with, without path compression, it better be the case that installing all these shortcuts really qualitatively speeds up, finds and And on some level it's sort of clear that things have to be sped up. We're replacing an old chain of pointers with a single pointer so you can only go faster. But how do we keep track of this progress? How do we compile intuition and make it into a rigorous gurantee? So here's the idea. Let's focus on an object x, which is no longer a root. At this point its parent is somebody other than itself. And one thing I want you to remember from our union by rank analysis, is that once an object is not a root, it's rank is frozen forever. Now we proved that in the context when there was no path compression. But again we manipulate ranks in exactly the same way when there is path compressions. If you're an object and you're no longer a root. There is no way your rank will ever change again. Now what we're certainly hoping is true, is that finds that originate at, say, this object X are running fast. Not only that, but they should even be getting faster and faster as time goes on, because we're doing more and more path compressions. Impression. So here's the big idea, the way we're going to reason about the worst case running time of the find operation or equivalently the longest sequence of parent point is we might have to transverse to get to a root is we're going to think about the sequence of ranks that we observe as we traverse these parent pointers from an object x. Up to the root. Let me give you an example. So, what would be the worst cast situation? Well, the worst case would be if we have a data structure and let's say, the maximum rank is something like 100. The worst case sequence of ranks, the longest sequence we would ever see would be if we start a fine operation at a, at a object with rank 0. We traverse it parent point or we get to it's parent, it has rank 1. It traverses it's parent, it has rank 2, then 3, then 4, and so on, all the way up to 100. Now remember, ranks have to strictly increase, whenever we traverse a parent pointer. As we discussed, that was true with or without path compression. So, in the worst case situation to go from 0 to 100, you have to traverse 100 pointers. So, that would be a bummer. Wouldn't it be nice if, every time we traversed a parent pointer, the rank increased not just by 1, but by a much bigger number. So for example, if we went from 0, to 10, to 20, to 30, to 40, and so on, that would guarantee that we'd get to the max rank node, 100, in only 10 steps. So again the bottom line is, if we can have a better, larger, lower bound on the gap in ranks between object and its parent. That implies more rapid progress through the possible ranks of the nodes that we can see. And it translates to faster finds, to fewer parent pointer traversals. So with that idea in mind, the big gaps between ranks imply rapid progress. I want to propose, as a progress measure, for a given non-root object x, the gap between x's rank, which again remember is frozen forevermore, and, the rank of it's current parent. So this progress measure is a good one for two reasons. First, as we just discussed, if you have a handle on this gap, if you can lower bound it, then that gives you an upper bound on the search time. Secondly, this gap allows us to quantify. The benefit of path compression. Of installing these shortcuts. Specifically, whenever you install a new shortcut, you rewire an objects parent pointer to point higher up in the tree. It's new parent is going to have rank, strictly bigger than it's old one. That means this gap, is only going to get bigger. Summarizing, path compression, improve, improves this progress Measure. That is, if an object x previously had a parent p and then has its parent pointer rewired to a different node, p prime, the rank of p prime is bigger than the rank of p. And just to make sure this is crystal clear, let's just draw a couple of cartoon examples. Examples. So first just in the abstract, if you think about an object x. Suppose it has some parent p, and suppose the root of the tree is some p prime, so some ancestor strictly further up in the tree. Remember, ranks always increase as you go up the tree, as you follow the The parent pointers. So that means p's rank is strictly bigger than x. p-prime's rank, this is the important thing, is strictly bigger than p. So, when you rewire x's parent pointer, to go, to point from p no longer, but rather to p-prime, it acquires a new parent, and it's new parent is an ancestor of it's old one, ergo, it must have strictly bigger rank. For that reason the gap Between x's rank, which is fixed forevermore, and the rank of its new parent, that gap is bigger than the gap between its rank and the rank of its old parent. You can also see this effect in action with the running seven object example we were using in the last video. So I've shown that example, that tree in pink, both before and after path compression. And again before path compression I just defined the ranks as with union by ranks so that the rank of each node is equal to the longest path from a leaf to that node. And of course we don't change the ranks when we apply path compression. And what do we observe where exactly two of the objects had their parent pointer rewired. Namely, objects one and four had their parent Pointer rewired to piont directly to 7, and as a consequence, the gap in rank, between object 1, and it's parent, has leaped, from merely 1, the difference between each rank, and the rank of 4, up to 3. The difference between it's rank, and that of it's new parent 7, similarly object 4, its rank gap has jumped from 1 to 2. Its rank is only 1 less than its old parent 6, but it's 2 less than that of its new parent 7. So, believe it or not, we actually the 2 crucial building blocks for the Hopcraft-Ullman analysis. Building block number 1, is the rank Lemma, that we discussed a couple videos ago. The fact that with or without path compression, you can't have too many objects where the given rank r. You're getting the most n/2^r. The second building block Is the one we just covered. That every time you update a parent pointer under path compression, the gap has to grow between the rank of that object and the rank of it's new parent. The rest of the proof is just an optimal exploitation of those two building blocks. Let me now show you the details.