1 00:00:00,012 --> 00:00:04,197 In this video we'll provide for the first time concrete evidence that the lazy 2 00:00:04,197 --> 00:00:07,397 union approach to the Union-Find data structure is viable. 3 00:00:07,397 --> 00:00:11,666 Specifically, we'll prove that the worst case running time of both the find and 4 00:00:11,666 --> 00:00:16,019 the union operation is logarithmic in n, the number of objects stored in the data 5 00:00:16,019 --> 00:00:18,981 structure. We are going to do even better later 6 00:00:18,981 --> 00:00:22,302 once, we introduce a second optimization known as path impression. 7 00:00:22,302 --> 00:00:25,963 But an important stepping stone is to understand just is why, just union by 8 00:00:25,963 --> 00:00:29,102 rank, already gets us to a reasonable algorithmic run-time. 9 00:00:29,102 --> 00:00:33,420 So a quick review of the lazy union approach to implementing the union fine 10 00:00:33,420 --> 00:00:36,166 data structure. So, with each note we're going to 11 00:00:36,166 --> 00:00:39,755 maintain a parent pointer. And it's no longer the case that we 12 00:00:39,755 --> 00:00:43,600 insist the parent pointer point directly to the leader of a group. 13 00:00:43,600 --> 00:00:48,142 Rather we just insist that the collection Collection of parent pointers, induces a 14 00:00:48,142 --> 00:00:51,616 collection of directed trees. The root of each tree, that is, the node 15 00:00:51,616 --> 00:00:55,196 which is it's own parent, we're going to define as the leader of that group. 16 00:00:55,196 --> 00:00:59,060 So, given any old object, x, how do you implement find? How do you, figure out 17 00:00:59,060 --> 00:01:03,036 what the leader vertex is? Well, you just traverse parent pointers, up until you 18 00:01:03,036 --> 00:01:07,311 get to the root of that particular group. So for this implementation of the Find 19 00:01:07,311 --> 00:01:11,220 operation, the worst case running time is just going to be the longest path of 20 00:01:11,220 --> 00:01:15,211 parent pointers that you ever have to traverse, to get from an object to some 21 00:01:15,211 --> 00:01:17,431 root. So the way we're going to quantify that 22 00:01:17,431 --> 00:01:20,289 is using these ranks. So this is again, a field that we 23 00:01:20,289 --> 00:01:23,799 maintain for each object. And for now, this will break down later, 24 00:01:23,799 --> 00:01:28,658 but for now before we have path -pression, we're going to maintain the 25 00:01:28,658 --> 00:01:34,049 invariant that the rank of an object x is exactly the largest number of pointers 26 00:01:34,049 --> 00:01:37,439 you have to traverse, from some leaf, to get to x. 27 00:01:37,439 --> 00:01:42,956 As a consequence, the biggest rank of any object is the longest path from any leaf 28 00:01:42,956 --> 00:01:46,296 to any root. And that's going to be an upper bound on 29 00:01:46,296 --> 00:01:49,942 the worst case running time of the find op- Operation. 30 00:01:49,942 --> 00:01:54,751 So let's move on to the union operation. So here, given 2 objects, x and y, you 31 00:01:54,751 --> 00:01:59,170 need to fuse their 2 trees, their 2 groups, so you find the roots of the 2 32 00:01:59,170 --> 00:02:02,315 trees, so you call a find on x, you call a find on y. 33 00:02:02,315 --> 00:02:07,232 That gives you their 2 respective roots, and now you install 1 as a new child. 34 00:02:07,232 --> 00:02:09,698 Of the other. Now we saw in a quiz in the last video, 35 00:02:09,698 --> 00:02:13,362 if you're not careful about which root you install as a child under the other, 36 00:02:13,362 --> 00:02:16,949 you can wind up with these long chains. And be stuck with a linear worse case 37 00:02:16,949 --> 00:02:20,078 time for both find and union. So instead we have this union by rank 38 00:02:20,078 --> 00:02:22,571 optimization. Which says, well, we want to keep our 39 00:02:22,571 --> 00:02:25,752 trees from getting scraggly. And the way we're going to do that is. 40 00:02:25,752 --> 00:02:29,889 When we have a shallow tree and a deep tree we make the shallow tree shall under 41 00:02:29,889 --> 00:02:33,827 the root of the deep one, that prevents the tree from getting even deeper. 42 00:02:33,827 --> 00:02:37,914 Now there is a situation where the two trees have exactly the same depths that 43 00:02:37,914 --> 00:02:42,051 is where the two roots have exactly the same rank, in that case we just proceed 44 00:02:42,051 --> 00:02:44,795 arbitrarily. Then when we merge two trees that both 45 00:02:44,795 --> 00:02:48,891 had a common rank r, its important that in the new tree, the rank is gone up to 46 00:02:48,891 --> 00:02:50,964 r+1. So we need the update, we need the 47 00:02:50,964 --> 00:02:54,182 incremental rank of the new root to reflect that increase. 48 00:02:54,182 --> 00:02:59,017 So that's where we've already been. Where are we going to next? Well the plan 49 00:02:59,017 --> 00:03:03,407 for this video is to show that, with the union by rank, optimization. 50 00:03:03,407 --> 00:03:08,177 The maximum rank of any node, is always, bounded above, by log, base 2 (n). 51 00:03:08,177 --> 00:03:11,627 Where n is the number of objects in the data structure. 52 00:03:11,627 --> 00:03:16,792 Now we just said, the worst case running time of find, is governed, by the maximum 53 00:03:16,792 --> 00:03:19,862 rank. So the logarithmic maximum rank means 54 00:03:19,862 --> 00:03:23,027 logarithm run time of find. that also carries over to the union 55 00:03:23,027 --> 00:03:25,592 operation. Remember union is just 2 finds plus 56 00:03:25,592 --> 00:03:29,602 constant work to rewire 1 pointer, so that's going to give us algorithm time 57 00:03:29,602 --> 00:03:32,672 value on both operations. So let's see why that's true. 58 00:03:32,672 --> 00:03:37,938 So let's begin the analyses with a few simple, but useful properties that follow 59 00:03:37,938 --> 00:03:42,668 immediately from our invariant. From the way that we change the ranks of 60 00:03:42,668 --> 00:03:45,425 objects as we do finds and as we do unions. 61 00:03:45,425 --> 00:03:49,632 So the first simple property is focus on your favorite object. 62 00:03:49,632 --> 00:03:52,053 x. And, watch this objects rank change, over 63 00:03:52,053 --> 00:03:55,245 the course of the data structure, as we do finds and unions. 64 00:03:55,245 --> 00:03:59,258 How can it change? Well, when we do a find we don't change anything, all the 65 00:03:59,258 --> 00:04:02,359 ranks stay the same. When we do a union, all the ranks stay 66 00:04:02,359 --> 00:04:04,747 the same. Well, except there is 1 case in the 67 00:04:04,747 --> 00:04:08,901 union, where the rank, of a single node, gets bumped up by 1, gets increased. 68 00:04:08,901 --> 00:04:13,212 So ranks only go up, over time, for all of the Objects, that's property one. 69 00:04:13,212 --> 00:04:17,632 So the second property is again pretty much trivial, but really, really useful. 70 00:04:17,632 --> 00:04:21,547 So what is the situation in which somebody's rank gets bumped up by 1? 71 00:04:21,547 --> 00:04:25,197 We're going to take the union of two trees that have a common rank. 72 00:04:25,197 --> 00:04:29,487 And then whichever of the two roots that we pick to be the root of the new bigger 73 00:04:29,487 --> 00:04:32,997 tree That's the object whose rank gets bumped up by 1. 74 00:04:32,997 --> 00:04:37,231 So new roots of this fused tree. So in particular, the only type of 75 00:04:37,231 --> 00:04:40,513 objects that can ever get a rank increase is a root. 76 00:04:40,513 --> 00:04:43,592 If you're not a root, your rank will not go up. 77 00:04:43,592 --> 00:04:48,827 Furthermore, once you're not a root in this data structure, you will never be a 78 00:04:48,827 --> 00:04:53,050 root again in the future. There is no process by which, you shed 79 00:04:53,050 --> 00:04:56,112 your parent. Once you have a parent other than 80 00:04:56,112 --> 00:04:59,618 yourself, you will always have exactly that parent. 81 00:04:59,618 --> 00:05:04,782 Putting those two observations together we find that, as soon as an object X. 82 00:05:04,782 --> 00:05:09,667 Becomes a non root but as soon as it has a in parent other than itself it rank is 83 00:05:09,667 --> 00:05:14,522 frozen for the rest of time forever more. The third and final simple property 84 00:05:14,522 --> 00:05:19,352 follows from a formula we mentioned in the last video about computing ranks. 85 00:05:19,352 --> 00:05:24,352 So remember the rank of a node in general is going to be one more than the maximum 86 00:05:24,352 --> 00:05:28,557 rank of any of its children. So if you have a child and there is some 87 00:05:28,557 --> 00:05:31,742 path from a leaf to that child, it takes 5 hops. 88 00:05:31,742 --> 00:05:34,953 The path to you from that child is going to take 6 hops. 89 00:05:34,953 --> 00:05:39,586 As a consequence as you go from the leaf up to the root you will see a strictly 90 00:05:39,586 --> 00:05:43,797 increasing sequence of ranks. The rank of a parent is always strictly 91 00:05:43,797 --> 00:05:46,552 more than the rank of all of those children. 92 00:05:46,552 --> 00:05:48,840 So that's it for the immediate properties. 93 00:05:48,840 --> 00:05:51,808 Let's go to a property which is a little less immediate. 94 00:05:51,808 --> 00:05:55,771 But still this next lemma, which I'm going to call the rank lemma, it's the 95 00:05:55,771 --> 00:05:58,814 best kind of lemma. So on the one hand, it's just not that 96 00:05:58,814 --> 00:06:01,318 hard to prove. I'll give you a full proof in the 97 00:06:01,318 --> 00:06:04,732 following 2 slides. On the other hand, it's really powerful. 98 00:06:04,732 --> 00:06:08,787 It's going to play a crucial role in the analysis were doing right now. 99 00:06:08,787 --> 00:06:12,696 A logarithmic run time bound, with a union by rank optimization, and we'll 100 00:06:12,696 --> 00:06:16,972 keep using it again as a workforce, once we introduce path compression, and prove 101 00:06:16,972 --> 00:06:20,653 better bounds on the operations. So what's the rank limit say? Well it 102 00:06:20,653 --> 00:06:24,912 controls the population size of objects, that have a given (no period) Rank, so we 103 00:06:24,912 --> 00:06:28,387 want it to apply at all intermediate stages of our data structure, so we're 104 00:06:28,387 --> 00:06:30,637 going to consider an arbitrary sequence of unions. 105 00:06:30,637 --> 00:06:32,962 You can throw in some finds as well. I don't care. 106 00:06:32,962 --> 00:06:36,662 Finds don't change the data structure, so they're totally irrelevant, so think 107 00:06:36,662 --> 00:06:39,037 about a sequence of unions, a sequence of mergers. 108 00:06:39,037 --> 00:06:41,412 The claim is, for every non-negative integer, r. 109 00:06:41,412 --> 00:06:45,491 The number of objects that have rank exactly r at this time is at must n. 110 00:06:45,491 --> 00:06:48,311 the total number of objects divided by 2 to the r. 111 00:06:48,311 --> 00:06:52,440 So for example, if our rank is 0. It says that at must n objects have rank 112 00:06:52,440 --> 00:06:55,600 0, so it is a trivial statement because only n objects. 113 00:06:55,600 --> 00:06:59,616 But at any given time the number of objects that have rank 1 is at most n 114 00:06:59,616 --> 00:07:04,002 over 2, the number of objects that have rank 2 is at most no over 4 and so on. 115 00:07:04,002 --> 00:07:09,327 And if you think about it, if we succeed in proving the rank Lemma, we're pretty 116 00:07:09,327 --> 00:07:13,852 much done, showing the efficacy of the union by rank optimization. 117 00:07:13,852 --> 00:07:19,252 So in particular, once you take r, the, in this, key Lemma, to be log base 2 (n), 118 00:07:19,252 --> 00:07:24,197 it says that there's at most 1 object. That has rank log2(n). 119 00:07:24,197 --> 00:07:28,137 And there can't be any objects that have rank strictly larger. 120 00:07:28,137 --> 00:07:32,712 That is, this limit implies that the maximum rank at all times is bounded 121 00:07:32,712 --> 00:07:36,232 above by log2(n). And remember, the maximum rank is the 122 00:07:36,232 --> 00:07:41,327 longest path of pointers, traversals, you ever need to get from a leaf to a root. 123 00:07:41,327 --> 00:07:46,187 And that means the most amount of work we'll ever do in a find, and therefore, 124 00:07:46,187 --> 00:07:51,857 in a union is O (log n) Okay, so I've now teased you with the consequences of the 125 00:07:51,857 --> 00:07:57,967 rank lemma, assuming that it's true, but why is it true? Let's turn to the proof. 126 00:07:57,967 --> 00:08:03,382 I'm going to break the proof down into two claims, claim one and claim two. 127 00:08:03,382 --> 00:08:08,057 We'll see that the two claims easily imply the rank Rank Lemma. 128 00:08:08,057 --> 00:08:12,625 So claim 1 asks you to consider 2 objects, X and Y, that have exactly the 129 00:08:12,625 --> 00:08:16,024 same rank R. And the claim asserts that the sub-trees 130 00:08:16,024 --> 00:08:20,542 of these 2 objects have to be disjoint. They have no objects in common. 131 00:08:20,542 --> 00:08:25,369 And here by the sub-tree of an object, I just mean the other objects that can 132 00:08:25,369 --> 00:08:29,313 reach this one by following a sequence. Of, parent pointers. 133 00:08:29,313 --> 00:08:33,235 So that is the subtree at x, is the objects from which you can reach x. 134 00:08:33,235 --> 00:08:36,499 The subtree at y is the objects from which you can reach y. 135 00:08:36,499 --> 00:08:41,054 The second claim, is that, if you look at any object that has rank r, and you look 136 00:08:41,054 --> 00:08:45,435 at it's subtree, that is, if you look at the number of objects that can reach, 137 00:08:45,435 --> 00:08:49,245 this object x by following pointers, there have to be a lot of them. 138 00:08:49,245 --> 00:08:54,289 There have to be at least 2 raised to that objects rank Are, objects in it's 139 00:08:54,289 --> 00:09:01,317 subtree? Notice that, if we prove claim 1 and claim 2, then the Rank Lemma, follows 140 00:09:01,317 --> 00:09:04,618 easily. Why? Well, fix a value for, r. 141 00:09:04,618 --> 00:09:10,212 2, 10, I don't care what. Look at all the nodes that have this rank 142 00:09:10,212 --> 00:09:12,553 R. By claim 2, each of them has at least 2 143 00:09:12,553 --> 00:09:17,123 to the R objects that could reach them. And by claim 1, these have to be disjoint 144 00:09:17,123 --> 00:09:20,069 sets of objects. Well, there's only N objects to go 145 00:09:20,069 --> 00:09:24,261 around, and if each of these disjoint sets has at least 2 to the R of them, 146 00:09:24,261 --> 00:09:27,559 there are going to be at most N over 2 to the R such groups. 147 00:09:27,559 --> 00:09:31,232 That is at most N over 2 to the R nodes, objects with this rank R. 148 00:09:31,232 --> 00:09:36,580 So we've reduced the proof of the rank Lemma to proving claims 1 and 2, I will 149 00:09:36,580 --> 00:09:40,352 do them in turn. So for claim 1 let me go via the contra 150 00:09:40,352 --> 00:09:46,199 positive, that is, I will assume that the conclusion is false, and I will show that 151 00:09:46,199 --> 00:09:51,389 the hypothesis must then also be false. So, lets assume that we have 2 no, 152 00:09:51,389 --> 00:09:55,269 objects x and y, and their subtrees are not, disjoint. 153 00:09:55,269 --> 00:10:00,322 That is, there exists an object z, from which You can reach X and from the same 154 00:10:00,322 --> 00:10:03,672 object Z, you can also reach Y by a sequence parent pointers. 155 00:10:03,672 --> 00:10:08,057 Well now let's use the fact that we're dealing with the directed tree, right, so 156 00:10:08,057 --> 00:10:12,306 if you start with an object Z. There's only a unique parent point, or 2, 157 00:10:12,306 --> 00:10:15,799 follow each time. So that is, all of the objects reachable 158 00:10:15,799 --> 00:10:20,011 from z, they form a directed path, leading up to the root of z's group. 159 00:10:20,011 --> 00:10:24,562 So the only way for both x and y to be reachable from z, they have to both be on 160 00:10:24,562 --> 00:10:27,593 this path. If they're both on this path, then 1 has 161 00:10:27,593 --> 00:10:31,906 to be an ancestor of the other. So now we're going to use the third of 162 00:10:31,906 --> 00:10:37,247 our simple properties that we observed. That is, on every path to the root, ranks 163 00:10:37,247 --> 00:10:41,639 strictly go up, each time. So, whichever of x or y is an ancestor of 164 00:10:41,639 --> 00:10:46,834 the other, that has strictly higher rank. Therefore x and y do not have the same 165 00:10:46,834 --> 00:10:49,952 rank. That completes the proof, of claim 1. 166 00:10:49,952 --> 00:10:55,681 So lets move on to claim 2. Remember, claim 2 is search that, an 167 00:10:55,681 --> 00:11:02,292 object of rank r, necessarily has 2 ^ r objects, or more, in its subtree. 168 00:11:02,292 --> 00:11:09,521 That's how many objects can actually reach this object x, by following Parent 169 00:11:09,521 --> 00:11:12,106 pointers. So for this proof we're going to proceed 170 00:11:12,106 --> 00:11:16,527 by induction on the number of operations, and again remember fine operations have 171 00:11:16,527 --> 00:11:19,507 no effect on the data structure, so we can ignore them. 172 00:11:19,507 --> 00:11:23,397 So it's just by induction on the number of union operations that happen. 173 00:11:23,397 --> 00:11:27,775 So for the base case, when, before we've done any unions whatsoever, we're doing 174 00:11:27,775 --> 00:11:30,258 just fine. Every object has a rank of 0 and the 175 00:11:30,258 --> 00:11:32,632 sub-tree size of every object is equal to 1. 176 00:11:32,632 --> 00:11:35,212 That object itself, also known as 2 to the 0. 177 00:11:35,212 --> 00:11:38,052 Zero. Now for the inductive step, there's an 178 00:11:38,052 --> 00:11:41,972 easy case and a hard case. The easy case is where nobody's rank 179 00:11:41,972 --> 00:11:46,712 changes, where we do a union, and everybody's rank stays exactly the same. 180 00:11:46,712 --> 00:11:51,082 In this case, we're golden. Why? Well, when you do a union, sub-tree 181 00:11:51,082 --> 00:11:53,412 sizes only go up. There's only more. 182 00:11:53,412 --> 00:11:57,272 Pointers so there's only more objects that can reach any given other objects. 183 00:11:57,272 --> 00:11:59,545 So sub-tree sizes go up, ranks stay the same. 184 00:11:59,545 --> 00:12:03,289 If we had this inequality of sub-tree size as being at least 2 ^ r before, we 185 00:12:03,289 --> 00:12:05,026 have it equally well now. Now. 186 00:12:05,026 --> 00:12:09,709 So the interesting case is when somebody's ranked actually changes. 187 00:12:09,709 --> 00:12:14,492 How can that happen? Well it happens in only one particular way that we 188 00:12:14,492 --> 00:12:18,213 understand well. Looking at a union operation between 189 00:12:18,213 --> 00:12:22,110 objects X and Y. Suppose the roots of these objects are S1 190 00:12:22,110 --> 00:12:25,312 and S2 respectively. It's only when these. 191 00:12:25,312 --> 00:12:29,824 Two roots have the same rank, let's call that common rank R, that somebodies rank 192 00:12:29,824 --> 00:12:32,905 gets changed. In particular, we're going to break ties 193 00:12:32,905 --> 00:12:36,918 as we did in the previous video. S2 will be the root of the fused tree, S1 194 00:12:36,918 --> 00:12:40,576 will become a child of it. And, in that case, S2s rank gets bumped 195 00:12:40,576 --> 00:12:42,381 up by 1. It goes from R To r + 1. 196 00:12:42,381 --> 00:12:45,675 Now notice, in this case, we do have something to prove. 197 00:12:45,675 --> 00:12:50,185 What are we trying to establish? We're trying to establish that every subtree 198 00:12:50,185 --> 00:12:54,806 size is big, as a function of the rank. So, s2's rank has gone up, and therefore 199 00:12:54,806 --> 00:12:59,767 the lowerbound, the bar that we have to meet, for the subtree size, has also Gone 200 00:12:59,767 --> 00:13:03,532 up, it's doubled. So in this case we actually have to 201 00:13:03,532 --> 00:13:08,547 scrutinize s2's new sub-tree. So what is its new sub-tree? Well, it's 202 00:13:08,547 --> 00:13:14,012 really just composed from its old sub-tree, and it inherits s1 and all of 203 00:13:14,012 --> 00:13:18,348 its sub-trees. Well, in that case, we know that s2's new 204 00:13:18,348 --> 00:13:24,105 subtree size, the nubmer of no objects that can reach it, is just, it's old 205 00:13:24,105 --> 00:13:27,832 subtree size, plus the old subtree size, of s1. 206 00:13:27,832 --> 00:13:34,032 But then w're in good shape because we have the inductive hypothesis to rescue 207 00:13:34,032 --> 00:13:37,457 us. So remember, before this union, by the 208 00:13:37,457 --> 00:13:43,847 inductive hypothesis, for every object with a given rank, say r, it had at least 209 00:13:43,847 --> 00:13:48,977 2^r objects in its sub tree. So S1, and S2, both had rank r before 210 00:13:48,977 --> 00:13:52,287 this. Unions, before this union, both of their 211 00:13:52,287 --> 00:13:57,287 subtree were at least two to the r. So as two subtree sizes bounded below by 212 00:13:57,287 --> 00:14:02,037 two to the r plus two to the r, a quantity also known as two raised to the 213 00:14:02,037 --> 00:14:05,587 r plus one. Quite conveniently, r plus one is S two's 214 00:14:05,587 --> 00:14:10,987 new rank, so S two's new bigger rank, its subtree size is still meeting the lower 215 00:14:10,987 --> 00:14:15,307 bound, meeting the target of two raised to, New rank, 2 ^ r + 1. 216 00:14:15,307 --> 00:14:21,017 So that completes the inductive step, therefore it completes the proof of claim 217 00:14:21,017 --> 00:14:25,467 2, that objects of rank r have subtree sizes at least 2 ^ r. 218 00:14:25,467 --> 00:14:31,332 Therefore completes the proof of the rank Lemma, that for every rank r, there's at 219 00:14:31,332 --> 00:14:35,379 most n / 2 ^ r nodes of rank r. And remember the rank Lemma, implies that 220 00:14:35,379 --> 00:14:39,217 the maximum rank, at all times, is bounded by log base 2 (n), as long as 221 00:14:39,217 --> 00:14:42,812 you're using union by rank. And that implies, that with this first 222 00:14:42,812 --> 00:14:47,215 optimization, the worst case running time of union, and find, are both O (log n), 223 00:14:47,215 --> 00:14:50,158 where n is the number of objects, in the data structure.