1 00:00:00,012 --> 00:00:03,486 So let's try thinking about this alternative lay, lazy union based 2 00:00:03,486 --> 00:00:06,650 approach of union-find data structure in some detail. 3 00:00:06,650 --> 00:00:10,659 so get this to work as quickly as we want to, we're going to need two 4 00:00:10,659 --> 00:00:13,686 optimizations. This video will introduce the first of 5 00:00:13,686 --> 00:00:17,735 those optimizations, union by rank. So let's have a quick slide making 6 00:00:17,735 --> 00:00:20,339 precise what we learned in the previous video. 7 00:00:20,339 --> 00:00:24,643 So, in our new implementation, it's still the case that each object has one extra 8 00:00:24,643 --> 00:00:26,906 field. You can think of it logically as a 9 00:00:26,906 --> 00:00:30,977 pointer to some other object,. Really, it's just recording the name of some 10 00:00:30,977 --> 00:00:33,801 other object, but the semantics have changed. 11 00:00:33,801 --> 00:00:38,824 In our first implementation of union-find with eager unions, this field we called a 12 00:00:38,824 --> 00:00:42,608 leader pointer, and the invariant was that every object's 13 00:00:42,608 --> 00:00:45,680 pointer points directly to the leader of its group. 14 00:00:45,680 --> 00:00:49,229 In this Lazy Union Implementation, we are relaxing that invariant, 15 00:00:49,229 --> 00:00:52,675 so, I'm going to change the name. I'm not going to call it a leader 16 00:00:52,675 --> 00:00:55,426 pointer, I'm just going to call it a parent pointer. 17 00:00:55,426 --> 00:00:59,798 And the new relaxed invariant is that the collection of everybody's parent pointers 18 00:00:59,798 --> 00:01:04,550 induces a collection of directed trees. Each tree in this collection represents 19 00:01:04,550 --> 00:01:08,066 one of the groups in the partition that we're maintaining. 20 00:01:08,066 --> 00:01:12,603 Each of these trees has a root vertex, that is, a vertex that points back to 21 00:01:12,603 --> 00:01:15,399 itself. It is these root vertices, these root 22 00:01:15,399 --> 00:01:19,392 objects, that we think of as the leaders of these componenets. 23 00:01:19,392 --> 00:01:24,297 So as before, we're going to refer to a group by the name of its leader vertex 24 00:01:24,297 --> 00:01:28,937 and that leader is by definition the roots of the corresponding directed tree. 25 00:01:28,937 --> 00:01:33,017 One way to think about things is that when we had eager unions, we again 26 00:01:33,017 --> 00:01:37,617 associated a directed tree with each of the groups, but we also insisted that it 27 00:01:37,617 --> 00:01:42,187 was a super shallow bushy tree, that it had depth only one, that is all you had 28 00:01:42,187 --> 00:01:47,032 was a root, the leader and then everybody else who pointed directly to the leader. 29 00:01:47,032 --> 00:01:51,619 In this lazy union implementation, we're allowing the trees to grow deeper. 30 00:01:51,619 --> 00:01:54,864 We don't insist that they have only one level below the root. 31 00:01:54,864 --> 00:01:59,331 So as far as the initialization, at birth, in a, a union-find data structure 32 00:01:59,331 --> 00:02:04,342 just consists of n degenerate trees, each object just initially points to itself 33 00:02:04,342 --> 00:02:07,681 and is a root at the beginning. In general, how do you implement find 34 00:02:07,681 --> 00:02:10,400 from some object? Well, you just have to return the leader 35 00:02:10,400 --> 00:02:14,024 of its group, and we know you can get to the leader just by traversing parent 36 00:02:14,024 --> 00:02:17,492 pointers from x until you get to the root, until you get to some object that 37 00:02:17,492 --> 00:02:20,960 points back to itself, that's where you return at the end of the find call. 38 00:02:20,960 --> 00:02:23,546 How do you do union? Well, given two objects, x and y, you 39 00:02:23,546 --> 00:02:27,328 have to find its respective roots, so you just invoke the find operation twice. 40 00:02:27,328 --> 00:02:32,847 Once for x to get to its root s1, once from y, to get its root s2, and then you 41 00:02:32,847 --> 00:02:35,997 install either s1 or s2 as a child of the other. 42 00:02:35,997 --> 00:02:38,887 So you do one pointer update after the two finds. 43 00:02:38,887 --> 00:02:42,077 So for now, I'm going to leave union underdetermined. 44 00:02:42,077 --> 00:02:46,587 I'm not going to give you a rule to decide which of s1 and s2 becomes the 45 00:02:46,587 --> 00:02:50,017 child of the other. On the next quiz, I want you to think 46 00:02:50,017 --> 00:02:55,899 through, you know, what would be the worst case running time of these 47 00:02:55,899 --> 00:03:02,081 operations, if we make that choice arbitrarily? If we just pick s1 or s2 48 00:03:02,081 --> 00:03:06,030 arbitrarily to become the child of the other. 49 00:03:06,030 --> 00:03:11,426 Alright. So the correct answer unfortunately is the last one, 50 00:03:11,426 --> 00:03:16,987 both find and union can actually blow up to linear time operations, if you're not 51 00:03:16,987 --> 00:03:19,167 careful about how you implement the union. 52 00:03:19,167 --> 00:03:23,107 So why is this true? Well, first let me just point out that B is definitely not 53 00:03:23,107 --> 00:03:25,997 the correct answer. Whatever, finding the unions worst case 54 00:03:25,997 --> 00:03:29,017 running time is, it's definitely the same for both of them. 55 00:03:29,017 --> 00:03:32,117 Remember, union reduces to two calls to find plus constant work, 56 00:03:32,117 --> 00:03:35,957 so that's going to tell you that they're both going to have the same operation 57 00:03:35,957 --> 00:03:39,787 time in the worst case. Now, why is it linear? The reason you're 58 00:03:39,787 --> 00:03:44,617 stuck with linear time is because the trees that you build through a sequence 59 00:03:44,617 --> 00:03:47,972 of unions can in the worst case become very scraggly. 60 00:03:47,972 --> 00:03:52,832 And to see what I mean, imagine you just do a sequence of unions, and in each time 61 00:03:52,832 --> 00:03:57,617 you're unioning a group with just one object into the group that you built up 62 00:03:57,617 --> 00:04:00,235 so far. So if you union together a group of size 63 00:04:00,235 --> 00:04:04,377 two and a group of size one, well if you choose arbitrarily, maybe you make the 64 00:04:04,377 --> 00:04:08,131 group fo size one the new root. So now you just got a chain of the three 65 00:04:08,131 --> 00:04:10,632 objects. Maybe you union in another singleton 66 00:04:10,632 --> 00:04:14,640 group and arbitrarily you choose the singleton group to be the new leader, 67 00:04:14,640 --> 00:04:17,435 that gives you a chain of four objects, and so on. 68 00:04:17,435 --> 00:04:21,252 So if you do a linear number of unions and you're not careful about who you make 69 00:04:21,252 --> 00:04:24,829 a child of the other, you might well end up with a chain of linear length, 70 00:04:24,829 --> 00:04:28,543 and then if you do finds for the objects toward the bottom of that chain, they're 71 00:04:28,543 --> 00:04:31,677 going to require linear work. And again, union has a subroutine of 72 00:04:31,677 --> 00:04:34,252 finds, so it's going to then take linear work as well. 73 00:04:34,252 --> 00:04:38,313 So when you see this example you know, I hope your immediate reaction is, well, 74 00:04:38,313 --> 00:04:40,435 geez, we can certainly do better than that. 75 00:04:40,435 --> 00:04:44,161 We can just be somewhat intelligent with when we do a union, which of the old 76 00:04:44,161 --> 00:04:48,119 roots do we install as the child of the other one? There's a rough analogy with 77 00:04:48,119 --> 00:04:52,073 how we made a natural optimization with our eager union implementation and union 78 00:04:52,073 --> 00:04:54,256 find. There, we had to make some kind of choice 79 00:04:54,256 --> 00:04:56,598 between which of the two old leaders do we retain. 80 00:04:56,598 --> 00:05:00,549 We retain the one for the bigger root to minimize the number of leader updates. 81 00:05:00,549 --> 00:05:04,201 So here, the natural thing to do is, well, you know, look at which of the two 82 00:05:04,201 --> 00:05:07,806 trees is already deeper and we want to keep the root of that deeper tree, 83 00:05:07,806 --> 00:05:10,045 that's to avoid it from getting deeper still. 84 00:05:10,045 --> 00:05:14,005 So we install the root of the shallower tree as a child of the deeper one. 85 00:05:14,005 --> 00:05:18,542 Let's make that precise on the next slide with the notion of the rank of an object. 86 00:05:18,542 --> 00:05:22,932 So the rank is just going to be a second field along with the parent pointer that 87 00:05:22,932 --> 00:05:26,737 we maintain for each object. It's just going to be some integer. 88 00:05:26,737 --> 00:05:30,377 Let me tell you how I want you to think about the ranks, at least for now. 89 00:05:30,377 --> 00:05:33,837 Okay? So I'm going to give you some semantics, pouring out to semantics, 90 00:05:33,837 --> 00:05:37,402 but I want to warn you, they're only going to hold valid for the next couple 91 00:05:37,402 --> 00:05:39,992 of videos. When we introduce something else called 92 00:05:39,992 --> 00:05:43,431 path compression, these semantics are going to break down, 93 00:05:43,431 --> 00:05:47,093 but initially, this is how I want you to think about ranks. 94 00:05:47,093 --> 00:05:51,920 It's going to be the maximum number of hops required to get from a leaf of xs 95 00:05:51,920 --> 00:05:55,540 tree up to x itself. So for example if, X is a root, this is 96 00:05:55,540 --> 00:06:00,200 going to be the worst case length, the maximum length of any leaf to root path 97 00:06:00,200 --> 00:06:03,365 in the tree. So just to make sure this is clear, let 98 00:06:03,365 --> 00:06:07,491 me draw a quick example. So here in blue, I've shown you one 99 00:06:07,491 --> 00:06:11,830 particular union-find data structure that has two groups. 100 00:06:11,830 --> 00:06:17,761 Now, it's very easy to compute the rank of leaves, that's 0, because the only 101 00:06:17,761 --> 00:06:21,479 path from a leaf to that node is the empty path. 102 00:06:21,479 --> 00:06:26,870 There are two nodes that have rank one, then the root of the bigger tree has rank 103 00:06:26,870 --> 00:06:29,655 two. And, if you think about it, in general, 104 00:06:29,655 --> 00:06:33,726 the rank of a node is going to be one plus the largest rank of any of its 105 00:06:33,726 --> 00:06:36,795 children. For example, if you're in node x, you 106 00:06:36,795 --> 00:06:41,864 have a child y and there's some path from a leaf z up to y that requires five 107 00:06:41,864 --> 00:06:46,503 pointer traversals, while going from z all the way up to u at x is going to 108 00:06:46,503 --> 00:06:49,356 require six pointer traversals. And of course, when you initialize the 109 00:06:49,356 --> 00:06:53,128 data structure, you want, you want to set everybody's rank to 0, because everybody 110 00:06:53,128 --> 00:06:56,260 is just is in a tree by themselves at the beginning. 111 00:06:56,260 --> 00:07:00,066 This notion of rank is going to be a super crucial concept throughout this 112 00:07:00,066 --> 00:07:02,390 entire sequence of lectures on union-find. 113 00:07:02,390 --> 00:07:04,832 So make sure you understand this definition, 114 00:07:04,832 --> 00:07:09,210 I encourage you to draw out some of your own examples, do some computations to get 115 00:07:09,210 --> 00:07:12,973 a better feel for the concept. For the moment, we're just going to use 116 00:07:12,973 --> 00:07:16,893 rank to avoid creating scraggly trees. So let's go back to the bad example in 117 00:07:16,893 --> 00:07:19,357 the quiz. Let's remember what the intuition was. 118 00:07:19,357 --> 00:07:23,188 We wound up with this long chain and therefore really bad, worst case running 119 00:07:23,188 --> 00:07:26,994 time for our operations, because we were stupidly taking a tree that was already 120 00:07:26,994 --> 00:07:30,897 pretty deep and installing it as a child under a single node, under a super 121 00:07:30,897 --> 00:07:33,641 shallow tree, thereby producing a tree which is even 122 00:07:33,641 --> 00:07:36,178 deeper. So the obvious way to avoid that is when 123 00:07:36,178 --> 00:07:40,066 faced with merging together a deep tree and a shallow tree, you install the 124 00:07:40,066 --> 00:07:41,807 shallow tree as a child under the deep. one. 125 00:07:42,839 --> 00:07:46,876 That's going to slow down this deepening process as much as possible. 126 00:07:46,876 --> 00:07:50,209 So, we can make that precise, just using this rank notion. 127 00:07:50,209 --> 00:07:54,054 Notice that the rank is exactly quantifying the depth of the tree. 128 00:07:54,054 --> 00:07:58,561 So the rank of the roots of a tree by this invariant is equal to the depth of 129 00:07:58,561 --> 00:08:01,306 the tree. So if you want to make the shallower 130 00:08:01,306 --> 00:08:05,691 tree, the child of the other one, that means you want to make it the smaller 131 00:08:05,691 --> 00:08:08,358 rank tree, the child of the bigger rank tree. 132 00:08:08,358 --> 00:08:11,604 This optimization is what is meant by union by rank. 133 00:08:11,604 --> 00:08:16,177 There still remains the case where the two roots have equal rank that is, with a 134 00:08:16,177 --> 00:08:20,664 two trees that were fusing, have exactly the same maximum path length, and in that 135 00:08:20,664 --> 00:08:23,293 case, we do just arbitrarily choose one of them. 136 00:08:23,293 --> 00:08:27,633 So in the pseudocode I've written here, I've arbitrarily chosen y's root to be 137 00:08:27,633 --> 00:08:31,621 the one that remains the root, x's root, S1 is going to be installed as 138 00:08:31,621 --> 00:08:34,436 a child of S2. But again, you could do it either way, 139 00:08:34,436 --> 00:08:37,489 it's not going to matter. Now, we're not off the hook yet. 140 00:08:37,489 --> 00:08:41,373 Remember, this is a data structure where we're intending for an invariant to, to 141 00:08:41,373 --> 00:08:45,282 hold to the semantics for the ranks as quantifying worst case path length to the 142 00:08:45,282 --> 00:08:47,481 node. And, we made a modification to the data 143 00:08:47,481 --> 00:08:49,917 structure. We rewired somebody's parent pointer. 144 00:08:49,917 --> 00:08:54,132 So now it's our responsibility to check, does the invariance still hold? And if it 145 00:08:54,132 --> 00:08:58,442 doesn't still hold, then we have to restore it, hopefully, using minimal 146 00:08:58,442 --> 00:09:01,134 extra work. So in this quiz, we're going to think 147 00:09:01,134 --> 00:09:03,851 through how ranks change, when we do a union. 148 00:09:03,851 --> 00:09:08,393 So remember the node notation, we're in the middle of a union of the objects x 149 00:09:08,393 --> 00:09:11,276 and y, s1 and s2 refer the ranks, that is the 150 00:09:11,276 --> 00:09:15,762 leaders, of the groups of the trees that contain x and y respectively. 151 00:09:15,762 --> 00:09:19,609 The first thing I want you to think through and convince yourself of, is that 152 00:09:19,609 --> 00:09:23,761 for objects other than s1 and s2, other than the pair of objects they're involved 153 00:09:23,761 --> 00:09:26,668 in the actual point of rewiring, nobody's rank changes. 154 00:09:26,668 --> 00:09:30,722 So make sure you think that through and convince yourself ranks are invariant, 155 00:09:30,722 --> 00:09:34,212 except possibly for the objects, s1 and s2. 156 00:09:34,212 --> 00:09:39,947 In this quiz, we're going to drill down on how did the ranks of s1 and s2 change, 157 00:09:39,947 --> 00:09:45,632 given that we took one of them and rewired its parent pointer to point to 158 00:09:45,632 --> 00:09:49,712 the other. Okay. So the correct answer to this quiz 159 00:09:49,712 --> 00:09:53,657 is the fourth one. So remarkably, in the, in many cases, 160 00:09:53,657 --> 00:09:57,717 there's no change at all to anybody's ranks, including s1 or s2. 161 00:09:57,717 --> 00:10:02,622 If the ranks were different before the union, both of the ranks stay the same. 162 00:10:02,622 --> 00:10:07,071 The one exception is when you take the union of two ranks which are equal, then, 163 00:10:07,071 --> 00:10:11,130 whichever one is the new root, and then the pseudocode that I showed you s2, is 164 00:10:11,130 --> 00:10:14,498 chosen to be the new root. It's rank is incremented, it goes up by 165 00:10:14,498 --> 00:10:16,646 1. I think the easiest way to see this is 166 00:10:16,646 --> 00:10:19,136 just with a couple pictures, a couple examples. 167 00:10:19,136 --> 00:10:22,802 So let me just draw two directed trees in the corresponding ranks. 168 00:10:22,802 --> 00:10:27,738 And in this one, S1 has strictly a bigger rank. 169 00:10:27,738 --> 00:10:33,218 Its rank is 2, so we install s2 as a child under s1. 170 00:10:33,218 --> 00:10:37,040 So recomputing the ranks in the fused together tree, we see that, again, the 171 00:10:37,040 --> 00:10:40,801 rank of the root at s1 is 2. Same thing as it was before, it didn't go 172 00:10:40,801 --> 00:10:42,836 up. So what's the general principle here? 173 00:10:42,836 --> 00:10:47,229 Well, in general, suppose you have these two trees and suppose the first ones a 174 00:10:47,229 --> 00:10:51,685 root, s1 has a bigger rank, say rank r. What does that mean? So that means that 175 00:10:51,685 --> 00:10:56,029 there exists a path from the leaf to the root s1, that requires a full r hops. 176 00:10:56,029 --> 00:11:00,683 And in the second tree, by virtue of its rank being less than r, [INAUDIBLE] minus 177 00:11:00,683 --> 00:11:03,649 1, it says that every single path from a 178 00:11:03,649 --> 00:11:06,812 leaf to the root s2 of that tree uses only r minus 1 hops. 179 00:11:06,812 --> 00:11:11,503 So when you install s2 as a child under s1, well, you know the length of every 180 00:11:11,503 --> 00:11:15,353 leaf to root path, now the root is s1, it's only gone up one. 181 00:11:15,353 --> 00:11:19,646 You just have to get to s2 and then one more hop, you make it all the way to s1. 182 00:11:19,646 --> 00:11:24,346 So if all those paths took at most r - 1 before, paths all the way up to s1, take 183 00:11:24,346 --> 00:11:27,290 at most r hops now. So, the worst case path link has not gone 184 00:11:27,290 --> 00:11:29,239 up. That's exactly what happens in this 185 00:11:29,239 --> 00:11:33,343 example so it's true that when we install s2 under s1, we have yet another path 186 00:11:33,343 --> 00:11:37,026 that requires 2 hops, that goes through s2, but we had one previously, so the 187 00:11:37,026 --> 00:11:40,722 rank doesn't go up, the worst case path links exactly the same as before. 188 00:11:40,722 --> 00:11:45,777 And this is same kind of reasoning explains why the rank has to get bumped 189 00:11:45,777 --> 00:11:50,052 up by one when you fuse together two trees that have the same rank. 190 00:11:50,052 --> 00:11:55,129 So suppose s1 and s2 both have rank r, well then, both of them have a leaf to 191 00:11:55,129 --> 00:12:00,674 root path that requires a full or halves. And now, when you install s1 is child 192 00:12:00,674 --> 00:12:05,949 under s2, that path where you needed r hops from the leaf to get to s1, now to 193 00:12:05,949 --> 00:12:10,377 get all the way up to s2 you need one more hop. So now, suddenly there exists a 194 00:12:10,377 --> 00:12:14,964 path from a leaf to the new root s2 that requires a full r + 1 hops, that's why 195 00:12:14,964 --> 00:12:16,149 the ranks gotta go up.