1 00:00:00,820 --> 00:00:04,570 So let's start thinking about this alternative lazy union based approach to 2 00:00:04,570 --> 00:00:07,490 the Union-Find data structure in some detail. 3 00:00:07,490 --> 00:00:11,880 To get this to work as quickly as we want to we're going to need two optimizations. 4 00:00:11,880 --> 00:00:15,570 This video will introduce the first of those optimizations, Union by Rank. 5 00:00:17,260 --> 00:00:21,540 So let's have a quick slide making precise what we learned in the previous video. 6 00:00:21,540 --> 00:00:24,240 So in our new implementation it's still the case 7 00:00:24,240 --> 00:00:26,380 that each object has one extra field. 8 00:00:26,380 --> 00:00:29,290 You can think of it logically as a pointer to some other object. 9 00:00:29,290 --> 00:00:32,490 Really it's just recording the name of some other object but 10 00:00:32,490 --> 00:00:34,210 the semantics have changed. 11 00:00:34,210 --> 00:00:37,850 In our first implementation of union find with eager unions, 12 00:00:37,850 --> 00:00:42,050 this field we called it a leader pointer and the invariate was that every 13 00:00:42,050 --> 00:00:46,450 object's pointer points directly to the leader of its group. 14 00:00:46,450 --> 00:00:49,730 In this Lazy Union Implementation, we are relaxing that variant. 15 00:00:49,730 --> 00:00:50,720 So I'm going to change the name. 16 00:00:50,720 --> 00:00:51,980 I'm not going to call it a leader pointer. 17 00:00:51,980 --> 00:00:53,590 I'm just going to call it a parent pointer. 18 00:00:53,590 --> 00:00:58,480 And the new relaxed invariant is that the collection of everybody's parent pointers 19 00:00:58,480 --> 00:01:01,780 induces a collection of directed trees. 20 00:01:04,060 --> 00:01:07,510 Each tree in this collection represents one of the groups 21 00:01:07,510 --> 00:01:09,560 in the partition that we're maintaining. 22 00:01:09,560 --> 00:01:11,740 Each of these trees has a root vertex, 23 00:01:11,740 --> 00:01:14,400 that is a vertex which points back to itself. 24 00:01:14,400 --> 00:01:15,700 It is these root vertices, 25 00:01:15,700 --> 00:01:20,040 these root objects that we think of as the leaders of these components. 26 00:01:20,040 --> 00:01:24,980 So as before, we're going to refer to a group by the name of its leader vertex and 27 00:01:24,980 --> 00:01:29,200 that leader is, by definition, the root of the corresponding directed tree. 28 00:01:32,030 --> 00:01:35,240 One way to think about things is that, when we had eager unions, we, again, 29 00:01:35,240 --> 00:01:38,980 associated a directed tree with each of the groups, but we also insisted 30 00:01:38,980 --> 00:01:42,940 that it was super shallow, bushy tree, that it had depth only one. 31 00:01:42,940 --> 00:01:44,810 That is, all you had was a root, the leader, 32 00:01:44,810 --> 00:01:47,710 and then everybody else who pointed directly to the leader. 33 00:01:47,710 --> 00:01:52,280 In this lazy union implementation we're allowing the trees to grow deeper. 34 00:01:52,280 --> 00:01:54,780 We don't insist that they have only one level below the root. 35 00:01:56,660 --> 00:02:00,530 So as far as the initialization at birth in a union defined data structure just 36 00:02:00,530 --> 00:02:04,430 consists of n degenerate trees each object initially points to itself and 37 00:02:04,430 --> 00:02:05,430 is a root at the beginning. 38 00:02:06,830 --> 00:02:09,157 In general how do you implement find from some object? 39 00:02:09,157 --> 00:02:11,575 Well you just have to return the leader of its group and 40 00:02:11,575 --> 00:02:15,122 we know you can get to the leader just by traversing parent pointers from x until 41 00:02:15,122 --> 00:02:18,371 you get to a root until you get to some object that points back to itself. 42 00:02:18,371 --> 00:02:20,440 That's what you return at the end of the find call. 43 00:02:20,440 --> 00:02:21,970 How do you do a union? 44 00:02:21,970 --> 00:02:25,820 Well given two objects x and y you have to find their respective roots so 45 00:02:25,820 --> 00:02:28,340 you just invoke the find operation twice once for 46 00:02:28,340 --> 00:02:32,180 x to get its root s1 once from y to get its root as two. 47 00:02:32,180 --> 00:02:36,490 And then you install either s1 or s2 as a child of the other. 48 00:02:36,490 --> 00:02:39,580 So you do one pointer update after the two finds. 49 00:02:41,600 --> 00:02:43,660 So for now I'm going to leave Union under determined. 50 00:02:43,660 --> 00:02:46,771 I'm not going to give you a rule to decide which of s1 and 51 00:02:46,771 --> 00:02:48,857 s2 becomes the child of the other. 52 00:02:48,857 --> 00:02:52,800 In the next quiz, I want you to think through what would be the worst case 53 00:02:52,800 --> 00:02:56,680 running time of these operations if we make that choice arbitrarily, 54 00:02:56,680 --> 00:03:00,454 if we just pick s1 s2 arbitrarily to become the child of the other. 55 00:03:08,873 --> 00:03:12,544 All right so the correct answer unfortunately is the last one, 56 00:03:12,544 --> 00:03:16,906 both find in union can actually blow up to linear time operations if you're 57 00:03:16,906 --> 00:03:19,690 not careful about how you implement the union. 58 00:03:21,520 --> 00:03:22,170 So why is this true? 59 00:03:22,170 --> 00:03:25,420 Well first just let me point out that B is definitely not the correct answer. 60 00:03:25,420 --> 00:03:27,700 Whatever finding union's worst case running time is, 61 00:03:27,700 --> 00:03:29,500 it's definitely the same for both of them. 62 00:03:29,500 --> 00:03:33,130 Remember, union reduces to calls to fine plus constant work. 63 00:03:33,130 --> 00:03:36,340 So it's going to tell you that they're both going to have the same operation time 64 00:03:36,340 --> 00:03:38,210 in the worst case. 65 00:03:38,210 --> 00:03:39,120 Now, why is it linear? 66 00:03:39,120 --> 00:03:42,780 The reason you're stuck with linear time is because the trees that you 67 00:03:42,780 --> 00:03:47,660 build through a sequence of unions can in the worst case become very scraggly. 68 00:03:49,670 --> 00:03:52,720 And to see what I mean imagine you just do a sequence of unions and 69 00:03:52,720 --> 00:03:56,080 at each time you're unioning a group with just one object 70 00:03:56,080 --> 00:03:57,920 into the group that you built up so far. 71 00:03:59,410 --> 00:04:02,280 So if you union together a group of size two and group of size one, 72 00:04:02,280 --> 00:04:03,480 we'll have you choose arbitrarily. 73 00:04:03,480 --> 00:04:05,810 Maybe you make the group of size one the new root. 74 00:04:06,880 --> 00:04:08,790 So now you just got a chain of the three objects. 75 00:04:08,790 --> 00:04:11,200 Maybe you union in another singleton group, 76 00:04:11,200 --> 00:04:14,860 and arbitrarily you choose the singleton group to be the new leader. 77 00:04:14,860 --> 00:04:18,310 That gives you a chain of four objects and so on. 78 00:04:18,310 --> 00:04:21,890 So if you do a linear number of unions, and you're not careful about who you make 79 00:04:21,890 --> 00:04:25,164 a child of the other, you might well end up with a chain of linear length. 80 00:04:25,164 --> 00:04:28,274 And then if you fines for the objects towards the bottom of that chain, 81 00:04:28,274 --> 00:04:29,972 they're going to require linear work. 82 00:04:29,972 --> 00:04:32,084 And again, union has a subroutine of fines so 83 00:04:32,084 --> 00:04:33,980 its going to then take linear work as well. 84 00:04:35,870 --> 00:04:38,910 So when you see this example, I hope your immediate reaction is well jeez, 85 00:04:38,910 --> 00:04:40,140 we can certainly do better than that. 86 00:04:40,140 --> 00:04:43,900 We can just be somewhat intelligent when we do a union which 87 00:04:43,900 --> 00:04:46,820 of the old roots do we install as a child of the other one? 88 00:04:48,470 --> 00:04:51,330 There's a rough analogy with how we made a natural optimization with our 89 00:04:51,330 --> 00:04:53,380 eager union implementation of union find. 90 00:04:53,380 --> 00:04:56,240 There we had to make some kind of choice between which of the two old leaders do 91 00:04:56,240 --> 00:04:56,810 we retain? 92 00:04:56,810 --> 00:04:59,670 We retain the one for the bigger group to minimize the number of leader updates. 93 00:04:59,670 --> 00:05:02,835 So here, the natural thing to do is well look at what you did, 94 00:05:02,835 --> 00:05:06,995 two trees is already deeper, and we want to keep the root of that deeper tree. 95 00:05:06,995 --> 00:05:09,405 That's to avoid it from getting deeper still. 96 00:05:09,405 --> 00:05:14,225 So we install the root of the shallower tree as a child of the deeper one. 97 00:05:14,225 --> 00:05:18,129 Let's make that precise on the next slide with the notion of the rank of an object. 98 00:05:20,380 --> 00:05:23,260 So the rank is just going to be a second field along with a parent 99 00:05:23,260 --> 00:05:25,290 pointer that we maintain for each object. 100 00:05:25,290 --> 00:05:26,440 It's just going to be some integer. 101 00:05:28,770 --> 00:05:31,790 Let me tell you how I want you to think about the ranks at least for now. 102 00:05:31,790 --> 00:05:34,508 Okay, so I'm going to give you some semantics, very natural semantics. 103 00:05:34,508 --> 00:05:37,490 But I want to warn you, they're only going to hold valid for 104 00:05:37,490 --> 00:05:38,930 the next couple of videos. 105 00:05:38,930 --> 00:05:42,010 When we introduce something else called path compression these 106 00:05:42,010 --> 00:05:43,560 semantics are going to break down. 107 00:05:43,560 --> 00:05:46,120 But initially this is how I want you to think about ranks. 108 00:05:46,120 --> 00:05:49,800 It's going to be the maximum number of hops required to get 109 00:05:49,800 --> 00:05:53,570 from a leaf of x's tree up to x itself. 110 00:05:53,570 --> 00:05:56,730 So for example if x is a root, this is going to be the worst case link, 111 00:05:56,730 --> 00:06:00,480 the maximum link of any leaf to root path in the tree. 112 00:06:02,420 --> 00:06:04,936 So just to make sure that this is clear let me draw a quick example. 113 00:06:10,683 --> 00:06:14,267 So here in blue I've shown you one particular union fine data structure that 114 00:06:14,267 --> 00:06:18,075 has two groups, now it's very easy to compute the rank of leafs, that's zero, 115 00:06:18,075 --> 00:06:21,060 because the only path from a leaf to that node is the empty path. 116 00:06:22,440 --> 00:06:24,316 There are two nodes that have rank 1. 117 00:06:26,533 --> 00:06:30,897 Then the root of the bigger tree has rank 2. 118 00:06:30,897 --> 00:06:32,801 And if you think about it, in general, 119 00:06:32,801 --> 00:06:36,575 the rank of a node is going to be 1 plus the largest rank of any of its children. 120 00:06:36,575 --> 00:06:40,122 For example, if you're in node X, you have a child Y and 121 00:06:40,122 --> 00:06:45,031 their sum path from a leaf Z up to Y that requires five pointed traversals while 122 00:06:45,031 --> 00:06:49,889 going from Z all the way up to U and X is going to require six pointed traversals. 123 00:06:49,889 --> 00:06:52,292 And of course, when you initialize the data structure, 124 00:06:52,292 --> 00:06:55,446 you want to set everybody's rank to zero, because everybody's just in a tree, 125 00:06:55,446 --> 00:06:56,965 by themselves, at the beginning. 126 00:06:59,260 --> 00:07:02,710 This notion of rank is going to be a super crucial concept throughout this entire 127 00:07:02,710 --> 00:07:06,180 sequence of lectures on union fine so make sure you understand this definition. 128 00:07:06,180 --> 00:07:08,160 I encourage you to draw out some of your own examples, 129 00:07:08,160 --> 00:07:10,730 do some computations to get a better feel for the concept. 130 00:07:10,730 --> 00:07:16,700 For the moment we're just going to use rank to avoid creating scraggly trees. 131 00:07:16,700 --> 00:07:18,410 So let's go back to the bad example on the quiz. 132 00:07:18,410 --> 00:07:20,000 Let's remember what the intuition was. 133 00:07:20,000 --> 00:07:21,460 We wound up with this long chain and 134 00:07:21,460 --> 00:07:24,920 therefore really bad worse case running time for our operations 135 00:07:24,920 --> 00:07:28,060 because we were stupidly taking a tree that was already pretty deep and 136 00:07:28,060 --> 00:07:32,670 installing it as a child under a single note under a super shallow tree. 137 00:07:32,670 --> 00:07:35,010 Thereby producing a tree which is even deeper. 138 00:07:35,010 --> 00:07:38,360 So the obvious way to avoid that is when faced with merging together a deep 139 00:07:38,360 --> 00:07:39,510 tree and a shallow tree, 140 00:07:39,510 --> 00:07:42,680 you install the shallow tree as a child under the deep one. 141 00:07:42,680 --> 00:07:47,590 That's going to slow down this deepening process as much as possible. 142 00:07:47,590 --> 00:07:50,980 So we can make that precise just to using this rank notion. 143 00:07:50,980 --> 00:07:54,100 Notice that the rank is exactly quantifying the depth of a tree. 144 00:07:54,100 --> 00:07:57,330 So the rank of the root of a tree by this invariant 145 00:07:57,330 --> 00:07:59,520 is equal to the depth of the tree. 146 00:07:59,520 --> 00:08:02,550 So if you want to make the shallower tree the child of the other one, that means you 147 00:08:02,550 --> 00:08:06,150 want to make the smaller rank tree the child of the bigger rank tree. 148 00:08:08,220 --> 00:08:11,580 This optimization is what is meant by Union by Rank. 149 00:08:13,970 --> 00:08:17,600 This still remains the case where the two roots have equal rank, that is where 150 00:08:17,600 --> 00:08:22,060 the two trees that were fusing have exactly the same maximum path length. 151 00:08:22,060 --> 00:08:24,830 And in that case we do just arbitrarily choose one of them. 152 00:08:24,830 --> 00:08:29,650 So in the pseudocode I've written here I've arbitrarily chosen y's root to be 153 00:08:29,650 --> 00:08:33,980 the one that remains the root, x's root s1 is going to be installed as a child of s2. 154 00:08:33,980 --> 00:08:35,509 But again, you can do it either way it's not going to matter. 155 00:08:37,290 --> 00:08:38,510 Now we are not off the hook yet. 156 00:08:38,510 --> 00:08:41,040 Remember this is a data structure where we're intending for 157 00:08:41,040 --> 00:08:44,360 a invariant to hold these semantics for the ranks that's quantifying 158 00:08:44,360 --> 00:08:49,000 worst case path link to the node and we made a modification to the data structure. 159 00:08:49,000 --> 00:08:50,900 We rewired somebody's parent pointer. 160 00:08:50,900 --> 00:08:55,809 So now it's our responsibility to check does the invariant still hold and 161 00:08:55,809 --> 00:08:57,472 if doesn't still hold, 162 00:08:57,472 --> 00:09:01,688 then we have to restore it hopefully using minimal extra work. 163 00:09:01,688 --> 00:09:04,275 So in this quiz we're going to think through how ranks 164 00:09:04,275 --> 00:09:05,665 change when we do a unions. 165 00:09:05,665 --> 00:09:09,196 Remember the intonation, we're in the middle of a union of the objects x and y. 166 00:09:09,196 --> 00:09:13,814 s1 and s2 refer the ranks that is the leaders of the groups of the trees that 167 00:09:13,814 --> 00:09:16,200 contain x and y respectively. 168 00:09:16,200 --> 00:09:17,590 The first thing I want you to think through and 169 00:09:17,590 --> 00:09:22,580 convince yourself of is that for objects other than s1 and s2, other than the pair 170 00:09:22,580 --> 00:09:27,790 of objects that are involved in the actual point of re-wiring, nobody's rank changes. 171 00:09:27,790 --> 00:09:29,750 So make sure you think that through and convince yourself. 172 00:09:29,750 --> 00:09:33,900 Ranks are invariant, except possibly for the objects S1 and S2. 173 00:09:33,900 --> 00:09:38,750 In this quiz, we're going to drill down on how do the ranks of S1 and S2 change given 174 00:09:38,750 --> 00:09:43,416 that we took one of them and rewired it's parent pointer to point to the other. 175 00:09:47,946 --> 00:09:53,647 Okay so the correct answer to this quiz is the fourth one. 176 00:09:53,647 --> 00:09:58,281 So remarkably in many cases there's no change at all to anybody's ranks including 177 00:09:58,281 --> 00:09:58,953 S1 or S2. 178 00:09:58,953 --> 00:10:03,510 If the ranks were different before the union both of the ranks stay the same. 179 00:10:03,510 --> 00:10:07,630 The one exception is when you take the union of two ranks which are equal. 180 00:10:07,630 --> 00:10:09,168 Then whichever one is the new root and 181 00:10:09,168 --> 00:10:11,951 in the pseudo code that I showed you s2 is chosen to be the new root. 182 00:10:11,951 --> 00:10:14,235 Its rank gets incremented, it goes up by one. 183 00:10:14,235 --> 00:10:16,776 I think the easiest way to see this is is just with a couple pictures, 184 00:10:16,776 --> 00:10:17,590 couple of examples. 185 00:10:19,170 --> 00:10:22,734 So let me just draw two directed trees in the corresponding ranks. 186 00:10:26,361 --> 00:10:28,870 And in this one s1 has strictly bigger rank. 187 00:10:28,870 --> 00:10:33,090 It's rank is 2 so we install s2 as a child under s1. 188 00:10:35,680 --> 00:10:39,670 So re-computing the ranks in the fused together tree, we see that again the rank 189 00:10:39,670 --> 00:10:43,470 at the route at s1 is 2., same thing as it was before it didn't go up. 190 00:10:43,470 --> 00:10:44,900 So, what's the general principle here? 191 00:10:44,900 --> 00:10:47,300 Well, in general, suppose you have these two trees and 192 00:10:47,300 --> 00:10:50,260 suppose the first ones route s1 has a bigger rank, say rank r. 193 00:10:50,260 --> 00:10:51,580 What does that mean? 194 00:10:51,580 --> 00:10:54,530 So, that means that there exists a path from a leaf to the root 195 00:10:54,530 --> 00:10:57,180 s1 that requires a full r hops. 196 00:10:57,180 --> 00:11:01,230 And in the second tree by virtue of its rank being less than r most r minus 1. 197 00:11:01,230 --> 00:11:05,583 It says that every single path from a leaf to the root s2 of that 198 00:11:05,583 --> 00:11:07,770 tree uses only r minus 1 hops. 199 00:11:07,770 --> 00:11:12,680 So when you install s2 as a child under s1 the length of every leaf to root 200 00:11:12,680 --> 00:11:16,073 path now the root is S1, it's only gone up by 1. 201 00:11:16,073 --> 00:11:19,500 You just have to get to s2 and then 1 more hop you make it all the way to s1. 202 00:11:19,500 --> 00:11:22,110 So if all those paths took almost r minus 1 before, 203 00:11:22,110 --> 00:11:26,260 all the paths all the way to s minus 1 to s1 now, taken most are hops now. 204 00:11:26,260 --> 00:11:28,850 So the worst case path link has not gone up. 205 00:11:28,850 --> 00:11:32,460 That's exactly what happens in this example so it's true that when we install 206 00:11:32,460 --> 00:11:36,940 s2 under s1 we have yet another path that requires two hops that goes through s2. 207 00:11:36,940 --> 00:11:39,080 But we had one previously so the rank doesn't go up. 208 00:11:39,080 --> 00:11:41,103 The worst case path length is exactly the same as before. 209 00:11:43,416 --> 00:11:47,446 And this same kind of reasoning explains why the rank has to get bumped up by one 210 00:11:47,446 --> 00:11:50,364 when you fuse together two trees that had the same rank. 211 00:11:50,364 --> 00:11:53,253 So suppose S1 and S2 both have rank R, 212 00:11:53,253 --> 00:11:58,630 well then both of them have a leaf to root path that requires a full R hops. 213 00:11:58,630 --> 00:12:03,618 And now when you install S1 as a child under S2 that path where you needed R hops 214 00:12:03,618 --> 00:12:08,760 to get from the leaf to S1, now to get all the way up to S2 you need one more hop. 215 00:12:08,760 --> 00:12:12,759 So now suddenly there exists a path from a leaf to the new root S2 that 216 00:12:12,759 --> 00:12:16,495 requires a full R plus 1 hops that's why the ranks gotta go up.