1 00:00:00,012 --> 00:00:03,662 So welcome to this optional sequence of videos on State of the Art 2 00:00:03,662 --> 00:00:06,562 Implementations of the Union-Find Data Structure. 3 00:00:06,562 --> 00:00:10,802 Now, as advanced in optional material, let me make a few comments before we get 4 00:00:10,802 --> 00:00:13,212 started. So the first comment is, I'm going to 5 00:00:13,212 --> 00:00:16,137 assume that you're in a particularly motivated mood. 6 00:00:16,137 --> 00:00:20,257 Now, no ones forcing you to take to these algorithms courses, so I'm always 7 00:00:20,257 --> 00:00:23,282 assuming that you're a very motiviated individual. 8 00:00:23,282 --> 00:00:25,811 Individual. But that's going to hold even more true 9 00:00:25,811 --> 00:00:29,600 than usual, in these advanced videos. And the way that plays out is, well I'm 10 00:00:29,600 --> 00:00:33,498 going to hold myself to the same exacting standards of clarity, that I usually do. 11 00:00:33,498 --> 00:00:35,996 I'm going to ask a little bit more of you, the viewer. 12 00:00:35,996 --> 00:00:39,860 I'm going to perhaps, dot a few less I's, cross a few less T's, than I usually do. 13 00:00:39,860 --> 00:00:43,312 And so you may find yourself periodicially needing to pause, and think 14 00:00:43,312 --> 00:00:46,267 through some of the details. That I'm glossing over. 15 00:00:46,267 --> 00:00:50,062 The second comment is that I'm going to give sort of short shrift to applications 16 00:00:50,062 --> 00:00:53,751 of the union-find data structure. That's not cause there aren't any, there 17 00:00:53,751 --> 00:00:56,255 are plenty of applications of this data structure. 18 00:00:56,255 --> 00:01:00,259 But for these videos, we're really just going to immerse ourselves, in the beauty 19 00:01:00,259 --> 00:01:04,132 and the depth of ideas behind the design of the data structures, and especially 20 00:01:04,132 --> 00:01:08,330 the analysis of their performance. The final comment is to keep in mind that 21 00:01:08,330 --> 00:01:10,787 this is some seriously next level material. 22 00:01:10,787 --> 00:01:14,725 It is totally normal that the first time you see this stuff, you find it 23 00:01:14,725 --> 00:01:17,451 confusing, you find it difficult to understand. 24 00:01:17,451 --> 00:01:21,863 So confusion should not discourage you. It does not represent any intellectual 25 00:01:21,863 --> 00:01:25,436 failing on your part. Rather, keep in mind, it represents an 26 00:01:25,436 --> 00:01:29,893 opportunity to get even smarter. So, with those comments in mind, let's go 27 00:01:29,893 --> 00:01:34,990 ahead and proceed to a different approach to the union-find data structure via lazy 28 00:01:34,990 --> 00:01:37,544 unions. So let's have one slide with a quick 29 00:01:37,544 --> 00:01:41,492 review of what we already know about the union-find data structure. 30 00:01:41,492 --> 00:01:45,465 If you recall, we discussed this in the context of a fast implementation of 31 00:01:45,465 --> 00:01:48,260 Kruskal's minimum spanning tree algorithm. 32 00:01:48,260 --> 00:01:52,859 The raison d'ĂȘtre of a union-find data structure, is to maintain a partition of 33 00:01:52,859 --> 00:01:56,640 a universe of objects. So in the context of Kruskal's algorithm, 34 00:01:56,640 --> 00:02:00,638 the objects were the vertices. And the groups we wanted to maintain, 35 00:02:00,638 --> 00:02:04,778 were the connecting components, with respect to the edges we committed 36 00:02:04,778 --> 00:02:08,532 ourselves to so far. The data structure should support two 37 00:02:08,532 --> 00:02:12,337 operations. No prizes for guessing the names of those 38 00:02:12,337 --> 00:02:15,406 2 operations. First is the find operation. 39 00:02:15,406 --> 00:02:20,022 This is given an object from the universe, return the name of that 40 00:02:20,022 --> 00:02:23,927 object's group. So for example, if x was in the middle of 41 00:02:23,927 --> 00:02:28,812 this square that I've put on the right, representing the universe X, 42 00:02:28,812 --> 00:02:32,261 we will return to c three the the name of the group that contains x . 43 00:02:32,261 --> 00:02:35,747 So in the context of Kruskal's algorithm, we use the find operation to check the 44 00:02:35,747 --> 00:02:39,507 cycles, how do we know whether adding a new edge would create a cycle with 45 00:02:39,507 --> 00:02:43,348 respect to edge we have already chosen, or will be if and only if the n points of 46 00:02:43,348 --> 00:02:47,170 that edge were already in the same connecting component that is if and only 47 00:02:47,170 --> 00:02:49,812 if two fine operations return the same answer. 48 00:02:49,812 --> 00:02:52,891 Answer. In the union operation, you're given not 49 00:02:52,891 --> 00:02:57,851 1 object, but 2, call them x and y, and then the responsibility of the operation 50 00:02:57,851 --> 00:03:02,792 is to merge the groups that contain x and y, so for example, in the picture, if y 51 00:03:02,792 --> 00:03:07,670 was in C4, x was, as before, in C3, then your job is to fuse the groups C3 and C4 52 00:03:07,670 --> 00:03:11,807 into a single group. In the context of Kruskal's algorithm, we 53 00:03:11,807 --> 00:03:14,748 needed this operation when we added a new edge. 54 00:03:14,748 --> 00:03:19,514 This, this fused together 2 of the existing connecting componenents into 1. 55 00:03:19,514 --> 00:03:23,960 So, we already went through 1 implementation of union find, that was in 56 00:03:23,960 --> 00:03:28,400 order to get a blazingly fast implemenation of Kruskal's algorithm, so 57 00:03:28,400 --> 00:03:33,287 lets just review how that worked. With each group, we associated a linked 58 00:03:33,287 --> 00:03:38,362 structure, so each object had one pointer, associated with it, and the 59 00:03:38,362 --> 00:03:43,837 invariant was in a given group, there was some representative leader object, and 60 00:03:43,837 --> 00:03:47,850 everybody in that group pointed to the leader of that Group. 61 00:03:47,850 --> 00:03:52,445 So for example, on the right I'm showing you a group with three objects x, y and z 62 00:03:52,445 --> 00:03:56,965 and x would be the leader of this group. All three of the objects point directly 63 00:03:56,965 --> 00:04:01,384 to x and that would just be the output of the Find operation, the leader of the 64 00:04:01,384 --> 00:04:03,631 group. We use that as the The group name. 65 00:04:03,631 --> 00:04:07,002 Now the part which is cool and obvious about this approach is our find 66 00:04:07,002 --> 00:04:10,426 operation's take constant time. All you do is return the leader of the 67 00:04:10,426 --> 00:04:12,955 given object. Now the tricky part was analyzing the 68 00:04:12,955 --> 00:04:16,142 cost of union operations. So the problem here is that to maintain 69 00:04:16,142 --> 00:04:19,354 the inveriant, that every object of a group points to its leader. 70 00:04:19,354 --> 00:04:23,012 When you fuse 2 groups, you have to update a bunch of the object's leaders. 71 00:04:23,012 --> 00:04:25,562 You would only have one leader for the one new. 72 00:04:25,562 --> 00:04:28,235 New group. The simple, but totally crucial 73 00:04:28,235 --> 00:04:33,259 optimization, that we discussed was, when 2 groups merged, you update the leader 74 00:04:33,259 --> 00:04:37,807 pointers of all objects in the smaller group, to point to the leader of the 75 00:04:37,807 --> 00:04:41,235 bigger group. That is, the new fused group inherits the 76 00:04:41,235 --> 00:04:44,252 leader from the bigger of it's constituents. 77 00:04:44,252 --> 00:04:47,048 Parts. If we do that optimization, it's still 78 00:04:47,048 --> 00:04:51,518 the case that a single union might take linear time, theta of n time, but a 79 00:04:51,518 --> 00:04:54,724 sequence of n unions takes only big o of n login time. 80 00:04:54,724 --> 00:04:59,404 And that's because each object endures at most a logarithmic number of leader 81 00:04:59,404 --> 00:05:02,175 updates. because every time its leader point gets 82 00:05:02,175 --> 00:05:06,016 updated, the population of the group it inhabits Doubles. 83 00:05:06,016 --> 00:05:11,728 So in this sequence of videos we are going to discuss a different approach, to 84 00:05:11,728 --> 00:05:15,057 implementing the union find data structure. 85 00:05:15,057 --> 00:05:20,379 And lets just go all-in, and try to get away with updating only, a single 86 00:05:20,379 --> 00:05:24,411 pointer, in each, union. All right, well how are we going to do 87 00:05:24,411 --> 00:05:28,792 that? Well, I think if we look at a picture, it'll be clear what the approach 88 00:05:28,792 --> 00:05:31,725 is going to be. So, let's look at a simple example, 89 00:05:31,725 --> 00:05:34,797 there's just six objects in the world, in two groups. 90 00:05:34,797 --> 00:05:39,417 One, two and three are in one group, with leader one, four five and six are in the 91 00:05:39,417 --> 00:05:43,334 second group with leader 4. So, with our previous implementation of 92 00:05:43,334 --> 00:05:47,243 union find and the 2 groups have equal size so we pick one arbitrary to 93 00:05:47,243 --> 00:05:50,817 represent the new leader. Let's say 4 is going to be the new leader 94 00:05:50,817 --> 00:05:55,285 and now we update objects 1, 2 and 3 so that they point directly to their new 95 00:05:55,285 --> 00:05:56,272 leader. Four. 96 00:05:56,272 --> 00:06:00,873 And so the new idea is very simple. Let's just sort of as a shorthand for 97 00:06:00,873 --> 00:06:05,762 rewiring all of one two and three, to point to four, let's just rewire one's 98 00:06:05,762 --> 00:06:09,831 pointer to point to four. And then it's understood that two and 99 00:06:09,831 --> 00:06:14,412 three as descendants of one, also now have the new leader four as well. 100 00:06:14,412 --> 00:06:18,955 So as a result, we do again get a directed tree afterwards, but we don't 101 00:06:18,955 --> 00:06:23,786 get one with just depth of one. We don't' get this shallow bushy a tree. 102 00:06:23,786 --> 00:06:27,552 We get a deeper one that has two levels below the root. 103 00:06:27,552 --> 00:06:30,891 So let me give you another way of thinking about this in terms of array 104 00:06:30,891 --> 00:06:33,635 representation. Here I also want to make the point that, 105 00:06:33,635 --> 00:06:37,105 you know, while conceptually it's going to be very useful to think of these 106 00:06:37,105 --> 00:06:40,629 unified data structure in terms of these directed trees, you're actually 107 00:06:40,629 --> 00:06:43,605 implementing. This data structure, this is not how you 108 00:06:43,605 --> 00:06:46,288 do it. You wouldn't bother with actual pointers 109 00:06:46,288 --> 00:06:49,998 between the different objects. You just have a simple array, indexed by 110 00:06:49,998 --> 00:06:53,951 the nodes and in the entry corresponding to a node i, you would just store the 111 00:06:53,951 --> 00:06:57,682 name of i's Parent. For instance, if we reframe this exact 112 00:06:57,682 --> 00:07:02,637 same example, in a ray representation, before we do the union, what do we got? 113 00:07:02,637 --> 00:07:05,437 Well, objects 1, 2, and 3 all have parent 1. 114 00:07:05,437 --> 00:07:10,167 Objects 4, 5, and 6 all have parent 4. So in the array, we'd have a 1, in the 115 00:07:10,167 --> 00:07:13,115 first 3 entries, and a 4 in the second 3 entries. 116 00:07:14,254 --> 00:07:18,928 So in the old solution we have to update the parent pointers of objects 1, 2 and 117 00:07:18,928 --> 00:07:23,486 3.They are all going to point directly to 4 so that gives us an array entirely of 118 00:07:23,486 --> 00:07:28,506 4's.Whereas in the new solution, the only node whose parent pointer gets updated is 119 00:07:28,506 --> 00:07:32,177 that of object 1, it gets updated from itself to 4. 120 00:07:32,177 --> 00:07:37,222 And in particular, objects 2 and 3 still point to object 1, not directly to object 121 00:07:37,222 --> 00:07:39,712 4. So that's how union works in a simple 122 00:07:39,712 --> 00:07:44,712 example, how does it work in general. Well, in general you're given 2 objects, 123 00:07:44,712 --> 00:07:48,327 each belongs to some group. You can link up these 2 groups 124 00:07:48,327 --> 00:07:52,852 conceptually as directed trees. If you follow all the parent pointers, 125 00:07:52,852 --> 00:07:57,237 all parent pointers lead to the root vertex and we identify those root 126 00:07:57,237 --> 00:08:00,341 vertices with the leaders. Of these two groups. 127 00:08:00,341 --> 00:08:04,772 Now, when you have to merge the two trees together, how do you do it? Well, you 128 00:08:04,772 --> 00:08:07,490 pick one of the groups and you look at it's root. 129 00:08:07,490 --> 00:08:11,647 Currently, it points to itself. You change its parent pointer to point to 130 00:08:11,647 --> 00:08:15,879 the other group's leader, that is, you install one of the two roots as a new 131 00:08:15,879 --> 00:08:19,840 child of the other tree's root. So, I hope the pros and cons of this 132 00:08:19,840 --> 00:08:24,374 alternative approach to the union fine data structure are intuitively clear, the 133 00:08:24,374 --> 00:08:28,251 when, the pro comes in the union operation, which is now very simple. 134 00:08:28,251 --> 00:08:32,164 Now you might be tempted to just say union takes constant time, because all 135 00:08:32,164 --> 00:08:35,632 you do is update one pointer, it's actually not so simple. 136 00:08:35,632 --> 00:08:40,186 Right because remember you are given two objects x and y and in general you aren't 137 00:08:40,186 --> 00:08:43,760 linking x and y together they might be somewhere deep in a tree. 138 00:08:43,760 --> 00:08:48,339 You are linking roots together, so what is that the union operation reduces to, 2 139 00:08:48,339 --> 00:08:52,633 indicatations of find you have to find x's root r r's of 1 you have to find y's 140 00:08:52,633 --> 00:08:56,547 root R sub 2, and then you just do constant time linking, either R1 to R2, 141 00:08:56,547 --> 00:08:59,102 or vice versa. Now the issue, with this lazy union 142 00:08:59,102 --> 00:09:02,932 approach, is that, it's not at all obvious, and in fact it's not going to be 143 00:09:02,932 --> 00:09:05,547 true, that the find operation takes constant time. 144 00:09:05,547 --> 00:09:09,497 Remember, previously, when we actually did all this hard work of updating all 145 00:09:09,497 --> 00:09:12,052 these leader pointers every time we had a union. 146 00:09:12,052 --> 00:09:16,295 It guaranteed that whenever there was a find boom we just look in the field we 147 00:09:16,295 --> 00:09:20,452 return the leader pointer, we're done. Here, parent pointers do not point 148 00:09:20,452 --> 00:09:24,792 directly to roots rather, you have to traverse in general a sequence of parent 149 00:09:24,792 --> 00:09:28,511 pointers from a given object x to go all the way up to the root of the 150 00:09:28,511 --> 00:09:31,545 corresponding tree. So because of these trade offs, these 151 00:09:31,545 --> 00:09:35,496 pros and cons, it's really not obvious at all whether this lazy union approach to 152 00:09:35,496 --> 00:09:39,104 the unifying data structure is a good idea, whether it's going to have any 153 00:09:39,104 --> 00:09:41,580 payoffs. So that's going to take some really quite 154 00:09:41,580 --> 00:09:44,714 subtle analysis, which is the main point of the lectures to come. 155 00:09:44,714 --> 00:09:48,728 It's also going to take a couple of optimizations, and in the next video, 156 00:09:48,728 --> 00:09:51,973 we'll start with the first one. You might be wondering, okay, so when you 157 00:09:51,973 --> 00:09:55,293 do a union, and you have to install 1 root, under the other, how do you make 158 00:09:55,293 --> 00:09:58,275 that choice? So the right answer is something called Union By Rank. 159 00:09:58,275 --> 00:09:59,209 That's coming up next.