1 00:00:01,520 --> 00:00:05,280 So, welcome to this optional sequence of videos on state of the art implementations 2 00:00:05,280 --> 00:00:07,600 of the union to find data structure. 3 00:00:07,600 --> 00:00:09,710 Now as advanced in optional material, 4 00:00:09,710 --> 00:00:12,260 let me make a few comments before we get started. 5 00:00:14,860 --> 00:00:17,790 So the first comment is I'm going to assume that you're in a particularly 6 00:00:17,790 --> 00:00:18,910 motivated mood. 7 00:00:18,910 --> 00:00:21,550 Now, no one's forcing you to take these algorithm courses, so 8 00:00:21,550 --> 00:00:23,800 I'm always assuming that you're very motivated individual, 9 00:00:23,800 --> 00:00:27,640 but that's going to hold even more true than usual in these advanced videos. 10 00:00:27,640 --> 00:00:31,590 And the way that plays out is while I'm going to hold myself to the same exacting 11 00:00:31,590 --> 00:00:35,362 standards of clarity that I usually do, I'm going to ask a little bit more of you, 12 00:00:35,362 --> 00:00:35,890 the viewer. 13 00:00:35,890 --> 00:00:41,180 I'm going to perhaps dot a few less Is, cross a few less Ts than I usually do. 14 00:00:41,180 --> 00:00:43,820 So you may find yourself periodically needing to pause and 15 00:00:43,820 --> 00:00:46,250 think through some of the details that I'm glossing over. 16 00:00:46,250 --> 00:00:50,311 The second comment is I'm going to give is sort of short shrift to applications 17 00:00:50,311 --> 00:00:52,103 of the union find data structure. 18 00:00:52,103 --> 00:00:55,922 That's not because there aren't any, there's plenty of applications of this 19 00:00:55,922 --> 00:00:59,910 data structure but for these videos we're really just going to immerse ourselves in 20 00:00:59,910 --> 00:01:03,502 the beauty and the depth of ideas behind the design of the data structures and 21 00:01:03,502 --> 00:01:05,830 especially the analysis of their performance. 22 00:01:06,990 --> 00:01:10,190 The final comment is to keep in mind that this is some seriously 23 00:01:10,190 --> 00:01:12,040 next level material. 24 00:01:12,040 --> 00:01:15,910 It is totally normal that the first time you see this stuff, you find it confusing. 25 00:01:15,910 --> 00:01:18,070 You find it difficult to understand. 26 00:01:18,070 --> 00:01:20,220 So confusion should not discourage you. 27 00:01:20,220 --> 00:01:23,550 It does not represent any intellectual failing on your part. 28 00:01:23,550 --> 00:01:28,410 Rather, keep in mind, it represents an opportunity to get even smarter. 29 00:01:28,410 --> 00:01:31,098 So, with those comments in mind, let's go ahead and 30 00:01:31,098 --> 00:01:35,357 proceed to a different approach to the Union-find data structure via lazy unions. 31 00:01:38,011 --> 00:01:40,962 So let's have one slide with a quick review of what we already know about 32 00:01:40,962 --> 00:01:42,400 the Union-find data structure. 33 00:01:42,400 --> 00:01:45,990 Recall we discussed this in the context of a fast implementation of 34 00:01:45,990 --> 00:01:50,760 Kruskal's minimum spanning tree algorithm The raison d'etre of a union-find 35 00:01:50,760 --> 00:01:55,980 data structure is to maintain a partition of a universe of objects. 36 00:01:55,980 --> 00:02:00,005 So in the context of Kruskal's algorithm, the objects were the vertices, and 37 00:02:00,005 --> 00:02:03,969 the groups we wanted to maintain were the connected components with respect to 38 00:02:03,969 --> 00:02:06,205 the edges we committed ourselves to so far. 39 00:02:08,482 --> 00:02:10,970 The data structure should support two operations. 40 00:02:10,970 --> 00:02:14,450 No prizes for guessing the names of those two operations. 41 00:02:14,450 --> 00:02:15,900 First is the find operation. 42 00:02:15,900 --> 00:02:20,720 This is given an object from the universe return the name of that object's group. 43 00:02:23,370 --> 00:02:26,670 So for example if X was in the middle of this square that I put 44 00:02:26,670 --> 00:02:30,160 on the right representing the universe capital X we would return a C3. 45 00:02:30,160 --> 00:02:32,040 The name of the group that contains X. 46 00:02:34,530 --> 00:02:38,510 So in the context of Kruskal's algorithm, we use find operation to check for cycles. 47 00:02:38,510 --> 00:02:41,290 How did we know whether adding a new edge would create a cycle with respect to 48 00:02:41,290 --> 00:02:42,350 the edges we've already chosen? 49 00:02:42,350 --> 00:02:43,330 Well, it would be if and 50 00:02:43,330 --> 00:02:47,040 only if the end points of that edge were already in the same connected component. 51 00:02:47,040 --> 00:02:50,309 That is, if and only if the two find operations returned the same answer. 52 00:02:52,270 --> 00:02:55,810 In the Union Operation you're given out one object but you call them x and y, 53 00:02:55,810 --> 00:02:59,630 and then the responsibility of the operation is to merge the groups 54 00:02:59,630 --> 00:03:00,790 that contain x and y. 55 00:03:00,790 --> 00:03:04,690 So for example in the picture y was in C4, x was as before in C3. 56 00:03:04,690 --> 00:03:08,592 Then your job is to fuse the groups C3 and C4 into a single group. 57 00:03:11,024 --> 00:03:12,954 In the context of Kruskal's algorithm, 58 00:03:12,954 --> 00:03:15,280 we needed this operation when we added a new edge. 59 00:03:15,280 --> 00:03:19,580 This fused together two of the existing connecting components into one. 60 00:03:19,580 --> 00:03:21,410 So that was exactly the union operation. 61 00:03:23,130 --> 00:03:25,960 So we all ready went through one implementation of union and find. 62 00:03:25,960 --> 00:03:29,070 That was order to get a blazingly fast implementation of Kruskal's algorithm. 63 00:03:29,070 --> 00:03:30,876 So let's just review how that works. 64 00:03:33,546 --> 00:03:36,682 With each group, we associated a linked structure, so 65 00:03:36,682 --> 00:03:39,770 each object had one pointer associated with it. 66 00:03:39,770 --> 00:03:41,720 And the invariant was in a given group, 67 00:03:41,720 --> 00:03:44,440 there was some representative leader object. 68 00:03:44,440 --> 00:03:47,620 And everybody in that group pointed to the leader of that group. 69 00:03:50,070 --> 00:03:53,620 So for example, on the right I'm showing you a group with three objects, x, y, and 70 00:03:53,620 --> 00:03:56,080 z, and x would be the leader of this group. 71 00:03:56,080 --> 00:03:58,370 All three of the objects point directly to x, and 72 00:03:58,370 --> 00:04:01,950 that would just be the output of the find operation, the leader of the group. 73 00:04:01,950 --> 00:04:03,540 We use that as the group name. 74 00:04:05,030 --> 00:04:06,130 Now the part which is cool and 75 00:04:06,130 --> 00:04:09,860 obvious about this approach is our find operations take constant time. 76 00:04:09,860 --> 00:04:12,510 All you do is return the leader of the given object. 77 00:04:12,510 --> 00:04:15,910 Now the tricky part was analyzing the cost of Union operations. 78 00:04:15,910 --> 00:04:18,710 So the problem here is that to maintain the invariant that 79 00:04:18,710 --> 00:04:20,540 every object of a group points to its leader. 80 00:04:20,540 --> 00:04:24,130 When you fuse two groups, you have to update a bunch of the objects' leaders. 81 00:04:24,130 --> 00:04:25,950 We'd only have one leader for the one new group. 82 00:04:27,630 --> 00:04:31,830 The simple, but totally crucial, optimization that we discussed was 83 00:04:31,830 --> 00:04:36,160 when two groups merge you update the leader pointers of all objects in 84 00:04:36,160 --> 00:04:39,450 the smaller group to point to the leader of the bigger group. 85 00:04:39,450 --> 00:04:42,400 That is the new fused group inherits the leader 86 00:04:42,400 --> 00:04:45,140 from the bigger of its constituent parts. 87 00:04:45,140 --> 00:04:49,290 If we do that optimization, it's still the case that a single union might take linear 88 00:04:49,290 --> 00:04:55,820 time fade event time but a sequence of n unions takes only big o of n log in time. 89 00:04:55,820 --> 00:04:59,934 And that's because each object endures at most a logarithmic number of leader 90 00:04:59,934 --> 00:05:03,988 updates, cause every time it's leader pointer gets updated the population 91 00:05:03,988 --> 00:05:06,059 of the group that it inhabits doubles. 92 00:05:07,831 --> 00:05:11,751 So in this sequence of videos we are going to discuss a different approach to 93 00:05:11,751 --> 00:05:14,191 implementing the union find data structure. 94 00:05:15,710 --> 00:05:17,260 And let's just go all in and 95 00:05:17,260 --> 00:05:22,300 try to get away with updating only a single pointer in each union. 96 00:05:23,960 --> 00:05:25,380 All right, well how are we going to do that? 97 00:05:25,380 --> 00:05:27,061 Well, I think if we look at a picture, 98 00:05:27,061 --> 00:05:29,230 it'll be clear what the approach is going to be. 99 00:05:31,173 --> 00:05:32,670 So let's look at a simple example. 100 00:05:32,670 --> 00:05:35,680 There's just six objects in the world and currently they're in two groups. 101 00:05:35,680 --> 00:05:38,190 One, two, and three are in one group with leader one. 102 00:05:38,190 --> 00:05:41,640 Four, five, and six are in the second group with leader four. 103 00:05:41,640 --> 00:05:45,940 So, with our previous implementation of union find and the two groups have equal 104 00:05:45,940 --> 00:05:49,230 size, so we pick one arbitrarily to represent the new leader. 105 00:05:49,230 --> 00:05:51,230 Let's say four is going to be the new leader. 106 00:05:51,230 --> 00:05:53,652 And now we update objects one, two, and three so 107 00:05:53,652 --> 00:05:56,324 that they point directly to their new leader, four. 108 00:05:57,650 --> 00:06:01,635 And so the new idea is very simple let's just sort of as a short hand for 109 00:06:01,635 --> 00:06:05,110 re-wiring all of one, two and three to point to four, 110 00:06:05,110 --> 00:06:09,890 let's just re-wire one pointer to point to four and then it's understood that two and 111 00:06:09,890 --> 00:06:14,340 three as descendants of one also now have the new leader four as well. 112 00:06:16,060 --> 00:06:19,880 So as a result, we do again get a directed tree afterwards, 113 00:06:19,880 --> 00:06:21,900 but we don't get one with just depth of one. 114 00:06:21,900 --> 00:06:24,110 We don't get as shallow, pushy a tree. 115 00:06:24,110 --> 00:06:26,980 We get a deeper one that has two levels below the root. 116 00:06:29,295 --> 00:06:31,635 So, let me give you another way of thinking about this in terms of an array 117 00:06:31,635 --> 00:06:32,725 representation. 118 00:06:32,725 --> 00:06:35,965 And here I also want to make a point that why conceptually it's going 119 00:06:35,965 --> 00:06:38,925 to be very useful to think of these union find data structure in terms of these 120 00:06:38,925 --> 00:06:40,270 directed trees. 121 00:06:40,270 --> 00:06:44,015 You're actually implementing this data structure this is not how you'd do it. 122 00:06:44,015 --> 00:06:47,450 You wouldn't bother with actually pointers between the different objects. 123 00:06:47,450 --> 00:06:50,500 You'd just have a simply array indexed by the nodes. 124 00:06:50,500 --> 00:06:54,420 And in the entry corresponding to a node i you would just store the name 125 00:06:54,420 --> 00:06:55,930 of i's parent. 126 00:06:55,930 --> 00:07:00,710 For instance if we reframe this exact same example in an array representation 127 00:07:00,710 --> 00:07:02,370 before we do the union what do we got? 128 00:07:02,370 --> 00:07:05,480 Well objects one, two, and three all have parent one. 129 00:07:05,480 --> 00:07:08,420 Objects four, five, and six all have parent four. 130 00:07:08,420 --> 00:07:10,890 So in the array we'd have a one. 131 00:07:10,890 --> 00:07:13,820 In the first three entries and a four in the second three entries. 132 00:07:14,830 --> 00:07:18,730 So in the old solution we have to update the parent pointers of objects one two and 133 00:07:18,730 --> 00:07:20,530 three, they're all going to point directly to four, so 134 00:07:20,530 --> 00:07:22,100 that gives us an array entirely of fours. 135 00:07:22,100 --> 00:07:27,330 Whereas in the new solution the only node whose parent 136 00:07:27,330 --> 00:07:32,050 pointer gets updated is that of object one, it gets updated from itself to four. 137 00:07:34,500 --> 00:07:36,388 And in a particular objects two and 138 00:07:36,388 --> 00:07:39,720 three still point to object one not directly to object four. 139 00:07:42,090 --> 00:07:43,960 So that's how union works so in simple example. 140 00:07:43,960 --> 00:07:45,125 How does it work in general? 141 00:07:45,125 --> 00:07:48,980 Well, in general, you're given two objects, each belongs to some group. 142 00:07:48,980 --> 00:07:52,300 You can think of these two groups conceptually as directed trees. 143 00:07:52,300 --> 00:07:56,660 If you follow all the parent corners, all parent corners lead to the root vertex. 144 00:07:56,660 --> 00:08:01,388 We identify those root vertices with the leaders of these two groups. 145 00:08:01,388 --> 00:08:05,110 Now when you have to merge the two trees together, how do you do it? 146 00:08:05,110 --> 00:08:06,580 Well, you pick one of the groups and 147 00:08:06,580 --> 00:08:08,960 you look at its roots currently it points to itself. 148 00:08:08,960 --> 00:08:12,490 You can change its parent pointer to point to the other groups leader. 149 00:08:12,490 --> 00:08:17,834 That is you install one of the two roots as a new child of the other trees root. 150 00:08:19,742 --> 00:08:23,013 So I hope the pros and cons of this alternative approach to the union find 151 00:08:23,013 --> 00:08:25,290 data structure are intuitively clear. 152 00:08:25,290 --> 00:08:30,050 The win, the pro, comes in the union operation, which is now very simple. 153 00:08:30,050 --> 00:08:32,630 Now you might be tempted to just say union takes constant time, 154 00:08:32,630 --> 00:08:35,900 because all you do is update one pointer, It's actually not so simple, right? 155 00:08:35,900 --> 00:08:38,920 Because remember you're given two objects, x and y, and 156 00:08:38,920 --> 00:08:40,750 in general you're not linking x and y together. 157 00:08:40,750 --> 00:08:42,550 They might be somewhere deep in a tree. 158 00:08:42,550 --> 00:08:44,140 You're linking roots together. 159 00:08:44,140 --> 00:08:47,880 So what is true is that the union operation reduces two 160 00:08:47,880 --> 00:08:49,310 two indications of find. 161 00:08:49,310 --> 00:08:51,110 You have to find x's root. 162 00:08:51,110 --> 00:08:53,950 R sub 1, you have to y's root r sub 2, 163 00:08:53,950 --> 00:08:59,110 and then you just do constant time linking either r1 to r 2 or vice versa. 164 00:08:59,110 --> 00:09:03,080 Now the issue with this lazy union approach is that it's not at all obvious, 165 00:09:03,080 --> 00:09:07,010 and in fact it's not going to be true, that find operation takes constant time. 166 00:09:07,010 --> 00:09:10,520 Remember, previously when we actually did this hard work of updating all these 167 00:09:10,520 --> 00:09:13,860 leader pointers every time we had a union it guaranteed that whenever there is 168 00:09:13,860 --> 00:09:17,680 a find boom, we just look in the field, we return the leader pointer, we're done. 169 00:09:17,680 --> 00:09:21,690 Here, parent pointers do not point directly to roots rather 170 00:09:21,690 --> 00:09:26,430 you have to traverse in general a sequence of parent pointers from a given object x 171 00:09:26,430 --> 00:09:29,170 to go all the way up to the root of the correspondence tree. 172 00:09:32,070 --> 00:09:33,350 So because of these trade offs. 173 00:09:33,350 --> 00:09:36,260 These pros and cons, it's really not obvious at all whether this 174 00:09:36,260 --> 00:09:39,570 lazy union approach to the union find data structure is a good idea. 175 00:09:39,570 --> 00:09:41,038 Whether it's going to have any pay offs. 176 00:09:41,038 --> 00:09:44,090 So that's going to take really some quite subtle analysis which is the main 177 00:09:44,090 --> 00:09:45,550 point of the lectures to come. 178 00:09:45,550 --> 00:09:47,790 It's also going to take a couple of optimizations and 179 00:09:47,790 --> 00:09:49,800 the next video will start with the first one. 180 00:09:49,800 --> 00:09:51,920 You might be wondering, okay, so when you do a union and 181 00:09:51,920 --> 00:09:55,430 you have to install one root under the other, how do you make that choice? 182 00:09:55,430 --> 00:09:58,270 So the right answer is something called union by rank. 183 00:09:58,270 --> 00:09:59,514 That's coming up next.