1 00:00:00,000 --> 00:00:03,294 So let me now introduce you to the union find data structure. 2 00:00:03,294 --> 00:00:07,021 Let's not lose sight of our goal. Our goal is to be able to check for 3 00:00:07,021 --> 00:00:09,880 cycles in Kruskal's algorithm in constant time. 4 00:00:09,880 --> 00:00:13,902 So the first, and most basic idea behind this union find data structure 5 00:00:13,902 --> 00:00:17,868 implementation, is, we're going to maintain a linked structure for each 6 00:00:17,868 --> 00:00:20,304 group. That is, for each connected component, 7 00:00:20,304 --> 00:00:23,364 with respect to the edges Kruskal has chosen thus far. 8 00:00:23,364 --> 00:00:27,726 By a link structure, I just mean each vertex of the graph is going to have an 9 00:00:27,726 --> 00:00:31,835 extra pointer field. In addition with each group, with each 10 00:00:31,835 --> 00:00:35,012 connected component, we're going to designate one vertex. 11 00:00:35,012 --> 00:00:39,437 And we don't care which one, just some vertex from this connected component, as 12 00:00:39,437 --> 00:00:42,268 the leader vertex of that component. Alright, 13 00:00:42,268 --> 00:00:46,086 so what is the point of these extra pointers at each vertex? 14 00:00:46,086 --> 00:00:49,012 And what is the point of having these leaders? 15 00:00:49,012 --> 00:00:53,975 Well, a key invariant that we're going to maintain is that, a given vertex, with 16 00:00:53,975 --> 00:00:58,620 its extra pointer, points to the leader vertex of its connected component. 17 00:00:58,620 --> 00:01:03,348 So, for example maybe we have two different connected components with three 18 00:01:03,348 --> 00:01:07,652 vertices each, one containing the vertices U, V and W. another containing 19 00:01:07,652 --> 00:01:11,713 the vertices X, Y and Z. any of these three vertices could be the 20 00:01:11,713 --> 00:01:16,563 leader of each of these two components. so perhaps U happens to be the leader 21 00:01:16,563 --> 00:01:21,109 vertex of the first component and X happens to be the leader vertex of the 22 00:01:21,109 --> 00:01:24,966 second component. And then the invariant just says, every 23 00:01:24,966 --> 00:01:28,249 vertex should be coining to the leader of it's component. 24 00:01:28,249 --> 00:01:32,512 So, V and W should be pointing to U. U has it's own ver, it's own pointer, it 25 00:01:32,512 --> 00:01:36,429 should be pointing to itself. Similarly, in the second component, X is 26 00:01:36,429 --> 00:01:39,770 pointing into itself. Y points to X and Z also points to X. 27 00:01:39,770 --> 00:01:42,650 So, in blue here, are the actual edges of the graph. 28 00:01:42,650 --> 00:01:47,143 And in green, there are these, sort of, extra edged pointers that we've invented 29 00:01:47,143 --> 00:01:51,060 where everybody's pointing back to the leader of their component. 30 00:01:52,620 --> 00:01:57,097 And so, one very simple thing in this setup which turns out to be a good idea, 31 00:01:57,097 --> 00:02:01,284 is each component is in effect, inheriting the name of its leader vertex. 32 00:02:01,284 --> 00:02:05,762 So we refer to a group, we refer to a component via the object, via the vertex 33 00:02:05,762 --> 00:02:10,480 who happens to be the representative. Who happens to be the leader. 34 00:02:10,480 --> 00:02:15,458 And what's kind of amazing is, even just this very simple scaffolding on the 35 00:02:15,458 --> 00:02:20,633 connected components is enough to have constant time cycle checks provided the 36 00:02:20,633 --> 00:02:23,865 invariant is satisfied. Well, how do we do that? 37 00:02:23,865 --> 00:02:27,292 So remember that, [COUGH], checking if adding the edge uv is going to create a 38 00:02:27,292 --> 00:02:29,376 cycle, boils down to checking whether there's 39 00:02:29,376 --> 00:02:32,572 already a path between U and V. Well, there's already a path between U 40 00:02:32,572 --> 00:02:35,582 and V if and only if they're same, in the same connected component. 41 00:02:35,582 --> 00:02:39,056 Given two vertices, U and V. How do we know if they're in the same connected 42 00:02:39,056 --> 00:02:41,232 component? We just follow the respective leader 43 00:02:41,232 --> 00:02:43,455 pointers, and we see if we get to the same place. 44 00:02:43,455 --> 00:02:46,049 If they're in the same component, we get the same leader. 45 00:02:46,049 --> 00:02:48,828 If they're in different components, we get different leaders. 46 00:02:48,828 --> 00:02:52,440 So, checking for a cycle just involves, for each of u and v, comparing a quality 47 00:02:52,440 --> 00:02:54,710 of leader pointers. That is clearly constant time. 48 00:02:54,710 --> 00:02:58,960 More generally, the way you implement the Find operation for this flavor of 49 00:02:58,960 --> 00:03:03,156 disjoint union data structure, is you simply return the leader pointer of the 50 00:03:03,156 --> 00:03:06,644 object that you are handed. So if you're given a vertex, you just 51 00:03:06,644 --> 00:03:11,120 follow the leader pointer of that vertex, and you return wherever you wind up. 52 00:03:11,120 --> 00:03:14,850 So that's pretty excellent, as long as, in this simple data structure, the 53 00:03:14,850 --> 00:03:18,736 invariant is satisfied, we have our desired implementation of constant time 54 00:03:18,736 --> 00:03:21,326 cycle checks. But, and you know, certainly, this is a 55 00:03:21,326 --> 00:03:23,917 recurring theme in our data structure discussions. 56 00:03:23,917 --> 00:03:27,699 Whenever you have a data structure and it needs to maintain an invariant. 57 00:03:27,699 --> 00:03:30,859 Whenever you do an operation that changes the data structure. 58 00:03:30,859 --> 00:03:34,383 So, like in this case, when you do a union fusing two groups together. 59 00:03:34,383 --> 00:03:38,579 You have to worry, well, does the variant get destroyed when you do that operation? 60 00:03:38,579 --> 00:03:42,620 And if it does, how are you going to restore the invariant without doing undue 61 00:03:42,620 --> 00:03:47,300 work? So in the present context of Kruskal's 62 00:03:47,300 --> 00:03:51,171 algorithm, here's how this plays out. So we're happily doing our constant time 63 00:03:51,171 --> 00:03:53,634 cycle checks. Whenever an edge creates a cycle, we 64 00:03:53,634 --> 00:03:57,504 don't do anything, we skip the edge, we don't change our data structure, and we 65 00:03:57,504 --> 00:03:59,968 move on. The issue is when we have a new edge and 66 00:03:59,968 --> 00:04:02,330 it doesn't create a cycle or cycle check fails. 67 00:04:02,330 --> 00:04:06,502 Now, Kruskal's algorithm dictates that we add this edge into the set capital T that 68 00:04:06,502 --> 00:04:09,267 we're building. And that fuses two connected components 69 00:04:09,267 --> 00:04:11,544 together. But remember, we have this invariant. 70 00:04:11,544 --> 00:04:14,432 Every vertex should point to the leader of its component. 71 00:04:14,432 --> 00:04:18,436 Well, if we had component A and we had component B, they're both pointing to the 72 00:04:18,436 --> 00:04:20,716 leader vertex of their respective components. 73 00:04:20,716 --> 00:04:24,720 Now, when these components fuse into one, we've got to do some updating of leader 74 00:04:24,720 --> 00:04:26,949 pointers. In particular, there used to be two 75 00:04:26,949 --> 00:04:30,953 leaders, now there has to be just one, and we have to rewire the leader pointers 76 00:04:30,953 --> 00:04:36,354 to restore the invariant. So, to make sure you're clear on this, 77 00:04:36,354 --> 00:04:39,482 important, problem. Let me ask you the following question. 78 00:04:39,482 --> 00:04:43,616 So, consider the case, when, at some point, in Kruskal's algorithm, a new edge 79 00:04:43,616 --> 00:04:46,298 is added. And two connected components fuse into 80 00:04:46,298 --> 00:04:48,644 one. Now, to restore the invariant, you have 81 00:04:48,644 --> 00:04:52,610 to do some leader pointer updates. In the worst case, asymptotically, how 82 00:04:52,610 --> 00:04:56,521 many leader pointer updates might be required in order to restore the 83 00:04:56,521 --> 00:05:01,387 invariant? So, the answer to this question, somewhat 84 00:05:01,387 --> 00:05:05,810 alarmingly, is the third answer. So it might require a linear number, in 85 00:05:05,810 --> 00:05:09,437 the number of vertices N, pointer updates to restore the invariant. 86 00:05:09,437 --> 00:05:13,431 And one easy way to see that is imagine the very last edge that Kruskal's going 87 00:05:13,431 --> 00:05:17,425 to add in the set T, the one which fuses the final two connected components down 88 00:05:17,425 --> 00:05:19,672 into one. For all you know, you know, those two 89 00:05:19,672 --> 00:05:22,967 components have exactly the same size, they have N / 2 vertices each. 90 00:05:22,967 --> 00:05:25,464 You're going down from two leader pointers to one. 91 00:05:25,464 --> 00:05:29,308 And one of those set of N / 2 vertices are going to have to inherit the leader 92 00:05:29,308 --> 00:05:32,703 pointer from the other side. So one of the two sets is going to have 93 00:05:32,703 --> 00:05:35,400 to have N / 2 vertices, get their leader pointer updated. 94 00:05:36,800 --> 00:05:39,798 So that's a bummer. We're sort of hoping for a near linear 95 00:05:39,798 --> 00:05:43,780 time bound but if every each one of our linear number of edge editions might 96 00:05:43,780 --> 00:05:46,313 trigger a linear number of liter pointer updates. 97 00:05:46,313 --> 00:05:49,157 That seems to be giving rise to a quadratic time bound. 98 00:05:49,157 --> 00:05:53,190 But remember when I introduced the union find data structure I only said that 99 00:05:53,190 --> 00:05:56,757 first idea was idea number one presumably there's an idea number two. 100 00:05:56,757 --> 00:06:00,842 And here it is and it's very natural if you are coding up an implementation of 101 00:06:00,842 --> 00:06:04,720 union find data structure you would probably do this optimization yourself. 102 00:06:04,720 --> 00:06:07,351 Alright. So, consider the moment in time in which 103 00:06:07,351 --> 00:06:09,874 some component A merges with some component B. 104 00:06:09,874 --> 00:06:14,151 Each of these two components currently has their own respective leader vertex. 105 00:06:14,151 --> 00:06:17,879 And all vertices from that group are pointing to that leader vertex. 106 00:06:17,879 --> 00:06:20,292 Now, when they fuse, what are you going to do? 107 00:06:20,292 --> 00:06:24,514 The first obvious thing to do is to say, well, let's not bother computing some 108 00:06:24,514 --> 00:06:27,804 totally new leader. Let's either reuse the leader from group 109 00:06:27,804 --> 00:06:31,477 A, or the leader from group B. That way, if, for example, we retain the 110 00:06:31,477 --> 00:06:34,822 leader from group A. The only leader pointers that need to be 111 00:06:34,822 --> 00:06:37,290 rewired are the ones who come from component. 112 00:06:37,290 --> 00:06:39,280 B. The vertices in component A can keep 113 00:06:39,280 --> 00:06:42,650 their same leader vertex and their same leader pointers as before. 114 00:06:42,650 --> 00:06:45,764 So that's the first point. Let's just have a new union of two 115 00:06:45,764 --> 00:06:48,929 components inherit the leader of one of the constituent parts. 116 00:06:48,929 --> 00:06:51,413 Now if you're going to retain one of the two leaders, 117 00:06:51,413 --> 00:06:54,926 which one are you going to retain. Maybe one of the components has 1,000 118 00:06:54,926 --> 00:06:56,976 vertices. The other component only has 100 119 00:06:56,976 --> 00:06:59,074 vertices. Well, given the choice, you're certainly 120 00:06:59,074 --> 00:07:02,294 going to keep the leader from the bigger, from the first component. 121 00:07:02,294 --> 00:07:06,002 That way, you only have to rewire the 100 liter pointers of the second group. 122 00:07:06,002 --> 00:07:09,515 If you catch the leader of the second group, you have to rewire the 1000 123 00:07:09,515 --> 00:07:12,589 pointer from the first group. And that seems silly and wasteful. 124 00:07:12,589 --> 00:07:16,590 So the obvious way to implement a merge is you just keep the leader of the bigger 125 00:07:16,590 --> 00:07:20,180 group, and rewire everybody from the second group, the smaller group. 126 00:07:20,180 --> 00:07:23,609 So you should notice that, in order to actually implement this optimization, 127 00:07:23,609 --> 00:07:26,535 where you always retain the lead of the larger group, 128 00:07:26,535 --> 00:07:30,330 you have to be able to quickly determine which of the two groups is larger but you 129 00:07:30,330 --> 00:07:33,623 can augment the data structure we've discussed in, to make, to facilitate 130 00:07:33,623 --> 00:07:35,589 that. So, just, with each group, you just keep 131 00:07:35,589 --> 00:07:37,738 account of how many vertices are in that group. 132 00:07:37,738 --> 00:07:39,750 So you maintain a size field for each group. 133 00:07:39,750 --> 00:07:42,996 That allows you to check, in constant time, what's the population of two 134 00:07:42,996 --> 00:07:46,380 different groups and figure out, again, in constant time, which one's bigger. 135 00:07:46,380 --> 00:07:50,084 And notice that when you fuse two groups together, it's easy to maintain the size 136 00:07:50,084 --> 00:07:53,325 field. It's just the sum of the sizes of the two 137 00:07:53,325 --> 00:07:57,119 constituent parts. So let me now revisit the question from 138 00:07:57,119 --> 00:08:00,323 the previous slide. In the worst case, given this 139 00:08:00,323 --> 00:08:05,596 optimization, how many leader pointers asymptotically might you have to re-wire 140 00:08:05,596 --> 00:08:12,485 when you merge two components together? Well, unfortunately, the song remains the 141 00:08:12,485 --> 00:08:14,595 same. The answer is still the third one. 142 00:08:14,595 --> 00:08:18,601 And it's because for exactly the same reason as on the previous slide. 143 00:08:18,601 --> 00:08:22,768 It still might be the case that, say, in the final iteration of Kruskal, you're 144 00:08:22,768 --> 00:08:25,691 merging two components that both have size N / 2. 145 00:08:25,691 --> 00:08:28,668 So, it doesn't matter. I mean, no matter which leader you 146 00:08:28,668 --> 00:08:33,215 choose, you're going to be stuck updating the leader pointers of N / 2 or theta of 147 00:08:33,215 --> 00:08:35,813 n vertices. So, this is clearly a smart practical 148 00:08:35,813 --> 00:08:38,736 optimization. It doesn't seem to be buying us anything 149 00:08:38,736 --> 00:08:41,280 in our asymptotic analysis of the running time. 150 00:08:44,180 --> 00:08:51,050 However, however, what if we think about the work done in all of these leader 151 00:08:51,050 --> 00:08:55,300 pointer updates, in the following different way? 152 00:08:55,300 --> 00:08:58,708 Rather than asking how many updates, might a merging of two components 153 00:08:58,708 --> 00:09:00,754 trigger. Let's adopt a vertex centric view. 154 00:09:00,754 --> 00:09:03,432 Let's suppose you're one of the vertices of this graph. 155 00:09:03,432 --> 00:09:06,938 So initially, at the beginning of Kruskal's algorithm, you're in your own 156 00:09:06,938 --> 00:09:09,666 isolated connected component. You just point to yourself. 157 00:09:09,666 --> 00:09:12,539 You're your own leader. And then, as, Kruskal's algorithm run 158 00:09:12,539 --> 00:09:15,948 its, runs its course, your leader pointer will periodically get updated. 159 00:09:15,948 --> 00:09:18,480 At some point, you're no longer pointing to yourself. 160 00:09:18,480 --> 00:09:22,230 You're pointing to some other vertex. Then at some point, your pointer gets 161 00:09:22,230 --> 00:09:24,908 updated again. You're pointing to yet some other vertex, 162 00:09:24,908 --> 00:09:28,605 and so on. How many times over the entire trajectory 163 00:09:28,605 --> 00:09:32,786 of Kruskal's algorithm. In light of our new optimization, might 164 00:09:32,786 --> 00:09:38,140 you, as some vertex of this graph, have, your leader pointer updated. 165 00:09:41,557 --> 00:09:45,790 Well, here's the very cool answer, the answer is the second one. 166 00:09:45,790 --> 00:09:51,035 So while it remains true, that if you always have the union of two groups, and 167 00:09:51,035 --> 00:09:56,347 here at the leader pointer of the larger one, it's still true that a given fusion 168 00:09:56,347 --> 00:10:00,937 might trigger, a, a linear number of leader pointer updates, each, vertex, 169 00:10:00,937 --> 00:10:06,182 will only see, its leader pointer updated a logarithmic number of times, over the 170 00:10:06,182 --> 00:10:09,778 course of Kruskal's algorithm. What is the reason for this? 171 00:10:09,778 --> 00:10:14,314 Well suppose you're a vertex and you're in some group and it has maybe twenty 172 00:10:14,314 --> 00:10:18,443 vertices, so you're one of twenty. Now suppose at some point, your leader 173 00:10:18,443 --> 00:10:20,827 pointer gets updated. Why did that happen? 174 00:10:20,827 --> 00:10:25,363 Well it meant that your group of twenty vertices merged with some other group 175 00:10:25,363 --> 00:10:28,910 that has to be bigger. Remember your leader pointer only gets 176 00:10:28,910 --> 00:10:31,876 rewired in a fusion if you were in a smaller group. 177 00:10:31,876 --> 00:10:36,528 So you're joining a group at least as big as yours, so the size of the union, the 178 00:10:36,528 --> 00:10:40,890 size of your new community, your new connected component, is at least double. 179 00:10:40,890 --> 00:10:44,994 The size of your previous one. So the bottom line is, every time you, as 180 00:10:44,994 --> 00:10:47,398 a vertex, has your leader pointer updated. 181 00:10:47,398 --> 00:10:51,736 The population, and the component to which you belong is at least twice as 182 00:10:51,736 --> 00:10:54,961 large as before. Now, you start in a connected component 183 00:10:54,961 --> 00:10:57,834 of size one. The connected component is not going to 184 00:10:57,834 --> 00:11:01,821 have more than N vertices. So the number of doublings you might have 185 00:11:01,821 --> 00:11:04,108 to endure, is, at most, log base two of n. 186 00:11:04,108 --> 00:11:08,740 So that bounds how many leader pointers you will see as a vertex in this graph. 187 00:11:12,880 --> 00:11:17,038 So, in light of that very cool operation, we can now give a good running time 188 00:11:17,038 --> 00:11:20,704 analysis of Kruskal's algorithm using the union find data structure. 189 00:11:20,704 --> 00:11:24,206 Using scaffolding on the component structure to do cycle checks. 190 00:11:24,206 --> 00:11:26,942 We, of course, have not changed the processing step. 191 00:11:26,942 --> 00:11:31,319 We're sorting the edges from cheapest to most expensive at the beginning of the 192 00:11:31,319 --> 00:11:34,580 algorithm and that still takes big O of M log N time. 193 00:11:34,580 --> 00:11:37,993 So what work do we have to do, beyond this sorting pre-processing step? 194 00:11:37,993 --> 00:11:41,505 Well, fundamentally, Kruskal's algorithm is just about these cycle checks. 195 00:11:41,505 --> 00:11:45,211 And each iteration of the four loop, we have to check if adding a given edge 196 00:11:45,211 --> 00:11:48,039 would create a cycle with those edges we've already added. 197 00:11:48,039 --> 00:11:51,697 And the whole point of the union find scaffolding, these link structures, is 198 00:11:51,697 --> 00:11:53,745 that we can check cycles in constant time. 199 00:11:53,745 --> 00:11:57,305 Just given an edge, we look at the leader pointers of a ten points. 200 00:11:57,305 --> 00:12:01,304 And, a cycle will be created if and only if their leader pointers are identical. 201 00:12:01,304 --> 00:12:04,913 So, for the cycle checking, we only do constant time for each of the O of M 202 00:12:04,913 --> 00:12:07,121 iterations. But the final source of work is 203 00:12:07,121 --> 00:12:09,367 maintaining this union fine data structure. 204 00:12:09,367 --> 00:12:13,179 Restoring the invariant each time we add a new edge to the set capital T. 205 00:12:13,179 --> 00:12:17,044 And here, the good idea is, we're not going to just bound to the worst case 206 00:12:17,044 --> 00:12:20,021 running time of these leader pointers for each iteration. 207 00:12:20,021 --> 00:12:23,625 because that could be too expensive. That could be up to linear in just a 208 00:12:23,625 --> 00:12:26,288 single iteration. Rather, we're going to do a global 209 00:12:26,288 --> 00:12:30,205 analyses, thinking about the total number of liter pointers that ever occur. 210 00:12:30,205 --> 00:12:34,227 On the previous find, we observed that for a single vertex, it's only going to 211 00:12:34,227 --> 00:12:36,943 endur a logarithmic number of liter pointer updates. 212 00:12:36,943 --> 00:12:41,374 So summing over all the end vertasese, the total work done for leader pointer 213 00:12:41,374 --> 00:12:45,356 updates is only end login. So even though we might do a linear 214 00:12:45,356 --> 00:12:49,517 amount of pointer updates on just one iteration of this for loop, we still have 215 00:12:49,517 --> 00:12:53,783 this global upper bound of N log N on the total number of leader pointer updates, 216 00:12:53,783 --> 00:12:56,301 very cool. So looking over this tally, we observe 217 00:12:56,301 --> 00:13:00,075 the stunning fact that the bo, the bottleneck for this implementation of 218 00:13:00,075 --> 00:13:04,215 Kruscos algorithm is actually sorting. So we do more work in the pre-processing 219 00:13:04,215 --> 00:13:06,731 step, N log N, than we do in the entire fore-loop. 220 00:13:06,731 --> 00:13:10,400 Which is only M plus N log N. So that gives us an overall running time 221 00:13:10,400 --> 00:13:14,278 bound of Big O of N log N, matching the theoretical performance that we've 222 00:13:14,278 --> 00:13:16,846 achieved. Prim's algorithm implemented with heaps. 223 00:13:16,846 --> 00:13:20,672 So just like with our heat based implementation, with Prim's algorithm, we 224 00:13:20,672 --> 00:13:24,813 get a running time which is almost linear only logarithmic factors slower than 225 00:13:24,813 --> 00:13:27,853 reading the input. And just like with Prim's algorithm, you 226 00:13:27,853 --> 00:13:30,620 should. Find Kruskal's algorithm to be very 227 00:13:30,620 --> 00:13:32,220 competitive in practice.