So let me now introduce you to the union find data structure. Let's not lose sight of our goal. Our goal is to be able to check for cycles in Kruskal's algorithm in constant time. So the first, and most basic idea behind this union find data structure implementation, is, we're going to maintain a linked structure for each group. That is, for each connected component, with respect to the edges Kruskal has chosen thus far. By a link structure, I just mean each vertex of the graph is going to have an extra pointer field. In addition with each group, with each connected component, we're going to designate one vertex. And we don't care which one, just some vertex from this connected component, as the leader vertex of that component. Alright, so what is the point of these extra pointers at each vertex? And what is the point of having these leaders? Well, a key invariant that we're going to maintain is that, a given vertex, with its extra pointer, points to the leader vertex of its connected component. So, for example maybe we have two different connected components with three vertices each, one containing the vertices U, V and W. another containing the vertices X, Y and Z. any of these three vertices could be the leader of each of these two components. so perhaps U happens to be the leader vertex of the first component and X happens to be the leader vertex of the second component. And then the invariant just says, every vertex should be coining to the leader of it's component. So, V and W should be pointing to U. U has it's own ver, it's own pointer, it should be pointing to itself. Similarly, in the second component, X is pointing into itself. Y points to X and Z also points to X. So, in blue here, are the actual edges of the graph. And in green, there are these, sort of, extra edged pointers that we've invented where everybody's pointing back to the leader of their component. And so, one very simple thing in this setup which turns out to be a good idea, is each component is in effect, inheriting the name of its leader vertex. So we refer to a group, we refer to a component via the object, via the vertex who happens to be the representative. Who happens to be the leader. And what's kind of amazing is, even just this very simple scaffolding on the connected components is enough to have constant time cycle checks provided the invariant is satisfied. Well, how do we do that? So remember that, [COUGH], checking if adding the edge uv is going to create a cycle, boils down to checking whether there's already a path between U and V. Well, there's already a path between U and V if and only if they're same, in the same connected component. Given two vertices, U and V. How do we know if they're in the same connected component? We just follow the respective leader pointers, and we see if we get to the same place. If they're in the same component, we get the same leader. If they're in different components, we get different leaders. So, checking for a cycle just involves, for each of u and v, comparing a quality of leader pointers. That is clearly constant time. More generally, the way you implement the Find operation for this flavor of disjoint union data structure, is you simply return the leader pointer of the object that you are handed. So if you're given a vertex, you just follow the leader pointer of that vertex, and you return wherever you wind up. So that's pretty excellent, as long as, in this simple data structure, the invariant is satisfied, we have our desired implementation of constant time cycle checks. But, and you know, certainly, this is a recurring theme in our data structure discussions. Whenever you have a data structure and it needs to maintain an invariant. Whenever you do an operation that changes the data structure. So, like in this case, when you do a union fusing two groups together. You have to worry, well, does the variant get destroyed when you do that operation? And if it does, how are you going to restore the invariant without doing undue work? So in the present context of Kruskal's algorithm, here's how this plays out. So we're happily doing our constant time cycle checks. Whenever an edge creates a cycle, we don't do anything, we skip the edge, we don't change our data structure, and we move on. The issue is when we have a new edge and it doesn't create a cycle or cycle check fails. Now, Kruskal's algorithm dictates that we add this edge into the set capital T that we're building. And that fuses two connected components together. But remember, we have this invariant. Every vertex should point to the leader of its component. Well, if we had component A and we had component B, they're both pointing to the leader vertex of their respective components. Now, when these components fuse into one, we've got to do some updating of leader pointers. In particular, there used to be two leaders, now there has to be just one, and we have to rewire the leader pointers to restore the invariant. So, to make sure you're clear on this, important, problem. Let me ask you the following question. So, consider the case, when, at some point, in Kruskal's algorithm, a new edge is added. And two connected components fuse into one. Now, to restore the invariant, you have to do some leader pointer updates. In the worst case, asymptotically, how many leader pointer updates might be required in order to restore the invariant? So, the answer to this question, somewhat alarmingly, is the third answer. So it might require a linear number, in the number of vertices N, pointer updates to restore the invariant. And one easy way to see that is imagine the very last edge that Kruskal's going to add in the set T, the one which fuses the final two connected components down into one. For all you know, you know, those two components have exactly the same size, they have N / 2 vertices each. You're going down from two leader pointers to one. And one of those set of N / 2 vertices are going to have to inherit the leader pointer from the other side. So one of the two sets is going to have to have N / 2 vertices, get their leader pointer updated. So that's a bummer. We're sort of hoping for a near linear time bound but if every each one of our linear number of edge editions might trigger a linear number of liter pointer updates. That seems to be giving rise to a quadratic time bound. But remember when I introduced the union find data structure I only said that first idea was idea number one presumably there's an idea number two. And here it is and it's very natural if you are coding up an implementation of union find data structure you would probably do this optimization yourself. Alright. So, consider the moment in time in which some component A merges with some component B. Each of these two components currently has their own respective leader vertex. And all vertices from that group are pointing to that leader vertex. Now, when they fuse, what are you going to do? The first obvious thing to do is to say, well, let's not bother computing some totally new leader. Let's either reuse the leader from group A, or the leader from group B. That way, if, for example, we retain the leader from group A. The only leader pointers that need to be rewired are the ones who come from component. B. The vertices in component A can keep their same leader vertex and their same leader pointers as before. So that's the first point. Let's just have a new union of two components inherit the leader of one of the constituent parts. Now if you're going to retain one of the two leaders, which one are you going to retain. Maybe one of the components has 1,000 vertices. The other component only has 100 vertices. Well, given the choice, you're certainly going to keep the leader from the bigger, from the first component. That way, you only have to rewire the 100 liter pointers of the second group. If you catch the leader of the second group, you have to rewire the 1000 pointer from the first group. And that seems silly and wasteful. So the obvious way to implement a merge is you just keep the leader of the bigger group, and rewire everybody from the second group, the smaller group. So you should notice that, in order to actually implement this optimization, where you always retain the lead of the larger group, you have to be able to quickly determine which of the two groups is larger but you can augment the data structure we've discussed in, to make, to facilitate that. So, just, with each group, you just keep account of how many vertices are in that group. So you maintain a size field for each group. That allows you to check, in constant time, what's the population of two different groups and figure out, again, in constant time, which one's bigger. And notice that when you fuse two groups together, it's easy to maintain the size field. It's just the sum of the sizes of the two constituent parts. So let me now revisit the question from the previous slide. In the worst case, given this optimization, how many leader pointers asymptotically might you have to re-wire when you merge two components together? Well, unfortunately, the song remains the same. The answer is still the third one. And it's because for exactly the same reason as on the previous slide. It still might be the case that, say, in the final iteration of Kruskal, you're merging two components that both have size N / 2. So, it doesn't matter. I mean, no matter which leader you choose, you're going to be stuck updating the leader pointers of N / 2 or theta of n vertices. So, this is clearly a smart practical optimization. It doesn't seem to be buying us anything in our asymptotic analysis of the running time. However, however, what if we think about the work done in all of these leader pointer updates, in the following different way? Rather than asking how many updates, might a merging of two components trigger. Let's adopt a vertex centric view. Let's suppose you're one of the vertices of this graph. So initially, at the beginning of Kruskal's algorithm, you're in your own isolated connected component. You just point to yourself. You're your own leader. And then, as, Kruskal's algorithm run its, runs its course, your leader pointer will periodically get updated. At some point, you're no longer pointing to yourself. You're pointing to some other vertex. Then at some point, your pointer gets updated again. You're pointing to yet some other vertex, and so on. How many times over the entire trajectory of Kruskal's algorithm. In light of our new optimization, might you, as some vertex of this graph, have, your leader pointer updated. Well, here's the very cool answer, the answer is the second one. So while it remains true, that if you always have the union of two groups, and here at the leader pointer of the larger one, it's still true that a given fusion might trigger, a, a linear number of leader pointer updates, each, vertex, will only see, its leader pointer updated a logarithmic number of times, over the course of Kruskal's algorithm. What is the reason for this? Well suppose you're a vertex and you're in some group and it has maybe twenty vertices, so you're one of twenty. Now suppose at some point, your leader pointer gets updated. Why did that happen? Well it meant that your group of twenty vertices merged with some other group that has to be bigger. Remember your leader pointer only gets rewired in a fusion if you were in a smaller group. So you're joining a group at least as big as yours, so the size of the union, the size of your new community, your new connected component, is at least double. The size of your previous one. So the bottom line is, every time you, as a vertex, has your leader pointer updated. The population, and the component to which you belong is at least twice as large as before. Now, you start in a connected component of size one. The connected component is not going to have more than N vertices. So the number of doublings you might have to endure, is, at most, log base two of n. So that bounds how many leader pointers you will see as a vertex in this graph. So, in light of that very cool operation, we can now give a good running time analysis of Kruskal's algorithm using the union find data structure. Using scaffolding on the component structure to do cycle checks. We, of course, have not changed the processing step. We're sorting the edges from cheapest to most expensive at the beginning of the algorithm and that still takes big O of M log N time. So what work do we have to do, beyond this sorting pre-processing step? Well, fundamentally, Kruskal's algorithm is just about these cycle checks. And each iteration of the four loop, we have to check if adding a given edge would create a cycle with those edges we've already added. And the whole point of the union find scaffolding, these link structures, is that we can check cycles in constant time. Just given an edge, we look at the leader pointers of a ten points. And, a cycle will be created if and only if their leader pointers are identical. So, for the cycle checking, we only do constant time for each of the O of M iterations. But the final source of work is maintaining this union fine data structure. Restoring the invariant each time we add a new edge to the set capital T. And here, the good idea is, we're not going to just bound to the worst case running time of these leader pointers for each iteration. because that could be too expensive. That could be up to linear in just a single iteration. Rather, we're going to do a global analyses, thinking about the total number of liter pointers that ever occur. On the previous find, we observed that for a single vertex, it's only going to endur a logarithmic number of liter pointer updates. So summing over all the end vertasese, the total work done for leader pointer updates is only end login. So even though we might do a linear amount of pointer updates on just one iteration of this for loop, we still have this global upper bound of N log N on the total number of leader pointer updates, very cool. So looking over this tally, we observe the stunning fact that the bo, the bottleneck for this implementation of Kruscos algorithm is actually sorting. So we do more work in the pre-processing step, N log N, than we do in the entire fore-loop. Which is only M plus N log N. So that gives us an overall running time bound of Big O of N log N, matching the theoretical performance that we've achieved. Prim's algorithm implemented with heaps. So just like with our heat based implementation, with Prim's algorithm, we get a running time which is almost linear only logarithmic factors slower than reading the input. And just like with Prim's algorithm, you should. Find Kruskal's algorithm to be very competitive in practice.