1 00:00:02,560 --> 00:00:06,091 So now we understand why Kruskal's algorithm is correct, why it always 2 00:00:06,091 --> 00:00:09,976 computes a Minimum cost-spanning Tree. In this video, we'll turn our attention 3 00:00:09,976 --> 00:00:13,003 to implementation issues. We'll begin with a straightforward 4 00:00:13,003 --> 00:00:16,838 implementation of Kruskal's algorithm. That'll give us a polynomial run time 5 00:00:16,838 --> 00:00:19,209 bound which is good but we'd like to do better. 6 00:00:19,209 --> 00:00:22,892 So then we'll show how deploying a suitable data structure, something you 7 00:00:22,892 --> 00:00:26,525 haven't seen before, the Union-Find data structure, allows us to speed up 8 00:00:26,525 --> 00:00:29,552 Kruskal's Algorithm to be competitive with Prim's Algorithm. 9 00:00:29,552 --> 00:00:32,940 That is we'll get a near linear running time bound of MlogN. 10 00:00:32,940 --> 00:00:36,480 So let's just briefly review the very elegant pseudocode for Kruskal's 11 00:00:36,480 --> 00:00:38,275 Algorithm, so it's a greedy algorithm. 12 00:00:38,275 --> 00:00:41,912 It considers the cheapest edges first, all the way up to the most expensive. 13 00:00:41,912 --> 00:00:45,987 So we begin with a sorting pre-processing step to put the edges in sorted order for 14 00:00:45,987 --> 00:00:49,818 notational convenience let's just rename the edges. So, that one is the cheapest 15 00:00:49,818 --> 00:00:52,680 edge, and, all the way up to M being the most expensive edge. 16 00:00:52,680 --> 00:00:56,366 We then have our single linear scan in this for loop, and we just grab edges 17 00:00:56,366 --> 00:00:57,385 whenever we can, okay? 18 00:00:57,385 --> 00:01:01,362 So we maintain this evolving set capital T, which at the end of the algorithm will 19 00:01:01,362 --> 00:01:04,223 be our spanning tree. Now, what forces us to exclude an edge 20 00:01:04,223 --> 00:01:07,223 from this set capital T? Well if it creates a cycle with the edges 21 00:01:07,223 --> 00:01:09,115 was already chosen, obviously that's a no go. 22 00:01:09,115 --> 00:01:13,040 We can't have cycles in our final output, but as long as we don't have a cycle from 23 00:01:13,040 --> 00:01:15,877 including an edge, we go ahead and optimistically include it. 24 00:01:15,877 --> 00:01:19,471 And as we've seen, this is a correct algorithm, it always outputs the minimum 25 00:01:19,471 --> 00:01:22,549 cost spanning tree. So, what would be the running time of 26 00:01:22,549 --> 00:01:25,731 this algorithm? If we just straightforwardly implement 27 00:01:25,731 --> 00:01:31,529 the pseudocode on this slide? Well, let's just considered the algorithm 28 00:01:31,529 --> 00:01:35,102 step by step. In the first step, we sort the edges, 29 00:01:35,102 --> 00:01:40,720 so that's going to take M log N time. Now don't forget whenever we're speaking 30 00:01:40,720 --> 00:01:45,154 about graphs, we have the convention that M denotes the number of edges and N 31 00:01:45,154 --> 00:01:49,283 denotes the number of vertices. So, you might justifiably wonder why I 32 00:01:49,283 --> 00:01:54,218 wrote M log N for the running time of the sorting step instead of M log M, since 33 00:01:54,218 --> 00:01:58,597 after all what it is we're sorting are the edges and there's M of them. 34 00:01:58,597 --> 00:02:03,346 Well, what I'm using here is that, in this context we can switch log N and log 35 00:02:03,346 --> 00:02:06,180 M interchangeably with each other inside a big-O notation. 36 00:02:06,180 --> 00:02:08,744 Why is that true? Well recall when we first discussed 37 00:02:08,744 --> 00:02:11,890 graphs in part one, we noticed that there can't be too many edges. 38 00:02:11,890 --> 00:02:15,423 So the number of edges M, is the most quadratic of the number of vertices, 39 00:02:15,423 --> 00:02:18,955 it's the most big-O of N squared. So if M is at most M squared, then log M 40 00:02:18,955 --> 00:02:22,295 is at most two log N And the two is suppressed in the big-O notation. 41 00:02:22,295 --> 00:02:24,956 So log M and log M are interchangeable in this context. 42 00:02:24,956 --> 00:02:28,973 Notice that for the minimum cost spanning tree problem you may as well assume that 43 00:02:28,973 --> 00:02:32,118 there's no parallel edges. You may as well assume that the graphs 44 00:02:32,118 --> 00:02:34,490 are simple. If you have a bunch of parallel edges 45 00:02:34,490 --> 00:02:37,347 between a given vertices, you can just throw out all but the 46 00:02:37,347 --> 00:02:40,000 cheapest one. That's the only one you'll ever need. 47 00:02:40,000 --> 00:02:44,114 So, moving on to the main loop, pretty obvious how many iterations we have 48 00:02:44,114 --> 00:02:45,635 there. We have M iterations. 49 00:02:45,635 --> 00:02:50,032 So all we need to figure out is how much work we have to do in each iteration. 50 00:02:50,032 --> 00:02:52,793 So what is it each iteration needs to accomplish? 51 00:02:52,793 --> 00:02:57,076 It needs to check, whether or not adding the current edge to the edges we've 52 00:02:57,076 --> 00:03:01,247 already chosen creates a cycle or not. So I claim that can be done in time 53 00:03:01,247 --> 00:03:05,418 linear in the number of vertices. That is it can be done in the big-O of N 54 00:03:05,418 --> 00:03:08,234 time. So how do we accomplish this? 55 00:03:08,234 --> 00:03:11,830 Well, we need two quick observations. First of all, and this is something we've 56 00:03:11,830 --> 00:03:15,660 seen in arguments in the previous videos, checking whether or not this new edge is 57 00:03:15,660 --> 00:03:18,462 going to create a cycle. Say the edge has end points U and V. 58 00:03:18,462 --> 00:03:22,058 Checking for a cycle boils down to checking whether or not there's already a 59 00:03:22,058 --> 00:03:25,748 path between the end points U and V, and the edges capital T that was chosen so 60 00:03:25,748 --> 00:03:27,662 far. If there is a UV path already, adding 61 00:03:27,662 --> 00:03:29,951 this edge will close the loop and create a cycle. 62 00:03:29,951 --> 00:03:33,640 If there currently is no UV path, then adding this edge will not create a cycle. 63 00:03:33,640 --> 00:03:37,377 So the second observation is well, how do we check if there's a path from U to V, 64 00:03:37,377 --> 00:03:40,646 in the edges we've already chosen? Well, we already know how to do that 65 00:03:40,646 --> 00:03:41,658 just. Using graph search. 66 00:03:41,658 --> 00:03:44,401 So you can use breadth for a search, you can use depth for a search. 67 00:03:44,401 --> 00:03:46,735 It doesn't matter. You just start at the vertex U and you 68 00:03:46,735 --> 00:03:49,479 see if you have a reach of V or not. If you reach it, there's a path. 69 00:03:49,479 --> 00:03:51,340 If you don't reach it, there's not a path. 70 00:03:51,340 --> 00:03:53,816 Breadth-first-search, depth-first-search, whatever. 71 00:03:53,816 --> 00:03:56,645 It takes time linear, in the graph that you're searching. 72 00:03:56,645 --> 00:04:00,132 And since we only need to search for the edges that are in capital T. 73 00:04:00,132 --> 00:04:02,760 And there's going to be, at most, N minus one of them. 74 00:04:02,760 --> 00:04:06,246 Linear time in this context means O of N. O of the number of vertices, 75 00:04:06,246 --> 00:04:09,885 because that also bounds the number of edges that might be in capital T. 76 00:04:09,885 --> 00:04:12,260 The edges that we're searching for are pathing. 77 00:04:13,330 --> 00:04:15,716 So, adding up all of this work, what do we have? 78 00:04:15,716 --> 00:04:20,177 We have this sorting pre-processing step. That takes time big-O of M log N, 79 00:04:20,177 --> 00:04:24,276 then we have these M iterations of the four loop like this is takes O of N times 80 00:04:24,276 --> 00:04:28,478 factor. the latter term dominates, so the overall 81 00:04:28,478 --> 00:04:31,850 running time is big-O of M times N. So this, coincidentally, is the same 82 00:04:31,850 --> 00:04:35,481 running time we got from the straightforward implementation of Prim's 83 00:04:35,481 --> 00:04:37,920 algorithm, and I'll make the same comments here. 84 00:04:37,920 --> 00:04:40,984 This is a reasonable running time, it's polynomial on the input size. 85 00:04:40,984 --> 00:04:44,273 It's way better than checking all exponentially many spanning trees that 86 00:04:44,273 --> 00:04:47,066 the graph might have. But we certainly would like to do better. 87 00:04:47,066 --> 00:04:50,040 We'd certainly love to have implementation of Kruskal's algorithm 88 00:04:50,040 --> 00:04:53,149 that gets us to a near linear running time bound, and that's the plan. 89 00:04:53,149 --> 00:04:55,898 How are we going to do it? Well, really the work that we're doing 90 00:04:55,898 --> 00:04:59,187 here over and over again, which is kind of a bummer, is these cycle checks. 91 00:04:59,187 --> 00:05:02,387 And every single iteration, we're spending time linear in the number of 92 00:05:02,387 --> 00:05:04,640 vertices to check for a cycle. And the question is, 93 00:05:04,640 --> 00:05:08,829 can we speed that up? And the Union-Find data structure will 94 00:05:08,829 --> 00:05:14,560 actually, believe it or not, allow us to check for a cycle in constant time. 95 00:05:16,040 --> 00:05:20,356 So if we actually had a data structure that could implement constant time cycle 96 00:05:20,356 --> 00:05:22,784 checks. Then we'd have to spend only constant 97 00:05:22,784 --> 00:05:24,835 time for each iteration of this while loop. 98 00:05:24,835 --> 00:05:28,935 So the loop overall would take only linear time in the number of edges, O of 99 00:05:28,935 --> 00:05:31,417 M edges. If we got that, then believe it or not, 100 00:05:31,417 --> 00:05:35,518 the sorting pre-processing step would become the bottle neck in the running 101 00:05:35,518 --> 00:05:39,241 time of Kruskal's algorithm. Our running time would drop from N times 102 00:05:39,241 --> 00:05:41,400 N down to near linear, down O of N log N. 103 00:05:44,460 --> 00:05:48,247 So let me now tell you a little bit about this magical data structure that's going 104 00:05:48,247 --> 00:05:51,762 to give us constant time cycle checks. I'm just going to give you the high level 105 00:05:51,762 --> 00:05:54,728 picture, and how it connects to Kruskal's algorithm on this slide. 106 00:05:54,728 --> 00:05:57,877 We'll look at the details of the data structure, in the next video. 107 00:05:57,877 --> 00:06:01,163 I also want to warn you that I'm not going to discuss, in this pair of videos, 108 00:06:01,163 --> 00:06:03,536 the state of the art for Union-Find Data Structures. 109 00:06:03,536 --> 00:06:05,635 I'm going to give you a fairly primitive version, 110 00:06:05,635 --> 00:06:09,286 but that is nevertheless, sufficient to give us our desired M Log N running time 111 00:06:09,286 --> 00:06:12,070 of Kruskal's algorithm. So if you're interested, there is some 112 00:06:12,070 --> 00:06:15,402 optional material about different implementations of Union-Find that use 113 00:06:15,402 --> 00:06:18,231 some super cool ideas, like union by rank and path compression. 114 00:06:18,231 --> 00:06:21,458 That give you different, and in some senses, better operation times. 115 00:06:21,458 --> 00:06:25,144 But the quick and dirty version of Union-Find that I'm going to discuss 116 00:06:25,144 --> 00:06:28,100 here, is sufficient for our present needs. 117 00:06:28,100 --> 00:06:33,417 And so the Raison d'ĂȘtre of a Union-Find Data Structure is to maintain a partition 118 00:06:33,417 --> 00:06:37,233 of a set of objects. So in this picture, the overall rectangle 119 00:06:37,233 --> 00:06:41,988 is meant to denote a set of objects and then C1, C2, C3, and C4 are disjointed 120 00:06:41,988 --> 00:06:45,428 subsets that together form the union of the entire set. 121 00:06:45,428 --> 00:06:49,480 So that's what I mean by a partition of a group of objects. 122 00:06:49,480 --> 00:06:52,142 We're not going to ask too much of this data structure. 123 00:06:52,142 --> 00:06:54,708 We're only going to ask it to support two operations. 124 00:06:54,708 --> 00:06:58,080 No prizes to guess what those two operations are called. 125 00:06:58,080 --> 00:07:03,169 So the find operation, we give it an object from this universe and we ask the 126 00:07:03,169 --> 00:07:07,663 data structure to return to us the name of the group to which that object 127 00:07:07,663 --> 00:07:10,902 belongs. So for example, if we handed it something 128 00:07:10,902 --> 00:07:15,726 in the middle of this rectangle, an object, we'd expect it to return to us 129 00:07:15,726 --> 00:07:19,586 the name c3. The union operation by contrast, takes as 130 00:07:19,586 --> 00:07:24,024 input, the names of two groups. And what we want the data structure to do 131 00:07:24,024 --> 00:07:28,708 is to fuse those two groups together. That is, the objects in the first group, 132 00:07:28,708 --> 00:07:33,269 and the objects in the second group should all coalesce, and be, now, in one 133 00:07:33,269 --> 00:07:36,289 sole group. So why might such a data structure be 134 00:07:36,289 --> 00:07:39,580 useful for speeding up Kruskal's Algorithm? 135 00:07:39,580 --> 00:07:44,329 To see the connection, think of Kruskal's algorithm as working conceptually in the 136 00:07:44,329 --> 00:07:47,325 following way. So initially, when the algorithm starts 137 00:07:47,325 --> 00:07:51,056 in the Set capital T is empty, each of the vertices is by it's own, 138 00:07:51,056 --> 00:07:55,127 it's on its own isolated component. And then each time Kruskal's adds a new 139 00:07:55,127 --> 00:07:58,802 edge to the set capital T. What that does is it takes two current 140 00:07:58,802 --> 00:08:02,760 connected components and fuses them into a single connected component. 141 00:08:04,370 --> 00:08:08,339 So for example, toward the end of Kruskal's algorithm that maybe it's 142 00:08:08,339 --> 00:08:12,998 included enough edges, that now the Tree capital T that is constructed so far has 143 00:08:12,998 --> 00:08:17,716 only four different connected components. And maybe it's about to add a new edge U 144 00:08:17,716 --> 00:08:22,375 comma V, where of course U and V should be in different connected components with 145 00:08:22,375 --> 00:08:26,402 respect to the edges chosen for far. So this new edge addition at this 146 00:08:26,402 --> 00:08:31,119 iteration of Kruskal is going to fuse the connecting components of U and V into a 147 00:08:31,119 --> 00:08:34,226 single one. So that corresponds to taking the union 148 00:08:34,226 --> 00:08:37,160 of the groups to which U and V respectively belong. 149 00:08:38,180 --> 00:08:41,905 So to be a little more precise about it. So what are going to be the objects 150 00:08:41,905 --> 00:08:45,324 contained by the Union-Find Data Structure in Kruskal's algorithm? 151 00:08:45,324 --> 00:08:49,356 We're going to correspond to vertices. It's the vertices coalescing that we want 152 00:08:49,356 --> 00:08:52,367 to keep track of. So what are going to be the groups in the 153 00:08:52,367 --> 00:08:55,429 partition that we maintain? They're just going to correspond to 154 00:08:55,429 --> 00:08:59,308 connected components with respect to the edges that Kruskal's algorithm has 155 00:08:59,308 --> 00:09:03,249 already committed to. And with these semantics, it's clear that 156 00:09:03,249 --> 00:09:07,946 every time Kruskal adds a new edge to its set capital T, we have to invoke the 157 00:09:07,946 --> 00:09:11,440 union operation to fuse two connected components into one.