1 00:00:02,600 --> 00:00:05,924 So, in these next few videos, we're going to continue our discussion of the 2 00:00:05,924 --> 00:00:09,295 minimum cost spanning tree problem. And I'm going to focus on a second good 3 00:00:09,295 --> 00:00:11,543 algorithm, good greedy algorithm for the problem, 4 00:00:11,543 --> 00:00:14,352 namely Kruskal's algorithm. Now, we already have an excellent 5 00:00:14,352 --> 00:00:17,864 algorithm for computing the minimum cost spanning tree in Prim's algorithm. 6 00:00:17,864 --> 00:00:21,094 So you might wonder, you know, why bother spending the time learning a 7 00:00:21,094 --> 00:00:23,187 second one? Well, let me give you three reasons. 8 00:00:23,187 --> 00:00:25,892 So the first reason is this is just a, it's a cool algorithm. 9 00:00:25,892 --> 00:00:28,145 It's definitely a candidate for the greatest hits. 10 00:00:28,145 --> 00:00:31,661 It's something I think you should know. It's competitive with Prim's algorithm 11 00:00:31,661 --> 00:00:34,952 both in theory and in practice. So it's another great greedy solution for 12 00:00:34,952 --> 00:00:38,426 the minimum cost spanning problem. The second reason is it'll give us an 13 00:00:38,426 --> 00:00:40,424 opportunity to learn a new data structure, 14 00:00:40,424 --> 00:00:42,516 one we haven't discussed yet in this course. 15 00:00:42,516 --> 00:00:44,513 It's called the union-find data structure. 16 00:00:44,513 --> 00:00:48,175 So, in exactly the same way or in a similar way to how we used heaps to get a 17 00:00:48,175 --> 00:00:50,362 super fast implementation of Prim's algorithm. 18 00:00:50,362 --> 00:00:54,309 We use a unionfind data structure to get a super fast implementation of Kruskal's 19 00:00:54,309 --> 00:00:56,782 algorithm. So that'll be a fun topic just in its own 20 00:00:56,782 --> 00:00:59,017 right. The third reason is, is there's some very 21 00:00:59,017 --> 00:01:02,678 cool connections between Kruskal's algorithm and certain types of clustering 22 00:01:02,678 --> 00:01:05,056 algorithms. So that's a nice application that we'll 23 00:01:05,056 --> 00:01:08,004 spend some time talking about. I'll discuss how natural greedy 24 00:01:08,004 --> 00:01:12,215 algorithms in a clustering context are best understood as a variance of 25 00:01:12,215 --> 00:01:15,130 Kruskal's minimum spanning tree algorithm. 26 00:01:15,130 --> 00:01:19,305 So let me just briefly review some of the things I expect you to remember about the 27 00:01:19,305 --> 00:01:23,132 minimum cost spanning tree problem. So the input of course is an undirected 28 00:01:23,132 --> 00:01:26,711 graph, G and each edge has a cost. And what we're trying to output, the 29 00:01:26,711 --> 00:01:29,246 responsibility of the algorithm is a spanning tree. 30 00:01:29,246 --> 00:01:32,080 That is a subgraph which has no cycles and is connected. 31 00:01:32,080 --> 00:01:35,758 There's a path between each pairs of vertices and amongst all potentially 32 00:01:35,758 --> 00:01:39,884 exponential many spanning trees, the algorithm is supposed to output the one 33 00:01:39,884 --> 00:01:42,369 with the smallest cost, smallest sum of edge costs. 34 00:01:42,369 --> 00:01:46,147 So let me reiterate the, the standing assumptions that we've got through the 35 00:01:46,147 --> 00:01:49,983 minimum spanning tree lectures. So first of all, in assuming if the graph 36 00:01:49,983 --> 00:01:52,592 is connected, that's obviously necessary for it to have 37 00:01:52,592 --> 00:01:55,248 any spanning trees. that said, Kruskal's algorithm actually 38 00:01:55,248 --> 00:01:58,995 extends in a really, just easy, elegant way to the case where G is disconnected. 39 00:01:58,995 --> 00:02:00,988 But I'm not going to talk about that here. 40 00:02:00,988 --> 00:02:04,782 secondly, remember, we're going to assume that all of the edge costs are distincts, 41 00:02:04,782 --> 00:02:07,960 that is there are no ties. Now don't worry, Kruskal's algorithm is 42 00:02:07,960 --> 00:02:10,664 just as correct. If there are ties amongst the edge costs. 43 00:02:10,664 --> 00:02:14,411 I'm just not going to give you a proof that covers that case, but don't worry, a 44 00:02:14,411 --> 00:02:17,619 proof does, indeed exist. Finally, of the various machinery that we 45 00:02:17,619 --> 00:02:21,405 developed to prove the correctness of Prim's algorithm perhaps the most 46 00:02:21,405 --> 00:02:24,787 important and most subtle point was what's called the cut property. 47 00:02:24,787 --> 00:02:28,774 So this is a condition which guarantees you're not making a mistake in a greedy 48 00:02:28,774 --> 00:02:31,550 algorithm. It guarantees that a given edge is indeed 49 00:02:31,550 --> 00:02:34,781 in the minimum spanning tree. And remember, the cut property says, 50 00:02:34,781 --> 00:02:38,617 if you have an edge of a graph and you could find just a single cut for which 51 00:02:38,617 --> 00:02:40,737 this edge is the cheapest one crossing it. 52 00:02:40,737 --> 00:02:43,059 Okay? So, the edge E crosses is cut and every 53 00:02:43,059 --> 00:02:45,330 other edge that crosses it is more expensive. 54 00:02:45,330 --> 00:02:48,532 That certifies the presence of this edge in the minimum spanning tree, 55 00:02:48,532 --> 00:02:50,422 guarantees that it's a safe edge to include. 56 00:02:50,422 --> 00:02:54,028 So we'll definitely be using that again in Kruskal's algorithm when we prove it's 57 00:02:54,028 --> 00:02:58,184 correct. So as with Prim's algorithm, before I hit 58 00:02:58,184 --> 00:03:02,033 you with the pseudocode, let me just show you how the algorithm works in an 59 00:03:02,033 --> 00:03:04,291 example. I think you'll find it very natural. 60 00:03:04,291 --> 00:03:07,820 So let's look at the following graph with five vertices. 61 00:03:07,820 --> 00:03:12,904 So the graph has seven edges and I've annotated them with their edge costs in 62 00:03:12,904 --> 00:03:14,988 blue. So here's the big difference in 63 00:03:14,988 --> 00:03:18,284 philosophy between Prim's algorithm and Kruskal's algorithm. 64 00:03:18,284 --> 00:03:22,240 In Prim's algorithm, you insisted on growing like a mold from a starting 65 00:03:22,240 --> 00:03:26,306 point, always maintaining connectivity, and spanning one new vertex in each 66 00:03:26,306 --> 00:03:28,778 iteration. Kruskal's going to just throw out the 67 00:03:28,778 --> 00:03:32,569 desire to have a connected subgraph at each step of the iteration. 68 00:03:32,569 --> 00:03:36,745 Kruskal's algorithm will be totally content to grow a tree in parallel with 69 00:03:36,745 --> 00:03:41,030 lots of simultaneous little pieces, only having them coalesce at the very end of 70 00:03:41,030 --> 00:03:43,471 the algorithm. So in Prim's algorithm, while we were 71 00:03:43,471 --> 00:03:46,996 only allowed to pick the cheapest edge subject to this constraint of spanning 72 00:03:46,996 --> 00:03:50,475 some new vertex. In Kruskal's algorithm we're just going to pick the cheapest 73 00:03:50,475 --> 00:03:53,502 edge that we haven't looked at yet. Now, there is an issue, of course, 74 00:03:53,502 --> 00:03:55,580 we want to construct a spanning tree at the end. 75 00:03:55,580 --> 00:03:57,659 So, we certainly don't want to create any cycles, 76 00:03:57,659 --> 00:03:59,873 so we'll skip over edges that will create cycles. 77 00:03:59,873 --> 00:04:03,307 But other than that constraint, we'll just look at the cheapest edge next in 78 00:04:03,307 --> 00:04:05,747 Kruskal's algorithm and pick it if there is no cycles. 79 00:04:05,747 --> 00:04:07,644 So let's look at this five vertex example. 80 00:04:07,644 --> 00:04:10,852 Again, there is no starting point. We're just going to look at the cheapest 81 00:04:10,852 --> 00:04:13,247 edge overall. So that's obviously this unit cost edge 82 00:04:13,247 --> 00:04:15,100 and we're going to include that in our tree. 83 00:04:15,100 --> 00:04:17,574 Right? Why not? Why not pick the cheapest edge? It's a 84 00:04:17,574 --> 00:04:19,430 greedy algorithm. So what do we do next? 85 00:04:19,430 --> 00:04:23,093 Well, now we have this edge of cost two, that looks good, so let's go ahead and 86 00:04:23,093 --> 00:04:24,053 pick that one. Cool. 87 00:04:24,053 --> 00:04:26,110 Notice these two edges are totally disjoint. 88 00:04:26,110 --> 00:04:29,342 Kay.' So we are not maintaining a connectivity 89 00:04:29,342 --> 00:04:32,691 of our subgraph at each iteration of Kruskal's algorithm. 90 00:04:32,691 --> 00:04:37,333 Now, it just so happens that when we look at the next edge, the edge of cost 3, we 91 00:04:37,333 --> 00:04:41,421 will fuse together the two disjoint pieces that we had previously. Now, we 92 00:04:41,421 --> 00:04:45,383 happen to have one connected piece. Now, here's where it gets interesting. 93 00:04:45,383 --> 00:04:49,620 When we look at the next edge, the edge of cost 4, we notice that we're not 94 00:04:49,620 --> 00:04:51,932 allowed to pick the edge of cost 4. Why? 95 00:04:51,932 --> 00:04:55,894 Well, that would create a triangle with the edges of costs 2 and 3, 96 00:04:55,894 --> 00:04:59,691 and that of course is a no-no. We want to span the tree at the end of 97 00:04:59,691 --> 00:05:04,093 the day, so we can't have any cycles. So we skip over the 4 because we have no 98 00:05:04,093 --> 00:05:08,280 choice, we can't pick it, we move on to the 5 and the 5 is fine. 99 00:05:08,280 --> 00:05:10,981 So when we pick the edge of cost 5, there's no cycles, 100 00:05:10,981 --> 00:05:14,164 we go ahead and include it. And now we have a spanning tree and we 101 00:05:14,164 --> 00:05:18,120 stop or if you prefer, you could think of it that we do, we do consider the edge of 102 00:05:18,120 --> 00:05:20,387 cost 6. That would create a triangle with the 103 00:05:20,387 --> 00:05:24,439 edges of costs 3 and 5, so we skip the 6. And then, for completeness, we think 104 00:05:24,439 --> 00:05:27,719 about considering the 7, but that would form a triangle with the 105 00:05:27,719 --> 00:05:31,675 edges of costs 1 and 5, so we skip that. So after this single scan through the 106 00:05:31,675 --> 00:05:35,099 edges in assorted order, we find ourselves with these four pink edges. 107 00:05:35,099 --> 00:05:39,270 In this case, it's a spanning tree and as we'll see, not just in this graph but in 108 00:05:39,270 --> 00:05:42,340 every graph it's actually the minimum cost spanning tree. 109 00:05:42,340 --> 00:05:47,207 So, with the intuition hopefully solidly in place, I don't think the following 110 00:05:47,207 --> 00:05:50,664 pseudocode will surprise you. We want to get away with a single scan 111 00:05:50,664 --> 00:05:54,586 through the edges in short order. So, obviously in the preprocessing step, 112 00:05:54,586 --> 00:05:58,453 we want to take the unsorted array of edges and sort them by edge cost. 113 00:05:58,453 --> 00:06:02,756 To keep the notation and the pseudocode simple, let me just, for the purposes of 114 00:06:02,756 --> 00:06:07,495 the algorithm, description only, rename the edges 1, 2, 3, 4, all the way up to m 115 00:06:07,495 --> 00:06:11,472 conforming to this sorted order, right? So, the algorithm's just going to scan 116 00:06:11,472 --> 00:06:14,250 through the edges in this newly found sorted order. 117 00:06:14,250 --> 00:06:19,790 So we're going to call the tree in progress capital T, like we did in Prim's 118 00:06:19,790 --> 00:06:23,086 algorithm. And now, we're just going to zip through 119 00:06:23,086 --> 00:06:28,869 the edges once in sorted order. And we take an edge, unless it's 120 00:06:28,869 --> 00:06:31,854 obviously a bad idea. And here a bad idea means it creates a 121 00:06:31,854 --> 00:06:34,788 cycle, that's a no-no, but as long as there's no cycle, we go 122 00:06:34,788 --> 00:06:38,022 ahead and include the edge. And that's it, after you finish up the 123 00:06:38,022 --> 00:06:40,260 for loop you just return the tree capital T. 124 00:06:40,260 --> 00:06:43,294 It's easy to imagine various optimizations that you could do. 125 00:06:43,294 --> 00:06:47,522 So for example, once you've added enough edges to have a spanning tree. So n - 1 126 00:06:47,522 --> 00:06:51,352 edges, where n is the number of vertices, you can get ahead, go ahead and abort 127 00:06:51,352 --> 00:06:53,641 this for loop and terminate the algorithm. 128 00:06:53,641 --> 00:06:57,421 But let's just keep things simple and analyze this three line version of 129 00:06:57,421 --> 00:07:01,808 Kruskal's algorithm in this lecture. So just like in our discussion of Primm's 130 00:07:01,808 --> 00:07:05,785 algorithm, what I want to do is first just focus on why is Kruskal's algorithm 131 00:07:05,785 --> 00:07:09,815 correct? Why does it output a spanning tree at all? And then, the spanning tree 132 00:07:09,815 --> 00:07:13,636 that it outputs? Why on earth should it have the minimum cost amongst all 133 00:07:13,636 --> 00:07:16,357 spanning trees? That's when we're, once we're convinced 134 00:07:16,357 --> 00:07:20,125 of the correctness, we'll move on to a naive running time implementation and 135 00:07:20,125 --> 00:07:23,580 then finally, a fast implementation using suitable data structures.