1 00:00:03,240 --> 00:00:07,275 So this algorithm will prove the correctness of Kruskal's minimum cost 2 00:00:07,275 --> 00:00:10,765 spanning tree algorithm. So to prove this correctness theorem, 3 00:00:10,765 --> 00:00:13,464 let's fix an arbitrary connected input graph G. 4 00:00:13,464 --> 00:00:18,403 And let's let T star denote the output of Kruskal's algorithm when we invoke it on 5 00:00:18,403 --> 00:00:21,322 this input graph. So, just like with our high level proof 6 00:00:21,322 --> 00:00:24,361 plan for Prim's algorithm, we're going to proceed in three steps. 7 00:00:24,361 --> 00:00:28,395 We're first going to just establish the more modest goal, the Kruskal's algorithm 8 00:00:28,395 --> 00:00:31,235 outputs a spanning tree. We're not going to make any initial 9 00:00:31,235 --> 00:00:34,422 claims about optimality. So those are the first two steps, one to 10 00:00:34,422 --> 00:00:37,660 argue there's no cycles, one to argue that the output's connected. 11 00:00:37,660 --> 00:00:40,400 And then, in the third step, relying on the cut property. 12 00:00:40,400 --> 00:00:44,385 We'll say, it's not just a spanning tree, it's actually the minimum cost spanning 13 00:00:44,385 --> 00:00:46,427 tree. For this proof, I am going to assume 14 00:00:46,427 --> 00:00:50,462 you're familiar with the machinery we developed to, to prove Prim's algorithm 15 00:00:50,462 --> 00:00:52,666 is correct. So the first step of the proof is going 16 00:00:52,666 --> 00:00:55,491 to be very easy. So one important property of a Spanish 17 00:00:55,491 --> 00:00:59,339 [INAUDIBLE] there's no cycles and it's quite obvious looking at the pseudocode 18 00:00:59,339 --> 00:01:02,311 of Kruskal's algorithm that its not going to produce any cycles. 19 00:01:02,311 --> 00:01:05,770 Every edge that creates a cycle is explicitly excluded from the output. 20 00:01:05,770 --> 00:01:10,314 What's a little less obvious is that as long as the, input graph is connected, 21 00:01:10,314 --> 00:01:13,737 then Kruskal's algorithm will output a connected sub-graph. 22 00:01:13,737 --> 00:01:17,927 And therefore a spanning tree. So to argue the output's connected, we're 23 00:01:17,927 --> 00:01:22,353 going to have two sub-parts of the proof. The first thing I want you to do is 24 00:01:22,353 --> 00:01:27,327 recall what we call the empty cut lemma. So this was a way of talking about when a 25 00:01:27,327 --> 00:01:31,208 graph is disconnected or equivalently when a graph is connected. 26 00:01:31,208 --> 00:01:36,000 And you'll recall a graph is connected if and only if for every cut there's at 27 00:01:36,000 --> 00:01:39,700 least one crossing edge. So to prove this second step of this 28 00:01:39,700 --> 00:01:44,491 proof that T star is connected we just have to prove that for every single cut 29 00:01:44,491 --> 00:01:47,100 it had at least one edge crossing that cut. 30 00:01:48,260 --> 00:01:52,769 So, that's what we're going to show. To get started let's fix an arbitrary cut 31 00:01:52,769 --> 00:01:55,816 A, B. Using the assumption the input graph is 32 00:01:55,816 --> 00:02:00,204 connected, it certainly contains at least one edge that crosses this cut. 33 00:02:00,204 --> 00:02:04,896 The question is whether or not T star contains a crossing edge but certainly 34 00:02:04,896 --> 00:02:07,980 the original graph G contains a crossing edge. 35 00:02:07,980 --> 00:02:12,709 So, here's the key point of the argument. Kruskal's algorithm, by definition, it 36 00:02:12,709 --> 00:02:15,509 makes a single scan through all of the edges. 37 00:02:15,509 --> 00:02:20,052 That is, it considers every edge of the original input graph exactly once. 38 00:02:20,052 --> 00:02:24,968 So, what I want you to do is, I want you to think about this cut A, B which has at 39 00:02:24,968 --> 00:02:29,261 least one edge of G crossing. And I want you to fast forward Kruskals 40 00:02:29,261 --> 00:02:34,680 algorithm until the moment that it first considers an edge crossing this cut AB. 41 00:02:34,680 --> 00:02:39,348 The key claim is that this first edge seen by Kruskal's algorithm, is 42 00:02:39,348 --> 00:02:43,340 definitely be, going to be included in the final output, T star. 43 00:02:43,340 --> 00:02:46,511 So why is this true? Well, let me remind you of the lonely cut 44 00:02:46,511 --> 00:02:49,206 corollary. The lonely cut corollary says that if an 45 00:02:49,206 --> 00:02:53,223 edge is the sole edge crossing a cut, if it's lonely in a cut, then it cannot 46 00:02:53,223 --> 00:02:54,862 participate in any cycle, right? 47 00:02:54,862 --> 00:02:58,984 If it was in a cycle, that cycle would loop back around and cross the cut a 48 00:02:58,984 --> 00:03:01,680 second time. Now, what does that have to do with the 49 00:03:01,680 --> 00:03:04,270 picture here? Well, if this is the first edge that 50 00:03:04,270 --> 00:03:08,413 Kruskal sees crossing this cut, then certainly the tree so far T star contains 51 00:03:08,413 --> 00:03:11,867 nothing crossing this cut, right? It hasn't even seen anything crossing 52 00:03:11,867 --> 00:03:14,530 this cut yet. So at the moment it encounters the first 53 00:03:14,530 --> 00:03:18,230 edge, there's no way including that first edge will create a cycle because that 54 00:03:18,230 --> 00:03:20,450 first edge is going to be lonely in this cut AB. 55 00:03:20,450 --> 00:03:24,298 So again summarising, the first edge crossing a cut is guaranteed to be chosen 56 00:03:24,298 --> 00:03:27,110 by a cross because the algorithmic cannot create a cycle. 57 00:03:27,110 --> 00:03:31,942 That's why there's at least one edge of Kruskal's output crossing this particular 58 00:03:31,942 --> 00:03:34,300 cut AB. Since AB was arbitrary, all edges, 59 00:03:34,300 --> 00:03:38,259 all cuts have some edge of T star crossing them, that's why T star is 60 00:03:38,259 --> 00:03:41,079 connected. So this completes the part of the proof 61 00:03:41,079 --> 00:03:44,645 where we argue that Kruskal's algorithm outputs a spanning tree. 62 00:03:44,645 --> 00:03:48,881 Now we have to move on and argue that it's actually a minimum cost spanning 63 00:03:48,881 --> 00:03:51,756 tree. We're going to argue this in the same way 64 00:03:51,756 --> 00:03:55,483 as we did for Prim's algorithm. We're going to argue that every edge ever 65 00:03:55,483 --> 00:03:58,944 selected by Kruskal's algorithm is justified by the cut property. 66 00:03:58,944 --> 00:04:02,060 Satisfies the hypotheses of the cut property. 67 00:04:02,060 --> 00:04:04,701 Recall from our correctness proof for Prim's Algorithm, 68 00:04:04,701 --> 00:04:08,062 this is enough to argue correctness. That the output is a minimum cost 69 00:04:08,062 --> 00:04:09,023 spanning tree, right? 70 00:04:09,023 --> 00:04:11,616 An edge justified by the cut properties, not a mistake. 71 00:04:11,616 --> 00:04:14,113 It's got to belong to the minimum cost spanning tree. 72 00:04:14,113 --> 00:04:17,667 If you have an algorithm that outputs a spanning tree, and it never made a 73 00:04:17,667 --> 00:04:20,308 mistake, that's got to be the minimum cost spanning tree. 74 00:04:20,308 --> 00:04:22,470 And that's going to be the case for Kruskal. 75 00:04:22,470 --> 00:04:26,832 Now this statement was easy to prove for Prim's algorithm and that's because the 76 00:04:26,832 --> 00:04:30,925 definition of Prim's algorithm, it selects edges based on being the cheapest 77 00:04:30,925 --> 00:04:33,725 in some cut. So it's tailor made for the application 78 00:04:33,725 --> 00:04:36,472 of the cut property. Not so for Kruskal's algorithm. 79 00:04:36,472 --> 00:04:40,835 If you look at the pseudocode, nowhere does the pseudocode discuss taking cheap 80 00:04:40,835 --> 00:04:43,635 edges across cuts. So we have to show that Kruskal's 81 00:04:43,635 --> 00:04:47,782 algorithm in effect is inadvertently at every edge picking the cheapest edge 82 00:04:47,782 --> 00:04:51,013 crossing some cut. And we actually have to exhibit what that 83 00:04:51,013 --> 00:04:54,730 cut is in our proof of correctness. So, that's what we have to do here. 84 00:04:54,730 --> 00:04:58,504 So to argue that, let's just sort of freeze Kruskal's algorithm at some 85 00:04:58,504 --> 00:05:00,896 arbitrary iteration, where it adds a new edge. 86 00:05:00,896 --> 00:05:04,298 We need to show that this edge is justified by the cut property. 87 00:05:04,298 --> 00:05:08,391 So, maybe this edge has end points U and V, and let's let capital T denote the 88 00:05:08,391 --> 00:05:11,581 current value of the edges selected by the algorithm so far. 89 00:05:11,581 --> 00:05:15,514 So let's think about what this state of the world looks like, at a, sort of, 90 00:05:15,514 --> 00:05:17,960 intermediate iteration of Kruskal's algorithm. 91 00:05:17,960 --> 00:05:22,170 So Kruskal's algorithm maintains the invariant there's no cycles but remember 92 00:05:22,170 --> 00:05:26,712 it doesn't maintain any invariant of the current edges forming a connected set so 93 00:05:26,712 --> 00:05:31,254 in general in an intermediate iteration of Kruskal's algorithm, you've got a 94 00:05:31,254 --> 00:05:35,700 bunch of pieces, a bunch of little mini trees floating around the graph. 95 00:05:35,700 --> 00:05:40,383 Connect the components if you like, with respect to the current edge shed, capital 96 00:05:40,383 --> 00:05:42,811 T. And there can be some, you know, isolated 97 00:05:42,811 --> 00:05:46,784 vertices floating around as well. What more can we say? 98 00:05:46,784 --> 00:05:51,229 Well, if the current edge has endpoints U and V and Kruskal is deciding to add this 99 00:05:51,229 --> 00:05:55,621 edge UV to the current set capital T, we certainly know that capital T has no path 100 00:05:55,621 --> 00:05:56,799 between U and V, right? 101 00:05:56,799 --> 00:06:01,137 If it did have a path, then this new edge would create a cycle, then Kruskal would 102 00:06:01,137 --> 00:06:04,083 skip the edge. Since it isn't skipping the edge, U and V 103 00:06:04,083 --> 00:06:07,886 are currently in different pieces. They're in different components with 104 00:06:07,886 --> 00:06:12,930 respect to the current edge set. Now remember, ultimately, if we're going 105 00:06:12,930 --> 00:06:17,531 to apply the cut property, we have to somehow exhibit some cut from somewhere, 106 00:06:17,531 --> 00:06:21,833 justifying this particular edge. So here's where we get the cut from, and 107 00:06:21,833 --> 00:06:26,374 it's going to be a very similar argument to when we proved the empty cut limit 108 00:06:26,374 --> 00:06:29,660 characterizing disconnectedness in terms of empty cuts. 109 00:06:29,660 --> 00:06:33,962 We're going to say look, with respect to the tree edges we have so far, with 110 00:06:33,962 --> 00:06:37,248 respect to the green edges, there's no path from U to V. 111 00:06:37,248 --> 00:06:41,849 That means we can find a cut such that U is on one side, V is on the other side, 112 00:06:41,849 --> 00:06:46,883 and there's no edges crossing the cut. So, here's the cool part of the proof. 113 00:06:46,883 --> 00:06:50,782 And this is also the part where we actually use the fact that Kruskal was a 114 00:06:50,782 --> 00:06:53,804 greedy algorithm. That it considers the edges from cheapest 115 00:06:53,804 --> 00:06:56,728 to most expensive. Notice that we haven't actually used that 116 00:06:56,728 --> 00:06:59,165 fact up to this point and we better use that fact. 117 00:06:59,165 --> 00:07:02,869 So, the claim is that not only is this edge we're adding UV, not only does it 118 00:07:02,869 --> 00:07:06,329 cross this cut AB, but actually it's the cheapest edge crossing the cut. 119 00:07:06,329 --> 00:07:09,790 Nothing from the original input graph G that crosses it can be cheaper. 120 00:07:09,790 --> 00:07:12,832 So why is that true? Well, let's remember how we wrapped up 121 00:07:12,832 --> 00:07:17,132 the previous slide when we were arguing that the output of Kruskal's algorithm is 122 00:07:17,132 --> 00:07:18,601 connected. What did we argue? 123 00:07:18,601 --> 00:07:21,486 We said, well, fix any cut you like. Any cut of the graph. 124 00:07:21,486 --> 00:07:25,367 And freeze Kruskal's algorithm the very first time it considers some edge 125 00:07:25,367 --> 00:07:28,357 crossing that cut. We noticed that Kruskal's algorithm is 126 00:07:28,357 --> 00:07:32,395 guaranteed to include that first edge crossing the cut in its final solution. 127 00:07:32,395 --> 00:07:36,119 There's no way that, that first edge considered can create a cycle with 128 00:07:36,119 --> 00:07:39,581 respect to the edges already chosen. So, it's not going to trigger the 129 00:07:39,581 --> 00:07:41,836 exclusion criterion in Kruskal's algorithm. 130 00:07:41,836 --> 00:07:46,270 And that edge is going to get picked. So what's the significance about being 131 00:07:46,270 --> 00:07:50,475 the first edge crossing a cut well because of the gradient criterian because 132 00:07:50,475 --> 00:07:53,752 Kruskal considers edges from cheapest and most expensive. 133 00:07:53,752 --> 00:07:57,958 The first edge it encounters crossing a cut is also necessarily the cheapest edge 134 00:07:57,958 --> 00:08:05,099 that's crossing the cut. So here's how we weave these different 135 00:08:05,099 --> 00:08:08,053 strands together to complete the correctness proof. 136 00:08:08,053 --> 00:08:12,630 Alright, so let's remember where we are. We're focusing on a single iteration of 137 00:08:12,630 --> 00:08:16,048 Kruskal's algorithm. It's about to add this edge U, V to the 138 00:08:16,048 --> 00:08:18,655 edges, capital T, that it's chosen in the past. 139 00:08:18,655 --> 00:08:22,838 We've exhibited a cut. A, B with a property that no previously 140 00:08:22,838 --> 00:08:27,944 chosen edges, no edges of capital T cross this cut, and UV is going to be the first 141 00:08:27,944 --> 00:08:32,598 chosen edge crossing this cut. We just argued, that Kruskal's algorithm 142 00:08:32,598 --> 00:08:35,959 is guaranteed to pick the first edge crossing a cut. 143 00:08:35,959 --> 00:08:40,871 So by virtue of their not yet being any chosen edges, crossing the cut AB, it 144 00:08:40,871 --> 00:08:46,174 must be the case that this current edge UV is the first one that's ever been seen 145 00:08:46,174 --> 00:08:49,281 by Kruskal's algorithm that crosses this cut AB. 146 00:08:49,281 --> 00:08:54,266 Now, in being the first edge it's ever seen crossing this cut. It must also be 147 00:08:54,266 --> 00:08:58,086 the cheapest edge of the input graph that crosses this cut. 148 00:08:58,086 --> 00:09:00,869 That is the hypothesis of the cut property. 149 00:09:00,869 --> 00:09:06,178 That is why this edge UV and this current arbitrary iteration is justified by the 150 00:09:06,178 --> 00:09:07,020 cut property.