1 00:00:00,000 --> 00:00:03,089 Having mastered computing the connected components of undirected graph 2 00:00:03,089 --> 00:00:07,054 in linear time, let's now turn our attention to directed graphs which also arise, in all 3 00:00:07,054 --> 00:00:11,044 kinds of different applications. Now the good news is, is we'll be able to get just 4 00:00:11,044 --> 00:00:15,038 a blazingly fast primitive for computing connectivity information for directed 5 00:00:15,038 --> 00:00:19,027 graphs. The bad news if you wanna call it that, is we'll have to think a little bit 6 00:00:19,027 --> 00:00:23,026 harder. So it won't be as obvious how to do it, but by the end of this lecture you 7 00:00:23,026 --> 00:00:27,010 will know a linear time algorithm with very good constants, really just based on 8 00:00:27,010 --> 00:00:31,002 depth-first search for computing all of the pieces of a directed graph. In 9 00:00:31,002 --> 00:00:35,005 fact it's not even so obvious how to define pieces, how to define connected 10 00:00:35,005 --> 00:00:38,907 components in a directed graph, Certainly not as obvious as it was with undirected 11 00:00:38,907 --> 00:00:43,059 graphs. So to see what I mean, let's consider the following four node directed 12 00:00:43,059 --> 00:00:48,995 graph. So on the one hand, this graph is in some sense in one piece. If this was an 13 00:00:48,995 --> 00:00:53,399 actual physical object made, say, of a bunch of strings connected to each other 14 00:00:53,399 --> 00:00:56,884 and we'd picked it up with our hands, it wouldn't fall apart into two pieces. It 15 00:00:56,884 --> 00:01:00,001 would hang together in one piece. On the other hand, when you think about moving 16 00:01:00,001 --> 00:01:04,009 around this network, it's not connected in the sense that we might think about. You 17 00:01:04,009 --> 00:01:07,389 cannot get from any one point A to any other point B. For example, if you start 18 00:01:07,389 --> 00:01:11,080 at the right most node in this graph, certainly there's no directed path that 19 00:01:11,080 --> 00:01:15,074 will get you to the left most node. So what's typically studied and most useful 20 00:01:15,074 --> 00:01:19,073 for directed graphs, is what you call strong connectivity. The graph is strongly 21 00:01:19,073 --> 00:01:23,087 connected if you can get from any one point to any other point, And vice versa. 22 00:01:23,087 --> 00:01:28,031 And the strongly connected components then informally, are the maximal portions 23 00:01:28,031 --> 00:01:32,049 of this graph, the maximal regions which are internally, strongly connected. So, 24 00:01:32,049 --> 00:01:36,789 the maximal regions from within which you can get from any one point A to any other 25 00:01:36,789 --> 00:01:41,045 point B along a directed graph. For the record, let me go ahead and give you a, a 26 00:01:41,045 --> 00:01:46,011 formal definition although, you know, this intuition is perfectly valid just regions 27 00:01:46,011 --> 00:01:51,028 where you can get from anywhere to anywhere else. So we say that the strongly 28 00:01:51,028 --> 00:01:56,018 connected components of the directed graph or the SCCs for short. And as in the 29 00:01:56,018 --> 00:01:59,795 undirected case, we're going to give a somewhat slick definition, rather than 30 00:01:59,795 --> 00:02:03,218 talking about maximal regions satisfying some property, we're going to talk about 31 00:02:03,218 --> 00:02:06,631 the equivalence classes of a particular equivalence relation. But really it means 32 00:02:06,631 --> 00:02:10,557 the same thing. This is just sort of the more mathematically mature way of writing 33 00:02:10,557 --> 00:02:14,386 it down. So the equivalence relation we're going to use it on nodes of the graph and 34 00:02:14,386 --> 00:02:19,240 we'll say that U, a node, is related to a node V if you can get from U to V via a 35 00:02:19,240 --> 00:02:25,509 directed path and also from V to U on some other directed path. [sound] I'm not gonna 36 00:02:25,509 --> 00:02:29,699 bore you with the verification that this is an equivalent relation. That's 37 00:02:29,699 --> 00:02:32,845 something that you can work out in the privacy of your own home. Just remember 38 00:02:32,845 --> 00:02:36,006 what it means to be in an equivalence relation. It's reflexive. That is, 39 00:02:36,006 --> 00:02:39,080 everybody's related to itself but, of course, there is a path from every node to 40 00:02:39,080 --> 00:02:43,072 itself. It also means that it's symmetric. So if U is related to V, then V is related 41 00:02:43,072 --> 00:02:47,026 to U. Well, again, by definition, we're saying that the paths are mutually, the 42 00:02:47,026 --> 00:02:51,019 vertices are mutually reachable from each other. So that's clearly symmetric by 43 00:02:51,019 --> 00:02:54,049 definition and then it has to be transitive. And the way you prove it's 44 00:02:54,049 --> 00:02:57,794 transitive, is you just paste paths together and it just works the way you'd 45 00:02:57,794 --> 00:03:01,049 expect it to. So let's illustrate this concretely with a, a somewhat more 46 00:03:01,049 --> 00:03:06,032 complicated directed graph. So here's a directed graph and I claim that it has 47 00:03:06,032 --> 00:03:11,028 exactly four strongly connected components. There's a triangle on the 48 00:03:11,028 --> 00:03:16,059 left, a triangle on the right. Has this single node on top and then it has this 49 00:03:16,059 --> 00:03:21,056 directed four-cycle with a diagonal at the bottom. So what I hope is pretty clear is 50 00:03:21,056 --> 00:03:25,187 that each of these circled regions is indeed strongly connected that is if you 51 00:03:25,187 --> 00:03:29,211 start from one node in one of these circled regions you can have a directed 52 00:03:29,211 --> 00:03:33,089 path to any other node, so that's certainly true cause on a directed cycle you can 53 00:03:33,089 --> 00:03:36,558 get from anyone starting point and the other place and all of these have directed 54 00:03:36,558 --> 00:03:40,053 cycles and there's the one strong component that just has one node which is 55 00:03:40,053 --> 00:03:44,020 obviously strongly connected. What I also claim is true is that all of these 56 00:03:44,020 --> 00:03:47,077 regions are maximal subject to being strongly connected. That's why they're the 57 00:03:47,077 --> 00:03:51,034 strongly connected components. That's why they're the equivalence classes of this equivalence 58 00:03:51,034 --> 00:03:54,091 relation we just defined. So, if you take any two pairs of nodes which lie in two 59 00:03:54,091 --> 00:03:58,070 different circles you either won't have a path from one... From the first one to the 60 00:03:58,070 --> 00:04:01,912 second one or you won't have a directed path from the second one back to the first 61 00:04:01,912 --> 00:04:06,014 one. In fact, the structure of the strong components in this black graph exactly 62 00:04:06,014 --> 00:04:10,054 mirrors, the directed acyclic graph that we started with in red. So in the same 63 00:04:10,054 --> 00:04:14,355 way, in the red four node graph, you can't move from right to left. Here in this 64 00:04:14,355 --> 00:04:19,019 bigger graph, you can't move from any of the circled SCC's to the right to any of 65 00:04:19,019 --> 00:04:23,026 the circled SCCs to the left. So, for example, from the rightmost nodes, there 66 00:04:23,026 --> 00:04:27,078 are no directed paths to the leftmost nodes. So that's a recap of the definition 67 00:04:27,078 --> 00:04:31,089 of the strongly connected components, I've motivated in a separate video some reasons 68 00:04:31,089 --> 00:04:35,006 why you might care about computing strongly connected components. In 69 00:04:35,006 --> 00:04:38,061 particular, on extremely large graphs, which motivates the need for blazingly 70 00:04:38,061 --> 00:04:42,834 fast subroutines so for-free primitives that will let you compute connectivity 71 00:04:42,834 --> 00:04:46,540 information. So you'd love to be able to know the pieces of a directed graph, maybe 72 00:04:46,540 --> 00:04:49,507 you don't even know why they're going to be useful, but you just compute them 73 00:04:49,507 --> 00:04:53,084 because why not? It's a for-free primitive. So that's what I'm gonna give you in this 74 00:04:53,084 --> 00:04:58,009 lecture. So the algorithm we're gonna present is based on depth-first search, and 75 00:04:58,009 --> 00:05:01,816 that's gonna be the main reason why it's going to be so blazingly fast because 76 00:05:01,816 --> 00:05:07,003 depth-first search is blazingly fast. Now you might be wondering what on earth does 77 00:05:07,003 --> 00:05:11,030 graph search have to do with computing components? They don't seem obviously 78 00:05:11,030 --> 00:05:15,040 related, so let's return to the same directed graph that I showed you on 79 00:05:15,040 --> 00:05:20,093 the previous slide. And to see why something like depth-first search might 80 00:05:20,093 --> 00:05:25,097 conceivably have some use for computing strong components. Suppose we call 81 00:05:25,097 --> 00:05:30,064 depth-first search, starting from this red node as a starting point. What would it 82 00:05:30,064 --> 00:05:34,072 explore? So remember what the guarantee of something like depth-first search or 83 00:05:34,072 --> 00:05:38,081 breadth-first search, for that matter, is. You find everything that's findable, but 84 00:05:38,081 --> 00:05:42,048 naturally, nothing else. So what is findable from this red node? Where, by 85 00:05:42,048 --> 00:05:46,067 findable, I mean is you can reach it from a directed path emanating from this red 86 00:05:46,067 --> 00:05:50,060 node. Well, there's not much you can do, so, from here, you can explore this arc, 87 00:05:50,060 --> 00:05:55,005 and you can explore this arc, and then you can go backward. And so if you do DFS 88 00:05:55,005 --> 00:05:59,090 or BFS from this node you're going to find precisely the nodes in this triangle. 89 00:05:59,090 --> 00:06:03,547 All of the other arcs involved go the wrong direction and they won't be 90 00:06:03,547 --> 00:06:08,147 traversed by say a depth-first search call. So, why is that interesting? What's 91 00:06:08,147 --> 00:06:12,268 interesting is that if we invoke DFS from this red node or any of the three nodes 92 00:06:12,268 --> 00:06:17,147 from this triangle then it's going to discover precisely this strongly connected 93 00:06:17,147 --> 00:06:23,034 component. Precisely the three nodes in this circled SCC. So that seems 94 00:06:23,034 --> 00:06:27,071 really cool. It seems like maybe we just do a DFS, and boom, we get an SCC. So 95 00:06:27,071 --> 00:06:31,723 maybe if we can do that over and over again, we'll get all the SCCs. So that's a 96 00:06:31,723 --> 00:06:36,079 good initial intuition. But something can go wrong. Suppose that instead of, an, 97 00:06:36,096 --> 00:06:40,952 initiating DFS from one of these three nodes on the triangle, we, say, initiate 98 00:06:40,952 --> 00:06:46,075 it from this bottom-most node in green. So remember, what is the guarantee of a graph 99 00:06:46,075 --> 00:06:50,814 search subroutine like DFS? It will find everything findable but of course nothing 100 00:06:50,814 --> 00:06:57,055 more. So what's findable from this green node? Well naturally, everything in its 101 00:06:57,055 --> 00:07:02,097 own SCC. Right, so the four nodes here, it will certainly find those four nodes. On 102 00:07:02,097 --> 00:07:07,081 the other hand, if we start from this green node, since there are arcs that go 103 00:07:07,081 --> 00:07:13,068 from this bottom most SCC to the right most SCC, not only will this DFS call find 104 00:07:13,068 --> 00:07:18,536 the four nodes in the green node's strong component, but it will also traverse these 105 00:07:18,536 --> 00:07:24,099 blue arcs and discover the three nodes in the red triangle. So, if we call DFS from 106 00:07:24,099 --> 00:07:30,036 this green node, we'll capture all seven of these. So the point is if we call DFS, 107 00:07:30,036 --> 00:07:35,029 it looks like we're going to get a union of possibly multiple SCCs. In fact, 108 00:07:35,029 --> 00:07:40,047 in the worse case, if we invoke DFS from the left-most node, what's it going to 109 00:07:40,047 --> 00:07:45,039 discover? It's going to discover the entire graph, and that didn't give us any 110 00:07:45,039 --> 00:07:50,013 insight into the strong component structure at all. So, what's the take away 111 00:07:50,013 --> 00:07:55,031 point is, the take away point is if you call DFS from just the right place, you'll 112 00:07:55,031 --> 00:08:00,029 actually uncover an SCC. If you call it from the wrong place, it will give you no 113 00:08:00,029 --> 00:08:05,052 information at all, so the magic of the algorithm that we're gonna discuss next is 114 00:08:05,052 --> 00:08:10,005 we'll show how in a super slick pre-processing step which ironically is 115 00:08:10,005 --> 00:08:15,035 itself a call to depth-first search, we can in linear time compute exactly where we want 116 00:08:15,035 --> 00:08:20,046 to start the subsequent depth-first searches from, so that each invocation gets us 117 00:08:20,046 --> 00:08:27,031 exactly one strongly connected component, and nothing more. So 118 00:08:27,031 --> 00:08:32,687 the algorithm that, that I'm going to show you is due to Kosaraju. And it will show 119 00:08:32,687 --> 00:08:37,100 the following theorem. That the strongly connected components of a directed graph 120 00:08:37,100 --> 00:08:41,009 can be computed in linear time. And as we'll see, the constants are 121 00:08:41,009 --> 00:08:45,045 also very small. It's really just gonna be two passes of depth-first search. And again, 122 00:08:45,045 --> 00:08:48,096 let me remind you that for many problems it's natural to assume that the 123 00:08:48,096 --> 00:08:52,060 number of edges is at least the number of nodes because, if you're only interested 124 00:08:52,060 --> 00:08:55,830 in cases where the graph is connected. Of course, when you're computing connected 125 00:08:55,830 --> 00:08:58,903 components, that's one of the most natural cases where you might have a super sparse, 126 00:08:58,903 --> 00:09:02,759 broken up graph. So we're not going to assume m is at least n. So that's why 127 00:09:02,759 --> 00:09:06,093 linear time is going to be m + n, because that's the size of the input. And we don't 128 00:09:06,093 --> 00:09:11,003 know, either M could be bigger than N, or N could be bigger than M. We have no idea. 129 00:09:11,041 --> 00:09:17,027 Kosaraju's algorithm is shocking in its simplicity. It has three steps, let me 130 00:09:17,027 --> 00:09:23,007 tell you what they are. First, very mysteriously, we're going to reverse all 131 00:09:23,007 --> 00:09:28,035 of the arcs of the given graph. Totally not clear why that would be an interesting 132 00:09:28,035 --> 00:09:33,064 thing to do yet. Then, were going to do our first pass, our first depth-first search, 133 00:09:33,064 --> 00:09:38,062 and we're going to do it on the reverse graph. Now, the naive way to implement 134 00:09:38,062 --> 00:09:42,042 this would be you literally construct a new copy of the input graph, with all of the 135 00:09:42,042 --> 00:09:46,002 arcs in the reverse direction. And then just run depth-first search on it. Of 136 00:09:46,002 --> 00:09:49,067 course, the sophisti--, the sort of obvious optimization would be to just run 137 00:09:49,067 --> 00:09:53,013 DFS on the original graph, but going across arcs backward. So I'll let you 138 00:09:53,013 --> 00:09:56,088 think through the details of how you'd do that, but that just works. You run DFS, 139 00:09:56,088 --> 00:10:00,039 and instead of going forward along edges, you go backward along edges. That 140 00:10:00,039 --> 00:10:03,099 simulates depth-first search on the reverse graph. Now, I've written here, DFS 141 00:10:03,099 --> 00:10:07,079 loop. And that just means the usual trick where, to make sure that you see all of 142 00:10:07,079 --> 00:10:11,062 the nodes of, of the graph, even if it's disconnected, you have an outer loop, 143 00:10:11,062 --> 00:10:15,069 where you just try each starting point separately. If you haven't already seen 144 00:10:15,069 --> 00:10:19,091 it, then you run DFS from that given node. I'll be more detailed on the next slide. 145 00:10:20,037 --> 00:10:25,024 And the third step, it's just you run depth-first search again, but this time on 146 00:10:25,024 --> 00:10:29,659 the original graph. Now at this point, you should be thinking that I'm totally crazy, 147 00:10:29,659 --> 00:10:33,584 right? So what are we trying to do? We're trying to compute these strongly connected 148 00:10:33,584 --> 00:10:36,493 components. We're trying to actually compute real objects, these maximal 149 00:10:36,493 --> 00:10:40,027 strongly connected regions. And all I'm doing is searching the graph. And I do it once 150 00:10:40,027 --> 00:10:43,071 forward, I do it once backward. I mean, I'm not, it doesn't really seem like I'm 151 00:10:43,071 --> 00:10:47,194 computing anything. So, here's the catch, and it's a very minor catch. So we're 152 00:10:47,194 --> 00:10:50,224 gonna need, to do a little bit of bookkeeping, it's gonna be very little overhead, so we'll still 153 00:10:50,224 --> 00:10:53,712 have a blazingly fast algorithm. So, but with a little bit of bookkeeping, 154 00:10:53,712 --> 00:10:59,450 here's what's gonna happen. The second depth-first search, which searches the graph, will, 155 00:10:59,450 --> 00:11:04,017 in its search process, discover the strongly connected components one at a time 156 00:11:04,017 --> 00:11:08,247 in a very natural way, and that'll be really obvious when we do an example, which we'll do in just a second. 157 00:11:08,247 --> 00:11:14,594 Now, for the second depth-first search to work in this magical way, where it just discovers the strongly connected 158 00:11:14,594 --> 00:11:20,097 components one at a time, it's really important that it executes the depth first 159 00:11:20,097 --> 00:11:23,309 searches in a particular order, that it goes through the nodes of the graph in a 160 00:11:23,309 --> 00:11:28,633 particular order. And that is exactly the job of the first pass. The depth-first 161 00:11:28,633 --> 00:11:33,243 search on the reverse graph is going to compute an ordering of the nodes, which, 162 00:11:33,243 --> 00:11:37,174 when the second depth-first search goes through them in that order, it will just 163 00:11:37,174 --> 00:11:43,049 discover the SCCs one at a time in linear time. So let me say a little bit more 164 00:11:43,049 --> 00:11:47,069 about the form of the bookkeeping and then I'll show you how that bookkeeping is 165 00:11:47,069 --> 00:11:51,515 kept in as we do depth-first search. So we're gonna have a notion of a 166 00:11:51,515 --> 00:11:55,093 finishing time of a vertex and that's gonna be computed in the first pass when 167 00:11:55,093 --> 00:12:00,026 we do depth-first search in the reversed graph. And we're going to make use of this 168 00:12:00,026 --> 00:12:04,052 data in the second pass. So, rather than just going through the nodes of the graph 169 00:12:04,052 --> 00:12:08,063 in an arbitrary order, like we usually do when we sort of have a looped 170 00:12:08,063 --> 00:12:12,079 depth-first or breadth-first search, we're going to make sure that we go through the 171 00:12:12,079 --> 00:12:16,078 vertices in decreasing order of these finishing times. Now there's still the 172 00:12:16,078 --> 00:12:21,011 question in what sense does this second depth-first search discover and report the 173 00:12:21,011 --> 00:12:25,039 strongly connected components that it finds? So we're going to label each node in this 174 00:12:25,039 --> 00:12:29,959 second pass with what we call a leader and the idea is that the nodes in the same 175 00:12:29,959 --> 00:12:33,479 strongly connected components will be labeled with exactly the same leader node. And 176 00:12:33,479 --> 00:12:38,012 again all of this will be much more clear once we do a concrete example. But I want 177 00:12:38,012 --> 00:12:43,062 to have it down for the record right now. So that's the algorithm at a high level 178 00:12:43,062 --> 00:12:47,019 it's really just two passes of DFS with some bookkeeping. But this is under 179 00:12:47,019 --> 00:12:50,349 specified. You really, shouldn't, understand how to implement the 180 00:12:50,349 --> 00:12:54,417 algorithm just yet. So what do I owe you? I owe you exactly what I mean by DFS loop 181 00:12:54,417 --> 00:12:58,334 although this you've seen more or less in the past. It's just a loop over all the 182 00:12:58,334 --> 00:13:02,052 vertices the graph and then when if you haven't seen something yet you DFS from 183 00:13:02,052 --> 00:13:06,038 that starting point. I need to tell you what finishing times are and how they get 184 00:13:06,038 --> 00:13:09,085 computed. They're just gonna be integers one to N. Which is basically when 185 00:13:09,085 --> 00:13:13,095 depth-first search gets finished with one of the nodes. And then I need to tell you 186 00:13:13,095 --> 00:13:18,002 how to compute these leaders. So let me show you all three of those things on 187 00:13:18,002 --> 00:13:22,366 the next slide. So the workhorse for Kosaraju's strongly connected components 188 00:13:22,366 --> 00:13:27,416 algorithm is this DFS loop subroutine and it takes as input in graph. Okay, so it 189 00:13:27,416 --> 00:13:31,294 does not take as input a starting node, it's gonna loop over possible starting 190 00:13:31,294 --> 00:13:35,523 nodes. Now for the bookkeeping, to compute finishing nodes, we're going to 191 00:13:35,523 --> 00:13:42,051 keep track of a global variable that I'll call T, which we initialize to zero. The 192 00:13:42,051 --> 00:13:48,017 point of T is to count how many nodes we have totally finished exploring at this 193 00:13:48,017 --> 00:13:52,037 point. So this is the variable we use to compute those finishing times in the first 194 00:13:52,037 --> 00:13:56,028 pass. That magical ordering that I was talking about. Now we're also gonna have a 195 00:13:56,028 --> 00:14:00,015 second global variable to compute these things I was calling leaders, and these 196 00:14:00,015 --> 00:14:04,054 are only relevant for the second pass. So what S is going to keep track of is the 197 00:14:04,054 --> 00:14:10,768 most recent vertex from which a DFS was initiated. So to keep the code simple I'm 198 00:14:10,768 --> 00:14:15,308 just going to do all of the bookkeeping in DFS loop, but really DFS loop gets 199 00:14:15,308 --> 00:14:19,113 called twice, once in the reverse graph, once in the forward graph and we only need 200 00:14:19,113 --> 00:14:23,861 to compute these finishing times in the first pass, on the reverse graph. And we 201 00:14:23,861 --> 00:14:28,016 only need to compute these leaders of the second pass for the forward graph. But 202 00:14:28,016 --> 00:14:32,035 let's just keep them both in there, just for kicks. Now, we're going to need to 203 00:14:32,035 --> 00:14:36,002 loop through the vertices. And so the question is, in what order are we going to 204 00:14:36,002 --> 00:14:39,042 loop through the vertices? And that's gonna happen differently in the two 205 00:14:39,042 --> 00:14:42,535 different passes. But let me just use some common notation. Let's just assume in this 206 00:14:42,535 --> 00:14:47,007 subroutine, that the nodes are somehow labeled from one to N. In our first depth-first 207 00:14:47,007 --> 00:14:51,045 search it's gonna be labeled totally arbitrary. So these are basically just the 208 00:14:51,045 --> 00:14:55,056 names of the node, or their position in the node array, whatever, you just do it 209 00:14:55,056 --> 00:14:59,078 in some arbitrary order. Now the second time we run DFS loop, as indicated on the 210 00:14:59,078 --> 00:15:03,069 previous slide, we're gonna use the finishing times as the labeling. And as 211 00:15:03,069 --> 00:15:08,011 we'll see the finishing times are indeed numbers between one and N. So now what we 212 00:15:08,011 --> 00:15:12,349 do is we just iterate through the nodes in decreasing order. And if we haven't 213 00:15:12,349 --> 00:15:17,292 already seen node I, then we initiate a DFS from it. So, as usual, we're gonna be 214 00:15:17,292 --> 00:15:22,018 maintaining a local boolean to keep track of whether we've already seen a node yet 215 00:15:22,018 --> 00:15:26,082 in one of the DFS passes. Now remember, the global variable S is responsible for 216 00:15:26,082 --> 00:15:31,034 keeping track of the most recent node from which depth-first search had been 217 00:15:31,034 --> 00:15:36,021 initiated. So if I is not explored and we initiate a depth-first search from it, we'd 218 00:15:36,021 --> 00:15:42,894 better reset S. And then we do the usual DFS in G, starting from the source node I. 219 00:15:42,894 --> 00:15:47,546 So for completeness let me just remind you what the depth-first search subroutine 220 00:15:47,546 --> 00:15:53,075 looks like. So now we're given a graph and a starting node. So the first time we see 221 00:15:53,075 --> 00:15:59,024 a node, we mark it as explored. And just a side note that once a node is marked 222 00:15:59,024 --> 00:16:03,082 explored it's explored for this entire invocation of DFS loop. Okay, so even if 223 00:16:03,082 --> 00:16:08,065 this DFS from a given node I finishes and then the outer for loop marches on, and 224 00:16:08,065 --> 00:16:12,062 encounters I again it's still gonna be marked as explored. Now 225 00:16:12,062 --> 00:16:18,677 one of our bookkeeping jobs is to keep track of, from which vertex did the DFS 226 00:16:18,677 --> 00:16:23,515 that discovered I get called, so when I is first encountered, we remember 227 00:16:23,515 --> 00:16:29,263 that S was the node from which this DFS originated, and that by definition is the 228 00:16:29,263 --> 00:16:34,491 leader of I. And then we do what we always do with depth-first search. We immediately 229 00:16:34,491 --> 00:16:38,946 look at the arcs going out of I, and we try to recursively DFS on any of those. 230 00:16:38,946 --> 00:16:42,906 Although, of course, we don't bother to do it if we've already seen those nodes. Now 231 00:16:42,906 --> 00:16:46,990 once this for loop has completed, once we've examined every outgoing arc from I and 232 00:16:46,990 --> 00:16:51,174 for each node J either we already saw in the past or we've recursively explored 233 00:16:51,174 --> 00:16:56,990 from J and have returned, at that point we call ourselves done with node I. There's 234 00:16:56,990 --> 00:17:00,905 no more outgoing arcs to explore, we think of it being finished. Remember T is the 235 00:17:00,905 --> 00:17:04,989 global variable that's keeping track of how many nodes we're done with. So we 236 00:17:04,989 --> 00:17:10,715 increment T because now we're done with I and we also remember that I was the Tth 237 00:17:10,715 --> 00:17:16,841 vertex with which we finished. And that is, we set I's finishing time to be T. 238 00:17:16,841 --> 00:17:21,060 Because depth-first search first is guaranteed to visit every node exactly 239 00:17:21,060 --> 00:17:25,774 once, and therefore finish with every node exactly once. This global counter T, well, 240 00:17:25,774 --> 00:17:29,689 when the first node is finished, it'll be value one. Then when the next node gets 241 00:17:29,689 --> 00:17:32,726 finished, it'll have value two, then it'll have value three and so on. When the final 242 00:17:32,726 --> 00:17:37,103 node gets finished with, it'll have value N. So the finishing times of the nodes are 243 00:17:37,103 --> 00:17:41,906 gonna be exactly the integers from one to N. Let's make this much more concrete by 244 00:17:41,906 --> 00:17:47,642 going through a careful example. In fact, I think it will be better for everybody if 245 00:17:47,642 --> 00:17:52,413 you yourself trace through part of this algorithm on a concrete example. So let me 246 00:17:52,413 --> 00:17:59,420 draw a nine-node graph for you. So to be clear let's assume that we've already 247 00:17:59,420 --> 00:18:02,134 executed step one of the algorithm. That is, we've already reversed the 248 00:18:02,134 --> 00:18:05,687 graph. So that is. This blue graph that I've drawn on the slide. This is the 249 00:18:05,687 --> 00:18:09,636 reversal. We've already reversed the arcs. Moreover the nodes are labeled in some 250 00:18:09,636 --> 00:18:13,517 arbitrary way from one to nine. Just assume these are how they show up in the 251 00:18:13,517 --> 00:18:17,815 node array for example. And remember in the DFS loop routine we're supposed to 252 00:18:17,815 --> 00:18:23,923 process the nodes, from the, from top to bottom. From N down to 1. So my 253 00:18:23,923 --> 00:18:27,541 question for you then is, in the second step of the algorithm when we run DFS loop 254 00:18:27,541 --> 00:18:31,839 and we process the nodes from the highest name 9 in order down to the lowest name 255 00:18:31,839 --> 00:18:35,675 1. What are the finishing times that we're going to compute as we run DFS loop? 256 00:18:35,675 --> 00:18:40,903 Now, it is true that you can get different finishing times depending on the different 257 00:18:40,903 --> 00:18:45,578 choices that the DFS loop has to make about which outgoing arc to explore next. But 258 00:18:45,578 --> 00:18:49,696 I've given you four options for what the finishing times of the nodes one, two, 259 00:18:49,696 --> 00:18:53,419 three, all the way up to nine respectively might be, and only one of these four could 260 00:18:53,419 --> 00:18:59,949 conceivably be an output of the finishing times of DFS loop on this graph. So, which one is it? 261 00:19:02,318 --> 00:19:07,080 Alright. So the answer is the fourth option. That is the only one of 262 00:19:07,080 --> 00:19:12,525 these four sets of finishing times that you might see being computed by DFS loop 263 00:19:12,525 --> 00:19:16,902 on this blue graph. So let's go ahead and trace through DFS loop and see how we 264 00:19:16,902 --> 00:19:21,143 might get this set of finishing times. So remember in the main loop we start from 265 00:19:21,143 --> 00:19:24,991 the highest node, nine, and then we descend down to the, to the lowest node, 266 00:19:24,991 --> 00:19:31,301 one. So we start by invoking DFS from the node nine. So now, from here there's only 267 00:19:31,301 --> 00:19:35,172 one outgoing arc we have to go to, so we mark nine as explored. And then there's 268 00:19:35,172 --> 00:19:39,750 only one place we can go. We can go to six so we mark six as explored. Now there's 269 00:19:39,750 --> 00:19:43,515 two places we can go next. We can either go to three or we can go to eight. And you 270 00:19:43,515 --> 00:19:48,071 know, in general DFS could do either one. Now to generate this fourth set of 271 00:19:48,071 --> 00:19:52,650 finishing times, I'm going to need to assume that I go to three first. Okay, so 272 00:19:52,650 --> 00:19:56,328 again, what DFS does, what we're assuming it does is it starts at nine, 273 00:19:56,328 --> 00:20:00,221 then it has to go to six. It marks those as explored. Then it goes to three. It 274 00:20:00,221 --> 00:20:04,777 does not go to eight first. It goes to three first. Now from three there's only 275 00:20:04,777 --> 00:20:08,265 one outgoing arc which goes to nine but nine you've already marked as explored. So 276 00:20:08,265 --> 00:20:11,786 it's not going to re-explore nine, it's going to skip that arc. Since that, that's 277 00:20:11,786 --> 00:20:15,259 3's only outgoing arc, then that for loop completes. 278 00:20:15,259 --> 00:20:19,909 And so three is the first node to finish. So when we finish with three we increment 279 00:20:19,909 --> 00:20:24,603 T. It started at zero, now it's one. And we set the finishing time of three to be 280 00:20:24,603 --> 00:20:32,115 one. Just like we said it was in the example. So now we go back. Now we 281 00:20:32,115 --> 00:20:35,520 backtrack to six. Now we have another outgoing arc from six to explore, so now 282 00:20:35,520 --> 00:20:39,413 we go to eight. From eight, we have to go to two. From two, we have to go to five. 283 00:20:39,413 --> 00:20:42,507 From five, we have to go to eight. Eight, we've already seen, so then we're gonna be 284 00:20:42,507 --> 00:20:46,253 done with five, because that was its only outgoing arc. So then we increment T. Now 285 00:20:46,253 --> 00:20:53,791 it's two, and the finishing time of five is going to be two. As promised. So now we 286 00:20:53,791 --> 00:21:00,211 backtrack to two. There's no more outgoing marks from two. So 2's gonna be third one 287 00:21:00,211 --> 00:21:04,759 that we finish, as promised. Then we finish with eight. So the finishing time 288 00:21:04,759 --> 00:21:10,631 for eight is gonna be the fourth node to be done, as promised. Now we backtrack to 289 00:21:10,631 --> 00:21:17,551 six, we're done with six. That's the fifth node to be completed, as promised. And 290 00:21:17,551 --> 00:21:22,905 finally we got all the way back to where we started at nine and nine is the sixth 291 00:21:22,905 --> 00:21:30,437 node to be completed, as promised. Now if we were computing those leaders, all of 292 00:21:30,437 --> 00:21:33,428 these nodes would get the leader nine. But again the leaders are only relevant for 293 00:21:33,428 --> 00:21:36,646 the second pass, so we're just going to ignore the leaders as we're doing this 294 00:21:36,646 --> 00:21:39,706 trace. We're just going to keep track of the finishing times. So now we're not 295 00:21:39,706 --> 00:21:44,532 done, so all we did is we finished with the DFS that is invoked from the node nine 296 00:21:44,532 --> 00:21:50,596 and we found six of the nodes total in that depth-first search. So now we return 297 00:21:50,596 --> 00:21:55,132 to the outer for loop and we decrement I, so it started at nine, we're done with 298 00:21:55,132 --> 00:21:57,833 that, now we go down to eight, we say that we've already seen eight. Yes, eight's 299 00:21:57,833 --> 00:22:01,793 already explored. So we skip it. We go, we decrement I down to seven, we say have we 300 00:22:01,793 --> 00:22:06,630 already seen node seven. No we have not. Okay, seven is not yet explored, so we 301 00:22:06,630 --> 00:22:12,199 invoke DFS now from node seven. Seven has two outgoing arcs. It can either go to 302 00:22:12,199 --> 00:22:16,474 four or it can go to nine. Let's say it checks the outgoing arc to nine first. Now 303 00:22:16,474 --> 00:22:20,163 nine we already explored. Granted that was in an earlier part of the for loop, but 304 00:22:20,163 --> 00:22:23,905 we remember that. We, we're going to keep track of who got explored on previous 305 00:22:23,905 --> 00:22:26,799 iterations of the for loop. So we don't bother to re-explore nine. So we skip 306 00:22:26,799 --> 00:22:30,511 that. So now from seven we have to go to four. From four we have to go to one. 307 00:22:30,511 --> 00:22:34,078 From one we have to go back to seven. 7's already been explored, so we backtrack, 308 00:22:34,078 --> 00:22:38,172 now we're down to, we're done with one. So one is the next one we're completed with. 309 00:22:38,172 --> 00:22:43,595 And the finishing time of one is going to be seven, as promised. We backtrack to 310 00:22:43,595 --> 00:22:47,982 four. There's no more outgoing arcs from four to explore, so that's going to be the 311 00:22:47,982 --> 00:22:54,439 eighth one to finish as promised. And then the last one to finish is poor node seven. 312 00:22:54,439 --> 00:23:01,716 It is last. So that would be an example of how the DFS loop subroutine computes 313 00:23:01,716 --> 00:23:07,048 finishing times on a reversed graph. So now let's work through the second pass on 314 00:23:07,048 --> 00:23:10,794 the forward version of the graph using the same example. Now remember the point of 315 00:23:10,794 --> 00:23:14,923 the first pass is to compute a magical ordering. And the magical ordering is 316 00:23:14,923 --> 00:23:19,833 these finishing times. So now we're going to throw out the original node name and we're 317 00:23:19,833 --> 00:23:23,804 going to replace the node names in blue by the finishing times in red. We're also 318 00:23:23,804 --> 00:23:26,672 going to work with the original graph which means we have to reverse the arcs 319 00:23:26,672 --> 00:23:29,969 back where they were originally. So, those are the two changes that you're going to 320 00:23:29,969 --> 00:23:33,040 see when I redraw this graph. First of all, all the arcs will reverse 321 00:23:33,040 --> 00:23:37,394 orientation. Second of all, all the nodes will change names from their original 322 00:23:37,394 --> 00:23:44,942 ones to the finishing times that we just computed. So here's our new graph with the 323 00:23:44,942 --> 00:23:48,824 new node names and all the arcs with their orientation reversed. And now we run DFS 324 00:23:48,824 --> 00:23:52,581 again on this graph. And again we're going to process the nodes in order from the 325 00:23:52,581 --> 00:23:56,845 highest label nine down to the lowest label one. Moreover we don't need to 326 00:23:56,845 --> 00:24:00,681 compute finishing times in the second pass. We only need to do that in the first 327 00:24:00,681 --> 00:24:03,550 pass. In the second pass we have to keep track of the leaders. And remember the 328 00:24:03,550 --> 00:24:09,710 leader of a vertex is the vertex from which DFS was called. That first, that 329 00:24:09,710 --> 00:24:12,466 first discovered that node. Alright, so what's going to happen. Well, in the outer 330 00:24:12,466 --> 00:24:18,009 for loop, we're going to start with I equal to nine. And we invoke DFS from the 331 00:24:18,009 --> 00:24:25,469 node nine. So that's gonna be the current leader. Cause that's, where the current 332 00:24:25,469 --> 00:24:29,147 DFS got initiated. Now from nine there's only one choice. We have to go to 333 00:24:29,147 --> 00:24:32,916 seven. From seven there's only one choice. We have to go to eight. From eight there's 334 00:24:32,916 --> 00:24:36,043 only one choice. We have to go back to nine. And then nine's already been seen 335 00:24:36,043 --> 00:24:40,138 so we backtrack. We go back to eight. We go back to seven. We go back to nine and 336 00:24:40,138 --> 00:24:45,273 that's it. So when we invoke DFS from node nine the only things that we 337 00:24:45,273 --> 00:24:50,653 encounter are the nodes seven, eight and nine. And these are all going to be given 338 00:24:50,653 --> 00:24:57,048 the leader vertex nine. You will notice that this is indeed one of the strongly 339 00:24:57,048 --> 00:25:02,358 connected components of the graph. We just sort of found it with this invocation of 340 00:25:02,358 --> 00:25:07,976 DFS from the node nine. So now we go back to the outer for loop. And we say, okay, 341 00:25:07,976 --> 00:25:11,184 let's go to node eight. Have you already seen eight? Yes. What about seven? Have 342 00:25:11,184 --> 00:25:14,806 you already seen seven? Yes. What about six, have you already seen six? We have 343 00:25:14,806 --> 00:25:19,441 not, we have not yet discovered six. So we invoke DFS from node six, we reset 344 00:25:19,441 --> 00:25:26,607 the global source vertex s to six. From six we can go to nine or we can go to 345 00:25:26,607 --> 00:25:29,634 one. So let's say we explore nine first. Well we already saw nine in the earlier 346 00:25:29,634 --> 00:25:33,819 iteration of the for loop so we don't explore it again. So we don't discover 347 00:25:33,819 --> 00:25:38,555 nine now. So we backtrack to six. We go to one. From one we have to go to five. From 348 00:25:38,555 --> 00:25:42,110 five we have to go to six and then we start backtracking again. So the only new 349 00:25:42,110 --> 00:25:47,810 nodes that we encounter when we invoke DFS from the node six are the vertices six, 350 00:25:47,810 --> 00:25:55,362 one, and five. And all of these will have a leader vertex of six. Because that's 351 00:25:55,362 --> 00:25:58,974 where we called DFS from, when we first discovered these three nodes. And you'll 352 00:25:58,974 --> 00:26:06,095 notice this is another SCC of this directed graph. So we invoke DFS again 353 00:26:06,095 --> 00:26:09,444 now from a new node, the new node six, and what it discovered, the new nodes 354 00:26:09,444 --> 00:26:14,247 discovered, is exactly an SCC of the graph; nothing more, nothing less. So now 355 00:26:14,247 --> 00:26:18,212 we return to the outer for loop. We go to net five. Have we already seen five? Yes. 356 00:26:18,212 --> 00:26:23,106 Have we already seen four? No. We haven't seen four yet. So now we invoke DFS from 357 00:26:23,106 --> 00:26:27,041 four. Again, we could try to explore five, but we've already seen that before, we're 358 00:26:27,041 --> 00:26:30,123 not going to explore it again. So from four then, we have to go to two. From two, 359 00:26:30,123 --> 00:26:32,295 we have to go to three. From three, we have to go back to four, and then after 360 00:26:32,295 --> 00:26:36,863 all the back tracking, we're done. So the final call to DFS will be from the node 361 00:26:36,863 --> 00:26:44,220 four. And that DFS will discover precisely, newly discover precisely the 362 00:26:44,220 --> 00:26:49,912 nodes 2,3 and 4. They will all have the leader vertex four because that was where 363 00:26:49,912 --> 00:26:53,557 this DFS was called from. It's true we will go back to the for loop and we will 364 00:26:53,557 --> 00:26:56,822 check, have we seen three yet, yes, have we seen two yet, yes, have we seen one 365 00:26:56,822 --> 00:27:00,982 yet, yes, and then the whole thing completes. And what we see is that using 366 00:27:00,982 --> 00:27:06,269 the finishing times computed from that first depth-first search pass, somehow the 367 00:27:06,269 --> 00:27:10,883 strongly connected components of this graph just showed up and presented 368 00:27:10,883 --> 00:27:15,720 themselves to us, one at a time, on a silver platter. Every time we invoked DFS, 369 00:27:15,720 --> 00:27:21,297 the nodes we discovered newly were precisely one of the SCCs. Nothing more, 370 00:27:21,297 --> 00:27:25,437 nothing less. And that's really what's going on in this algorithm. It turns out 371 00:27:25,437 --> 00:27:29,296 this is true in general. The first pass, DFS on the reverse graph, computes 372 00:27:29,296 --> 00:27:34,414 finishing times so that if you then process nodes according to decrease ordering 373 00:27:34,414 --> 00:27:39,454 of finishing times in the second pass, each invocation to DFS will discover 374 00:27:39,454 --> 00:27:44,381 one new SCC and exactly one SCC. So they'll just present themselves to you one 375 00:27:44,381 --> 00:27:51,199 per DFS call in that second pass's for loop. This is, of course, merely an 376 00:27:51,199 --> 00:27:55,530 example. You should not just take a single example as proof that this algorithm 377 00:27:55,530 --> 00:27:59,445 always works. I will give you a general argument in the next video. But hopefully 378 00:27:59,445 --> 00:28:03,399 there's at least a plausibility argument. No longer does this three step algorithm 379 00:28:03,399 --> 00:28:08,850 seem totally insane, and maybe you could imagine perhaps it works. At least there's 380 00:28:08,850 --> 00:28:12,822 some principles going on where you first compute the right ordering to process 381 00:28:12,822 --> 00:28:18,313 the nodes, and, and then the second pass peels off SCCs one at a time like layers 382 00:28:18,313 --> 00:28:22,622 from an onion. One thing that I hope is pretty clear is that this algorithm 383 00:28:22,622 --> 00:28:30,214 correct or not, is blazingly fast. Pretty much all you do is two depth-first searches. 384 00:28:30,214 --> 00:28:34,635 And since depth-first search, as we've seen in the past runs in time linear in the 385 00:28:34,635 --> 00:28:39,773 size of the graph so does Kosaraju's two-pass algorithm. There are a couple of 386 00:28:39,773 --> 00:28:42,440 subtleties and I encourage you to think about this so you'll be forced to think 387 00:28:42,440 --> 00:28:47,085 about this in the programming project for week four. So for example, in the second 388 00:28:47,085 --> 00:28:51,540 pass how do you process the nodes in decreasing order of finishing time? You 389 00:28:51,540 --> 00:28:54,318 don't want to sort the notes by their finishing time cause that would take 390 00:28:54,318 --> 00:28:58,705 n log n time, so you need to make sure that you remember in the first pass that you 391 00:28:58,705 --> 00:29:01,901 sort of remember the nodes in a way that you can just do a linear scan through them 392 00:29:01,901 --> 00:29:04,961 in the second pass. So there are some details but if your intuition is that this 393 00:29:04,961 --> 00:29:08,437 is really just double DFS properly implemented that's pretty much exactly 394 00:29:08,437 --> 00:29:12,454 right. So having spelled out the full implementation, argued that it's 395 00:29:12,454 --> 00:29:15,806 definitely a linear time algorithm, and given at least a plausibility argument via 396 00:29:15,806 --> 00:29:21,939 an example, that it might conceivably be correct, let's now turn to the general argument.