Having mastered computing the connected components of undirected graph in linear time, let's now turn our attention to directed graphs which also arise, in all kinds of different applications. Now the good news is, is we'll be able to get just a blazingly fast primitive for computing connectivity information for directed graphs. The bad news if you wanna call it that, is we'll have to think a little bit harder. So it won't be as obvious how to do it, but by the end of this lecture you will know a linear time algorithm with very good constants, really just based on depth-first search for computing all of the pieces of a directed graph. In fact it's not even so obvious how to define pieces, how to define connected components in a directed graph, Certainly not as obvious as it was with undirected graphs. So to see what I mean, let's consider the following four node directed graph. So on the one hand, this graph is in some sense in one piece. If this was an actual physical object made, say, of a bunch of strings connected to each other and we'd picked it up with our hands, it wouldn't fall apart into two pieces. It would hang together in one piece. On the other hand, when you think about moving around this network, it's not connected in the sense that we might think about. You cannot get from any one point A to any other point B. For example, if you start at the right most node in this graph, certainly there's no directed path that will get you to the left most node. So what's typically studied and most useful for directed graphs, is what you call strong connectivity. The graph is strongly connected if you can get from any one point to any other point, And vice versa. And the strongly connected components then informally, are the maximal portions of this graph, the maximal regions which are internally, strongly connected. So, the maximal regions from within which you can get from any one point A to any other point B along a directed graph. For the record, let me go ahead and give you a, a formal definition although, you know, this intuition is perfectly valid just regions where you can get from anywhere to anywhere else. So we say that the strongly connected components of the directed graph or the SCCs for short. And as in the undirected case, we're going to give a somewhat slick definition, rather than talking about maximal regions satisfying some property, we're going to talk about the equivalence classes of a particular equivalence relation. But really it means the same thing. This is just sort of the more mathematically mature way of writing it down. So the equivalence relation we're going to use it on nodes of the graph and we'll say that U, a node, is related to a node V if you can get from U to V via a directed path and also from V to U on some other directed path. [sound] I'm not gonna bore you with the verification that this is an equivalent relation. That's something that you can work out in the privacy of your own home. Just remember what it means to be in an equivalence relation. It's reflexive. That is, everybody's related to itself but, of course, there is a path from every node to itself. It also means that it's symmetric. So if U is related to V, then V is related to U. Well, again, by definition, we're saying that the paths are mutually, the vertices are mutually reachable from each other. So that's clearly symmetric by definition and then it has to be transitive. And the way you prove it's transitive, is you just paste paths together and it just works the way you'd expect it to. So let's illustrate this concretely with a, a somewhat more complicated directed graph. So here's a directed graph and I claim that it has exactly four strongly connected components. There's a triangle on the left, a triangle on the right. Has this single node on top and then it has this directed four-cycle with a diagonal at the bottom. So what I hope is pretty clear is that each of these circled regions is indeed strongly connected that is if you start from one node in one of these circled regions you can have a directed path to any other node, so that's certainly true cause on a directed cycle you can get from anyone starting point and the other place and all of these have directed cycles and there's the one strong component that just has one node which is obviously strongly connected. What I also claim is true is that all of these regions are maximal subject to being strongly connected. That's why they're the strongly connected components. That's why they're the equivalence classes of this equivalence relation we just defined. So, if you take any two pairs of nodes which lie in two different circles you either won't have a path from one... From the first one to the second one or you won't have a directed path from the second one back to the first one. In fact, the structure of the strong components in this black graph exactly mirrors, the directed acyclic graph that we started with in red. So in the same way, in the red four node graph, you can't move from right to left. Here in this bigger graph, you can't move from any of the circled SCC's to the right to any of the circled SCCs to the left. So, for example, from the rightmost nodes, there are no directed paths to the leftmost nodes. So that's a recap of the definition of the strongly connected components, I've motivated in a separate video some reasons why you might care about computing strongly connected components. In particular, on extremely large graphs, which motivates the need for blazingly fast subroutines so for-free primitives that will let you compute connectivity information. So you'd love to be able to know the pieces of a directed graph, maybe you don't even know why they're going to be useful, but you just compute them because why not? It's a for-free primitive. So that's what I'm gonna give you in this lecture. So the algorithm we're gonna present is based on depth-first search, and that's gonna be the main reason why it's going to be so blazingly fast because depth-first search is blazingly fast. Now you might be wondering what on earth does graph search have to do with computing components? They don't seem obviously related, so let's return to the same directed graph that I showed you on the previous slide. And to see why something like depth-first search might conceivably have some use for computing strong components. Suppose we call depth-first search, starting from this red node as a starting point. What would it explore? So remember what the guarantee of something like depth-first search or breadth-first search, for that matter, is. You find everything that's findable, but naturally, nothing else. So what is findable from this red node? Where, by findable, I mean is you can reach it from a directed path emanating from this red node. Well, there's not much you can do, so, from here, you can explore this arc, and you can explore this arc, and then you can go backward. And so if you do DFS or BFS from this node you're going to find precisely the nodes in this triangle. All of the other arcs involved go the wrong direction and they won't be traversed by say a depth-first search call. So, why is that interesting? What's interesting is that if we invoke DFS from this red node or any of the three nodes from this triangle then it's going to discover precisely this strongly connected component. Precisely the three nodes in this circled SCC. So that seems really cool. It seems like maybe we just do a DFS, and boom, we get an SCC. So maybe if we can do that over and over again, we'll get all the SCCs. So that's a good initial intuition. But something can go wrong. Suppose that instead of, an, initiating DFS from one of these three nodes on the triangle, we, say, initiate it from this bottom-most node in green. So remember, what is the guarantee of a graph search subroutine like DFS? It will find everything findable but of course nothing more. So what's findable from this green node? Well naturally, everything in its own SCC. Right, so the four nodes here, it will certainly find those four nodes. On the other hand, if we start from this green node, since there are arcs that go from this bottom most SCC to the right most SCC, not only will this DFS call find the four nodes in the green node's strong component, but it will also traverse these blue arcs and discover the three nodes in the red triangle. So, if we call DFS from this green node, we'll capture all seven of these. So the point is if we call DFS, it looks like we're going to get a union of possibly multiple SCCs. In fact, in the worse case, if we invoke DFS from the left-most node, what's it going to discover? It's going to discover the entire graph, and that didn't give us any insight into the strong component structure at all. So, what's the take away point is, the take away point is if you call DFS from just the right place, you'll actually uncover an SCC. If you call it from the wrong place, it will give you no information at all, so the magic of the algorithm that we're gonna discuss next is we'll show how in a super slick pre-processing step which ironically is itself a call to depth-first search, we can in linear time compute exactly where we want to start the subsequent depth-first searches from, so that each invocation gets us exactly one strongly connected component, and nothing more. So the algorithm that, that I'm going to show you is due to Kosaraju. And it will show the following theorem. That the strongly connected components of a directed graph can be computed in linear time. And as we'll see, the constants are also very small. It's really just gonna be two passes of depth-first search. And again, let me remind you that for many problems it's natural to assume that the number of edges is at least the number of nodes because, if you're only interested in cases where the graph is connected. Of course, when you're computing connected components, that's one of the most natural cases where you might have a super sparse, broken up graph. So we're not going to assume m is at least n. So that's why linear time is going to be m + n, because that's the size of the input. And we don't know, either M could be bigger than N, or N could be bigger than M. We have no idea. Kosaraju's algorithm is shocking in its simplicity. It has three steps, let me tell you what they are. First, very mysteriously, we're going to reverse all of the arcs of the given graph. Totally not clear why that would be an interesting thing to do yet. Then, were going to do our first pass, our first depth-first search, and we're going to do it on the reverse graph. Now, the naive way to implement this would be you literally construct a new copy of the input graph, with all of the arcs in the reverse direction. And then just run depth-first search on it. Of course, the sophisti--, the sort of obvious optimization would be to just run DFS on the original graph, but going across arcs backward. So I'll let you think through the details of how you'd do that, but that just works. You run DFS, and instead of going forward along edges, you go backward along edges. That simulates depth-first search on the reverse graph. Now, I've written here, DFS loop. And that just means the usual trick where, to make sure that you see all of the nodes of, of the graph, even if it's disconnected, you have an outer loop, where you just try each starting point separately. If you haven't already seen it, then you run DFS from that given node. I'll be more detailed on the next slide. And the third step, it's just you run depth-first search again, but this time on the original graph. Now at this point, you should be thinking that I'm totally crazy, right? So what are we trying to do? We're trying to compute these strongly connected components. We're trying to actually compute real objects, these maximal strongly connected regions. And all I'm doing is searching the graph. And I do it once forward, I do it once backward. I mean, I'm not, it doesn't really seem like I'm computing anything. So, here's the catch, and it's a very minor catch. So we're gonna need, to do a little bit of bookkeeping, it's gonna be very little overhead, so we'll still have a blazingly fast algorithm. So, but with a little bit of bookkeeping, here's what's gonna happen. The second depth-first search, which searches the graph, will, in its search process, discover the strongly connected components one at a time in a very natural way, and that'll be really obvious when we do an example, which we'll do in just a second. Now, for the second depth-first search to work in this magical way, where it just discovers the strongly connected components one at a time, it's really important that it executes the depth first searches in a particular order, that it goes through the nodes of the graph in a particular order. And that is exactly the job of the first pass. The depth-first search on the reverse graph is going to compute an ordering of the nodes, which, when the second depth-first search goes through them in that order, it will just discover the SCCs one at a time in linear time. So let me say a little bit more about the form of the bookkeeping and then I'll show you how that bookkeeping is kept in as we do depth-first search. So we're gonna have a notion of a finishing time of a vertex and that's gonna be computed in the first pass when we do depth-first search in the reversed graph. And we're going to make use of this data in the second pass. So, rather than just going through the nodes of the graph in an arbitrary order, like we usually do when we sort of have a looped depth-first or breadth-first search, we're going to make sure that we go through the vertices in decreasing order of these finishing times. Now there's still the question in what sense does this second depth-first search discover and report the strongly connected components that it finds? So we're going to label each node in this second pass with what we call a leader and the idea is that the nodes in the same strongly connected components will be labeled with exactly the same leader node. And again all of this will be much more clear once we do a concrete example. But I want to have it down for the record right now. So that's the algorithm at a high level it's really just two passes of DFS with some bookkeeping. But this is under specified. You really, shouldn't, understand how to implement the algorithm just yet. So what do I owe you? I owe you exactly what I mean by DFS loop although this you've seen more or less in the past. It's just a loop over all the vertices the graph and then when if you haven't seen something yet you DFS from that starting point. I need to tell you what finishing times are and how they get computed. They're just gonna be integers one to N. Which is basically when depth-first search gets finished with one of the nodes. And then I need to tell you how to compute these leaders. So let me show you all three of those things on the next slide. So the workhorse for Kosaraju's strongly connected components algorithm is this DFS loop subroutine and it takes as input in graph. Okay, so it does not take as input a starting node, it's gonna loop over possible starting nodes. Now for the bookkeeping, to compute finishing nodes, we're going to keep track of a global variable that I'll call T, which we initialize to zero. The point of T is to count how many nodes we have totally finished exploring at this point. So this is the variable we use to compute those finishing times in the first pass. That magical ordering that I was talking about. Now we're also gonna have a second global variable to compute these things I was calling leaders, and these are only relevant for the second pass. So what S is going to keep track of is the most recent vertex from which a DFS was initiated. So to keep the code simple I'm just going to do all of the bookkeeping in DFS loop, but really DFS loop gets called twice, once in the reverse graph, once in the forward graph and we only need to compute these finishing times in the first pass, on the reverse graph. And we only need to compute these leaders of the second pass for the forward graph. But let's just keep them both in there, just for kicks. Now, we're going to need to loop through the vertices. And so the question is, in what order are we going to loop through the vertices? And that's gonna happen differently in the two different passes. But let me just use some common notation. Let's just assume in this subroutine, that the nodes are somehow labeled from one to N. In our first depth-first search it's gonna be labeled totally arbitrary. So these are basically just the names of the node, or their position in the node array, whatever, you just do it in some arbitrary order. Now the second time we run DFS loop, as indicated on the previous slide, we're gonna use the finishing times as the labeling. And as we'll see the finishing times are indeed numbers between one and N. So now what we do is we just iterate through the nodes in decreasing order. And if we haven't already seen node I, then we initiate a DFS from it. So, as usual, we're gonna be maintaining a local boolean to keep track of whether we've already seen a node yet in one of the DFS passes. Now remember, the global variable S is responsible for keeping track of the most recent node from which depth-first search had been initiated. So if I is not explored and we initiate a depth-first search from it, we'd better reset S. And then we do the usual DFS in G, starting from the source node I. So for completeness let me just remind you what the depth-first search subroutine looks like. So now we're given a graph and a starting node. So the first time we see a node, we mark it as explored. And just a side note that once a node is marked explored it's explored for this entire invocation of DFS loop. Okay, so even if this DFS from a given node I finishes and then the outer for loop marches on, and encounters I again it's still gonna be marked as explored. Now one of our bookkeeping jobs is to keep track of, from which vertex did the DFS that discovered I get called, so when I is first encountered, we remember that S was the node from which this DFS originated, and that by definition is the leader of I. And then we do what we always do with depth-first search. We immediately look at the arcs going out of I, and we try to recursively DFS on any of those. Although, of course, we don't bother to do it if we've already seen those nodes. Now once this for loop has completed, once we've examined every outgoing arc from I and for each node J either we already saw in the past or we've recursively explored from J and have returned, at that point we call ourselves done with node I. There's no more outgoing arcs to explore, we think of it being finished. Remember T is the global variable that's keeping track of how many nodes we're done with. So we increment T because now we're done with I and we also remember that I was the Tth vertex with which we finished. And that is, we set I's finishing time to be T. Because depth-first search first is guaranteed to visit every node exactly once, and therefore finish with every node exactly once. This global counter T, well, when the first node is finished, it'll be value one. Then when the next node gets finished, it'll have value two, then it'll have value three and so on. When the final node gets finished with, it'll have value N. So the finishing times of the nodes are gonna be exactly the integers from one to N. Let's make this much more concrete by going through a careful example. In fact, I think it will be better for everybody if you yourself trace through part of this algorithm on a concrete example. So let me draw a nine-node graph for you. So to be clear let's assume that we've already executed step one of the algorithm. That is, we've already reversed the graph. So that is. This blue graph that I've drawn on the slide. This is the reversal. We've already reversed the arcs. Moreover the nodes are labeled in some arbitrary way from one to nine. Just assume these are how they show up in the node array for example. And remember in the DFS loop routine we're supposed to process the nodes, from the, from top to bottom. From N down to 1. So my question for you then is, in the second step of the algorithm when we run DFS loop and we process the nodes from the highest name 9 in order down to the lowest name 1. What are the finishing times that we're going to compute as we run DFS loop? Now, it is true that you can get different finishing times depending on the different choices that the DFS loop has to make about which outgoing arc to explore next. But I've given you four options for what the finishing times of the nodes one, two, three, all the way up to nine respectively might be, and only one of these four could conceivably be an output of the finishing times of DFS loop on this graph. So, which one is it? Alright. So the answer is the fourth option. That is the only one of these four sets of finishing times that you might see being computed by DFS loop on this blue graph. So let's go ahead and trace through DFS loop and see how we might get this set of finishing times. So remember in the main loop we start from the highest node, nine, and then we descend down to the, to the lowest node, one. So we start by invoking DFS from the node nine. So now, from here there's only one outgoing arc we have to go to, so we mark nine as explored. And then there's only one place we can go. We can go to six so we mark six as explored. Now there's two places we can go next. We can either go to three or we can go to eight. And you know, in general DFS could do either one. Now to generate this fourth set of finishing times, I'm going to need to assume that I go to three first. Okay, so again, what DFS does, what we're assuming it does is it starts at nine, then it has to go to six. It marks those as explored. Then it goes to three. It does not go to eight first. It goes to three first. Now from three there's only one outgoing arc which goes to nine but nine you've already marked as explored. So it's not going to re-explore nine, it's going to skip that arc. Since that, that's 3's only outgoing arc, then that for loop completes. And so three is the first node to finish. So when we finish with three we increment T. It started at zero, now it's one. And we set the finishing time of three to be one. Just like we said it was in the example. So now we go back. Now we backtrack to six. Now we have another outgoing arc from six to explore, so now we go to eight. From eight, we have to go to two. From two, we have to go to five. From five, we have to go to eight. Eight, we've already seen, so then we're gonna be done with five, because that was its only outgoing arc. So then we increment T. Now it's two, and the finishing time of five is going to be two. As promised. So now we backtrack to two. There's no more outgoing marks from two. So 2's gonna be third one that we finish, as promised. Then we finish with eight. So the finishing time for eight is gonna be the fourth node to be done, as promised. Now we backtrack to six, we're done with six. That's the fifth node to be completed, as promised. And finally we got all the way back to where we started at nine and nine is the sixth node to be completed, as promised. Now if we were computing those leaders, all of these nodes would get the leader nine. But again the leaders are only relevant for the second pass, so we're just going to ignore the leaders as we're doing this trace. We're just going to keep track of the finishing times. So now we're not done, so all we did is we finished with the DFS that is invoked from the node nine and we found six of the nodes total in that depth-first search. So now we return to the outer for loop and we decrement I, so it started at nine, we're done with that, now we go down to eight, we say that we've already seen eight. Yes, eight's already explored. So we skip it. We go, we decrement I down to seven, we say have we already seen node seven. No we have not. Okay, seven is not yet explored, so we invoke DFS now from node seven. Seven has two outgoing arcs. It can either go to four or it can go to nine. Let's say it checks the outgoing arc to nine first. Now nine we already explored. Granted that was in an earlier part of the for loop, but we remember that. We, we're going to keep track of who got explored on previous iterations of the for loop. So we don't bother to re-explore nine. So we skip that. So now from seven we have to go to four. From four we have to go to one. From one we have to go back to seven. 7's already been explored, so we backtrack, now we're down to, we're done with one. So one is the next one we're completed with. And the finishing time of one is going to be seven, as promised. We backtrack to four. There's no more outgoing arcs from four to explore, so that's going to be the eighth one to finish as promised. And then the last one to finish is poor node seven. It is last. So that would be an example of how the DFS loop subroutine computes finishing times on a reversed graph. So now let's work through the second pass on the forward version of the graph using the same example. Now remember the point of the first pass is to compute a magical ordering. And the magical ordering is these finishing times. So now we're going to throw out the original node name and we're going to replace the node names in blue by the finishing times in red. We're also going to work with the original graph which means we have to reverse the arcs back where they were originally. So, those are the two changes that you're going to see when I redraw this graph. First of all, all the arcs will reverse orientation. Second of all, all the nodes will change names from their original ones to the finishing times that we just computed. So here's our new graph with the new node names and all the arcs with their orientation reversed. And now we run DFS again on this graph. And again we're going to process the nodes in order from the highest label nine down to the lowest label one. Moreover we don't need to compute finishing times in the second pass. We only need to do that in the first pass. In the second pass we have to keep track of the leaders. And remember the leader of a vertex is the vertex from which DFS was called. That first, that first discovered that node. Alright, so what's going to happen. Well, in the outer for loop, we're going to start with I equal to nine. And we invoke DFS from the node nine. So that's gonna be the current leader. Cause that's, where the current DFS got initiated. Now from nine there's only one choice. We have to go to seven. From seven there's only one choice. We have to go to eight. From eight there's only one choice. We have to go back to nine. And then nine's already been seen so we backtrack. We go back to eight. We go back to seven. We go back to nine and that's it. So when we invoke DFS from node nine the only things that we encounter are the nodes seven, eight and nine. And these are all going to be given the leader vertex nine. You will notice that this is indeed one of the strongly connected components of the graph. We just sort of found it with this invocation of DFS from the node nine. So now we go back to the outer for loop. And we say, okay, let's go to node eight. Have you already seen eight? Yes. What about seven? Have you already seen seven? Yes. What about six, have you already seen six? We have not, we have not yet discovered six. So we invoke DFS from node six, we reset the global source vertex s to six. From six we can go to nine or we can go to one. So let's say we explore nine first. Well we already saw nine in the earlier iteration of the for loop so we don't explore it again. So we don't discover nine now. So we backtrack to six. We go to one. From one we have to go to five. From five we have to go to six and then we start backtracking again. So the only new nodes that we encounter when we invoke DFS from the node six are the vertices six, one, and five. And all of these will have a leader vertex of six. Because that's where we called DFS from, when we first discovered these three nodes. And you'll notice this is another SCC of this directed graph. So we invoke DFS again now from a new node, the new node six, and what it discovered, the new nodes discovered, is exactly an SCC of the graph; nothing more, nothing less. So now we return to the outer for loop. We go to net five. Have we already seen five? Yes. Have we already seen four? No. We haven't seen four yet. So now we invoke DFS from four. Again, we could try to explore five, but we've already seen that before, we're not going to explore it again. So from four then, we have to go to two. From two, we have to go to three. From three, we have to go back to four, and then after all the back tracking, we're done. So the final call to DFS will be from the node four. And that DFS will discover precisely, newly discover precisely the nodes 2,3 and 4. They will all have the leader vertex four because that was where this DFS was called from. It's true we will go back to the for loop and we will check, have we seen three yet, yes, have we seen two yet, yes, have we seen one yet, yes, and then the whole thing completes. And what we see is that using the finishing times computed from that first depth-first search pass, somehow the strongly connected components of this graph just showed up and presented themselves to us, one at a time, on a silver platter. Every time we invoked DFS, the nodes we discovered newly were precisely one of the SCCs. Nothing more, nothing less. And that's really what's going on in this algorithm. It turns out this is true in general. The first pass, DFS on the reverse graph, computes finishing times so that if you then process nodes according to decrease ordering of finishing times in the second pass, each invocation to DFS will discover one new SCC and exactly one SCC. So they'll just present themselves to you one per DFS call in that second pass's for loop. This is, of course, merely an example. You should not just take a single example as proof that this algorithm always works. I will give you a general argument in the next video. But hopefully there's at least a plausibility argument. No longer does this three step algorithm seem totally insane, and maybe you could imagine perhaps it works. At least there's some principles going on where you first compute the right ordering to process the nodes, and, and then the second pass peels off SCCs one at a time like layers from an onion. One thing that I hope is pretty clear is that this algorithm correct or not, is blazingly fast. Pretty much all you do is two depth-first searches. And since depth-first search, as we've seen in the past runs in time linear in the size of the graph so does Kosaraju's two-pass algorithm. There are a couple of subtleties and I encourage you to think about this so you'll be forced to think about this in the programming project for week four. So for example, in the second pass how do you process the nodes in decreasing order of finishing time? You don't want to sort the notes by their finishing time cause that would take n log n time, so you need to make sure that you remember in the first pass that you sort of remember the nodes in a way that you can just do a linear scan through them in the second pass. So there are some details but if your intuition is that this is really just double DFS properly implemented that's pretty much exactly right. So having spelled out the full implementation, argued that it's definitely a linear time algorithm, and given at least a plausibility argument via an example, that it might conceivably be correct, let's now turn to the general argument.