So the goal of this video is to prove the
correctness of Kasaraju two-pass,
depth-first-search based, linear time
algorithm that computes the strongly
connected components of a directed graph.
So I've given you the full specification
of the algorithm. I've also given you a
plausibility argument of why it might
work, in that at least it does something
sensible on an example. Namely, it first
does a pass of depth first search on the
reverse graph. It computes this magical
ordering. And what's so special about this
ordering is then when we do a depth first
search using this ordering on the forward
graph; it seems to do exactly what we
want. Every indication of depth first
search to some new node discovers exactly
the nodes of the strong component and no
extra stuff. Remember that was our first
observation, but that was unclear whether
depth for search would be useful or not
for computing strong components. If you
call depth first search from just the
right place, you're gonna get exactly the
nodes of an SCC and nothing more. If you
call it from the wrong place, you might
get all of the nodes of the graph, and get
no information at all about the structure of the strong components and at least in
this example this first pass with the
finishing time seems to be accomplishing
seems to be leading us to invoking
DFS from exactly the right
places. So remember how this worked in the
example so in the top graph, I have
shown you the graph with the arch
reversed. This is where we first invoked
DFS loop with the loop over the nodes
going from the highest node name nine all
the way down to the node name one. And
here we compute finishing time that's the
bookkeeping that we do in the first pass
so we just keep the running count of how
many nodes we've finished processing. That
is how many we've both explored that node
as well as explore all of the outgoing
arches and so that gives us these numbers
in the red, these finishing times between
one and nine for the various
nodes. Those became the new node names in
the second graph and then we reverse the
arches again and get the original graphs
back and then we saw that every time we
invoked DFS in our second pass we
uncovered exactly the nodes of an SCC. So
when we invoked it from the node 9 we
discovered that 9, 8 and 7 those
have a leader vortex 9. Then when we
next invoked DFS from 6, we discovered
6, 5, and 1, and nothing else. And
then finally we invoked it from 4, and
we discovered 2, 3, and 4, and
nothing else. And those are exactly the
three, SCCs of this graph. So let's now
understand why this works in any directed
graph, not just this, in this one example.
So let's begin with a simple observation
about directed graphs, which is actually
interesting in its own right. The claim is
that every directed graph has two levels
of granularity. If you squint, if you sort
of zoom out, then what you see is a
directed acyclic graph, of course
comprising its strongly connective
components. And if you want, you can zoom
in and focus on the fine grain structure
with one SCC. A little bit more precisely.
The claim is of the strongly connected
components of a directed graph induce in
a natural way an acyclic metagraph. So
what is this metagraph? What are the nodes
and what are the arcs, what are the edges?
Well, the metanodes are just the SCCs, so
we think of every strong connected
component as being a single node in this
metagraph. So call them say 'C1' up to 'Ck'. So what are the arcs in this metagraph?
Well, they're basically just the ones
corresponding to the arcs between SCCs in
the original graph. That is, we include in
the meta graph an arc from the strong
component 'C' to 'C-hat'
if and only if there's an arc from a node
in 'C' to a node in 'C-hat' in the original
graph 'G'. So for example if this is your 'C'
and so the triangle is your 'C-hat' and you
have one or maybe multiple edges going
from 'C' to 'C-hat', then in the corresponding
metagraph your just gonna have a node for
'C', a node for 'C-hat' and the directed
arch from 'C' to 'C-hat'. So if we go back to
some of the directed graphs that we've
used as running examples so we go back to
the beginning of the
previous video and it's look maybe something
like this the corresponding directed acyclic
graph has four nodes and four
arches. And for the running example we used
to illustrate Kosaraju's algorithm with
the three triangles, the corresponding
metagraph would just be a path with three
nodes. So why is this meta-graph
guaranteed to be acyclic? Well, remember
metanodes correspond to strong
components, and in a strong component you
can get from anywhere to anywhere else.
So, if you had a cycle that involved two
different metanodes, that is two
different strong connected components,
remember on a directed cycle you can also
get from anywhere to anywhere else. So if
you had two supposedly distinct SCCs,
that you could get from the one to the
other and vice versa, they would collapse
into a single SCC. You can get from
anywhere to anywhere in one, anywhere from
anywhere in the other one, and you can
also go between them at will, so you can get from
anywhere in this union to anywhere in the
union. So not just in this context of
competing strong components but also just
more generally, this is a useful fact to
know about directed graphs. On the one
hand, they can have very complex structure
within a strong components. You have paths
going from everywhere to everywhere else,
and it may be sort of complicated looking.
But at a higher level, if you abstract out
to the level of SCCs, you are guaranteed
to have this simple DAG, this simple
directed acyclic graph structure. So, to
reinforce these concepts, and also segue
into thinking about Kosaraju's algorithm
in particular, let me ask you a question
about how reversing arcs affects the
strong components of a directed graph. So
the correct answer to this quiz is the
fourth one. The strong components are
exactly the same as they were before, in
fact the relation that we described is
exactly the same as it was before so
therefore the equivalence classes of the
strong components is exactly the same. So
if two nodes were related in
the original graph, that is a path from U
to V and a path from V to U, that's still
true after you reverse all the arcs, you
just use the reversal of the two paths
that you had before. Similarly if the two
nodes weren't related before, for example
because you could not get from U to V,
well that after you reverse everything,
then you can't get from V to U, so again you don't have this relation holding, so
the SCCs are exactly the same in the
forward or the backward graph. So in
particular in Kazarogi's algorithm, the
strong component structure is exactly the
same in the first pass of DFS and in the
second pass of DFS. So now that we understand
how every directed graph has a meta graph with the nodes correspond to a strong
connected components, and you have an arch
from one SCC to another if there's any
arch from any node in that SCC to the
other SCC in the original graph, I'm in a
position to state what's the key lemma.
That drives the correctness of Kosaraju's
two pass algorithm for computing the
strong connected component of a directed graph.
So here's the lemma statement. It
considers two strongly connecting
components that are adjacent, in the sense
that there's an arc from one node in one
of them to one node in the other one. So
let's say we have one SCC - 'C1', with a
node I, and another, SCC 'C2' with a node
J, and that in G, in the graph, there's an
arc directly from I to J. So in this sense
we say that these SCCs are adjacent, with
the second one being in some sense after
the first one. Now let's suppose we've
already run the first pass of the DFS loop
subroutine. And remember that works on the
reverse graph. So we've invoked it on the
reverse graph. We've computed these
finishing times. As usual we'll let f(v)
denote the finishing times computed in
that depth first search subroutine on the
reverse graph. The lemma then asserts the
following. It says first, amongst all the
nodes in 'C1' look at the one with the
largest finishing time. Similarly amongst
all nodes in 'C2' look at the one with the
biggest finishing time. Amongst all of
these the claim is that the biggest
finishing time will be in 'C2' not in 'C1'. So
what I wanna do next is I wanna assume that
this lemma is true temporarily. I wanna
explore the consequences of that
assumption and in particular what I wanna
show you is that if this lemma holds, then
we can complete the proof of correctness
of Kosaraju's two-pass SCC computation
algorithm. Okay, so if the lemma is true
then after... I'll give you the argument
about why we're done. About why we just
peel off the SCC one at
a time with the second pass of depth first
search. Now of course a proof with a hole
in it, isn't a proof. So at the end of the
lecture I'm gonna fill in the hole. That
is, I'm gonna supply a proof of this key
lemma. But for now, as a working
hypothesis, let's assume that it's true.
Let's begin with a corollary, that is a
statement which follows essentially
immediately, from the statement of a lema.
So for the corollary, let's forget about
just trying to find the maximum
maximum finishing time in a single SCC.
Let's think about the maximum finishing
time in the entire graph. Now, why do we
care about the maximum finishing time in
the entire graph? Well, notice that's
exactly where the second pass of DFS
is going to begin. Right, so
it processes nodes in order from largest
finishing time to smallest finishing time.
So equivalently, let's think about the
node at which the second pass of depth
first search is going to begin, i.e., the
node with the maximum finishing time.
Where could it be? Well, the corollary is
that it has to be in what I'm gonna call a
sink, a strongly connected component, that
is a strongly connected component without
any outgoing arcs. So for example
let's go back to the, meta graph of SCCs
for the very first directed graph we
looked at. You recall in the very first
direc ted graph we looked at in when we
started talking about this algorithm there
were four SCCs. So there was a 'C1', a 'C2',
a 'C3', and a 'C4'. And of course within each
of these components, there could be
multiple nodes but they are all strongly
connected to each other. Now, let's use
F1, F2, F3 and F4 to denote the maximum
finishing time in each of these SCCs. So
we have F1, F2, F3 and F4. So, now we
have four different opportunities to apply
this lemma. Right? Those four different
pairs of adjacent SCCs. And so, what do
we find? We find that well, comparing F1 and F2, because C2 comes after C1,
that is there's an arc from C1 to
C2, the max finishing time in C2 has
to be bigger than that in C1. That is F2 is bigger than F1. For the same
reasoning F3 has to be bigger than F1.
Symmetrically we can apply the limit to
the pair C2, C4 and C3, C4 and we get that
F4 has to dominate both of them. Now
notice we actually have no idea whether F2
or F3 is bigger. So that pair we can't
resolver. But we do know these
relationships. Okay F1 is the smallest and
F4 is the smallest [biggest!!]. And you also notice
that C4 is a sink SCC and the sink has no
outgoing arches and you think about it
that's a totally general consequence of
this lema. So in a simple group of
contradiction will go as follows. Consider
this SCC with
the maximum F value. Suppose it was not a
sink SCC that it has an outgoing arch,
follow that outgoing arch to get some
other SCC by the lema the SCC you've got
into has even bigger maximum finishing
time. So that contradicts the fact that
you started in the SCC with a maximum
finishing time. Okay. So just like in this
cartoon, where the unique sink SCC has
to have the largest finishing time, that's
totally general. As another sanity check,
we might return to the nine node graph,
where we actually ran Kasaraja's
algorithm and looking at the ford
version of the graph, which is the one on
the bottom, we see that the maximum
finishing times in the three FCC are 4,6
and 9. And it turns out that the same
as the leader nodes which is not an
accident if you think about it for a
little while and again you'll observe the
maximum finishing time in this graph
namely 9 is indeed in the left most SCC
which is the only SCC with no outgoing
arks. Okay but it's totally general
basically you can keep following arks and
you keep seeing bigger and bigger
finishing times so the biggest one of all
it has to be somewhere where you get stuck
where you can't go forward but there's no
outgoing arks and that's what I'm calling
a sink SCC. Okay. So assuming the lemma is true we know that the corollary
is true. Now using this corollary
let's finish the proof of correctness, of
Kasaraja's algorithm, module over proof
of the key lima. So I'm not going to do
this super rigorously, although everything
I say is correct and can be made, made
rigorous. And if you want a more rigorous
version I'll post some notes on the course
website which you can consult for more
details. So what the previous corollary
accomplished, it allows us to locate, the
node with maximum finishing time. We can
locate it in somewhere in some sink
SCC. Let me remind you about the
discussion we had at the very beginning of
talking about computing strong components.
We're tryna understand depth-first search
would be a useful workhorse for finding
the strong components. And the key
observation was that it depends, where you
begin that depth-first search. So for
example in this, graph with four SCC's
shown in blue on the right. A really bad
place to start. DFS called Depth For
Search would be somewhere in C1. Somewhere
in this source SCC, so this is a bad DFS.
Why is it bad? Well remember what Depth
For Search does; it finds everything
findable from its starting point. And from
C1 you can get to the entire world, you
can get to all the nodes in the entire
graph. So you can discover everything. And
this is totally useless because we wanted
to discover much more fine-grain
structure. We wanted to discover C1, C2,
C3 and C4 individually. So that would be
an disaster if we invoked depth first
search somewhere from C1. Fortunately
that's not what's going to happen, right?
We computed this magical ordering in the
first pass to insure that we look at the
node with the maximum finishing time and,
by the corollary, the maximum finishing
time is going to be somewhere in C4.
That's gonna be a good DFS, in the sense
that, when we start exploring from
anywhere in C4, there's no outgoing arcs.
So, of course, we're gonna find everything
in C4. Everything in C4's strongly
connected to each other. But we can't get
out. We will not have the option of
trespassing on other strong components,
and we're not gonna find'em. So we're only
gonna find C4, nothing more. Now, here's
where I'm gonna be a little informal.
Although, again, everything I'm gonna say
is gonna be correct. So what happens now,
once we've discovered everything in C4?
Well, all the nodes in C4 get marked as
explored, as we're doing depth first
search. And then they're basically dead to
us, right? The rest of our depth first
search loop will never explore them again.
They're already marked as explored. If we
ever see'em, we don't even go there. So
the way to think about that is when we
proceed with the rest of our for loop in
DFS loop it's as if we're starting
afresh. We're doing depth first search
from scratch on a smaller graph, on the
residual graph. The graph G with this
newly discovered strong component 'C<i>'</i>
deleted. So in this example on the right,
all of the nodes in C4 are dead to us and
it's as if we run DFS anew, just on the
graph containing the strong components C1,
C2 and C3. So in particular, where is the
next indication of depth first search
going to come from? It's going to come
from some sink SCC in the residual graph,
right? It's going to start at the node
that remains and that has the largest
finishing time left. So there's some
ambiguity in this picture. Again recall we
don't know whether F2 is bigger or F3 is
bigger. It could be either one. Maybe F2
is the largest remaining finishing time in
which case the next DFS indication's gonna
begin somewhere more from C2. Again, the
only things outgoing from C2 are these
already explored nodes. Their effectively
deleted. We're not gonna go there again.
So this is essentially a sink FCC. We
discover, we newly discover the nodes in
C2 and nothing else. Those are now
effectively deleted. Now, the next
indication of DFS will come from somewhere
in F3, somewhere in C3. That's the only
remaining sink SCC in the residual graph.
So the third call, the DFS will discover
this stuff. And now, of course, we're left
only with C1. And so the final indication
of DFS will emerge from and discover the
nodes in C1. And in this sense because
we've ordered the nodes by finishing times
when DFS was reverse graph, that ordering
has this incredible property that when you
process the nodes in the second pass we'll
just peel off the strongly connected
components one at a time. If you think
about it, it's in reverse topological
order with respect to the directed asypric
graph of the strongly connected
components. So we've constructed a proof
of correctness of Kosaraju's, algorithm
for computing strongly connected
components. But again, there's a hole in
it. So we completed the argument assuming
a statement that we haven't proved. So
let's fill in that last gap in the proof,
and we'll we done. So what we need to do
is prove the key lemma. Let me remind you
what it says. It says if you have two
adjacent SCCs, C1 and C2 and is an arc from a
node in C1, call it 'I' to a node in C2, say
J. Then the max finishing time in C2 is
bigger than the max finishing time in C1. Where, as always, these finishing
times are computed in that first pass of
depth-first search loop in the reversed
graph. All right, now the finishing times
are computed in the reversed graph, so
let's actually reverse all the arcs and
reason about what's happening there. We
still have C1. It still contains the
node I. We still have C2, which still
contains the node J. But now of course the
orientation of the arc has reversed. So
the arc now points from J to I. Recall we
had a quiz which said, asked you to
understand the effect of reversing all
arcs on the SCC's and in particular there
is no effect. So the SCC's in the reverse
graph are exactly the same as in the
forward graph. So now we're going to have
two cases in this proof and the cases
correspond to where we first encounter a
node of C1 and union C2. Now remember,
when we do this DFS loop, this second
pass, because we have this outer four loop
that iterates over all of the nodes we're
guaranteed to explore every single node of
the graph at some point. So in particular
we're gonna have to explore at some point
every node in C1 and C2. What I want you
to do is pause the algorithm. When it
first, for the first time, explores some
node that's in either C1 or C2. There's
going to be two cases, of course, because
that node might be in C1, you might see
that first. Or it might be in C2, you
might see something from C2 first. So our
case one is going to be when the first
node that we see from either one happens
to lie in C1. And the second case is where
the first node V that we see happens to
lie in C2. So clearly exactly one of these
will occur. So let's think about case one.
When we see a node of C1 before we see
any nodes of C2. So in this case where we
encounter a node in C1 before we encounter
any node in C2, the claim is that we're
going to explore everything in C1 before
we ever see anything in C2. Why is that
true? The reason is there cannot be a path
that starts somewhere in C1, for example,
like the vertex V, and reaches C2. This is
where we are using the fact that the
meta-graph on SCC is a cyclic. Right C1 is strong
connected, C2 is strong connected, you can
get from C2 to C1 and, if you can also get
from C1 back to C2 this all collapses into
a single strongly connected component. But
that would be a contraction, we're
assuming C1 and C2 are distinct strongly
connected components, therefore you can't
have paths in both directions. We already
have a path from right to left, via JI, so
there's no path from left to right. That's
why if you originate a depth first search
from somewhere inside C1 like this vertex
V, you would finish exploring all of C1
before you ever are going to see C2,
you're. Only gonna see C2 at some later
point in the outer for loop. So, what's
the consequence that you completely finish
with C1 before you ever see C2? Well it
means every single finishing time in C1 is
going to be smaller than every single
finishing time in C2. So that's even
stronger that what we're claiming, we're
just claiming that the biggest thing in C2
is bigger than the biggest of C1. But
actually finishing times in C2 totally
dominate those in C1, because you finish
C1 before you ever see C2. So let's now
have a look at case one actually in
action. Let's return to the nine-node
graph, the one that we actually ran
Kosaraju's algorithm to completion. So if
we go back to this graph which has the
three connected components, then remember
that the bottom version is the forward
version, the top version is the reversed
version. So if, if you think about the
middle SCC as being c1, pulling the row of
c1 and the left most. Scc playing the role
of C2, then what we have exactly is case
one of the key lemma. So, which was
the first of these six vertices visited
during the DFS loop in the reversed graph?
Well that would just be the node with the
highest name, so the node nine. So this
was the first of these six vertices that
depth first search looked at in the first
pass, that lies in what we're calling C1.
And indeed everything in C1 was discovered
in that pass before anything in C2 and
that's why all of the finishing times in
C2, the 7,8,9 are bigger than
all of the finishing times in C1 -
the 1,5, and 6. So we're good to go in
case two. We've proven sorry, in case one,
we've proven the lemma. When it's the case
that, amongst the vertices in C1
and C2, depth first search in the first pass
sees something from C1 first. So
now, let's look at this other case, this
grey case, which could also happen,
totally possible. Well, the first thing we
see when depth first searching in the
first pass, is something from C2. And here
now is where we truly use the fact that
we're using a depth first search rather
than some other graph search algorithm
like breadth first search. There's a lot
of places in this algorithm you could swap
in breadth first search but in this case
two, you'll see why it's important we're
using depth first search to compute the
finishing times. And what's the key point?
The key point is that, when we invoke
depth first search beginning from this
node V, which is now assuming the line C2.
Remember depth first search will not
complete. We won't be done with V until
we've found everything there is to find
from it, right? So we recursively explore
all of the outgoing arcs. They recursively
explore all of the outgoing arcs, and so
on. It's only when all paths going out of
V have been totally explored and
exhausted, that we finally backtrack all
the way to V, and we consider ourselves
done with it. That is, depth first search.
In the reverse graph initiated at v. Won't
finish until everything findable has been
completely explored. Because there's an
arc from C2 to C1, obviously everything to
C2 is findable from V, that's strongly
connected. We can from C2 to C1 just using
this arc from J to I. C1 being strongly
connected we can then find all of that.
Maybe we can find other strongly connected
components, as well, but for sure
depth-first search starting from V will
find everything in C1 to C2, maybe some
other things. And we won't finish with V
until we finish with everything else,
that's the depth-first search property.
For that reason the finishing time of this
vertex V will be the largest of anything
reachable from it. So in particular it'll
be larger than everything in C two but
more to the point, it'll be larger than
everything in C1 which is what we are
trying to prove. Again let's just see
this quickly in action in the nine node
network on which we traced through
Kosaraju's algorithm. So to show the rule
that case two is playing in this concrete
example let's think of the right most
strongly connected component as being C1.
And let's think of the middle SCC as being C.2. Now
the last time. We called the middle one C1
and the leftmost one C2. Now we're calling
the rightmost one C1 and the middle one
C2. So again, we have to ask the question,
you know, of the six nodes in C1 and in
C2, what is the first one encountered in
the depth first search that we do in the
first pass. And then again, is the node
nine? The, the node which is originally
labeled not. So it's the same node that
was relevant in the previous case, but now
with this relabeling of the components,
nine appears in the strongly connected
component C-2, not in the one labeled C-1.
So that's the reason now we're in case
two, not in case one. And what you'll see
is, what is the finishing time that this
originally labeled nine node gets. It gets
the finishing time six. And you'll notice
six is bigger than any of the other
finishing times of any of the other nodes
in C1 or C2. All, the other five nodes
have the finishing times one through five.
And that's exactly because when we ran
depth first search in the first pass, and
we started it at the node originally
labeled nine, it discovered these other
five nodes and finished exploring them
first before finally back tracking all the
way back to nine, and deeming nine fully
explored. And it was only at that point
that nine got its finishing time after
everything reachable from it had gotten
there lower finishing times. So that
wraps it up. We had two cases depending on
whether in these two adjacent SCC's, the
first vertex encountered was in the C1, or
in C2. Either way it doesn't matter, the
largest finishing time has to be in C2.
Sometimes it's bigger than everything,
sometimes it's just bigger than the
biggest in C-1, but it's all the same to
us. And to re cap how the rest of the
proof goes, we have a corollary based on
this lemma, which is maximum finishing
time have to lie in sink SCC
And that's exactly
where we want our depth first search to
initiate. If you're initiated in a strong
component with no outgoing arcs, you do
DFS. The stuff you find is just the stuff
and that strongly connected component. You
do not have any avenues by which to
trespass on other strong components. So
you find exactly one SCC. In effect, you
can peel that off and recurse on the rest
of the graph. And our slick way of
implementing this recursion is to just do
the single second, DFS pass, where you
just treat the nodes in decreasing order
of finishing times, that in effect,
unveil, unveils all of the SCCs in
reverse topological ordering. So that's
it, Kosaraju's algorithm, and the complete
proof of correctness. A blazingly fast
graph primitive that, in any directed
graph, will tell you its strong
components.