1
00:00:00,000 --> 00:00:03,089
Having mastered computing the connected
components of undirected graph

2
00:00:03,089 --> 00:00:07,054
in linear time, let's now turn our attention to
directed graphs which also arise, in all

3
00:00:07,054 --> 00:00:11,044
kinds of different applications. Now the
good news is, is we'll be able to get just

4
00:00:11,044 --> 00:00:15,038
a blazingly fast primitive for computing
connectivity information for directed

5
00:00:15,038 --> 00:00:19,027
graphs. The bad news if you wanna call it
that, is we'll have to think a little bit

6
00:00:19,027 --> 00:00:23,026
harder. So it won't be as obvious how to
do it, but by the end of this lecture you

7
00:00:23,026 --> 00:00:27,010
will know a linear time algorithm with
very good constants, really just based on

8
00:00:27,010 --> 00:00:31,002
depth-first search for computing
all of the pieces of a directed graph. In

9
00:00:31,002 --> 00:00:35,005
fact it's not even so obvious how to
define pieces, how to define connected

10
00:00:35,005 --> 00:00:38,907
components in a directed graph, Certainly
not as obvious as it was with undirected

11
00:00:38,907 --> 00:00:43,059
graphs. So to see what I mean, let's
consider the following four node directed

12
00:00:43,059 --> 00:00:48,995
graph. So on the one hand, this graph is
in some sense in one piece. If this was an

13
00:00:48,995 --> 00:00:53,399
actual physical object made, say, of a
bunch of strings connected to each other

14
00:00:53,399 --> 00:00:56,884
and we'd picked it up with our hands, it
wouldn't fall apart into two pieces. It

15
00:00:56,884 --> 00:01:00,001
would hang together in one piece. On the
other hand, when you think about moving

16
00:01:00,001 --> 00:01:04,009
around this network, it's not connected in
the sense that we might think about. You

17
00:01:04,009 --> 00:01:07,389
cannot get from any one point A to any
other point B. For example, if you start

18
00:01:07,389 --> 00:01:11,080
at the right most node in this graph,
certainly there's no directed path that

19
00:01:11,080 --> 00:01:15,074
will get you to the left most node. So
what's typically studied and most useful

20
00:01:15,074 --> 00:01:19,073
for directed graphs, is what you call
strong connectivity. The graph is strongly

21
00:01:19,073 --> 00:01:23,087
connected if you can get from any one
point to any other point, And vice versa.

22
00:01:23,087 --> 00:01:28,031
And the strongly connected components
then informally, are the maximal portions

23
00:01:28,031 --> 00:01:32,049
of this graph, the maximal regions which
are internally, strongly connected. So,

24
00:01:32,049 --> 00:01:36,789
the maximal regions from within which you
can get from any one point A to any other

25
00:01:36,789 --> 00:01:41,045
point B along a directed graph. For the
record, let me go ahead and give you a, a

26
00:01:41,045 --> 00:01:46,011
formal definition although, you know, this
intuition is perfectly valid just regions

27
00:01:46,011 --> 00:01:51,028
where you can get from anywhere to
anywhere else. So we say that the strongly

28
00:01:51,028 --> 00:01:56,018
connected components of the directed graph
or the SCCs for short. And as in the

29
00:01:56,018 --> 00:01:59,795
undirected case, we're going to give a
somewhat slick definition, rather than

30
00:01:59,795 --> 00:02:03,218
talking about maximal regions satisfying
some property, we're going to talk about

31
00:02:03,218 --> 00:02:06,631
the equivalence classes of a particular
equivalence relation. But really it means

32
00:02:06,631 --> 00:02:10,557
the same thing. This is just sort of the
more mathematically mature way of writing

33
00:02:10,557 --> 00:02:14,386
it down. So the equivalence relation we're
going to use it on nodes of the graph and

34
00:02:14,386 --> 00:02:19,240
we'll say that U, a node, is related to a
node V if you can get from U to V via a

35
00:02:19,240 --> 00:02:25,509
directed path and also from V to U on some
other directed path. [sound] I'm not gonna

36
00:02:25,509 --> 00:02:29,699
bore you with the verification that this
is an equivalent relation. That's

37
00:02:29,699 --> 00:02:32,845
something that you can work out in the
privacy of your own home. Just remember

38
00:02:32,845 --> 00:02:36,006
what it means to be in an equivalence
relation. It's reflexive. That is,

39
00:02:36,006 --> 00:02:39,080
everybody's related to itself but, of
course, there is a path from every node to

40
00:02:39,080 --> 00:02:43,072
itself. It also means that it's symmetric.
So if U is related to V, then V is related

41
00:02:43,072 --> 00:02:47,026
to U. Well, again, by definition, we're
saying that the paths are mutually, the

42
00:02:47,026 --> 00:02:51,019
vertices are mutually reachable from each
other. So that's clearly symmetric by

43
00:02:51,019 --> 00:02:54,049
definition and then it has to be
transitive. And the way you prove it's

44
00:02:54,049 --> 00:02:57,794
transitive, is you just paste paths
together and it just works the way you'd

45
00:02:57,794 --> 00:03:01,049
expect it to. So let's illustrate this
concretely with a, a somewhat more

46
00:03:01,049 --> 00:03:06,032
complicated directed graph. So here's a
directed graph and I claim that it has

47
00:03:06,032 --> 00:03:11,028
exactly four strongly connected
components. There's a triangle on the

48
00:03:11,028 --> 00:03:16,059
left, a triangle on the right. Has this
single node on top and then it has this

49
00:03:16,059 --> 00:03:21,056
directed four-cycle with a diagonal at the
bottom. So what I hope is pretty clear is

50
00:03:21,056 --> 00:03:25,187
that each of these circled regions is
indeed strongly connected that is if you

51
00:03:25,187 --> 00:03:29,211
start from one node in one of these
circled regions you can have a directed

52
00:03:29,211 --> 00:03:33,089
path to any other node, so that's
certainly true cause on a directed cycle you can

53
00:03:33,089 --> 00:03:36,558
get from anyone starting point and the
other place and all of these have directed

54
00:03:36,558 --> 00:03:40,053
cycles and there's the one strong
component that just has one node which is

55
00:03:40,053 --> 00:03:44,020
obviously strongly connected. What I
also claim is true is that all of these

56
00:03:44,020 --> 00:03:47,077
regions are maximal subject to being
strongly connected. That's why they're the

57
00:03:47,077 --> 00:03:51,034
strongly connected components. That's why 
they're the equivalence classes of this equivalence

58
00:03:51,034 --> 00:03:54,091
relation we just defined. So, if you take
any two pairs of nodes which lie in two

59
00:03:54,091 --> 00:03:58,070
different circles you either won't have a
path from one... From the first one to the

60
00:03:58,070 --> 00:04:01,912
second one or you won't have a directed
path from the second one back to the first

61
00:04:01,912 --> 00:04:06,014
one. In fact, the structure of the strong
components in this black graph exactly

62
00:04:06,014 --> 00:04:10,054
mirrors, the directed acyclic graph that
we started with in red. So in the same

63
00:04:10,054 --> 00:04:14,355
way, in the red four node graph, you can't
move from right to left. Here in this

64
00:04:14,355 --> 00:04:19,019
bigger graph, you can't move from any of
the circled SCC's to the right to any of

65
00:04:19,019 --> 00:04:23,026
the circled SCCs to the left. So, for
example, from the rightmost nodes, there

66
00:04:23,026 --> 00:04:27,078
are no directed paths to the leftmost
nodes. So that's a recap of the definition

67
00:04:27,078 --> 00:04:31,089
of the strongly connected components, I've
motivated in a separate video some reasons

68
00:04:31,089 --> 00:04:35,006
why you might care about computing
strongly connected components. In

69
00:04:35,006 --> 00:04:38,061
particular, on extremely large graphs,
which motivates the need for blazingly

70
00:04:38,061 --> 00:04:42,834
fast subroutines so for-free primitives
that will let you compute connectivity

71
00:04:42,834 --> 00:04:46,540
information. So you'd love to be able to
know the pieces of a directed graph, maybe

72
00:04:46,540 --> 00:04:49,507
you don't even know why they're going to
be useful, but you just compute them

73
00:04:49,507 --> 00:04:53,084
because why not? It's a for-free primitive.
So that's what I'm gonna give you in this

74
00:04:53,084 --> 00:04:58,009
lecture. So the algorithm we're gonna
present is based on depth-first search, and

75
00:04:58,009 --> 00:05:01,816
that's gonna be the main reason why it's
going to be so blazingly fast because

76
00:05:01,816 --> 00:05:07,003
depth-first search is blazingly fast. Now you
might be wondering what on earth does

77
00:05:07,003 --> 00:05:11,030
graph search have to do with computing
components? They don't seem obviously

78
00:05:11,030 --> 00:05:15,040
related, so let's return to the same
directed graph that I showed you on

79
00:05:15,040 --> 00:05:20,093
the previous slide. And to see why something
like depth-first search might

80
00:05:20,093 --> 00:05:25,097
conceivably have some use for computing
strong components. Suppose we call

81
00:05:25,097 --> 00:05:30,064
depth-first search, starting from this red node
as a starting point. What would it

82
00:05:30,064 --> 00:05:34,072
explore? So remember what the guarantee of
something like depth-first search or

83
00:05:34,072 --> 00:05:38,081
breadth-first search, for that matter, is.
You find everything that's findable, but

84
00:05:38,081 --> 00:05:42,048
naturally, nothing else. So what is
findable from this red node? Where, by

85
00:05:42,048 --> 00:05:46,067
findable, I mean is you can reach it from
a directed path emanating from this red

86
00:05:46,067 --> 00:05:50,060
node. Well, there's not much you can do,
so, from here, you can explore this arc,

87
00:05:50,060 --> 00:05:55,005
and you can explore this arc, and then you
can go backward. And so if you do DFS

88
00:05:55,005 --> 00:05:59,090
or BFS from this node you're going to find
precisely the nodes in this triangle.

89
00:05:59,090 --> 00:06:03,547
All of the other arcs involved go the
wrong direction and they won't be

90
00:06:03,547 --> 00:06:08,147
traversed by say a depth-first search
call. So, why is that interesting? What's

91
00:06:08,147 --> 00:06:12,268
interesting is that if we invoke DFS from
this red node or any of the three nodes

92
00:06:12,268 --> 00:06:17,147
from this triangle then it's going to
discover precisely this strongly connected

93
00:06:17,147 --> 00:06:23,034
component. Precisely the three nodes in
this circled SCC. So that seems

94
00:06:23,034 --> 00:06:27,071
really cool. It seems like maybe we just
do a DFS, and boom, we get an SCC. So

95
00:06:27,071 --> 00:06:31,723
maybe if we can do that over and over
again, we'll get all the SCCs. So that's a

96
00:06:31,723 --> 00:06:36,079
good initial intuition. But something can
go wrong. Suppose that instead of, an,

97
00:06:36,096 --> 00:06:40,952
initiating DFS from one of these three
nodes on the triangle, we, say, initiate

98
00:06:40,952 --> 00:06:46,075
it from this bottom-most node in green. So
remember, what is the guarantee of a graph

99
00:06:46,075 --> 00:06:50,814
search subroutine like DFS? It will find
everything findable but of course nothing

100
00:06:50,814 --> 00:06:57,055
more. So what's findable from this green
node? Well naturally, everything in its

101
00:06:57,055 --> 00:07:02,097
own SCC. Right, so the four nodes here, it
will certainly find those four nodes. On

102
00:07:02,097 --> 00:07:07,081
the other hand, if we start from this
green node, since there are arcs that go

103
00:07:07,081 --> 00:07:13,068
from this bottom most SCC to the right
most SCC, not only will this DFS call find

104
00:07:13,068 --> 00:07:18,536
the four nodes in the green node's strong
component, but it will also traverse these

105
00:07:18,536 --> 00:07:24,099
blue arcs and discover the three nodes in
the red triangle. So, if we call DFS from

106
00:07:24,099 --> 00:07:30,036
this green node, we'll capture all seven
of these. So the point is if we call DFS,

107
00:07:30,036 --> 00:07:35,029
it looks like we're going to get a
union of possibly multiple SCCs. In fact,

108
00:07:35,029 --> 00:07:40,047
in the worse case, if we invoke DFS from
the left-most node, what's it going to

109
00:07:40,047 --> 00:07:45,039
discover? It's going to discover the
entire graph, and that didn't give us any

110
00:07:45,039 --> 00:07:50,013
insight into the strong component
structure at all. So, what's the take away

111
00:07:50,013 --> 00:07:55,031
point is, the take away point is if you
call DFS from just the right place, you'll

112
00:07:55,031 --> 00:08:00,029
actually uncover an SCC. If you call it
from the wrong place, it will give you no

113
00:08:00,029 --> 00:08:05,052
information at all, so the magic of the
algorithm that we're gonna discuss next is

114
00:08:05,052 --> 00:08:10,005
we'll show how in a super slick
pre-processing step which ironically is

115
00:08:10,005 --> 00:08:15,035
itself a call to depth-first search, we can in
linear time compute exactly where we want

116
00:08:15,035 --> 00:08:20,046
to start the subsequent depth-first searches
from, so that each invocation gets us

117
00:08:20,046 --> 00:08:27,031
exactly one strongly connected component, and nothing more. So

118
00:08:27,031 --> 00:08:32,687
the algorithm that, that I'm going to show
you is due to Kosaraju. And it will show

119
00:08:32,687 --> 00:08:37,100
the following theorem. That the strongly
connected components of a directed graph

120
00:08:37,100 --> 00:08:41,009
can be computed in linear
time. And as we'll see, the constants are

121
00:08:41,009 --> 00:08:45,045
also very small. It's really just gonna be
two passes of depth-first search. And again,

122
00:08:45,045 --> 00:08:48,096
let me remind you that for many
problems it's natural to assume that the

123
00:08:48,096 --> 00:08:52,060
number of edges is at least the number of
nodes because, if you're only interested

124
00:08:52,060 --> 00:08:55,830
in cases where the graph is connected. Of
course, when you're computing connected

125
00:08:55,830 --> 00:08:58,903
components, that's one of the most natural
cases where you might have a super sparse,

126
00:08:58,903 --> 00:09:02,759
broken up graph. So we're not going to
assume m is at least n. So that's why

127
00:09:02,759 --> 00:09:06,093
linear time is going to be m + n, because
that's the size of the input. And we don't

128
00:09:06,093 --> 00:09:11,003
know, either M could be bigger than N, or
N could be bigger than M. We have no idea.

129
00:09:11,041 --> 00:09:17,027
Kosaraju's algorithm is shocking in its
simplicity. It has three steps, let me

130
00:09:17,027 --> 00:09:23,007
tell you what they are. First, very
mysteriously, we're going to reverse all

131
00:09:23,007 --> 00:09:28,035
of the arcs of the given graph. Totally not
clear why that would be an interesting

132
00:09:28,035 --> 00:09:33,064
thing to do yet. Then, were going to do
our first pass, our first depth-first search,

133
00:09:33,064 --> 00:09:38,062
and we're going to do it on the reverse
graph. Now, the naive way to implement

134
00:09:38,062 --> 00:09:42,042
this would be you literally construct a new
copy of the input graph, with all of the

135
00:09:42,042 --> 00:09:46,002
arcs in the reverse direction. And then
just run depth-first search on it. Of

136
00:09:46,002 --> 00:09:49,067
course, the sophisti--, the sort of
obvious optimization would be to just run

137
00:09:49,067 --> 00:09:53,013
DFS on the original graph, but going
across arcs backward. So I'll let you

138
00:09:53,013 --> 00:09:56,088
think through the details of how you'd do
that, but that just works. You run DFS,

139
00:09:56,088 --> 00:10:00,039
and instead of going forward along edges,
you go backward along edges. That

140
00:10:00,039 --> 00:10:03,099
simulates depth-first search on the
reverse graph. Now, I've written here, DFS

141
00:10:03,099 --> 00:10:07,079
loop. And that just means the usual trick
where, to make sure that you see all of

142
00:10:07,079 --> 00:10:11,062
the nodes of, of the graph, even if it's
disconnected, you have an outer loop,

143
00:10:11,062 --> 00:10:15,069
where you just try each starting point
separately. If you haven't already seen

144
00:10:15,069 --> 00:10:19,091
it, then you run DFS from that given node.
I'll be more detailed on the next slide.

145
00:10:20,037 --> 00:10:25,024
And the third step, it's just you run
depth-first search again, but this time on

146
00:10:25,024 --> 00:10:29,659
the original graph. Now at this point, you
should be thinking that I'm totally crazy,

147
00:10:29,659 --> 00:10:33,584
right? So what are we trying to do? We're
trying to compute these strongly connected

148
00:10:33,584 --> 00:10:36,493
components. We're trying to actually
compute real objects, these maximal

149
00:10:36,493 --> 00:10:40,027
strongly connected regions. And all I'm doing is
searching the graph. And I do it once

150
00:10:40,027 --> 00:10:43,071
forward, I do it once backward. I mean,
I'm not, it doesn't really seem like I'm

151
00:10:43,071 --> 00:10:47,194
computing anything. So, here's the catch,
and it's a very minor catch. So we're

152
00:10:47,194 --> 00:10:50,224
gonna need, to do a little bit of bookkeeping,
it's gonna be very little overhead, so we'll still

153
00:10:50,224 --> 00:10:53,712
have a blazingly fast algorithm. So, but with a little bit of bookkeeping,

154
00:10:53,712 --> 00:10:59,450
here's what's gonna happen. The second depth-first search, which searches the graph, will,

155
00:10:59,450 --> 00:11:04,017
in its search process, discover the strongly connected components one at a time

156
00:11:04,017 --> 00:11:08,247
in a very natural way, and that'll be really obvious when we do an example, which we'll do in just a second.

157
00:11:08,247 --> 00:11:14,594
Now, for the second depth-first search to work in this magical way, where it just discovers the strongly connected

158
00:11:14,594 --> 00:11:20,097
components one at a time, it's really
important that it executes the depth first

159
00:11:20,097 --> 00:11:23,309
searches in a particular order, that it
goes through the nodes of the graph in a

160
00:11:23,309 --> 00:11:28,633
particular order. And that is exactly the
job of the first pass. The depth-first

161
00:11:28,633 --> 00:11:33,243
search on the reverse graph is going to
compute an ordering of the nodes, which,

162
00:11:33,243 --> 00:11:37,174
when the second depth-first search goes
through them in that order, it will just

163
00:11:37,174 --> 00:11:43,049
discover the SCCs one at a time in linear
time. So let me say a little bit more

164
00:11:43,049 --> 00:11:47,069
about the form of the bookkeeping and then
I'll show you how that bookkeeping is

165
00:11:47,069 --> 00:11:51,515
kept in as we do depth-first
search. So we're gonna have a notion of a

166
00:11:51,515 --> 00:11:55,093
finishing time of a vertex and that's
gonna be computed in the first pass when

167
00:11:55,093 --> 00:12:00,026
we do depth-first search in the reversed
graph. And we're going to make use of this

168
00:12:00,026 --> 00:12:04,052
data in the second pass. So, rather than
just going through the nodes of the graph

169
00:12:04,052 --> 00:12:08,063
in an arbitrary order, like we usually do
when we sort of have a looped

170
00:12:08,063 --> 00:12:12,079
depth-first or breadth-first search, we're
going to make sure that we go through the

171
00:12:12,079 --> 00:12:16,078
vertices in decreasing order of these
finishing times. Now there's still the

172
00:12:16,078 --> 00:12:21,011
question in what sense does this second
depth-first search discover and report the

173
00:12:21,011 --> 00:12:25,039
strongly connected components that it finds?
So we're going to label each node in this

174
00:12:25,039 --> 00:12:29,959
second pass with what we call a leader and
the idea is that the nodes in the same

175
00:12:29,959 --> 00:12:33,479
strongly connected components will be labeled
with exactly the same leader node. And

176
00:12:33,479 --> 00:12:38,012
again all of this will be much more clear
once we do a concrete example. But I want

177
00:12:38,012 --> 00:12:43,062
to have it down for the record right now.
So that's the algorithm at a high level

178
00:12:43,062 --> 00:12:47,019
it's really just two passes of DFS with
some bookkeeping. But this is under

179
00:12:47,019 --> 00:12:50,349
specified. You really, shouldn't,
understand how to implement the

180
00:12:50,349 --> 00:12:54,417
algorithm just yet. So what do I owe you?
I owe you exactly what I mean by DFS loop

181
00:12:54,417 --> 00:12:58,334
although this you've seen more or less in
the past. It's just a loop over all the

182
00:12:58,334 --> 00:13:02,052
vertices the graph and then when if you
haven't seen something yet you DFS from

183
00:13:02,052 --> 00:13:06,038
that starting point. I need to tell you
what finishing times are and how they get

184
00:13:06,038 --> 00:13:09,085
computed. They're just gonna be integers
one to N. Which is basically when

185
00:13:09,085 --> 00:13:13,095
depth-first search gets finished with one
of the nodes. And then I need to tell you

186
00:13:13,095 --> 00:13:18,002
how to compute these leaders. So let me
show you all three of those things on

187
00:13:18,002 --> 00:13:22,366
the next slide. So the workhorse for
Kosaraju's strongly connected components

188
00:13:22,366 --> 00:13:27,416
algorithm is this DFS loop subroutine and
it takes as input in graph. Okay, so it

189
00:13:27,416 --> 00:13:31,294
does not take as input a starting node,
it's gonna loop over possible starting

190
00:13:31,294 --> 00:13:35,523
nodes. Now for the bookkeeping, to
compute finishing nodes, we're going to

191
00:13:35,523 --> 00:13:42,051
keep track of a global variable that I'll
call T, which we initialize to zero. The

192
00:13:42,051 --> 00:13:48,017
point of T is to count how many nodes we
have totally finished exploring at this

193
00:13:48,017 --> 00:13:52,037
point. So this is the variable we use to
compute those finishing times in the first

194
00:13:52,037 --> 00:13:56,028
pass. That magical ordering that I was
talking about. Now we're also gonna have a

195
00:13:56,028 --> 00:14:00,015
second global variable to compute these
things I was calling leaders, and these

196
00:14:00,015 --> 00:14:04,054
are only relevant for the second pass. So
what S is going to keep track of is the

197
00:14:04,054 --> 00:14:10,768
most recent vertex from which a DFS was
initiated. So to keep the code simple I'm

198
00:14:10,768 --> 00:14:15,308
just going to do all of the bookkeeping in
DFS loop, but really DFS loop gets

199
00:14:15,308 --> 00:14:19,113
called twice, once in the reverse graph,
once in the forward graph and we only need

200
00:14:19,113 --> 00:14:23,861
to compute these finishing times in the
first pass, on the reverse graph. And we

201
00:14:23,861 --> 00:14:28,016
only need to compute these leaders of the
second pass for the forward graph. But

202
00:14:28,016 --> 00:14:32,035
let's just keep them both in there, just
for kicks. Now, we're going to need to

203
00:14:32,035 --> 00:14:36,002
loop through the vertices. And so the
question is, in what order are we going to

204
00:14:36,002 --> 00:14:39,042
loop through the vertices? And that's
gonna happen differently in the two

205
00:14:39,042 --> 00:14:42,535
different passes. But let me just use some
common notation. Let's just assume in this

206
00:14:42,535 --> 00:14:47,007
subroutine, that the nodes are somehow
labeled from one to N. In our first depth-first

207
00:14:47,007 --> 00:14:51,045
search it's gonna be labeled totally
arbitrary. So these are basically just the

208
00:14:51,045 --> 00:14:55,056
names of the node, or their position in
the node array, whatever, you just do it

209
00:14:55,056 --> 00:14:59,078
in some arbitrary order. Now the second
time we run DFS loop, as indicated on the

210
00:14:59,078 --> 00:15:03,069
previous slide, we're gonna use the
finishing times as the labeling. And as

211
00:15:03,069 --> 00:15:08,011
we'll see the finishing times are indeed
numbers between one and N. So now what we

212
00:15:08,011 --> 00:15:12,349
do is we just iterate through the nodes in
decreasing order. And if we haven't

213
00:15:12,349 --> 00:15:17,292
already seen node I, then we initiate a
DFS from it. So, as usual, we're gonna be

214
00:15:17,292 --> 00:15:22,018
maintaining a local boolean to keep track
of whether we've already seen a node yet

215
00:15:22,018 --> 00:15:26,082
in one of the DFS passes. Now remember,
the global variable S is responsible for

216
00:15:26,082 --> 00:15:31,034
keeping track of the most recent node from
which depth-first search had been

217
00:15:31,034 --> 00:15:36,021
initiated. So if I is not explored and we
initiate a depth-first search from it, we'd

218
00:15:36,021 --> 00:15:42,894
better reset S. And then we do the usual
DFS in G, starting from the source node I.

219
00:15:42,894 --> 00:15:47,546
So for completeness let me just remind you
what the depth-first search subroutine

220
00:15:47,546 --> 00:15:53,075
looks like. So now we're given a graph and
a starting node. So the first time we see

221
00:15:53,075 --> 00:15:59,024
a node, we mark it as explored. And just a
side note that once a node is marked

222
00:15:59,024 --> 00:16:03,082
explored it's explored for this entire
invocation of DFS loop. Okay, so even if

223
00:16:03,082 --> 00:16:08,065
this DFS from a given node I finishes and
then the outer for loop marches on, and

224
00:16:08,065 --> 00:16:12,062
encounters I again it's still gonna be
marked as explored. Now

225
00:16:12,062 --> 00:16:18,677
one of our bookkeeping jobs is to keep
track of, from which vertex did the DFS

226
00:16:18,677 --> 00:16:23,515
that discovered I get called, so
when I is first encountered, we remember

227
00:16:23,515 --> 00:16:29,263
that S was the node from which this DFS
originated, and that by definition is the

228
00:16:29,263 --> 00:16:34,491
leader of I. And then we do what we always
do with depth-first search. We immediately

229
00:16:34,491 --> 00:16:38,946
look at the arcs going out of I, and we
try to recursively DFS on any of those.

230
00:16:38,946 --> 00:16:42,906
Although, of course, we don't bother to do
it if we've already seen those nodes. Now

231
00:16:42,906 --> 00:16:46,990
once this for loop has completed, once we've
examined every outgoing arc from I and

232
00:16:46,990 --> 00:16:51,174
for each node J either we already saw in
the past or we've recursively explored

233
00:16:51,174 --> 00:16:56,990
from J and have returned, at that point we
call ourselves done with node I. There's

234
00:16:56,990 --> 00:17:00,905
no more outgoing arcs to explore, we think
of it being finished. Remember T is the

235
00:17:00,905 --> 00:17:04,989
global variable that's keeping track of how
many nodes we're done with. So we

236
00:17:04,989 --> 00:17:10,715
increment T because now we're done with I and
we also remember that I was the Tth

237
00:17:10,715 --> 00:17:16,841
vertex with which we finished. And that
is, we set I's finishing time to be T.

238
00:17:16,841 --> 00:17:21,060
Because depth-first search first is
guaranteed to visit every node exactly

239
00:17:21,060 --> 00:17:25,774
once, and therefore finish with every node
exactly once. This global counter T, well,

240
00:17:25,774 --> 00:17:29,689
when the first node is finished, it'll be
value one. Then when the next node gets

241
00:17:29,689 --> 00:17:32,726
finished, it'll have value two, then it'll
have value three and so on. When the final

242
00:17:32,726 --> 00:17:37,103
node gets finished with, it'll have value
N. So the finishing times of the nodes are

243
00:17:37,103 --> 00:17:41,906
gonna be exactly the integers from one to
N. Let's make this much more concrete by

244
00:17:41,906 --> 00:17:47,642
going through a careful example. In fact,
I think it will be better for everybody if

245
00:17:47,642 --> 00:17:52,413
you yourself trace through part of this
algorithm on a concrete example. So let me

246
00:17:52,413 --> 00:17:59,420
draw a nine-node graph for you. So to
be clear let's assume that we've already

247
00:17:59,420 --> 00:18:02,134
executed step one of the algorithm.
That is, we've already reversed the

248
00:18:02,134 --> 00:18:05,687
graph. So that is. This blue graph that
I've drawn on the slide. This is the

249
00:18:05,687 --> 00:18:09,636
reversal. We've already reversed the arcs.
Moreover the nodes are labeled in some

250
00:18:09,636 --> 00:18:13,517
arbitrary way from one to nine. Just
assume these are how they show up in the

251
00:18:13,517 --> 00:18:17,815
node array for example. And remember in
the DFS loop routine we're supposed to

252
00:18:17,815 --> 00:18:23,923
process the nodes, from the, from top to
bottom. From N down to 1. So my

253
00:18:23,923 --> 00:18:27,541
question for you then is, in the second
step of the algorithm when we run DFS loop

254
00:18:27,541 --> 00:18:31,839
and we process the nodes from the highest
name 9 in order down to the lowest name

255
00:18:31,839 --> 00:18:35,675
1. What are the finishing times that
we're going to compute as we run DFS loop?

256
00:18:35,675 --> 00:18:40,903
Now, it is true that you can get different
finishing times depending on the different

257
00:18:40,903 --> 00:18:45,578
choices that the DFS loop has to make about
which outgoing arc to explore next. But

258
00:18:45,578 --> 00:18:49,696
I've given you four options for what the
finishing times of the nodes one, two,

259
00:18:49,696 --> 00:18:53,419
three, all the way up to nine respectively
might be, and only one of these four could

260
00:18:53,419 --> 00:18:59,949
conceivably be an output of the finishing
times of DFS loop on this graph. So, which one is it?

261
00:19:02,318 --> 00:19:07,080
Alright. So the answer is the
fourth option. That is the only one of

262
00:19:07,080 --> 00:19:12,525
these four sets of finishing times that
you might see being computed by DFS loop

263
00:19:12,525 --> 00:19:16,902
on this blue graph. So let's go ahead and
trace through DFS loop and see how we

264
00:19:16,902 --> 00:19:21,143
might get this set of finishing times. So
remember in the main loop we start from

265
00:19:21,143 --> 00:19:24,991
the highest node, nine, and then we
descend down to the, to the lowest node,

266
00:19:24,991 --> 00:19:31,301
one. So we start by invoking DFS from the
node nine. So now, from here there's only

267
00:19:31,301 --> 00:19:35,172
one outgoing arc we have to go to, so we
mark nine as explored. And then there's

268
00:19:35,172 --> 00:19:39,750
only one place we can go. We can go to six
so we mark six as explored. Now there's

269
00:19:39,750 --> 00:19:43,515
two places we can go next. We can either
go to three or we can go to eight. And you

270
00:19:43,515 --> 00:19:48,071
know, in general DFS could do either
one. Now to generate this fourth set of

271
00:19:48,071 --> 00:19:52,650
finishing times, I'm going to need to
assume that I go to three first. Okay, so

272
00:19:52,650 --> 00:19:56,328
again, what DFS does, what we're
assuming it does is it starts at nine,

273
00:19:56,328 --> 00:20:00,221
then it has to go to six. It marks those
as explored. Then it goes to three. It

274
00:20:00,221 --> 00:20:04,777
does not go to eight first. It goes to
three first. Now from three there's only

275
00:20:04,777 --> 00:20:08,265
one outgoing arc which goes to nine but
nine you've already marked as explored. So

276
00:20:08,265 --> 00:20:11,786
it's not going to re-explore nine, it's
going to skip that arc. Since that, that's

277
00:20:11,786 --> 00:20:15,259
3's only outgoing arc, then that for loop completes.

278
00:20:15,259 --> 00:20:19,909
And so three is the first node to finish.
So when we finish with three we increment

279
00:20:19,909 --> 00:20:24,603
T. It started at zero, now it's one. And
we set the finishing time of three to be

280
00:20:24,603 --> 00:20:32,115
one. Just like we said it was in the
example. So now we go back. Now we

281
00:20:32,115 --> 00:20:35,520
backtrack to six. Now we have another
outgoing arc from six to explore, so now

282
00:20:35,520 --> 00:20:39,413
we go to eight. From eight, we have to go
to two. From two, we have to go to five.

283
00:20:39,413 --> 00:20:42,507
From five, we have to go to eight. Eight,
we've already seen, so then we're gonna be

284
00:20:42,507 --> 00:20:46,253
done with five, because that was its only
outgoing arc. So then we increment T. Now

285
00:20:46,253 --> 00:20:53,791
it's two, and the finishing time of five
is going to be two. As promised. So now we

286
00:20:53,791 --> 00:21:00,211
backtrack to two. There's no more outgoing
marks from two. So 2's gonna be third one

287
00:21:00,211 --> 00:21:04,759
that we finish, as promised. Then we
finish with eight. So the finishing time

288
00:21:04,759 --> 00:21:10,631
for eight is gonna be the fourth node to
be done, as promised. Now we backtrack to

289
00:21:10,631 --> 00:21:17,551
six, we're done with six. That's the fifth
node to be completed, as promised. And

290
00:21:17,551 --> 00:21:22,905
finally we got all the way back to where
we started at nine and nine is the sixth

291
00:21:22,905 --> 00:21:30,437
node to be completed, as promised. Now if
we were computing those leaders, all of

292
00:21:30,437 --> 00:21:33,428
these nodes would get the leader nine. But
again the leaders are only relevant for

293
00:21:33,428 --> 00:21:36,646
the second pass, so we're just going to
ignore the leaders as we're doing this

294
00:21:36,646 --> 00:21:39,706
trace. We're just going to keep track of
the finishing times. So now we're not

295
00:21:39,706 --> 00:21:44,532
done, so all we did is we finished with
the DFS that is invoked from the node nine

296
00:21:44,532 --> 00:21:50,596
and we found six of the nodes total in
that depth-first search. So now we return

297
00:21:50,596 --> 00:21:55,132
to the outer for loop and we decrement I,
so it started at nine, we're done with

298
00:21:55,132 --> 00:21:57,833
that, now we go down to eight, we say that
we've already seen eight. Yes, eight's

299
00:21:57,833 --> 00:22:01,793
already explored. So we skip it. We go, we
decrement I down to seven, we say have we

300
00:22:01,793 --> 00:22:06,630
already seen node seven. No we have not.
Okay, seven is not yet explored, so we

301
00:22:06,630 --> 00:22:12,199
invoke DFS now from node seven. Seven has
two outgoing arcs. It can either go to

302
00:22:12,199 --> 00:22:16,474
four or it can go to nine. Let's say it
checks the outgoing arc to nine first. Now

303
00:22:16,474 --> 00:22:20,163
nine we already explored. Granted that was
in an earlier part of the for loop, but

304
00:22:20,163 --> 00:22:23,905
we remember that. We, we're going to
keep track of who got explored on previous

305
00:22:23,905 --> 00:22:26,799
iterations of the for loop. So we don't
bother to re-explore nine. So we skip

306
00:22:26,799 --> 00:22:30,511
that. So now from seven we have to go
to four. From four we have to go to one.

307
00:22:30,511 --> 00:22:34,078
From one we have to go back to seven. 7's
already been explored, so we backtrack,

308
00:22:34,078 --> 00:22:38,172
now we're down to, we're done with one. So
one is the next one we're completed with.

309
00:22:38,172 --> 00:22:43,595
And the finishing time of one is going to
be seven, as promised. We backtrack to

310
00:22:43,595 --> 00:22:47,982
four. There's no more outgoing arcs from
four to explore, so that's going to be the

311
00:22:47,982 --> 00:22:54,439
eighth one to finish as promised. And then
the last one to finish is poor node seven.

312
00:22:54,439 --> 00:23:01,716
It is last. So that would be an example of
how the DFS loop subroutine computes

313
00:23:01,716 --> 00:23:07,048
finishing times on a reversed graph. So
now let's work through the second pass on

314
00:23:07,048 --> 00:23:10,794
the forward version of the graph using the
same example. Now remember the point of

315
00:23:10,794 --> 00:23:14,923
the first pass is to compute a magical
ordering. And the magical ordering is

316
00:23:14,923 --> 00:23:19,833
these finishing times. So now we're going to
throw out the original node name and we're

317
00:23:19,833 --> 00:23:23,804
going to replace the node names in blue by
the finishing times in red. We're also

318
00:23:23,804 --> 00:23:26,672
going to work with the original graph
which means we have to reverse the arcs

319
00:23:26,672 --> 00:23:29,969
back where they were originally. So, those
are the two changes that you're going to

320
00:23:29,969 --> 00:23:33,040
see when I redraw this graph. First of
all, all the arcs will reverse

321
00:23:33,040 --> 00:23:37,394
orientation. Second of all, all the nodes
will change names from their original

322
00:23:37,394 --> 00:23:44,942
ones to the finishing times that we just
computed. So here's our new graph with the

323
00:23:44,942 --> 00:23:48,824
new node names and all the arcs with their
orientation reversed. And now we run DFS

324
00:23:48,824 --> 00:23:52,581
again on this graph. And again we're going
to process the nodes in order from the

325
00:23:52,581 --> 00:23:56,845
highest label nine down to the lowest
label one. Moreover we don't need to

326
00:23:56,845 --> 00:24:00,681
compute finishing times in the second
pass. We only need to do that in the first

327
00:24:00,681 --> 00:24:03,550
pass. In the second pass we have to keep
track of the leaders. And remember the

328
00:24:03,550 --> 00:24:09,710
leader of a vertex is the vertex from
which DFS was called. That first, that

329
00:24:09,710 --> 00:24:12,466
first discovered that node. Alright, so
what's going to happen. Well, in the outer

330
00:24:12,466 --> 00:24:18,009
for loop, we're going to start with I
equal to nine. And we invoke DFS from the

331
00:24:18,009 --> 00:24:25,469
node nine. So that's gonna be the current
leader. Cause that's, where the current

332
00:24:25,469 --> 00:24:29,147
DFS got initiated. Now from nine
there's only one choice. We have to go to

333
00:24:29,147 --> 00:24:32,916
seven. From seven there's only one choice.
We have to go to eight. From eight there's

334
00:24:32,916 --> 00:24:36,043
only one choice. We have to go back to
nine. And then nine's already been seen

335
00:24:36,043 --> 00:24:40,138
so we backtrack. We go back to eight. We
go back to seven. We go back to nine and

336
00:24:40,138 --> 00:24:45,273
that's it. So when we invoke DFS
from node nine the only things that we

337
00:24:45,273 --> 00:24:50,653
encounter are the nodes seven, eight and
nine. And these are all going to be given

338
00:24:50,653 --> 00:24:57,048
the leader vertex nine. You will notice
that this is indeed one of the strongly

339
00:24:57,048 --> 00:25:02,358
connected components of the graph. We just
sort of found it with this invocation of

340
00:25:02,358 --> 00:25:07,976
DFS from the node nine. So now we go back
to the outer for loop. And we say, okay,

341
00:25:07,976 --> 00:25:11,184
let's go to node eight. Have you already
seen eight? Yes. What about seven? Have

342
00:25:11,184 --> 00:25:14,806
you already seen seven? Yes. What about
six, have you already seen six? We have

343
00:25:14,806 --> 00:25:19,441
not, we have not yet discovered six. So we
invoke DFS from node six, we reset

344
00:25:19,441 --> 00:25:26,607
the global source vertex s to six. From
six we can go to nine or we can go to

345
00:25:26,607 --> 00:25:29,634
one. So let's say we explore nine first.
Well we already saw nine in the earlier

346
00:25:29,634 --> 00:25:33,819
iteration of the for loop so we don't
explore it again. So we don't discover

347
00:25:33,819 --> 00:25:38,555
nine now. So we backtrack to six. We go to
one. From one we have to go to five. From

348
00:25:38,555 --> 00:25:42,110
five we have to go to six and then we
start backtracking again. So the only new

349
00:25:42,110 --> 00:25:47,810
nodes that we encounter when we invoke DFS
from the node six are the vertices six,

350
00:25:47,810 --> 00:25:55,362
one, and five. And all of these will have
a leader vertex of six. Because that's

351
00:25:55,362 --> 00:25:58,974
where we called DFS from, when we first
discovered these three nodes. And you'll

352
00:25:58,974 --> 00:26:06,095
notice this is another SCC of this
directed graph. So we invoke DFS again

353
00:26:06,095 --> 00:26:09,444
now from a new node, the new node six, and
what it discovered, the new nodes

354
00:26:09,444 --> 00:26:14,247
discovered, is exactly an SCC of the
graph; nothing more, nothing less. So now

355
00:26:14,247 --> 00:26:18,212
we return to the outer for loop. We go to
net five. Have we already seen five? Yes.

356
00:26:18,212 --> 00:26:23,106
Have we already seen four? No. We haven't
seen four yet. So now we invoke DFS from

357
00:26:23,106 --> 00:26:27,041
four. Again, we could try to explore five,
but we've already seen that before, we're

358
00:26:27,041 --> 00:26:30,123
not going to explore it again. So from
four then, we have to go to two. From two,

359
00:26:30,123 --> 00:26:32,295
we have to go to three. From three, we
have to go back to four, and then after

360
00:26:32,295 --> 00:26:36,863
all the back tracking, we're done. So the
final call to DFS will be from the node

361
00:26:36,863 --> 00:26:44,220
four. And that DFS will discover
precisely, newly discover precisely the

362
00:26:44,220 --> 00:26:49,912
nodes 2,3 and 4. They will all have the
leader vertex four because that was where

363
00:26:49,912 --> 00:26:53,557
this DFS was called from. It's true we
will go back to the for loop and we will

364
00:26:53,557 --> 00:26:56,822
check, have we seen three yet, yes, have
we seen two yet, yes, have we seen one

365
00:26:56,822 --> 00:27:00,982
yet, yes, and then the whole thing
completes. And what we see is that using

366
00:27:00,982 --> 00:27:06,269
the finishing times computed from that
first depth-first search pass, somehow the

367
00:27:06,269 --> 00:27:10,883
strongly connected components of this
graph just showed up and presented

368
00:27:10,883 --> 00:27:15,720
themselves to us, one at a time, on a
silver platter. Every time we invoked DFS,

369
00:27:15,720 --> 00:27:21,297
the nodes we discovered newly were
precisely one of the SCCs. Nothing more,

370
00:27:21,297 --> 00:27:25,437
nothing less. And that's really what's
going on in this algorithm. It turns out

371
00:27:25,437 --> 00:27:29,296
this is true in general. The first pass,
DFS on the reverse graph, computes

372
00:27:29,296 --> 00:27:34,414
finishing times so that if you then
process nodes according to decrease ordering

373
00:27:34,414 --> 00:27:39,454
of finishing times in the second
pass, each invocation to DFS will discover

374
00:27:39,454 --> 00:27:44,381
one new SCC and exactly one SCC. So
they'll just present themselves to you one

375
00:27:44,381 --> 00:27:51,199
per DFS call in that second pass's for
loop. This is, of course, merely an

376
00:27:51,199 --> 00:27:55,530
example. You should not just take a single
example as proof that this algorithm

377
00:27:55,530 --> 00:27:59,445
always works. I will give you a general
argument in the next video. But hopefully

378
00:27:59,445 --> 00:28:03,399
there's at least a plausibility argument.
No longer does this three step algorithm

379
00:28:03,399 --> 00:28:08,850
seem totally insane, and maybe you could
imagine perhaps it works. At least there's

380
00:28:08,850 --> 00:28:12,822
some principles going on where you first
compute the right ordering to process

381
00:28:12,822 --> 00:28:18,313
the nodes, and, and then the second pass
peels off SCCs one at a time like layers

382
00:28:18,313 --> 00:28:22,622
from an onion. One thing that I hope is
pretty clear is that this algorithm

383
00:28:22,622 --> 00:28:30,214
correct or not, is blazingly fast. Pretty
much all you do is two depth-first searches.

384
00:28:30,214 --> 00:28:34,635
And since depth-first search, as we've seen
in the past runs in time linear in the

385
00:28:34,635 --> 00:28:39,773
size of the graph so does Kosaraju's two-pass
algorithm. There are a couple of

386
00:28:39,773 --> 00:28:42,440
subtleties and I encourage you to think
about this so you'll be forced to think

387
00:28:42,440 --> 00:28:47,085
about this in the programming project for
week four. So for example, in the second

388
00:28:47,085 --> 00:28:51,540
pass how do you process the nodes in
decreasing order of finishing time? You

389
00:28:51,540 --> 00:28:54,318
don't want to sort the notes by their
finishing time cause that would take

390
00:28:54,318 --> 00:28:58,705
n log n time, so you need to make sure that
you remember in the first pass that you

391
00:28:58,705 --> 00:29:01,901
sort of remember the nodes in a way that
you can just do a linear scan through them

392
00:29:01,901 --> 00:29:04,961
in the second pass. So there are some
details but if your intuition is that this

393
00:29:04,961 --> 00:29:08,437
is really just double DFS properly
implemented that's pretty much exactly

394
00:29:08,437 --> 00:29:12,454
right. So having spelled out the full
implementation, argued that it's

395
00:29:12,454 --> 00:29:15,806
definitely a linear time algorithm, and
given at least a plausibility argument via

396
00:29:15,806 --> 00:29:21,939
an example, that it might conceivably be
correct, let's now turn to the general argument.