1
00:00:00,707 --> 00:00:03,523
So the goal of this video is to prove the
correctness of Kasaraju two-pass,

2
00:00:03,523 --> 00:00:07,016
depth-first-search based, linear time
algorithm that computes the strongly

3
00:00:07,016 --> 00:00:10,904
connected components of a directed graph.
So I've given you the full specification

4
00:00:10,904 --> 00:00:14,614
of the algorithm. I've also given you a
plausibility argument of why it might

5
00:00:14,614 --> 00:00:18,630
work, in that at least it does something
sensible on an example. Namely, it first

6
00:00:18,630 --> 00:00:22,151
does a pass of depth first search on the
reverse graph. It computes this magical

7
00:00:22,151 --> 00:00:25,539
ordering. And what's so special about this
ordering is then when we do a depth first

8
00:00:25,539 --> 00:00:29,571
search using this ordering on the forward
graph; it seems to do exactly what we

9
00:00:29,571 --> 00:00:33,601
want. Every indication of depth first
search to some new node discovers exactly

10
00:00:33,601 --> 00:00:37,528
the nodes of the strong component and no
extra stuff. Remember that was our first

11
00:00:37,528 --> 00:00:41,514
observation, but that was unclear whether
depth for search would be useful or not

12
00:00:41,514 --> 00:00:45,411
for computing strong components. If you
call depth first search from just the

13
00:00:45,411 --> 00:00:49,148
right place, you're gonna get exactly the
nodes of an SCC and nothing more. If you

14
00:00:49,148 --> 00:00:52,803
call it from the wrong place, you might
get all of the nodes of the graph, and get

15
00:00:52,803 --> 00:00:56,948
no information at all about the structure of the strong components and at least in

16
00:00:56,948 --> 00:01:00,524
this example this first pass with the
finishing time seems to be accomplishing

17
00:01:00,524 --> 00:01:04,392
seems to be leading us to invoking
DFS from exactly the right

18
00:01:04,392 --> 00:01:08,604
places. So remember how this worked in the
example so in the top graph, I have

19
00:01:08,604 --> 00:01:11,774
shown you the graph with the arch
reversed. This is where we first invoked

20
00:01:11,774 --> 00:01:16,686
DFS loop with the loop over the nodes
going from the highest node name nine all

21
00:01:16,686 --> 00:01:20,642
the way down to the node name one. And
here we compute finishing time that's the

22
00:01:20,642 --> 00:01:23,788
bookkeeping that we do in the first pass
so we just keep the running count of how

23
00:01:23,788 --> 00:01:27,003
many nodes we've finished processing. That
is how many we've both explored that node

24
00:01:27,003 --> 00:01:30,900
as well as explore all of the outgoing
arches and so that gives us these numbers

25
00:01:30,900 --> 00:01:33,696
in the red, these finishing times between
one and nine for the various

26
00:01:33,696 --> 00:01:38,789
nodes. Those became the new node names in
the second graph and then we reverse the

27
00:01:38,789 --> 00:01:42,322
arches again and get the original graphs
back and then we saw that every time we

28
00:01:42,322 --> 00:01:47,229
invoked DFS in our second pass we
uncovered exactly the nodes of an SCC. So

29
00:01:47,229 --> 00:01:51,243
when we invoked it from the node 9 we
discovered that 9, 8 and 7 those

30
00:01:51,243 --> 00:01:55,741
have a leader vortex 9. Then when we
next invoked DFS from 6, we discovered

31
00:01:55,741 --> 00:01:59,942
6, 5, and 1, and nothing else. And
then finally we invoked it from 4, and

32
00:01:59,942 --> 00:02:02,582
we discovered 2, 3, and 4, and
nothing else. And those are exactly the

33
00:02:02,582 --> 00:02:08,599
three, SCCs of this graph. So let's now
understand why this works in any directed

34
00:02:08,599 --> 00:02:14,204
graph, not just this, in this one example.
So let's begin with a simple observation

35
00:02:14,204 --> 00:02:18,479
about directed graphs, which is actually
interesting in its own right. The claim is

36
00:02:18,479 --> 00:02:23,287
that every directed graph has two levels
of granularity. If you squint, if you sort

37
00:02:23,287 --> 00:02:27,116
of zoom out, then what you see is a
directed acyclic graph, of course

38
00:02:27,116 --> 00:02:30,156
comprising its strongly connective
components. And if you want, you can zoom

39
00:02:30,156 --> 00:02:34,736
in and focus on the fine grain structure
with one SCC. A little bit more precisely.

40
00:02:34,736 --> 00:02:43,872
The claim is of the strongly connected
components of a directed graph induce in

41
00:02:43,872 --> 00:02:51,398
a natural way an acyclic metagraph. So
what is this metagraph? What are the nodes

42
00:02:51,398 --> 00:02:57,345
and what are the arcs, what are the edges?
Well, the metanodes are just the SCCs, so

43
00:02:57,345 --> 00:02:59,957
we think of every strong connected
component as being a single node in this

44
00:02:59,957 --> 00:03:07,349
metagraph. So call them say 'C1' up to 'Ck'. So what are the arcs in this metagraph?

45
00:03:07,349 --> 00:03:11,124
Well, they're basically just the ones
corresponding to the arcs between SCCs in

46
00:03:11,124 --> 00:03:19,311
the original graph. That is, we include in
the meta graph an arc from the strong

47
00:03:19,311 --> 00:03:24,643
component 'C' to 'C-hat'
if and only if there's an arc from a node

48
00:03:24,643 --> 00:03:31,037
in 'C' to a node in 'C-hat' in the original
graph 'G'. So for example if this is your 'C'

49
00:03:31,037 --> 00:03:38,689
and so the triangle is your 'C-hat' and you
have one or maybe multiple edges going

50
00:03:38,689 --> 00:03:44,533
from 'C' to 'C-hat', then in the corresponding
metagraph your just gonna have a node for

51
00:03:44,533 --> 00:03:55,322
'C', a node for 'C-hat' and the directed
arch from 'C' to 'C-hat'. So if we go back to

52
00:03:55,322 --> 00:03:58,471
some of the directed graphs that we've
used as running examples so we go back to

53
00:03:58,471 --> 00:04:02,499
the beginning of the
previous video and it's look maybe something

54
00:04:02,499 --> 00:04:11,372
like this the corresponding directed acyclic
graph has four nodes and four

55
00:04:11,372 --> 00:04:17,586
arches. And for the running example we used
to illustrate Kosaraju's algorithm with

56
00:04:17,586 --> 00:04:22,628
the three triangles, the corresponding
metagraph would just be a path with three

57
00:04:22,628 --> 00:04:29,988
nodes. So why is this meta-graph
guaranteed to be acyclic? Well, remember

58
00:04:29,988 --> 00:04:33,024
metanodes correspond to strong
components, and in a strong component you

59
00:04:33,024 --> 00:04:37,396
can get from anywhere to anywhere else.
So, if you had a cycle that involved two

60
00:04:37,396 --> 00:04:40,582
different metanodes, that is two
different strong connected components,

61
00:04:40,582 --> 00:04:44,183
remember on a directed cycle you can also
get from anywhere to anywhere else. So if

62
00:04:44,183 --> 00:04:48,284
you had two supposedly distinct SCCs,
that you could get from the one to the

63
00:04:48,284 --> 00:04:52,031
other and vice versa, they would collapse
into a single SCC. You can get from

64
00:04:52,031 --> 00:04:54,155
anywhere to anywhere in one, anywhere from
anywhere in the other one, and you can

65
00:04:54,155 --> 00:04:57,422
also go between them at will, so you can get from

66
00:04:57,453 --> 00:05:02,811
anywhere in this union to anywhere in the
union. So not just in this context of

67
00:05:02,811 --> 00:05:05,742
competing strong components but also just
more generally, this is a useful fact to

68
00:05:05,742 --> 00:05:09,601
know about directed graphs. On the one
hand, they can have very complex structure

69
00:05:09,601 --> 00:05:13,226
within a strong components. You have paths
going from everywhere to everywhere else,

70
00:05:13,226 --> 00:05:16,807
and it may be sort of complicated looking.
But at a higher level, if you abstract out

71
00:05:16,807 --> 00:05:20,805
to the level of SCCs, you are guaranteed
to have this simple DAG, this simple

72
00:05:20,805 --> 00:05:25,799
directed acyclic graph structure. So, to
reinforce these concepts, and also segue

73
00:05:25,799 --> 00:05:30,828
into thinking about Kosaraju's algorithm
in particular, let me ask you a question

74
00:05:30,828 --> 00:05:37,067
about how reversing arcs affects the
strong components of a directed graph. So

75
00:05:37,067 --> 00:05:40,924
the correct answer to this quiz is the
fourth one. The strong components are

76
00:05:40,924 --> 00:05:45,084
exactly the same as they were before, in
fact the relation that we described is

77
00:05:45,084 --> 00:05:48,068
exactly the same as it was before so
therefore the equivalence classes of the

78
00:05:48,068 --> 00:05:51,930
strong components is exactly the same. So
if two nodes were related in

79
00:05:51,930 --> 00:05:55,839
the original graph, that is a path from U
to V and a path from V to U, that's still

80
00:05:55,839 --> 00:05:58,281
true after you reverse all the arcs, you
just use the reversal of the two paths

81
00:05:58,281 --> 00:06:02,676
that you had before. Similarly if the two
nodes weren't related before, for example

82
00:06:02,676 --> 00:06:06,134
because you could not get from U to V,
well that after you reverse everything,

83
00:06:06,134 --> 00:06:10,936
then you can't get from V to U, so again you don't have this relation holding, so

84
00:06:10,936 --> 00:06:14,922
the SCCs are exactly the same in the
forward or the backward graph. So in

85
00:06:14,922 --> 00:06:18,005
particular in Kazarogi's algorithm, the
strong component structure is exactly the

86
00:06:18,005 --> 00:06:23,648
same in the first pass of DFS and in the
second pass of DFS. So now that we understand

87
00:06:23,648 --> 00:06:27,279
how every directed graph has a meta graph with the nodes correspond to a strong

88
00:06:27,279 --> 00:06:31,029
connected components, and you have an arch
from one SCC to another if there's any

89
00:06:31,029 --> 00:06:35,482
arch from any node in that SCC to the
other SCC in the original graph, I'm in a

90
00:06:35,482 --> 00:06:39,952
position to state what's the key lemma.
That drives the correctness of Kosaraju's

91
00:06:39,952 --> 00:06:43,303
two pass algorithm for computing the
strong connected component of a directed graph.

92
00:06:43,303 --> 00:06:47,691
So here's the lemma statement. It
considers two strongly connecting

93
00:06:47,691 --> 00:06:50,724
components that are adjacent, in the sense
that there's an arc from one node in one

94
00:06:50,724 --> 00:06:57,361
of them to one node in the other one. So
let's say we have one SCC - 'C1', with a

95
00:06:57,361 --> 00:07:04,419
node I, and another, SCC 'C2' with a node
J, and that in G, in the graph, there's an

96
00:07:04,419 --> 00:07:10,100
arc directly from I to J. So in this sense
we say that these SCCs are adjacent, with

97
00:07:10,100 --> 00:07:16,235
the second one being in some sense after
the first one. Now let's suppose we've

98
00:07:16,235 --> 00:07:21,097
already run the first pass of the DFS loop
subroutine. And remember that works on the

99
00:07:21,097 --> 00:07:23,309
reverse graph. So we've invoked it on the
reverse graph. We've computed these

100
00:07:23,309 --> 00:07:28,521
finishing times. As usual we'll let f(v)
denote the finishing times computed in

101
00:07:28,521 --> 00:07:33,511
that depth first search subroutine on the
reverse graph. The lemma then asserts the

102
00:07:33,511 --> 00:07:37,713
following. It says first, amongst all the
nodes in 'C1' look at the one with the

103
00:07:37,713 --> 00:07:42,036
largest finishing time. Similarly amongst
all nodes in 'C2' look at the one with the

104
00:07:42,036 --> 00:07:45,810
biggest finishing time. Amongst all of
these the claim is that the biggest

105
00:07:45,810 --> 00:07:51,886
finishing time will be in 'C2' not in 'C1'. So
what I wanna do next is I wanna assume that

106
00:07:51,886 --> 00:07:56,516
this lemma is true temporarily. I wanna
explore the consequences of that

107
00:07:56,516 --> 00:08:01,210
assumption and in particular what I wanna
show you is that if this lemma holds, then

108
00:08:01,210 --> 00:08:07,373
we can complete the proof of correctness
of Kosaraju's two-pass SCC computation

109
00:08:07,373 --> 00:08:11,212
algorithm. Okay, so if the lemma is true
then after... I'll give you the argument

110
00:08:11,212 --> 00:08:14,283
about why we're done. About why we just
peel off the SCC one at

111
00:08:14,283 --> 00:08:18,765
a time with the second pass of depth first
search. Now of course a proof with a hole

112
00:08:18,765 --> 00:08:22,312
in it, isn't a proof. So at the end of the
lecture I'm gonna fill in the hole. That

113
00:08:22,312 --> 00:08:25,251
is, I'm gonna supply a proof of this key
lemma. But for now, as a working

114
00:08:25,251 --> 00:08:30,323
hypothesis, let's assume that it's true.
Let's begin with a corollary, that is a

115
00:08:30,323 --> 00:08:35,542
statement which follows essentially
immediately, from the statement of a lema.

116
00:08:35,542 --> 00:08:39,553
So for the corollary, let's forget about
just trying to find the maximum

117
00:08:39,553 --> 00:08:43,441
maximum finishing time in a single SCC.
Let's think about the maximum finishing

118
00:08:43,441 --> 00:08:47,739
time in the entire graph. Now, why do we
care about the maximum finishing time in

119
00:08:47,739 --> 00:08:51,660
the entire graph? Well, notice that's
exactly where the second pass of DFS

120
00:08:51,660 --> 00:08:55,361
is going to begin. Right, so
it processes nodes in order from largest

121
00:08:55,361 --> 00:08:59,526
finishing time to smallest finishing time.
So equivalently, let's think about the

122
00:08:59,526 --> 00:09:05,155
node at which the second pass of depth
first search is going to begin, i.e., the

123
00:09:05,155 --> 00:09:09,431
node with the maximum finishing time.
Where could it be? Well, the corollary is

124
00:09:09,431 --> 00:09:13,652
that it has to be in what I'm gonna call a
sink, a strongly connected component, that

125
00:09:13,652 --> 00:09:19,445
is a strongly connected component without
any outgoing arcs. So for example

126
00:09:19,445 --> 00:09:23,923
let's go back to the, meta graph of SCCs
for the very first directed graph we

127
00:09:23,923 --> 00:09:28,126
looked at. You recall in the very first
direc ted graph we looked at in when we

128
00:09:28,126 --> 00:09:35,535
started talking about this algorithm there
were four SCCs. So there was a 'C1', a 'C2',

129
00:09:35,535 --> 00:09:44,371
a 'C3', and a 'C4'. And of course within each
of these components, there could be

130
00:09:44,371 --> 00:09:48,439
multiple nodes but they are all strongly
connected to each other. Now, let's use

131
00:09:48,439 --> 00:09:55,986
F1, F2, F3 and F4 to denote the maximum
finishing time in each of these SCCs. So

132
00:09:55,986 --> 00:10:03,048
we have F1, F2, F3 and F4. So, now we
have four different opportunities to apply

133
00:10:03,048 --> 00:10:07,506
this lemma. Right? Those four different
pairs of adjacent SCCs. And so, what do

134
00:10:07,506 --> 00:10:12,662
we find? We find that well, comparing F1 and F2, because C2 comes after C1,

135
00:10:12,662 --> 00:10:17,276
that is there's an arc from C1 to
C2, the max finishing time in C2 has

136
00:10:17,276 --> 00:10:22,121
to be bigger than that in C1. That is F2 is bigger than F1. For the same

137
00:10:22,121 --> 00:10:27,382
reasoning F3 has to be bigger than F1.
Symmetrically we can apply the limit to

138
00:10:27,382 --> 00:10:33,748
the pair C2, C4 and C3, C4 and we get that
F4 has to dominate both of them. Now

139
00:10:33,748 --> 00:10:37,986
notice we actually have no idea whether F2
or F3 is bigger. So that pair we can't

140
00:10:37,986 --> 00:10:41,874
resolver. But we do know these
relationships. Okay F1 is the smallest and

141
00:10:41,874 --> 00:10:48,374
F4 is the smallest [biggest!!]. And you also notice
that C4 is a sink SCC and the sink has no

142
00:10:48,374 --> 00:10:52,579
outgoing arches and you think about it
that's a totally general consequence of

143
00:10:52,579 --> 00:10:56,718
this lema. So in a simple group of
contradiction will go as follows. Consider

144
00:10:56,718 --> 00:11:00,880
this SCC with
the maximum F value. Suppose it was not a

145
00:11:00,880 --> 00:11:05,364
sink SCC that it has an outgoing arch,
follow that outgoing arch to get some

146
00:11:05,364 --> 00:11:11,078
other SCC by the lema the SCC you've got
into has even bigger maximum finishing

147
00:11:11,078 --> 00:11:14,536
time. So that contradicts the fact that
you started in the SCC with a maximum

148
00:11:14,536 --> 00:11:19,862
finishing time. Okay. So just like in this
cartoon, where the unique sink SCC has

149
00:11:19,862 --> 00:11:23,651
to have the largest finishing time, that's
totally general. As another sanity check,

150
00:11:23,651 --> 00:11:27,295
we might return to the nine node graph,
where we actually ran Kasaraja's

151
00:11:27,295 --> 00:11:31,648
algorithm and looking at the ford
version of the graph, which is the one on

152
00:11:31,648 --> 00:11:38,093
the bottom, we see that the maximum
finishing times in the three FCC are 4,6

153
00:11:38,093 --> 00:11:42,178
and 9. And it turns out that the same
as the leader nodes which is not an

154
00:11:42,178 --> 00:11:45,917
accident if you think about it for a
little while and again you'll observe the

155
00:11:45,917 --> 00:11:49,906
maximum finishing time in this graph
namely 9 is indeed in the left most SCC

156
00:11:49,906 --> 00:11:53,953
which is the only SCC with no outgoing
arks. Okay but it's totally general

157
00:11:53,953 --> 00:11:57,290
basically you can keep following arks and
you keep seeing bigger and bigger

158
00:11:57,290 --> 00:12:00,862
finishing times so the biggest one of all
it has to be somewhere where you get stuck

159
00:12:00,862 --> 00:12:04,327
where you can't go forward but there's no
outgoing arks and that's what I'm calling

160
00:12:04,327 --> 00:12:08,608
a sink SCC. Okay. So assuming the lemma is true we know that the corollary

161
00:12:08,608 --> 00:12:12,544
is true. Now using this corollary
let's finish the proof of correctness, of

162
00:12:12,544 --> 00:12:17,795
Kasaraja's algorithm, module over proof
of the key lima. So I'm not going to do

163
00:12:17,795 --> 00:12:22,121
this super rigorously, although everything
I say is correct and can be made, made

164
00:12:22,121 --> 00:12:25,376
rigorous. And if you want a more rigorous
version I'll post some notes on the course

165
00:12:25,376 --> 00:12:31,278
website which you can consult for more
details. So what the previous corollary

166
00:12:31,278 --> 00:12:35,891
accomplished, it allows us to locate, the
node with maximum finishing time. We can

167
00:12:35,891 --> 00:12:40,856
locate it in somewhere in some sink
SCC. Let me remind you about the

168
00:12:40,856 --> 00:12:44,120
discussion we had at the very beginning of
talking about computing strong components.

169
00:12:44,120 --> 00:12:48,593
We're tryna understand depth-first search
would be a useful workhorse for finding

170
00:12:48,593 --> 00:12:53,815
the strong components. And the key
observation was that it depends, where you

171
00:12:53,815 --> 00:12:58,966
begin that depth-first search. So for
example in this, graph with four SCC's

172
00:12:58,966 --> 00:13:04,050
shown in blue on the right. A really bad
place to start. DFS called Depth For

173
00:13:04,050 --> 00:13:10,782
Search would be somewhere in C1. Somewhere
in this source SCC, so this is a bad DFS.

174
00:13:10,782 --> 00:13:15,316
Why is it bad? Well remember what Depth
For Search does; it finds everything

175
00:13:15,316 --> 00:13:20,248
findable from its starting point. And from
C1 you can get to the entire world, you

176
00:13:20,248 --> 00:13:25,227
can get to all the nodes in the entire
graph. So you can discover everything. And

177
00:13:25,227 --> 00:13:28,347
this is totally useless because we wanted
to discover much more fine-grain

178
00:13:28,347 --> 00:13:32,917
structure. We wanted to discover C1, C2,
C3 and C4 individually. So that would be

179
00:13:32,917 --> 00:13:37,601
an disaster if we invoked depth first
search somewhere from C1. Fortunately

180
00:13:37,601 --> 00:13:40,256
that's not what's going to happen, right?
We computed this magical ordering in the

181
00:13:40,256 --> 00:13:46,335
first pass to insure that we look at the
node with the maximum finishing time and,

182
00:13:46,335 --> 00:13:50,944
by the corollary, the maximum finishing
time is going to be somewhere in C4.

183
00:13:50,944 --> 00:13:56,912
That's gonna be a good DFS, in the sense
that, when we start exploring from

184
00:13:56,912 --> 00:14:00,710
anywhere in C4, there's no outgoing arcs.
So, of course, we're gonna find everything

185
00:14:00,710 --> 00:14:03,068
in C4. Everything in C4's strongly
connected to each other. But we can't get

186
00:14:03,068 --> 00:14:06,629
out. We will not have the option of
trespassing on other strong components,

187
00:14:06,629 --> 00:14:13,431
and we're not gonna find'em. So we're only
gonna find C4, nothing more. Now, here's

188
00:14:13,431 --> 00:14:16,340
where I'm gonna be a little informal.
Although, again, everything I'm gonna say

189
00:14:16,340 --> 00:14:20,545
is gonna be correct. So what happens now,
once we've discovered everything in C4?

190
00:14:20,545 --> 00:14:24,479
Well, all the nodes in C4 get marked as
explored, as we're doing depth first

191
00:14:24,479 --> 00:14:27,945
search. And then they're basically dead to
us, right? The rest of our depth first

192
00:14:27,945 --> 00:14:31,586
search loop will never explore them again.
They're already marked as explored. If we

193
00:14:31,586 --> 00:14:35,297
ever see'em, we don't even go there. So
the way to think about that is when we

194
00:14:35,297 --> 00:14:39,337
proceed with the rest of our for loop in
DFS loop it's as if we're starting

195
00:14:39,337 --> 00:14:43,845
afresh. We're doing depth first search
from scratch on a smaller graph, on the

196
00:14:43,845 --> 00:14:48,437
residual graph. The graph G with this
newly discovered strong component 'C<i>'</i>

197
00:14:48,437 --> 00:14:54,543
deleted. So in this example on the right,
all of the nodes in C4 are dead to us and

198
00:14:54,543 --> 00:14:59,655
it's as if we run DFS anew, just on the
graph containing the strong components C1,

199
00:14:59,655 --> 00:15:04,810
C2 and C3. So in particular, where is the
next indication of depth first search

200
00:15:04,810 --> 00:15:10,078
going to come from? It's going to come
from some sink SCC in the residual graph,

201
00:15:10,078 --> 00:15:14,255
right? It's going to start at the node
that remains and that has the largest

202
00:15:14,255 --> 00:15:18,295
finishing time left. So there's some
ambiguity in this picture. Again recall we

203
00:15:18,295 --> 00:15:22,166
don't know whether F2 is bigger or F3 is
bigger. It could be either one. Maybe F2

204
00:15:22,166 --> 00:15:27,075
is the largest remaining finishing time in
which case the next DFS indication's gonna

205
00:15:27,075 --> 00:15:31,287
begin somewhere more from C2. Again, the
only things outgoing from C2 are these

206
00:15:31,287 --> 00:15:34,688
already explored nodes. Their effectively
deleted. We're not gonna go there again.

207
00:15:34,688 --> 00:15:40,695
So this is essentially a sink FCC. We
discover, we newly discover the nodes in

208
00:15:40,695 --> 00:15:43,595
C2 and nothing else. Those are now
effectively deleted. Now, the next

209
00:15:43,595 --> 00:15:49,122
indication of DFS will come from somewhere
in F3, somewhere in C3. That's the only

210
00:15:49,122 --> 00:15:53,803
remaining sink SCC in the residual graph.
So the third call, the DFS will discover

211
00:15:53,803 --> 00:15:57,076
this stuff. And now, of course, we're left
only with C1. And so the final indication

212
00:15:57,076 --> 00:16:04,317
of DFS will emerge from and discover the
nodes in C1. And in this sense because

213
00:16:04,317 --> 00:16:09,048
we've ordered the nodes by finishing times
when DFS was reverse graph, that ordering

214
00:16:09,048 --> 00:16:13,986
has this incredible property that when you
process the nodes in the second pass we'll

215
00:16:13,986 --> 00:16:17,933
just peel off the strongly connected
components one at a time. If you think

216
00:16:17,933 --> 00:16:22,304
about it, it's in reverse topological
order with respect to the directed asypric

217
00:16:22,304 --> 00:16:26,962
graph of the strongly connected
components. So we've constructed a proof

218
00:16:26,962 --> 00:16:30,972
of correctness of Kosaraju's, algorithm
for computing strongly connected

219
00:16:30,972 --> 00:16:34,608
components. But again, there's a hole in
it. So we completed the argument assuming

220
00:16:34,608 --> 00:16:38,961
a statement that we haven't proved. So
let's fill in that last gap in the proof,

221
00:16:38,976 --> 00:16:43,175
and we'll we done. So what we need to do
is prove the key lemma. Let me remind you

222
00:16:43,175 --> 00:16:50,649
what it says. It says if you have two
adjacent SCCs, C1 and C2 and is an arc from a

223
00:16:50,649 --> 00:16:58,151
node in C1, call it 'I' to a node in C2, say
J. Then the max finishing time in C2 is

224
00:16:58,151 --> 00:17:02,275
bigger than the max finishing time in C1. Where, as always, these finishing

225
00:17:02,275 --> 00:17:06,335
times are computed in that first pass of
depth-first search loop in the reversed

226
00:17:06,335 --> 00:17:10,327
graph. All right, now the finishing times
are computed in the reversed graph, so

227
00:17:10,327 --> 00:17:14,289
let's actually reverse all the arcs and
reason about what's happening there. We

228
00:17:14,289 --> 00:17:19,454
still have C1. It still contains the
node I. We still have C2, which still

229
00:17:19,454 --> 00:17:25,732
contains the node J. But now of course the
orientation of the arc has reversed. So

230
00:17:25,732 --> 00:17:31,675
the arc now points from J to I. Recall we
had a quiz which said, asked you to

231
00:17:31,675 --> 00:17:36,105
understand the effect of reversing all
arcs on the SCC's and in particular there

232
00:17:36,105 --> 00:17:39,981
is no effect. So the SCC's in the reverse
graph are exactly the same as in the

233
00:17:39,981 --> 00:17:45,577
forward graph. So now we're going to have
two cases in this proof and the cases

234
00:17:45,577 --> 00:17:51,382
correspond to where we first encounter a
node of C1 and union C2. Now remember,

235
00:17:51,382 --> 00:17:55,327
when we do this DFS loop, this second
pass, because we have this outer four loop

236
00:17:55,327 --> 00:17:59,570
that iterates over all of the nodes we're
guaranteed to explore every single node of

237
00:17:59,570 --> 00:18:03,391
the graph at some point. So in particular
we're gonna have to explore at some point

238
00:18:03,391 --> 00:18:08,168
every node in C1 and C2. What I want you
to do is pause the algorithm. When it

239
00:18:08,168 --> 00:18:14,190
first, for the first time, explores some
node that's in either C1 or C2. There's

240
00:18:14,190 --> 00:18:18,332
going to be two cases, of course, because
that node might be in C1, you might see

241
00:18:18,332 --> 00:18:22,393
that first. Or it might be in C2, you
might see something from C2 first. So our

242
00:18:22,393 --> 00:18:26,815
case one is going to be when the first
node that we see from either one happens

243
00:18:26,815 --> 00:18:32,378
to lie in C1. And the second case is where
the first node V that we see happens to

244
00:18:32,378 --> 00:18:38,981
lie in C2. So clearly exactly one of these
will occur. So let's think about case one.

245
00:18:38,981 --> 00:18:47,826
When we see a node of C1 before we see
any nodes of C2. So in this case where we

246
00:18:47,826 --> 00:18:53,228
encounter a node in C1 before we encounter
any node in C2, the claim is that we're

247
00:18:53,228 --> 00:18:59,086
going to explore everything in C1 before
we ever see anything in C2. Why is that

248
00:18:59,086 --> 00:19:04,714
true? The reason is there cannot be a path
that starts somewhere in C1, for example,

249
00:19:04,714 --> 00:19:10,555
like the vertex V, and reaches C2. This is
where we are using the fact that the

250
00:19:10,555 --> 00:19:15,229
meta-graph on SCC is a cyclic. Right C1 is strong

251
00:19:15,229 --> 00:19:19,385
connected, C2 is strong connected, you can
get from C2 to C1 and, if you can also get

252
00:19:19,385 --> 00:19:24,804
from C1 back to C2 this all collapses into
a single strongly connected component. But

253
00:19:24,804 --> 00:19:28,655
that would be a contraction, we're
assuming C1 and C2 are distinct strongly

254
00:19:28,655 --> 00:19:32,238
connected components, therefore you can't
have paths in both directions. We already

255
00:19:32,238 --> 00:19:36,392
have a path from right to left, via JI, so
there's no path from left to right. That's

256
00:19:36,392 --> 00:19:40,790
why if you originate a depth first search
from somewhere inside C1 like this vertex

257
00:19:40,790 --> 00:19:45,875
V, you would finish exploring all of C1
before you ever are going to see C2,

258
00:19:45,875 --> 00:19:51,058
you're. Only gonna see C2 at some later
point in the outer for loop. So, what's

259
00:19:51,058 --> 00:19:55,767
the consequence that you completely finish
with C1 before you ever see C2? Well it

260
00:19:55,767 --> 00:20:01,356
means every single finishing time in C1 is
going to be smaller than every single

261
00:20:01,356 --> 00:20:05,333
finishing time in C2. So that's even
stronger that what we're claiming, we're

262
00:20:05,333 --> 00:20:08,962
just claiming that the biggest thing in C2
is bigger than the biggest of C1. But

263
00:20:08,962 --> 00:20:12,682
actually finishing times in C2 totally
dominate those in C1, because you finish

264
00:20:12,682 --> 00:20:17,089
C1 before you ever see C2. So let's now
have a look at case one actually in

265
00:20:17,089 --> 00:20:20,727
action. Let's return to the nine-node
graph, the one that we actually ran

266
00:20:20,727 --> 00:20:25,921
Kosaraju's algorithm to completion. So if
we go back to this graph which has the

267
00:20:25,921 --> 00:20:29,844
three connected components, then remember
that the bottom version is the forward

268
00:20:29,844 --> 00:20:33,281
version, the top version is the reversed
version. So if, if you think about the

269
00:20:33,281 --> 00:20:39,470
middle SCC as being c1, pulling the row of
c1 and the left most. Scc playing the role

270
00:20:39,470 --> 00:20:44,608
of C2, then what we have exactly is case
one of the key lemma. So, which was

271
00:20:44,608 --> 00:20:50,641
the first of these six vertices visited
during the DFS loop in the reversed graph?

272
00:20:50,641 --> 00:20:54,785
Well that would just be the node with the
highest name, so the node nine. So this

273
00:20:54,785 --> 00:20:58,499
was the first of these six vertices that
depth first search looked at in the first

274
00:20:58,499 --> 00:21:04,327
pass, that lies in what we're calling C1.
And indeed everything in C1 was discovered

275
00:21:04,327 --> 00:21:09,278
in that pass before anything in C2 and
that's why all of the finishing times in

276
00:21:09,278 --> 00:21:14,477
C2, the 7,8,9 are bigger than
all of the finishing times in C1 -

277
00:21:14,477 --> 00:21:19,594
the 1,5, and 6. So we're good to go in
case two. We've proven sorry, in case one,

278
00:21:19,594 --> 00:21:24,051
we've proven the lemma. When it's the case
that, amongst the vertices in C1

279
00:21:24,051 --> 00:21:28,740
and C2, depth first search in the first pass
sees something from C1 first. So

280
00:21:28,740 --> 00:21:32,803
now, let's look at this other case, this
grey case, which could also happen,

281
00:21:32,803 --> 00:21:36,638
totally possible. Well, the first thing we
see when depth first searching in the

282
00:21:36,638 --> 00:21:42,873
first pass, is something from C2. And here
now is where we truly use the fact that

283
00:21:42,873 --> 00:21:47,384
we're using a depth first search rather
than some other graph search algorithm

284
00:21:47,384 --> 00:21:50,549
like breadth first search. There's a lot
of places in this algorithm you could swap

285
00:21:50,549 --> 00:21:55,049
in breadth first search but in this case
two, you'll see why it's important we're

286
00:21:55,049 --> 00:21:59,514
using depth first search to compute the
finishing times. And what's the key point?

287
00:21:59,514 --> 00:22:04,014
The key point is that, when we invoke
depth first search beginning from this

288
00:22:04,014 --> 00:22:08,505
node V, which is now assuming the line C2.
Remember depth first search will not

289
00:22:08,505 --> 00:22:12,649
complete. We won't be done with V until
we've found everything there is to find

290
00:22:12,649 --> 00:22:16,077
from it, right? So we recursively explore
all of the outgoing arcs. They recursively

291
00:22:16,077 --> 00:22:20,922
explore all of the outgoing arcs, and so
on. It's only when all paths going out of

292
00:22:20,922 --> 00:22:24,572
V have been totally explored and
exhausted, that we finally backtrack all

293
00:22:24,572 --> 00:22:29,926
the way to V, and we consider ourselves
done with it. That is, depth first search.

294
00:22:29,926 --> 00:22:36,562
In the reverse graph initiated at v. Won't
finish until everything findable has been

295
00:22:36,562 --> 00:22:41,473
completely explored. Because there's an
arc from C2 to C1, obviously everything to

296
00:22:41,473 --> 00:22:45,938
C2 is findable from V, that's strongly
connected. We can from C2 to C1 just using

297
00:22:45,938 --> 00:22:50,051
this arc from J to I. C1 being strongly
connected we can then find all of that.

298
00:22:50,051 --> 00:22:53,360
Maybe we can find other strongly connected
components, as well, but for sure

299
00:22:53,360 --> 00:22:57,129
depth-first search starting from V will
find everything in C1 to C2, maybe some

300
00:22:57,129 --> 00:23:01,778
other things. And we won't finish with V
until we finish with everything else,

301
00:23:01,778 --> 00:23:06,807
that's the depth-first search property.
For that reason the finishing time of this

302
00:23:06,807 --> 00:23:11,614
vertex V will be the largest of anything
reachable from it. So in particular it'll

303
00:23:11,614 --> 00:23:14,875
be larger than everything in C two but
more to the point, it'll be larger than

304
00:23:14,875 --> 00:23:18,962
everything in C1 which is what we are
trying to prove. Again let's just see

305
00:23:18,962 --> 00:23:22,064
this quickly in action in the nine node
network on which we traced through

306
00:23:22,064 --> 00:23:26,220
Kosaraju's algorithm. So to show the rule
that case two is playing in this concrete

307
00:23:26,220 --> 00:23:30,441
example let's think of the right most
strongly connected component as being C1.

308
00:23:30,441 --> 00:23:34,572
And let's think of the middle SCC as being C.2. Now

309
00:23:34,572 --> 00:23:38,742
the last time. We called the middle one C1
and the leftmost one C2. Now we're calling

310
00:23:38,742 --> 00:23:42,066
the rightmost one C1 and the middle one
C2. So again, we have to ask the question,

311
00:23:42,066 --> 00:23:47,091
you know, of the six nodes in C1 and in
C2, what is the first one encountered in

312
00:23:47,091 --> 00:23:50,185
the depth first search that we do in the
first pass. And then again, is the node

313
00:23:50,185 --> 00:23:54,113
nine? The, the node which is originally
labeled not. So it's the same node that

314
00:23:54,113 --> 00:23:57,588
was relevant in the previous case, but now
with this relabeling of the components,

315
00:23:57,588 --> 00:24:02,821
nine appears in the strongly connected
component C-2, not in the one labeled C-1.

316
00:24:02,821 --> 00:24:07,351
So that's the reason now we're in case
two, not in case one. And what you'll see

317
00:24:07,351 --> 00:24:13,301
is, what is the finishing time that this
originally labeled nine node gets. It gets

318
00:24:13,301 --> 00:24:18,614
the finishing time six. And you'll notice
six is bigger than any of the other

319
00:24:18,614 --> 00:24:22,523
finishing times of any of the other nodes
in C1 or C2. All, the other five nodes

320
00:24:22,523 --> 00:24:26,518
have the finishing times one through five.
And that's exactly because when we ran

321
00:24:26,518 --> 00:24:30,515
depth first search in the first pass, and
we started it at the node originally

322
00:24:30,515 --> 00:24:35,352
labeled nine, it discovered these other
five nodes and finished exploring them

323
00:24:35,352 --> 00:24:40,897
first before finally back tracking all the
way back to nine, and deeming nine fully

324
00:24:40,897 --> 00:24:44,839
explored. And it was only at that point
that nine got its finishing time after

325
00:24:44,839 --> 00:24:50,371
everything reachable from it had gotten
there lower finishing times. So that

326
00:24:50,371 --> 00:24:54,851
wraps it up. We had two cases depending on
whether in these two adjacent SCC's, the

327
00:24:54,851 --> 00:25:00,349
first vertex encountered was in the C1, or
in C2. Either way it doesn't matter, the

328
00:25:00,349 --> 00:25:04,621
largest finishing time has to be in C2.
Sometimes it's bigger than everything,

329
00:25:04,621 --> 00:25:08,351
sometimes it's just bigger than the
biggest in C-1, but it's all the same to

330
00:25:08,351 --> 00:25:10,721
us. And to re cap how the rest of the
proof goes, we have a corollary based on

331
00:25:10,721 --> 00:25:16,079
this lemma, which is maximum finishing
time have to lie in sink SCC

332
00:25:16,079 --> 00:25:18,877
And that's exactly
where we want our depth first search to

333
00:25:18,877 --> 00:25:23,135
initiate. If you're initiated in a strong
component with no outgoing arcs, you do

334
00:25:23,135 --> 00:25:26,872
DFS. The stuff you find is just the stuff
and that strongly connected component. You

335
00:25:26,872 --> 00:25:31,061
do not have any avenues by which to
trespass on other strong components. So

336
00:25:31,061 --> 00:25:35,067
you find exactly one SCC. In effect, you
can peel that off and recurse on the rest

337
00:25:35,067 --> 00:25:39,197
of the graph. And our slick way of
implementing this recursion is to just do

338
00:25:39,197 --> 00:25:43,854
the single second, DFS pass, where you
just treat the nodes in decreasing order

339
00:25:43,854 --> 00:25:48,436
of finishing times, that in effect,
unveil, unveils all of the SCCs in

340
00:25:48,436 --> 00:25:53,397
reverse topological ordering. So that's
it, Kosaraju's algorithm, and the complete

341
00:25:53,397 --> 00:25:58,093
proof of correctness. A blazingly fast
graph primitive that, in any directed

342
00:25:58,093 --> 00:26:02,866
graph, will tell you its strong
components.