1
00:00:03,620 --> 00:00:10,617
As our final digraph processing algorithm,
we'll take a look at computing strong

2
00:00:10,617 --> 00:00:15,244
components.
So the definition two vertices, v and w,

3
00:00:15,244 --> 00:00:21,758
in a digraph are strongly connected if
there's a directed path from v to w and

4
00:00:21,758 --> 00:00:27,947
another directed path from w to v.
And the thing about strongly connected is

5
00:00:27,947 --> 00:00:34,123
that it's unequivalence relation. That is
each vertex is strongly connected to

6
00:00:34,123 --> 00:00:37,299
itself.
If v is connected to w, strongly, and w is

7
00:00:37,299 --> 00:00:41,059
strongly connected to, to, then w is
strongly connected to v.

8
00:00:41,059 --> 00:00:44,820
That just means that paths, directed paths
connecting them.

9
00:00:45,046 --> 00:00:50,038
And it's also transitive.
If v is connected to w and w to x,, then v

10
00:00:50,038 --> 00:00:55,029
is strongly connected to x.
How to get from v to x? You go from w, v

11
00:00:55,029 --> 00:00:59,189
to w and w to x,
And get from x to v, you go from x to w

12
00:00:59,189 --> 00:01:02,516
and w to v.
So, that means that since it's an

13
00:01:02,516 --> 00:01:08,037
equivalence relation, it divides the
digraph up into components, into sets

14
00:01:08,037 --> 00:01:14,238
called strongly connected components that
have the property that there's directed

15
00:01:14,238 --> 00:01:18,020
paths connecting each pair of vertices in
the set.

16
00:01:18,020 --> 00:01:23,742
So, for example nine and twelve are
strongly connected,

17
00:01:23,742 --> 00:01:28,545
There's a path from twelve to nine,
Another one from nine to twelve and so

18
00:01:28,545 --> 00:01:30,377
forth.
So the,

19
00:01:30,377 --> 00:01:38,450
Our challenge is to compute the strong
components in diagraphs.

20
00:01:38,679 --> 00:01:45,418
And it's worth comparing this to what we
did for connected components in undirected

21
00:01:45,418 --> 00:01:48,328
graphs.
So, this is just a quick review.

22
00:01:48,558 --> 00:01:54,607
In an undirected graph, if there's a path
between two vertices that are connected,

23
00:01:54,607 --> 00:02:00,274
that's an equivalence relation, divides
them up into connective components.

24
00:02:00,504 --> 00:02:07,013
And what we did was, our design pattern is
to build our graph processing algorithm as

25
00:02:07,013 --> 00:02:13,767
a constructor that takes a graph and does
the pre-processing to create a table like

26
00:02:13,767 --> 00:02:20,765
this one which assigns a unique ID to all
the vertices in each given component.

27
00:02:20,765 --> 00:02:26,819
So, that we can have a constant time
client query to check whether two given

28
00:02:26,819 --> 00:02:30,320
vertices are in the same connected
component or not.

29
00:02:30,320 --> 00:02:36,227
So, linear time processing, pre-processing
in the constructor to build the table,

30
00:02:36,227 --> 00:02:41,388
constant time client queries.
And that's as good performance as we can

31
00:02:41,388 --> 00:02:45,875
expect to have for a graph for a graph
processing algorithm.

32
00:02:46,099 --> 00:02:51,110
Remarkably, now, we're able to do the same
thing for strong connectivity.

33
00:02:51,322 --> 00:02:57,339
It's a, it's a much more sophisticated
algorithm but the design pattern and the

34
00:02:57,339 --> 00:03:03,316
bottom line for the client is the same.
There's a constructor, it processes the

35
00:03:03,316 --> 00:03:07,707
graph in linear time.
And it assigns a unique ID to each one of

36
00:03:07,707 --> 00:03:10,863
the strongly connected components in the
graph.

37
00:03:10,863 --> 00:03:16,100
So, once in the component by itself,
Zero, two, three, four, and five six, and

38
00:03:16,100 --> 00:03:21,539
eight, seven, and nine, ten, and twelve,
There's five different strongly connected

39
00:03:21,539 --> 00:03:26,240
components in this graph.
The constructor builds that array and the

40
00:03:26,240 --> 00:03:31,209
client gets to in constant time, know
whether two vertices are strongly

41
00:03:31,209 --> 00:03:35,775
connected or not.
It's quite amazing that we can solve this

42
00:03:35,775 --> 00:03:39,730
problem in this way.
And as I'd mentioned it's got an

43
00:03:39,730 --> 00:03:42,730
interesting history that we'll talk about
in a second.

44
00:03:42,730 --> 00:03:47,306
So, here's an example of strong component
application.

45
00:03:47,552 --> 00:03:53,028
So, the food web graph,
This is a very small subset of that graph.

46
00:03:53,273 --> 00:04:00,710
The vertices are all the, species and an
edge goes from producer to consumer.

47
00:04:00,710 --> 00:04:05,859
So, if animal a eats animal b, there's an
edge from a to b.

48
00:04:06,105 --> 00:04:13,133
And what a strong component corresponds to
in this graph is a kind of a subset of

49
00:04:13,133 --> 00:04:18,942
species that, have a common energy flow.
They naturally eat each other.

50
00:04:19,187 --> 00:04:24,021
And that's extremely important in
ecological studies.

51
00:04:24,267 --> 00:04:31,395
Another example is again, in processing
software big software is made up of

52
00:04:31,395 --> 00:04:35,327
modules.
And the vertices are, are, are, are the

53
00:04:35,327 --> 00:04:39,423
modules.
And edges is if one depends on another.

54
00:04:39,669 --> 00:04:46,797
And so a strong component in this graph is
a subset of mutually interacting modules.

55
00:04:47,043 --> 00:04:53,283
And in a huge program like Internet
Explorer you want to know what the strong

56
00:04:53,283 --> 00:04:58,391
components are so that you can package
them together and maybe improve the

57
00:04:58,391 --> 00:05:01,081
design.
So again, software you know, can, these

58
00:05:01,081 --> 00:05:05,616
graphs can be huge.
And this kind of graph processing can be

59
00:05:05,616 --> 00:05:10,080
extremely important in improving the
design of software.

60
00:05:11,651 --> 00:05:18,464
Now again this algorithm has an
interesting history,, it along with

61
00:05:18,695 --> 00:05:25,233
scheduling and other things that's a core
problem in operations research that was

62
00:05:25,464 --> 00:05:30,695
widely studied, but really not understood
how difficult it was.

63
00:05:30,926 --> 00:05:37,772
In Tarjan's paper in 1972 was a big
surprise that this could be solved in

64
00:05:38,003 --> 00:05:44,080
linear time with depth research.
Now this algorithm had a couple of other

65
00:05:44,080 --> 00:05:49,587
data structures and this I guess,
A diligent student in this class could

66
00:05:49,587 --> 00:05:55,413
understand it with quite a bit of work.
But it really demonstrated the importance

67
00:05:55,413 --> 00:05:58,807
of depth-first search in great graph
processing.

68
00:05:59,006 --> 00:06:04,529
Now, the algorithm that we're going to
talk about today actually was invented in

69
00:06:04,529 --> 00:06:08,122
the 80s' by Kosaraju and, and
independently by Sharir.

70
00:06:08,321 --> 00:06:14,043
The story goes that Kosaraju had to go
lecture about Tarjan's algorithms and he

71
00:06:14,043 --> 00:06:18,169
forgot his notes.
And he had taught it a number of times, he

72
00:06:18,169 --> 00:06:23,957
was trying to figure out what Tarjan's
algorithm did and he developed this other

73
00:06:23,957 --> 00:06:27,138
algorithm that's extremely simple ro
implement.

74
00:06:27,349 --> 00:06:30,810
That's what we're going to look at today,
today.

75
00:06:30,810 --> 00:06:37,516
And actually in the 1990s Gabow and
Mehlhorn and particularly, Melhorn had to

76
00:06:37,516 --> 00:06:44,136
implement this algorithm for a big
software package and he found another

77
00:06:44,136 --> 00:06:49,972
simple linear time algorithm.
Now, so, this story indicates even from

78
00:06:49,972 --> 00:06:57,289
fundamental problems in graph processing
there's algorithms out there still waiting

79
00:06:57,289 --> 00:07:02,690
to be discovered and this, this algorithm
is a good example of that.

80
00:07:02,690 --> 00:07:09,230
So we give the intuition in the code and
you see how the algorithm works fully

81
00:07:09,230 --> 00:07:15,250
convincing yourself or proving why it
works is a bit more of a challenge.

82
00:07:15,250 --> 00:07:19,762
Now, we'll leave that mostly for the book.
But let's describe what it is.

83
00:07:19,992 --> 00:07:23,900
So, the first idea is to think about the
reverse graph.

84
00:07:24,130 --> 00:07:30,261
So if we take a graph and we reverse the
sense of all the edges we're going to get

85
00:07:30,261 --> 00:07:34,859
the same, strong components.
We need edges in both directions.

86
00:07:34,859 --> 00:07:40,300
So, if we switch the directions, we're
still going to have edges in both

87
00:07:40,300 --> 00:07:43,672
directions.
The, the second concept is called the,

88
00:07:43,672 --> 00:07:47,350
the, the kernel DAG.
And what that does, is it kind of, you

89
00:07:47,350 --> 00:07:53,719
think about contracting each strong
component into a single vertex.

90
00:07:54,115 --> 00:08:01,250
And then, just to worry about the edges
that go from one strong component to

91
00:08:01,250 --> 00:08:06,422
another.
So the digraph that you get that way,

92
00:08:06,422 --> 00:08:11,723
turns out to be a cyclic.
If it was cyclic, it would involve, it

93
00:08:11,723 --> 00:08:19,161
would create a bigger strong component, a
couple of strong components into one.

94
00:08:19,418 --> 00:08:25,830
So if we think about that processing that
kernel DAG, that's the edges that go

95
00:08:25,830 --> 00:08:33,354
between strong components and we get a DAG
and we know how to deal with a DAG So, the

96
00:08:33,354 --> 00:08:40,305
idea of the algorithm is to go ahead and
compute the topological order or the

97
00:08:40,305 --> 00:08:46,669
reverse post order in the kernel DAG.
At least put up the edges of the original

98
00:08:46,669 --> 00:08:53,257
digraph in order so that all the edges in
the kernel DAG point from right to left,

99
00:08:53,257 --> 00:08:58,648
That's like a topological sort.
And then we run DFS but instead of

100
00:08:58,648 --> 00:09:04,787
considering the vertices in, in numerical
order for the DFS, we consider them in

101
00:09:04,787 --> 00:09:10,851
that reverse topological order.
So it's extremely easy to implement this

102
00:09:10,851 --> 00:09:16,000
algorithm.
Of course, we're going to use DFS.

103
00:09:16,289 --> 00:09:24,452
So let's look at a demo at how the
algorithm works., okay, so, this is a

104
00:09:24,452 --> 00:09:29,328
digraph, and our goal is to compute the
strong components.

105
00:09:29,328 --> 00:09:33,446
And we're going to do it in two phases,
two DFSs.

106
00:09:33,783 --> 00:09:39,330
One is to compute the reverse post order
and the reverse of the graph.

107
00:09:39,667 --> 00:09:46,139
And the other is to run the DFS but for
the other in which we visit the vertices,

108
00:09:46,139 --> 00:09:51,518
we use that reverse post order that we've
computed. so here goes.

109
00:09:51,518 --> 00:09:55,553
So, first, we'll do the DFS and the
reverse graph.

110
00:09:55,805 --> 00:09:59,141
So that's the graph, that's the reverse
graph.

111
00:09:59,141 --> 00:10:02,377
Remember, these two has the same strong
component.

112
00:10:02,579 --> 00:10:06,692
So, there's our marked array.
We will do the DFS, and reverse post order

113
00:10:06,692 --> 00:10:10,670
means that when we are done with the
vertex, we'll put it out.

114
00:10:11,118 --> 00:10:21,165
So checked six unmarked, checked eight
unmarked checked six, it's marked. so

115
00:10:21,165 --> 00:10:27,594
we're done with a answer, that's the
reverse post order. and again, as before,

116
00:10:27,594 --> 00:10:34,481
put it on a stack and but we'll just list
them in reverse order in this demo.

117
00:10:34,715 --> 00:10:38,782
So, eight is done.
So, six lots of places to go from six.

118
00:10:39,017 --> 00:10:43,479
So, let's check seven.
That's unmarked, so we go to seven.

119
00:10:43,479 --> 00:10:47,366
No place to go from seven, so we're done
with seven.

120
00:10:47,366 --> 00:10:52,624
We put it on the reverse post order list.
We're also done with six.

121
00:10:52,853 --> 00:10:58,568
Cuz four and nine are coming in, in this
example, so I put it on the list.

122
00:10:58,797 --> 00:11:02,836
Next place to go from zero is two, so we
checked two

123
00:11:03,065 --> 00:11:09,618
That's unmarked so we mark it and recurse
and we checked the vertices adjacent to

124
00:11:09,618 --> 00:11:12,210
two.
First, four, that's unmarked.

125
00:11:12,415 --> 00:11:17,356
So, then we recurse and go to four.
We gotta got to, to eleven first and

126
00:11:17,356 --> 00:11:22,639
recurse so now we're at eleven.
And, and from eleven, we go to nine, which

127
00:11:22,639 --> 00:11:26,688
is unmarked.
So we have a pretty long recursive stack

128
00:11:26,688 --> 00:11:30,599
right here.
So, from nine, there's we have to check a

129
00:11:30,599 --> 00:11:35,128
bunch of things, we'll check twelve first
and then visit twelve.

130
00:11:35,128 --> 00:11:40,343
It's unmarked from twelve, we checked
eleven which is marked, nowhere to go.

131
00:11:40,549 --> 00:11:43,980
Then, we checked ten. That's unmarked, so
we go to ten.

132
00:11:44,203 --> 00:11:49,126
Visit ten it's, it's unmarked so we
recurse and then go to nine.

133
00:11:49,350 --> 00:11:53,527
And that's marked and that's everywhere we
get from ten.

134
00:11:53,527 --> 00:11:59,569
So, we're done with ten, so we put it on
the list and return. And now, where done

135
00:11:59,569 --> 00:12:03,150
with twelve, so we put it on the list and
return.

136
00:12:03,150 --> 00:12:07,947
And now, at nine, we have to check seven,
which is marked, and six, which is marked.

137
00:12:07,947 --> 00:12:10,680
And then, we're done so we put it on the
list.

138
00:12:10,879 --> 00:12:14,197
Then, we're done with eleven we put it on
the list.

139
00:12:14,396 --> 00:12:19,042
Then we're finished with four, so we
checked six which is marked, then we

140
00:12:19,042 --> 00:12:22,560
checked five, which is unmarked so we
recurse to five.

141
00:12:22,759 --> 00:12:26,475
From five, we checked three which is
unmarked so we recurse.

142
00:12:26,475 --> 00:12:30,458
Then, we checked four which is mark, we
checked two which is mark.

143
00:12:30,458 --> 00:12:34,373
And then we're done with three, so we put
it on the list.

144
00:12:34,573 --> 00:12:40,413
And then from five we checked zero it's
marked so we're done with five, we put it

145
00:12:40,413 --> 00:12:44,665
on the list.
And that means that we're now done with

146
00:12:44,665 --> 00:12:48,281
four.
And then, we're also done with two, after

147
00:12:48,281 --> 00:12:52,781
checking three.
And then, we go to zero and put it on the

148
00:12:52,781 --> 00:12:55,754
list.
So, that's all the vertices that, you

149
00:12:55,754 --> 00:13:00,817
know, you get through from zero.
So, we look for more vertices and it's one

150
00:13:00,817 --> 00:13:06,040
and it's marked so that's the last one in
the reverse post order.

151
00:13:06,040 --> 00:13:10,219
So, that's a reverse post order of the
reverse graph.

152
00:13:10,540 --> 00:13:16,584
And all we're going to do is take that
order and use that order to check vertices

153
00:13:16,584 --> 00:13:20,626
at the top level in the depth-first search
of our regular graph.

154
00:13:20,815 --> 00:13:24,920
We have to check all the other vertices to
make sure we are done.

155
00:13:24,920 --> 00:13:29,446
So, that's phase one.
So now, we'll just do a DFS in the

156
00:13:29,446 --> 00:13:35,454
original graph, using that order that we
just computed. So, we don't start with

157
00:13:35,454 --> 00:13:41,697
zero, the way we always have, now we start
with one. we visit one, its unmarked and

158
00:13:41,697 --> 00:13:47,939
now all the vertices that we visit during
that DFS are going to be in that same

159
00:13:47,939 --> 00:01:32,591
strong component, that's the theorem that
makes this algorithm work, in this case,

160
00:13:53,713 --> 00:13:59,593
this is only the one, so vertex one is its
own, is in its own strong component with

161
00:13:59,593 --> 00:14:03,118
label zero.
So now, we've got to start, once all done,

162
00:14:03,118 --> 00:14:08,123
so now, we have to start with looking for
another vertex to search from.

163
00:14:08,334 --> 00:14:11,295
In this case, it's zero that's second on
the list.

164
00:14:11,337 --> 00:14:13,762
So that's where we start,
With zero.

165
00:14:13,762 --> 00:14:19,260
And now, all the vertices that we can
reach from zero are going to have strong,

166
00:14:19,260 --> 00:14:23,420
in this graph, are going to have strong
component label one.

167
00:14:23,668 --> 00:14:27,060
So, we do the DFS.
So, first, we get to five.

168
00:14:27,060 --> 00:14:31,445
That's in the same strong component we
checked four.

169
00:14:31,693 --> 00:14:34,671
And it's unmarked.
So, we label it.

170
00:14:34,671 --> 00:14:39,801
We checked three, it's unmarked.
We checked five, which is marked.

171
00:14:39,801 --> 00:14:45,660
We checked two, which is unmarked.
So now we have shown that zero, two,

172
00:14:45,660 --> 00:14:49,670
three, four, and five are all in the same
strong component.

173
00:14:49,881 --> 00:14:55,229
And now we're going to find both vertices
we can get three from two or mark.

174
00:14:55,229 --> 00:14:59,169
So we're done with two then we're done
with three.

175
00:14:59,380 --> 00:15:02,335
Four, we have to check two, which is
marked.

176
00:15:02,335 --> 00:15:08,175
And then, we're done with four and five.
From zero we can check one but that's

177
00:15:08,175 --> 00:15:14,527
already marked so that's not relevant for
this search and then we're done with zero.

178
00:15:14,527 --> 00:15:19,508
And we've established that zero, two,
three, four, and five are all in the same

179
00:15:19,508 --> 00:15:22,419
strong component.
, So that's the second one.

180
00:15:22,640 --> 00:15:29,189
So now, we continue and we check all of
those and they're all marked. so, the next

181
00:15:29,189 --> 00:15:35,002
vertex in the reverse post order of the
reverse graph is eleven so we visit

182
00:15:35,002 --> 00:15:36,547
eleven.
Checked four,

183
00:15:36,547 --> 00:15:39,196
It's already marked.
Checked twelve,

184
00:15:39,196 --> 00:15:44,126
It's not so aj. we mark it, and these are
the third strong component.

185
00:15:44,126 --> 00:15:49,332
They get labeled with two.
Then, we get to nine. from nine, we check

186
00:15:49,332 --> 00:15:54,532
eleven and nowhere to go. Then, we checked
ten, and so that's seems a strong

187
00:15:54,532 --> 00:16:00,441
component. from ten, we checked twelve,
which is marked. We're done with ten, were

188
00:16:00,441 --> 00:16:06,744
done with nine, were done with twelve, and
were done with eleven. so, that's our

189
00:16:06,744 --> 00:16:11,550
second strong or third strong component
nine, ten, eleven and twelve.

190
00:16:11,797 --> 00:16:18,970
So we're done with those and now we go
through the list to find another unmarked

191
00:16:18,970 --> 00:16:21,856
vertex.
Nine is marked, twelve is marked, ten is

192
00:16:21,856 --> 00:16:29,441
marked. we get to six from six nine is
already marked. four is four is already

193
00:16:29,441 --> 00:16:33,069
marked.
Eight is not marked so we go there. from

194
00:16:33,069 --> 00:16:39,457
eight, we get to six and that's it.
Zero is already marked so we can only get

195
00:16:39,457 --> 00:16:44,072
to eight from six at this point, and so
that's a strong component.

196
00:16:44,285 --> 00:16:47,480
And then finally, we finish up by doing
seven.

197
00:16:47,480 --> 00:16:54,704
So the end of the computation gives us the
strong component array, which is a unique

198
00:16:54,704 --> 00:17:01,842
ID for each of the vertices so that two
vertices in the same strong component have

199
00:17:01,842 --> 00:17:06,228
the same ID.
And that supports constant time strong

200
00:17:06,228 --> 00:17:11,748
connectivity checks for clients.
And so, the bottom line is that this

201
00:17:11,979 --> 00:17:18,053
algorithm is a very simple even though it
might be mysterious algorithm for

202
00:17:18,053 --> 00:17:23,359
computing the strong component.
First, run DFS on the reverse graph to

203
00:17:23,359 --> 00:17:28,435
compute the reverse post order.
Then, run DFS on the original graph

204
00:17:28,435 --> 00:17:33,125
considering the vertices in the order
given by the first DFS.

205
00:17:33,356 --> 00:17:39,746
And so these, these diagrams give a
summary of the computation.

206
00:17:39,981 --> 00:17:46,703
I'm not going to spend too much time
explaining why and how it works in this

207
00:17:46,703 --> 00:17:50,923
lecture.
But these kinds of diagrams give the

208
00:17:50,923 --> 00:17:57,880
detail that can give some intuition about
how and why the algorithm works.

209
00:17:58,114 --> 00:18:06,321
The for the proof it requires some
mathematical sophistication and find it in

210
00:18:06,321 --> 00:18:10,951
the book.
But the programming of it is quite simple

211
00:18:10,951 --> 00:18:17,549
and proved that it's efficient is and all
it does is run DFS twice but to really see

212
00:18:17,549 --> 00:18:23,290
the correctness proof you want to look a
the second printing of the textbook. we

213
00:18:23,290 --> 00:18:28,272
got it wrong in the first printing.
And, but look at the implementation so,

214
00:18:28,272 --> 00:18:33,748
this is connected components and in the
undirected graph with DFS that we did last

215
00:18:33,748 --> 00:18:39,354
time and I'm sure many of you thought that
it was one of the simplest algorithms that

216
00:18:39,354 --> 00:18:42,483
we looked at.
We just marked the vertices and do

217
00:18:42,483 --> 00:18:49,383
recursive and that's the end of it.
If you take that code and just had I'll

218
00:18:49,383 --> 00:18:56,420
change the names.
And then just instead of going for the

219
00:18:56,420 --> 00:19:04,753
vertices from zero through v minus one if
you first compute the depth-first order

220
00:19:04,753 --> 00:19:12,599
that you get by running doing that first
search of the reverse graph and then you

221
00:19:12,599 --> 00:19:18,522
iterate through the vertices in that
order, you get an algorithm for strong

222
00:19:18,522 --> 00:19:22,864
connectivity.
It's really remarkable that just changing

223
00:19:22,864 --> 00:19:28,155
that one line will perform this
computation that was thought to be

224
00:19:28,155 --> 00:07:20,012
difficult for,, for many years, and, and
people were learning quite difficult codes

225
00:02:26,670 --> 00:19:39,255
for, for many, many years.
So, that's a quick summary of digraph

226
00:19:39,255 --> 00:19:43,945
processing.
We talked about single source reachability

227
00:19:44,177 --> 00:19:50,288
getting the paths from a vertex to any
vertex that can be reached from that

228
00:19:50,288 --> 00:19:56,012
vertex with a directed path.
We talked about topological sort in graphs

229
00:19:56,012 --> 00:20:02,045
that have no cycles and that uses DFS.
And we talked about computing strong

230
00:20:02,045 --> 00:20:08,079
components in a digraph with two DFSs.
Digraph processing is really h, a

231
00:20:08,079 --> 00:20:14,400
testimony to the ingenuity that's possible
in algorithm and graph processing.