1 00:00:03,620 --> 00:00:10,617 As our final digraph processing algorithm, we'll take a look at computing strong 2 00:00:10,617 --> 00:00:15,244 components. So the definition two vertices, v and w, 3 00:00:15,244 --> 00:00:21,758 in a digraph are strongly connected if there's a directed path from v to w and 4 00:00:21,758 --> 00:00:27,947 another directed path from w to v. And the thing about strongly connected is 5 00:00:27,947 --> 00:00:34,123 that it's unequivalence relation. That is each vertex is strongly connected to 6 00:00:34,123 --> 00:00:37,299 itself. If v is connected to w, strongly, and w is 7 00:00:37,299 --> 00:00:41,059 strongly connected to, to, then w is strongly connected to v. 8 00:00:41,059 --> 00:00:44,820 That just means that paths, directed paths connecting them. 9 00:00:45,046 --> 00:00:50,038 And it's also transitive. If v is connected to w and w to x,, then v 10 00:00:50,038 --> 00:00:55,029 is strongly connected to x. How to get from v to x? You go from w, v 11 00:00:55,029 --> 00:00:59,189 to w and w to x, And get from x to v, you go from x to w 12 00:00:59,189 --> 00:01:02,516 and w to v. So, that means that since it's an 13 00:01:02,516 --> 00:01:08,037 equivalence relation, it divides the digraph up into components, into sets 14 00:01:08,037 --> 00:01:14,238 called strongly connected components that have the property that there's directed 15 00:01:14,238 --> 00:01:18,020 paths connecting each pair of vertices in the set. 16 00:01:18,020 --> 00:01:23,742 So, for example nine and twelve are strongly connected, 17 00:01:23,742 --> 00:01:28,545 There's a path from twelve to nine, Another one from nine to twelve and so 18 00:01:28,545 --> 00:01:30,377 forth. So the, 19 00:01:30,377 --> 00:01:38,450 Our challenge is to compute the strong components in diagraphs. 20 00:01:38,679 --> 00:01:45,418 And it's worth comparing this to what we did for connected components in undirected 21 00:01:45,418 --> 00:01:48,328 graphs. So, this is just a quick review. 22 00:01:48,558 --> 00:01:54,607 In an undirected graph, if there's a path between two vertices that are connected, 23 00:01:54,607 --> 00:02:00,274 that's an equivalence relation, divides them up into connective components. 24 00:02:00,504 --> 00:02:07,013 And what we did was, our design pattern is to build our graph processing algorithm as 25 00:02:07,013 --> 00:02:13,767 a constructor that takes a graph and does the pre-processing to create a table like 26 00:02:13,767 --> 00:02:20,765 this one which assigns a unique ID to all the vertices in each given component. 27 00:02:20,765 --> 00:02:26,819 So, that we can have a constant time client query to check whether two given 28 00:02:26,819 --> 00:02:30,320 vertices are in the same connected component or not. 29 00:02:30,320 --> 00:02:36,227 So, linear time processing, pre-processing in the constructor to build the table, 30 00:02:36,227 --> 00:02:41,388 constant time client queries. And that's as good performance as we can 31 00:02:41,388 --> 00:02:45,875 expect to have for a graph for a graph processing algorithm. 32 00:02:46,099 --> 00:02:51,110 Remarkably, now, we're able to do the same thing for strong connectivity. 33 00:02:51,322 --> 00:02:57,339 It's a, it's a much more sophisticated algorithm but the design pattern and the 34 00:02:57,339 --> 00:03:03,316 bottom line for the client is the same. There's a constructor, it processes the 35 00:03:03,316 --> 00:03:07,707 graph in linear time. And it assigns a unique ID to each one of 36 00:03:07,707 --> 00:03:10,863 the strongly connected components in the graph. 37 00:03:10,863 --> 00:03:16,100 So, once in the component by itself, Zero, two, three, four, and five six, and 38 00:03:16,100 --> 00:03:21,539 eight, seven, and nine, ten, and twelve, There's five different strongly connected 39 00:03:21,539 --> 00:03:26,240 components in this graph. The constructor builds that array and the 40 00:03:26,240 --> 00:03:31,209 client gets to in constant time, know whether two vertices are strongly 41 00:03:31,209 --> 00:03:35,775 connected or not. It's quite amazing that we can solve this 42 00:03:35,775 --> 00:03:39,730 problem in this way. And as I'd mentioned it's got an 43 00:03:39,730 --> 00:03:42,730 interesting history that we'll talk about in a second. 44 00:03:42,730 --> 00:03:47,306 So, here's an example of strong component application. 45 00:03:47,552 --> 00:03:53,028 So, the food web graph, This is a very small subset of that graph. 46 00:03:53,273 --> 00:04:00,710 The vertices are all the, species and an edge goes from producer to consumer. 47 00:04:00,710 --> 00:04:05,859 So, if animal a eats animal b, there's an edge from a to b. 48 00:04:06,105 --> 00:04:13,133 And what a strong component corresponds to in this graph is a kind of a subset of 49 00:04:13,133 --> 00:04:18,942 species that, have a common energy flow. They naturally eat each other. 50 00:04:19,187 --> 00:04:24,021 And that's extremely important in ecological studies. 51 00:04:24,267 --> 00:04:31,395 Another example is again, in processing software big software is made up of 52 00:04:31,395 --> 00:04:35,327 modules. And the vertices are, are, are, are the 53 00:04:35,327 --> 00:04:39,423 modules. And edges is if one depends on another. 54 00:04:39,669 --> 00:04:46,797 And so a strong component in this graph is a subset of mutually interacting modules. 55 00:04:47,043 --> 00:04:53,283 And in a huge program like Internet Explorer you want to know what the strong 56 00:04:53,283 --> 00:04:58,391 components are so that you can package them together and maybe improve the 57 00:04:58,391 --> 00:05:01,081 design. So again, software you know, can, these 58 00:05:01,081 --> 00:05:05,616 graphs can be huge. And this kind of graph processing can be 59 00:05:05,616 --> 00:05:10,080 extremely important in improving the design of software. 60 00:05:11,651 --> 00:05:18,464 Now again this algorithm has an interesting history,, it along with 61 00:05:18,695 --> 00:05:25,233 scheduling and other things that's a core problem in operations research that was 62 00:05:25,464 --> 00:05:30,695 widely studied, but really not understood how difficult it was. 63 00:05:30,926 --> 00:05:37,772 In Tarjan's paper in 1972 was a big surprise that this could be solved in 64 00:05:38,003 --> 00:05:44,080 linear time with depth research. Now this algorithm had a couple of other 65 00:05:44,080 --> 00:05:49,587 data structures and this I guess, A diligent student in this class could 66 00:05:49,587 --> 00:05:55,413 understand it with quite a bit of work. But it really demonstrated the importance 67 00:05:55,413 --> 00:05:58,807 of depth-first search in great graph processing. 68 00:05:59,006 --> 00:06:04,529 Now, the algorithm that we're going to talk about today actually was invented in 69 00:06:04,529 --> 00:06:08,122 the 80s' by Kosaraju and, and independently by Sharir. 70 00:06:08,321 --> 00:06:14,043 The story goes that Kosaraju had to go lecture about Tarjan's algorithms and he 71 00:06:14,043 --> 00:06:18,169 forgot his notes. And he had taught it a number of times, he 72 00:06:18,169 --> 00:06:23,957 was trying to figure out what Tarjan's algorithm did and he developed this other 73 00:06:23,957 --> 00:06:27,138 algorithm that's extremely simple ro implement. 74 00:06:27,349 --> 00:06:30,810 That's what we're going to look at today, today. 75 00:06:30,810 --> 00:06:37,516 And actually in the 1990s Gabow and Mehlhorn and particularly, Melhorn had to 76 00:06:37,516 --> 00:06:44,136 implement this algorithm for a big software package and he found another 77 00:06:44,136 --> 00:06:49,972 simple linear time algorithm. Now, so, this story indicates even from 78 00:06:49,972 --> 00:06:57,289 fundamental problems in graph processing there's algorithms out there still waiting 79 00:06:57,289 --> 00:07:02,690 to be discovered and this, this algorithm is a good example of that. 80 00:07:02,690 --> 00:07:09,230 So we give the intuition in the code and you see how the algorithm works fully 81 00:07:09,230 --> 00:07:15,250 convincing yourself or proving why it works is a bit more of a challenge. 82 00:07:15,250 --> 00:07:19,762 Now, we'll leave that mostly for the book. But let's describe what it is. 83 00:07:19,992 --> 00:07:23,900 So, the first idea is to think about the reverse graph. 84 00:07:24,130 --> 00:07:30,261 So if we take a graph and we reverse the sense of all the edges we're going to get 85 00:07:30,261 --> 00:07:34,859 the same, strong components. We need edges in both directions. 86 00:07:34,859 --> 00:07:40,300 So, if we switch the directions, we're still going to have edges in both 87 00:07:40,300 --> 00:07:43,672 directions. The, the second concept is called the, 88 00:07:43,672 --> 00:07:47,350 the, the kernel DAG. And what that does, is it kind of, you 89 00:07:47,350 --> 00:07:53,719 think about contracting each strong component into a single vertex. 90 00:07:54,115 --> 00:08:01,250 And then, just to worry about the edges that go from one strong component to 91 00:08:01,250 --> 00:08:06,422 another. So the digraph that you get that way, 92 00:08:06,422 --> 00:08:11,723 turns out to be a cyclic. If it was cyclic, it would involve, it 93 00:08:11,723 --> 00:08:19,161 would create a bigger strong component, a couple of strong components into one. 94 00:08:19,418 --> 00:08:25,830 So if we think about that processing that kernel DAG, that's the edges that go 95 00:08:25,830 --> 00:08:33,354 between strong components and we get a DAG and we know how to deal with a DAG So, the 96 00:08:33,354 --> 00:08:40,305 idea of the algorithm is to go ahead and compute the topological order or the 97 00:08:40,305 --> 00:08:46,669 reverse post order in the kernel DAG. At least put up the edges of the original 98 00:08:46,669 --> 00:08:53,257 digraph in order so that all the edges in the kernel DAG point from right to left, 99 00:08:53,257 --> 00:08:58,648 That's like a topological sort. And then we run DFS but instead of 100 00:08:58,648 --> 00:09:04,787 considering the vertices in, in numerical order for the DFS, we consider them in 101 00:09:04,787 --> 00:09:10,851 that reverse topological order. So it's extremely easy to implement this 102 00:09:10,851 --> 00:09:16,000 algorithm. Of course, we're going to use DFS. 103 00:09:16,289 --> 00:09:24,452 So let's look at a demo at how the algorithm works., okay, so, this is a 104 00:09:24,452 --> 00:09:29,328 digraph, and our goal is to compute the strong components. 105 00:09:29,328 --> 00:09:33,446 And we're going to do it in two phases, two DFSs. 106 00:09:33,783 --> 00:09:39,330 One is to compute the reverse post order and the reverse of the graph. 107 00:09:39,667 --> 00:09:46,139 And the other is to run the DFS but for the other in which we visit the vertices, 108 00:09:46,139 --> 00:09:51,518 we use that reverse post order that we've computed. so here goes. 109 00:09:51,518 --> 00:09:55,553 So, first, we'll do the DFS and the reverse graph. 110 00:09:55,805 --> 00:09:59,141 So that's the graph, that's the reverse graph. 111 00:09:59,141 --> 00:10:02,377 Remember, these two has the same strong component. 112 00:10:02,579 --> 00:10:06,692 So, there's our marked array. We will do the DFS, and reverse post order 113 00:10:06,692 --> 00:10:10,670 means that when we are done with the vertex, we'll put it out. 114 00:10:11,118 --> 00:10:21,165 So checked six unmarked, checked eight unmarked checked six, it's marked. so 115 00:10:21,165 --> 00:10:27,594 we're done with a answer, that's the reverse post order. and again, as before, 116 00:10:27,594 --> 00:10:34,481 put it on a stack and but we'll just list them in reverse order in this demo. 117 00:10:34,715 --> 00:10:38,782 So, eight is done. So, six lots of places to go from six. 118 00:10:39,017 --> 00:10:43,479 So, let's check seven. That's unmarked, so we go to seven. 119 00:10:43,479 --> 00:10:47,366 No place to go from seven, so we're done with seven. 120 00:10:47,366 --> 00:10:52,624 We put it on the reverse post order list. We're also done with six. 121 00:10:52,853 --> 00:10:58,568 Cuz four and nine are coming in, in this example, so I put it on the list. 122 00:10:58,797 --> 00:11:02,836 Next place to go from zero is two, so we checked two 123 00:11:03,065 --> 00:11:09,618 That's unmarked so we mark it and recurse and we checked the vertices adjacent to 124 00:11:09,618 --> 00:11:12,210 two. First, four, that's unmarked. 125 00:11:12,415 --> 00:11:17,356 So, then we recurse and go to four. We gotta got to, to eleven first and 126 00:11:17,356 --> 00:11:22,639 recurse so now we're at eleven. And, and from eleven, we go to nine, which 127 00:11:22,639 --> 00:11:26,688 is unmarked. So we have a pretty long recursive stack 128 00:11:26,688 --> 00:11:30,599 right here. So, from nine, there's we have to check a 129 00:11:30,599 --> 00:11:35,128 bunch of things, we'll check twelve first and then visit twelve. 130 00:11:35,128 --> 00:11:40,343 It's unmarked from twelve, we checked eleven which is marked, nowhere to go. 131 00:11:40,549 --> 00:11:43,980 Then, we checked ten. That's unmarked, so we go to ten. 132 00:11:44,203 --> 00:11:49,126 Visit ten it's, it's unmarked so we recurse and then go to nine. 133 00:11:49,350 --> 00:11:53,527 And that's marked and that's everywhere we get from ten. 134 00:11:53,527 --> 00:11:59,569 So, we're done with ten, so we put it on the list and return. And now, where done 135 00:11:59,569 --> 00:12:03,150 with twelve, so we put it on the list and return. 136 00:12:03,150 --> 00:12:07,947 And now, at nine, we have to check seven, which is marked, and six, which is marked. 137 00:12:07,947 --> 00:12:10,680 And then, we're done so we put it on the list. 138 00:12:10,879 --> 00:12:14,197 Then, we're done with eleven we put it on the list. 139 00:12:14,396 --> 00:12:19,042 Then we're finished with four, so we checked six which is marked, then we 140 00:12:19,042 --> 00:12:22,560 checked five, which is unmarked so we recurse to five. 141 00:12:22,759 --> 00:12:26,475 From five, we checked three which is unmarked so we recurse. 142 00:12:26,475 --> 00:12:30,458 Then, we checked four which is mark, we checked two which is mark. 143 00:12:30,458 --> 00:12:34,373 And then we're done with three, so we put it on the list. 144 00:12:34,573 --> 00:12:40,413 And then from five we checked zero it's marked so we're done with five, we put it 145 00:12:40,413 --> 00:12:44,665 on the list. And that means that we're now done with 146 00:12:44,665 --> 00:12:48,281 four. And then, we're also done with two, after 147 00:12:48,281 --> 00:12:52,781 checking three. And then, we go to zero and put it on the 148 00:12:52,781 --> 00:12:55,754 list. So, that's all the vertices that, you 149 00:12:55,754 --> 00:13:00,817 know, you get through from zero. So, we look for more vertices and it's one 150 00:13:00,817 --> 00:13:06,040 and it's marked so that's the last one in the reverse post order. 151 00:13:06,040 --> 00:13:10,219 So, that's a reverse post order of the reverse graph. 152 00:13:10,540 --> 00:13:16,584 And all we're going to do is take that order and use that order to check vertices 153 00:13:16,584 --> 00:13:20,626 at the top level in the depth-first search of our regular graph. 154 00:13:20,815 --> 00:13:24,920 We have to check all the other vertices to make sure we are done. 155 00:13:24,920 --> 00:13:29,446 So, that's phase one. So now, we'll just do a DFS in the 156 00:13:29,446 --> 00:13:35,454 original graph, using that order that we just computed. So, we don't start with 157 00:13:35,454 --> 00:13:41,697 zero, the way we always have, now we start with one. we visit one, its unmarked and 158 00:13:41,697 --> 00:13:47,939 now all the vertices that we visit during that DFS are going to be in that same 159 00:13:47,939 --> 00:01:32,591 strong component, that's the theorem that makes this algorithm work, in this case, 160 00:13:53,713 --> 00:13:59,593 this is only the one, so vertex one is its own, is in its own strong component with 161 00:13:59,593 --> 00:14:03,118 label zero. So now, we've got to start, once all done, 162 00:14:03,118 --> 00:14:08,123 so now, we have to start with looking for another vertex to search from. 163 00:14:08,334 --> 00:14:11,295 In this case, it's zero that's second on the list. 164 00:14:11,337 --> 00:14:13,762 So that's where we start, With zero. 165 00:14:13,762 --> 00:14:19,260 And now, all the vertices that we can reach from zero are going to have strong, 166 00:14:19,260 --> 00:14:23,420 in this graph, are going to have strong component label one. 167 00:14:23,668 --> 00:14:27,060 So, we do the DFS. So, first, we get to five. 168 00:14:27,060 --> 00:14:31,445 That's in the same strong component we checked four. 169 00:14:31,693 --> 00:14:34,671 And it's unmarked. So, we label it. 170 00:14:34,671 --> 00:14:39,801 We checked three, it's unmarked. We checked five, which is marked. 171 00:14:39,801 --> 00:14:45,660 We checked two, which is unmarked. So now we have shown that zero, two, 172 00:14:45,660 --> 00:14:49,670 three, four, and five are all in the same strong component. 173 00:14:49,881 --> 00:14:55,229 And now we're going to find both vertices we can get three from two or mark. 174 00:14:55,229 --> 00:14:59,169 So we're done with two then we're done with three. 175 00:14:59,380 --> 00:15:02,335 Four, we have to check two, which is marked. 176 00:15:02,335 --> 00:15:08,175 And then, we're done with four and five. From zero we can check one but that's 177 00:15:08,175 --> 00:15:14,527 already marked so that's not relevant for this search and then we're done with zero. 178 00:15:14,527 --> 00:15:19,508 And we've established that zero, two, three, four, and five are all in the same 179 00:15:19,508 --> 00:15:22,419 strong component. , So that's the second one. 180 00:15:22,640 --> 00:15:29,189 So now, we continue and we check all of those and they're all marked. so, the next 181 00:15:29,189 --> 00:15:35,002 vertex in the reverse post order of the reverse graph is eleven so we visit 182 00:15:35,002 --> 00:15:36,547 eleven. Checked four, 183 00:15:36,547 --> 00:15:39,196 It's already marked. Checked twelve, 184 00:15:39,196 --> 00:15:44,126 It's not so aj. we mark it, and these are the third strong component. 185 00:15:44,126 --> 00:15:49,332 They get labeled with two. Then, we get to nine. from nine, we check 186 00:15:49,332 --> 00:15:54,532 eleven and nowhere to go. Then, we checked ten, and so that's seems a strong 187 00:15:54,532 --> 00:16:00,441 component. from ten, we checked twelve, which is marked. We're done with ten, were 188 00:16:00,441 --> 00:16:06,744 done with nine, were done with twelve, and were done with eleven. so, that's our 189 00:16:06,744 --> 00:16:11,550 second strong or third strong component nine, ten, eleven and twelve. 190 00:16:11,797 --> 00:16:18,970 So we're done with those and now we go through the list to find another unmarked 191 00:16:18,970 --> 00:16:21,856 vertex. Nine is marked, twelve is marked, ten is 192 00:16:21,856 --> 00:16:29,441 marked. we get to six from six nine is already marked. four is four is already 193 00:16:29,441 --> 00:16:33,069 marked. Eight is not marked so we go there. from 194 00:16:33,069 --> 00:16:39,457 eight, we get to six and that's it. Zero is already marked so we can only get 195 00:16:39,457 --> 00:16:44,072 to eight from six at this point, and so that's a strong component. 196 00:16:44,285 --> 00:16:47,480 And then finally, we finish up by doing seven. 197 00:16:47,480 --> 00:16:54,704 So the end of the computation gives us the strong component array, which is a unique 198 00:16:54,704 --> 00:17:01,842 ID for each of the vertices so that two vertices in the same strong component have 199 00:17:01,842 --> 00:17:06,228 the same ID. And that supports constant time strong 200 00:17:06,228 --> 00:17:11,748 connectivity checks for clients. And so, the bottom line is that this 201 00:17:11,979 --> 00:17:18,053 algorithm is a very simple even though it might be mysterious algorithm for 202 00:17:18,053 --> 00:17:23,359 computing the strong component. First, run DFS on the reverse graph to 203 00:17:23,359 --> 00:17:28,435 compute the reverse post order. Then, run DFS on the original graph 204 00:17:28,435 --> 00:17:33,125 considering the vertices in the order given by the first DFS. 205 00:17:33,356 --> 00:17:39,746 And so these, these diagrams give a summary of the computation. 206 00:17:39,981 --> 00:17:46,703 I'm not going to spend too much time explaining why and how it works in this 207 00:17:46,703 --> 00:17:50,923 lecture. But these kinds of diagrams give the 208 00:17:50,923 --> 00:17:57,880 detail that can give some intuition about how and why the algorithm works. 209 00:17:58,114 --> 00:18:06,321 The for the proof it requires some mathematical sophistication and find it in 210 00:18:06,321 --> 00:18:10,951 the book. But the programming of it is quite simple 211 00:18:10,951 --> 00:18:17,549 and proved that it's efficient is and all it does is run DFS twice but to really see 212 00:18:17,549 --> 00:18:23,290 the correctness proof you want to look a the second printing of the textbook. we 213 00:18:23,290 --> 00:18:28,272 got it wrong in the first printing. And, but look at the implementation so, 214 00:18:28,272 --> 00:18:33,748 this is connected components and in the undirected graph with DFS that we did last 215 00:18:33,748 --> 00:18:39,354 time and I'm sure many of you thought that it was one of the simplest algorithms that 216 00:18:39,354 --> 00:18:42,483 we looked at. We just marked the vertices and do 217 00:18:42,483 --> 00:18:49,383 recursive and that's the end of it. If you take that code and just had I'll 218 00:18:49,383 --> 00:18:56,420 change the names. And then just instead of going for the 219 00:18:56,420 --> 00:19:04,753 vertices from zero through v minus one if you first compute the depth-first order 220 00:19:04,753 --> 00:19:12,599 that you get by running doing that first search of the reverse graph and then you 221 00:19:12,599 --> 00:19:18,522 iterate through the vertices in that order, you get an algorithm for strong 222 00:19:18,522 --> 00:19:22,864 connectivity. It's really remarkable that just changing 223 00:19:22,864 --> 00:19:28,155 that one line will perform this computation that was thought to be 224 00:19:28,155 --> 00:07:20,012 difficult for,, for many years, and, and people were learning quite difficult codes 225 00:02:26,670 --> 00:19:39,255 for, for many, many years. So, that's a quick summary of digraph 226 00:19:39,255 --> 00:19:43,945 processing. We talked about single source reachability 227 00:19:44,177 --> 00:19:50,288 getting the paths from a vertex to any vertex that can be reached from that 228 00:19:50,288 --> 00:19:56,012 vertex with a directed path. We talked about topological sort in graphs 229 00:19:56,012 --> 00:20:02,045 that have no cycles and that uses DFS. And we talked about computing strong 230 00:20:02,045 --> 00:20:08,079 components in a digraph with two DFSs. Digraph processing is really h, a 231 00:20:08,079 --> 00:20:14,400 testimony to the ingenuity that's possible in algorithm and graph processing.