1 00:00:03,740 --> 00:00:08,502 Next we're going to look at a slightly different graph processing application. 2 00:00:08,693 --> 00:00:13,328 But it's a, a, a graph processing algorithm that's useful in many 3 00:00:13,328 --> 00:00:18,535 applications we'll see in a minute. And slightly different than, depth and 4 00:00:18,535 --> 00:00:21,901 breadth first search. It's called computing connected 5 00:00:21,901 --> 00:00:25,688 components. Now I mentioned this a little bit when we 6 00:00:25,688 --> 00:00:31,043 talked about basic definitions. So the idea is that if there's a path 7 00:00:31,043 --> 00:00:34,406 between two vertices we say they're connected. 8 00:00:34,406 --> 00:00:41,262 And what we want to do is reprocess the graph that is, build a data type that can 9 00:00:41,262 --> 00:00:47,084 answer queries of the form, is V connected to W in constant time. 10 00:00:47,084 --> 00:00:53,906 Now, we want to be able to do that for a huge, sparse graph of the type that 11 00:00:53,906 --> 00:00:59,364 appears in practice. So we can't use the, if we could use the 12 00:00:59,364 --> 00:01:05,550 adjacency matrix data structure, maybe we could do that but we can't. 13 00:01:05,550 --> 00:01:11,104 So we're going to build a class that uses our standard representation, that will 14 00:01:11,104 --> 00:01:14,228 enable clients to find connective components. 15 00:01:14,228 --> 00:01:17,561 It's really interesting to think about this one. 16 00:01:17,561 --> 00:01:23,185 We're getting the job done that we could get done if we get a huge, sparse matrix. 17 00:01:23,185 --> 00:01:28,183 But if we have billions of vertices, there's no way we can have billions 18 00:01:28,183 --> 00:01:32,488 squared in the matrix. So we have to find another way to do it. 19 00:01:32,488 --> 00:01:35,960 So here's the data type that we want to implement. 20 00:01:35,960 --> 00:01:42,360 So it's called and it's going to the constructor is going to build the data 21 00:01:42,360 --> 00:01:48,597 structure that finds the connected components in the given graph to be able 22 00:01:48,597 --> 00:01:52,618 to efficiently answer these connectivity queries. 23 00:01:52,618 --> 00:01:57,870 It'll also be able to count the number of connective components. 24 00:01:57,870 --> 00:02:04,352 And it also assigns an identifier from zero to count minus one, that identifies 25 00:02:04,352 --> 00:02:08,373 the connective component that every vertex is in. 26 00:02:08,878 --> 00:02:16,621 Now if you maybe, this sounds a little bit similar to what we did for the union sign 27 00:02:16,621 --> 00:02:20,517 problem. So when you union find problem we're, 28 00:02:20,517 --> 00:02:26,336 we're taking edges and we wanted to have queries that, that, well, well union is 29 00:02:26,336 --> 00:02:32,229 like adding edge to the graph and then find is like are these things connected 30 00:02:32,229 --> 00:02:35,511 yet. Now with union find, we found that we 31 00:02:35,511 --> 00:02:39,166 couldn't quite answer the thing in constant time. 32 00:02:39,166 --> 00:02:45,134 Are members are very slow growing on questions constant practical terms, but 33 00:02:45,134 --> 00:02:49,300 not, not really. So it's less efficient than the algorithm 34 00:02:49,484 --> 00:02:53,710 that we're going to talk about cause it doesn't quite get constant time. 35 00:02:53,710 --> 00:02:58,733 On the other hand in another way it's better than the algorithm we're going to 36 00:02:58,733 --> 00:03:02,347 talk about because you can either mix the unions and fines. 37 00:03:02,347 --> 00:03:06,207 And in this case, because we're working with the graph it's like we're taking all 38 00:03:06,207 --> 00:03:09,270 the unions and then we are handling fine requests. 39 00:03:09,456 --> 00:03:14,551 So anyway what we're going to use to implement this is a depth first search 40 00:03:14,551 --> 00:03:17,844 approach. So we'll do the depth first search, we'll 41 00:03:17,844 --> 00:03:22,380 just keep different data then we did, than when we were findings paths. 42 00:03:22,685 --> 00:03:31,439 So the algorithm is based on the notion that connection is an equivalence 43 00:03:31,439 --> 00:03:36,834 relation. So recall an equivalence relation has 44 00:03:36,834 --> 00:03:42,840 these three properties. Every vertex is connected to itself. 45 00:03:42,840 --> 00:03:46,724 If v is connected to w then w is connected to v. 46 00:03:46,724 --> 00:03:52,065 And if v is connected to w and w to x then v is connected to x. 47 00:03:52,065 --> 00:03:58,376 And those basic operations underlie the algorithm that we're going to talk about. 48 00:03:58,619 --> 00:04:05,578 So the equivalence relation is a, a general mathematical concept that implies, 49 00:04:05,821 --> 00:04:11,809 in graph theory in this case. And as I already mentioned, in the case of 50 00:04:11,809 --> 00:04:16,773 graph, it implies that. The vertices divide up into connected 51 00:04:16,773 --> 00:04:21,121 components which are maximal sets of connected vertices. 52 00:04:21,121 --> 00:04:25,082 So our sample graph has three connected components. 53 00:04:25,082 --> 00:04:31,759 And what we'll do is assign identifiers to each one of the components in that will 54 00:04:31,759 --> 00:04:36,565 for every vertex. Give us an identifier number for that 55 00:04:36,565 --> 00:04:40,198 vertex. And that's the data structure then our 56 00:04:40,198 --> 00:04:46,131 depth for search is going to built, and that immediately gives the constant time 57 00:04:46,131 --> 00:04:49,320 implementation of the connectivity queries. 58 00:04:49,320 --> 00:04:55,303 Given two vertices you look up their ID and if they're equal, they're in the same 59 00:04:55,303 --> 00:04:59,155 connected component, if they're different they're not. 60 00:04:59,361 --> 00:05:05,345 So that's the data structure that we're going to build it's like a unifying tree, 61 00:05:05,345 --> 00:05:08,440 where there's just, all the trees are flat. 62 00:05:09,406 --> 00:05:14,352 So that's where I just said, given connected components, we can answer 63 00:05:14,352 --> 00:05:20,505 queries in constant time. So here's a big graph, a big grid graph 64 00:05:20,801 --> 00:05:29,672 that we use in when we're talking about union find And turns out that this one's 65 00:05:29,672 --> 00:05:36,168 got 63 connected components. And again when you really think about it 66 00:05:36,377 --> 00:05:42,645 it's kind of amazing that we can do this computation in linear time even for a huge 67 00:05:42,645 --> 00:05:46,057 graph. And it's really important to be able to do 68 00:05:46,057 --> 00:05:51,071 so for the very huge graph that we talked about in so many applications. 69 00:05:51,280 --> 00:05:56,921 If you can process all the edges you can also find out the connective components 70 00:05:56,921 --> 00:06:00,821 and be set up to answer connect, connectivity queries. 71 00:06:00,821 --> 00:06:04,930 This is a simple algorithm but really it's ingenious. 72 00:06:04,930 --> 00:06:08,774 Okay. So let's look at the implementation. 73 00:06:08,774 --> 00:06:14,962 So our goal is to petition the vertices into connected components. 74 00:06:14,962 --> 00:06:21,097 So we're going to use DFS in marking. And so what we're going to do is for a 75 00:06:21,097 --> 00:06:25,237 general graph. For every unmarked vertex, we'rere going 76 00:06:25,237 --> 00:06:30,453 to run DFS to find all the vertices that are connected to that one. 77 00:06:30,453 --> 00:06:33,848 They are going to be part of the same component. 78 00:06:34,074 --> 00:06:40,485 So we used the marking to help control the DFS but also to control the connective 79 00:06:40,485 --> 00:06:47,425 components that the vertices that already have been processed and known to be in a 80 00:06:47,425 --> 00:06:52,856 given connective component. So let's look at a demo to understand how 81 00:06:52,856 --> 00:06:57,942 this algorithm works. So again, we're going to use depth for 82 00:06:57,942 --> 00:07:02,742 search and there, summary of depth for search, divisative vertex, we mark it as 83 00:07:02,742 --> 00:07:07,730 visitive and then recursively visit all the unmarked vertices that are adjacent. 84 00:07:08,516 --> 00:07:15,018 So in this case so we'll start we're going to visit all the vertices in the graph in 85 00:07:15,018 --> 00:07:20,859 order to identify the connected component. So we'll start by visiting zero. 86 00:07:21,102 --> 00:07:27,350 To visit zero, we have to check six, two, one, and five so we start by checking six. 87 00:07:27,585 --> 00:07:32,853 We mark it as visited. So that's entry six in the ray and now 88 00:07:32,853 --> 00:07:38,751 we're going to keep this other vertex indexed array which is the ID the 89 00:07:38,751 --> 00:07:44,569 connected component number. So or we're saying by putting a zero and 90 00:07:44,569 --> 00:07:49,916 that entry is that six and zero are in the same connected component. 91 00:07:49,916 --> 00:07:56,599 Every vertex that we encounter during the depth per search from zero we're going to 92 00:07:56,835 --> 00:08:02,661 assign a value of zero. Okay so, so what do we have to do to visit 93 00:08:02,661 --> 00:08:06,778 six. We have to check zero and then we have to 94 00:08:06,778 --> 00:08:14,004 check four so and four is unmarked so we're going to recourse and visit four. 95 00:08:14,004 --> 00:08:18,710 To visit four, we have to check five, five is unmarked. 96 00:08:18,952 --> 00:08:25,082 So we recourse to visit five. And to visit five we have to check three, 97 00:08:25,082 --> 00:08:27,664 four, and zero. Three is unmarked. 98 00:08:27,906 --> 00:08:31,940 This is same depth per search that we did before. 99 00:08:31,940 --> 00:08:37,481 But now we're just keeping track of the connective component number, and we're 100 00:08:37,481 --> 00:08:42,880 assigning every vertex that we encounter with the same idea as zero. 101 00:08:43,906 --> 00:08:49,764 Okay so, now, we have to visit three. To visit three we have to check five and 102 00:08:49,764 --> 00:08:52,451 four. Five was marked and nothing to do. 103 00:08:52,451 --> 00:08:56,448 Four was marked, nothing to do. So we're done with three. 104 00:08:56,655 --> 00:09:01,134 Done with three, we can continue the depth per search from five. 105 00:09:01,134 --> 00:09:04,373 We have to check four and zero. Four was marked. 106 00:09:04,580 --> 00:09:06,854 Zero was marked. So we're done. 107 00:09:07,061 --> 00:09:11,333 And now we could continue with the depth per search of four. 108 00:09:11,333 --> 00:09:14,625 We have to check six- three. Six was marked. 109 00:09:14,625 --> 00:09:18,259 Three was marked and we're done. And we can, 110 00:09:18,259 --> 00:09:23,077 Now six is done. And now we can continue with zero and we 111 00:09:23,077 --> 00:09:26,796 have to check two. Check two it's not marked. 112 00:09:26,796 --> 00:09:32,122 So we mark it and give it a connected component number of zero. 113 00:09:32,122 --> 00:09:37,785 To visit two, all we do is check zero which is marked, so we're done. 114 00:09:37,785 --> 00:09:41,420 And then we do the same thing with one, Unmarked. 115 00:09:41,420 --> 00:09:46,092 So we visit it, I guess assign it zero. And then visit one. 116 00:09:46,301 --> 00:09:49,439 And to do that, we check zero, which is marked. 117 00:09:49,439 --> 00:09:54,321 So we're done with one. And then, to finish zero we have to check 118 00:09:54,321 --> 00:09:57,738 five. And that's, marked so we don't have to do 119 00:09:57,738 --> 00:10:00,110 anything. And we're done with zero. 120 00:10:00,110 --> 00:10:03,936 So, now we're done with the first connected component. 121 00:10:03,936 --> 00:10:10,000 But, we're not done with the whole graph. So what we want to do is, Go look for, so 122 00:10:10,000 --> 00:10:13,898 that's a connected component, corresponding to zero. 123 00:10:14,115 --> 00:10:18,879 Now we want, what we want to do is go look for an unmarked vertex. 124 00:10:19,096 --> 00:10:23,933 So, we started at zero and we check one, two, three, four, five, and six. 125 00:10:24,149 --> 00:10:29,323 And they're all marked. And so the next unmarked vertex that we 126 00:10:29,323 --> 00:10:36,921 find in the graph is seven. So once we return from the depth first 127 00:10:36,921 --> 00:10:42,742 search for zero, we Increment, counter. Which is our, how many connected 128 00:10:42,742 --> 00:10:46,370 components have we seen. And now, we're going to assign that number 129 00:10:46,370 --> 00:10:50,690 to everything that's connected to seven on the depth-first search on seven. 130 00:10:50,690 --> 00:10:55,239 So what do we do to depth per research and sum we check eight. 131 00:10:55,239 --> 00:10:57,819 8s on the mark. And so we go visit it. 132 00:10:58,023 --> 00:11:02,708 So we assign it, connect the component of number of one same as seven. 133 00:11:02,911 --> 00:11:06,850 And mark it. And then go ahead and recourse to visit 134 00:11:06,850 --> 00:11:09,362 eight. We check seven which is marked. 135 00:11:09,362 --> 00:11:12,621 So there is nothing to do. We're done with eight. 136 00:11:12,825 --> 00:11:17,578 And then we're done with seven. So now we're done with all the vertices 137 00:11:17,578 --> 00:11:22,014 that were connected to seven. We increment our counter of number of 138 00:11:22,014 --> 00:11:25,060 components to two, and look for another vertex. 139 00:11:25,060 --> 00:11:30,288 So now we check eight, which we already know is marked, connected to seven. 140 00:11:30,288 --> 00:11:34,657 So now nine is unmarked so we're going to do a DFS from nine. 141 00:11:34,657 --> 00:11:40,745 So everybody connected to nine is going to get assigned a connected component number 142 00:11:40,745 --> 00:11:44,040 of two. So at nine, we have to check eleven and 143 00:11:44,040 --> 00:11:46,285 that was unmarked. So we visit it. 144 00:11:46,476 --> 00:11:50,169 Give it a two. To visit eleven, we have to check nine, 145 00:11:50,169 --> 00:11:52,270 which is marked. So nothing to do. 146 00:11:52,270 --> 00:11:56,791 And twelve, which is unmarked. So we visit it and, give it a number of 147 00:11:56,791 --> 00:11:59,593 two. To visit twelve, we have to check eleven, 148 00:11:59,593 --> 00:12:02,522 which is marked and nine which is also marked. 149 00:12:02,522 --> 00:12:07,426 And then we're done with twelve. And then we're done with eleven. 150 00:12:07,652 --> 00:12:12,330 And then, to finish doing nine, we have to check ten and twelve. 151 00:12:12,565 --> 00:12:17,123 Ten is unmarked. So we mark it and give it a number of two. 152 00:12:17,359 --> 00:12:21,838 To visit ten, we check nine which is marked, so we're done. 153 00:12:21,838 --> 00:12:28,438 And then finally, to finish, the DFS. We check, twelve from nine, and that's 154 00:12:28,438 --> 00:12:31,110 marked. So we're done with nine. 155 00:12:31,110 --> 00:12:36,532 And now we keep looking. And we find that, ten, eleven, twelve, are 156 00:12:36,532 --> 00:12:39,440 all marked. So we've completed the. 157 00:12:39,635 --> 00:12:43,082 Computation. And for every vertex we have a connected 158 00:12:43,082 --> 00:12:46,530 component number. And for any given query we can test 159 00:12:46,530 --> 00:12:51,668 whether their in the same connected component simply by looking up that number 160 00:12:51,668 --> 00:12:57,180 and seeing if it's equal. That's a demo of connected components 161 00:12:57,180 --> 00:13:01,440 computation. Okay, so here's the code for finding 162 00:13:01,440 --> 00:13:07,624 connected components with DFS. Which is, another straightforward DFS 163 00:13:07,624 --> 00:13:11,335 implementation. Just like the other one. 164 00:13:11,335 --> 00:13:16,724 And it just keeps, slightly, different, data structure. 165 00:13:16,989 --> 00:13:25,785 So, The We keep the marked data structure, which is the vertices that we visited. 166 00:13:25,785 --> 00:13:32,410 And then we keep this vertex index array ID which gives the identifier, the 167 00:13:32,410 --> 00:13:37,887 component containing V, I think we call it on the demo. 168 00:13:37,887 --> 00:13:43,276 And then a count of the number of components that we've seen. 169 00:13:43,541 --> 00:13:50,469 So the constructor creates, the, marked array and it creates this idea array. 170 00:13:50,701 --> 00:13:55,719 But now the constructor does more work than a single call on DFS. 171 00:13:55,950 --> 00:14:00,274 What it does is it goes through, this is the constructor. 172 00:14:00,274 --> 00:14:04,520 Goes through every vertex in the array in the graph. 173 00:14:04,755 --> 00:14:10,809 And if it's not marked, it does the DFS. And that DFS will mark a lot of other 174 00:14:10,809 --> 00:14:15,133 vertices. But when it's done that's all of those are 175 00:14:15,133 --> 00:14:21,344 going to get assigned a value of count. And we're going to increment count, then 176 00:14:21,344 --> 00:14:27,791 go and look for another unmarked vertex. Anything that wasn't marked by that first 177 00:14:27,791 --> 00:14:33,844 DFX, DFS, we'll do a DFS from that one, and mark all its vertices with the next 178 00:14:33,844 --> 00:14:38,865 value for the ID. So now let's look at the implementation of 179 00:14:38,865 --> 00:14:44,898 DFS. It's recursive array just like the one 180 00:14:44,898 --> 00:14:48,963 that we did. For, for path finding. 181 00:14:49,244 --> 00:14:53,651 Except all that we do, when we mark a vertex. 182 00:14:53,932 --> 00:14:59,464 We also simply set its ID to the current component name. 183 00:14:59,464 --> 00:15:07,060 So all the vertices that are discovered in the same call of DFS have the same ID. 184 00:15:07,060 --> 00:15:13,070 And to visit a vertex you go through all its adjacent vertices. 185 00:15:13,302 --> 00:15:18,040 And any that are not marked you give a recursive DFS call. 186 00:15:18,040 --> 00:15:22,311 Again, this code is amazingly compact and elegant. 187 00:15:22,544 --> 00:15:29,300 When we're going through the demo step by step maybe you, you can see that the 188 00:15:29,300 --> 00:15:33,726 underlying computation is actually kind of complex. 189 00:15:33,726 --> 00:15:39,007 But, but, recursion and the graph-processing API that we set up 190 00:15:39,240 --> 00:15:43,900 provides a compact and easy to understand implementation. 191 00:15:43,900 --> 00:15:50,176 So that's using DFS to find connector components and then to return the idea of 192 00:15:50,176 --> 00:15:56,088 a given vertex, just look at up in the array and to return then our components 193 00:15:56,088 --> 00:16:02,657 just you can count and then you can build up the needed connectivity API from those. 194 00:16:02,876 --> 00:16:07,110 So that's depth first search to find connected components. 195 00:16:07,110 --> 00:16:14,405 I will just talk briefly about two applications from scientific applications. 196 00:16:14,678 --> 00:16:21,518 So here's an application of sexually transmitted diseases at a high school. 197 00:16:21,518 --> 00:16:27,628 And, simply the vertices are people, blue are men and pink are women. 198 00:16:27,901 --> 00:16:33,979 And you I have an edge between if there was a contact. 199 00:16:33,979 --> 00:16:40,568 And so it's obvious that you're going to be interested in connective components of 200 00:16:40,568 --> 00:16:44,457 this graph. To be able to properly study sexually 201 00:16:44,457 --> 00:16:49,141 transmitted diseases. These individuals had no contact with 202 00:16:49,141 --> 00:16:52,713 these. Then, the, whichever one has the disease, 203 00:16:52,713 --> 00:16:57,635 maybe it won't spread. Or if you add a new edge, then maybe you 204 00:16:57,635 --> 00:17:02,001 have a problem. That's just one example of studying the 205 00:17:02,001 --> 00:17:08,367 spread of disease. Here's another example that we use, for, 206 00:17:08,367 --> 00:17:14,437 for that's similar to the flood fill example. 207 00:17:14,726 --> 00:17:20,790 And this is processing data from a scientific experiment. 208 00:17:21,078 --> 00:17:26,836 And in, in this case this image is comes from a photograph. 209 00:17:26,836 --> 00:17:30,986 And the white things are particles that are moving. 210 00:17:30,986 --> 00:17:35,542 And all we get is a image where it's a grey scale image. 211 00:17:35,542 --> 00:17:41,237 And so, what we'll do to do this processing is, we want to identify the 212 00:17:41,237 --> 00:17:47,340 movement of these particles over time. And the way we do it is build a grid 213 00:17:47,340 --> 00:17:53,849 graph, like the one for the flood fill application and do an edge connecting two 214 00:17:53,849 --> 00:17:56,290 vertices. If they're different, 215 00:17:56,290 --> 00:18:00,290 He said their gray scale values is greater than less than some threshold. 216 00:18:00,290 --> 00:18:06,228 And so then if you do that, and then find the connective components, then you can 217 00:18:06,228 --> 00:18:11,351 identify blobs which correspond to real particles in this simulation. 218 00:18:11,351 --> 00:18:17,364 And they do that every frame in a movie, then you can track moving particles over 219 00:18:17,364 --> 00:18:20,853 time. So these are maybe fairly high resolution 220 00:18:20,853 --> 00:18:24,268 images. These are graphs with lots and lots of 221 00:18:24,268 --> 00:18:30,281 edges, and you need to be able to do this computation quickly in order to do this 222 00:18:30,281 --> 00:18:34,290 scientific experiment. And we'd use this as an example. 223 00:18:34,290 --> 00:18:40,380 Example in our first year programming course it is based on computing connected 224 00:18:40,380 --> 00:18:45,588 components using depth-first search. So that's our third example of a graph 225 00:18:45,588 --> 00:18:47,020 processing algorithm.