1 00:00:01,310 --> 00:00:02,300 So what's the problem? 2 00:00:02,300 --> 00:00:06,086 Well, I did say most of the stuff about graph search really doesn't matter, 3 00:00:06,086 --> 00:00:09,245 undirected or directed, it's pretty much cosmetic changes. 4 00:00:09,245 --> 00:00:11,699 But the big exception is when you're computing connectivity, 5 00:00:11,699 --> 00:00:14,020 when you're computing the pieces of the graph. 6 00:00:14,020 --> 00:00:17,290 So right now, I'm only going to talk about undirected graphs. 7 00:00:17,290 --> 00:00:21,600 The directed case, we can get very efficient algorithms for, but it's quite 8 00:00:21,600 --> 00:00:24,339 a bit harder work, so that's going to be covered in detail in a separate video. 9 00:00:25,600 --> 00:00:28,290 So for now, focus just on an undirected graph, G. 10 00:00:28,290 --> 00:00:32,650 And we're certainly not going to assume that G is connected, and the part of 11 00:00:32,650 --> 00:00:36,880 the point here is to figure out whether or not it's connected, i.e., in one piece. 12 00:00:36,880 --> 00:00:39,580 So maybe the graph looks like this. 13 00:00:39,580 --> 00:00:44,059 So for example maybe the graph has ten vertices and looks like this on the right. 14 00:00:45,870 --> 00:00:49,720 And intuitively, especially given that I've drawn it in such a clean way, 15 00:00:49,720 --> 00:00:52,060 it's clear that this graph has three pieces. 16 00:00:52,060 --> 00:00:54,980 And those are the things that we want to call the connected components. 17 00:00:56,210 --> 00:00:59,780 But we do want to sum up more formal definitions, something which is actually 18 00:00:59,780 --> 00:01:03,040 in math that we could say is true or false about a given graph. 19 00:01:03,040 --> 00:01:06,660 And roughly, we define the connected components of an undirected graph 20 00:01:06,660 --> 00:01:10,300 as the maximal regions that are connected. 21 00:01:10,300 --> 00:01:13,130 In the sense you can get from any vertex in the region from 22 00:01:13,130 --> 00:01:15,610 any other vertex in the region using a path. 23 00:01:15,610 --> 00:01:17,910 So, maximal connected regions in that sense. 24 00:01:17,910 --> 00:01:20,838 Now the slick way to do this is using an equivalence relation. 25 00:01:20,838 --> 00:01:24,214 And I'm going to do this here in part because it's really the right way to think 26 00:01:24,214 --> 00:01:27,720 about the directed graph case, which we'll talk about in some detail later. 27 00:01:28,720 --> 00:01:32,215 So from your [INAUDIBLE] graphs, so this isn't super important, but 28 00:01:32,215 --> 00:01:34,987 let me go ahead and state the formal definition just for 29 00:01:34,987 --> 00:01:37,716 completeness about what is a connecting component. 30 00:01:37,716 --> 00:01:42,250 What do I mean by a maximal region that's mutually connected. 31 00:01:43,770 --> 00:01:48,670 So a good formal definition is as the equivalence classes of the relation on 32 00:01:48,670 --> 00:01:53,860 vertices where we define by u being related to v if and 33 00:01:53,860 --> 00:01:56,300 only if there's a path between u and v in the graph. 34 00:01:57,770 --> 00:02:01,030 So I'll leave it for you to do the simple check that this squiggle is indeed 35 00:02:01,030 --> 00:02:02,140 an equivalence relation. 36 00:02:02,140 --> 00:02:05,170 I'm going to remind you what equivalence relations are. 37 00:02:05,170 --> 00:02:08,360 This is something you generally learn in your first class on proofs or 38 00:02:08,360 --> 00:02:10,420 your first class in discrete math. 39 00:02:10,420 --> 00:02:14,450 So it's just something which may or may not be true about pairs of objects. 40 00:02:14,450 --> 00:02:17,490 In an equivalence relation, you have to satisfy three properties. 41 00:02:17,490 --> 00:02:21,090 So first you have to be reflexive, meaning everything has to be related to itself, 42 00:02:21,090 --> 00:02:26,430 and indeed in a graph there is a path from any node to itself, namely the empty path. 43 00:02:26,430 --> 00:02:29,330 Also a couple of these relations have to be symmetric, meaning that if u and 44 00:02:29,330 --> 00:02:30,800 v are related then v and u are related. 45 00:02:30,800 --> 00:02:34,780 Because this is an undirected graph it's clear that this is symmetric. 46 00:02:34,780 --> 00:02:38,130 If there's a path from u to v in the graph there's also a path from v to u so 47 00:02:38,130 --> 00:02:39,390 no problem there. 48 00:02:39,390 --> 00:02:41,310 Finally equivalence classes have got to be transitive. 49 00:02:41,310 --> 00:02:45,700 So that means if u and v are related and so are v and w and so are u and wo. 50 00:02:45,700 --> 00:02:48,980 That is if u and v have a path, v and w have a path, then so 51 00:02:48,980 --> 00:02:53,710 does u and w and you can prove transtivity just by pasting the two paths together. 52 00:02:53,710 --> 00:02:56,180 And so the upshot is, when you want to say something like 53 00:02:56,180 --> 00:03:00,310 the maximal subset of something where everything is the same. 54 00:03:00,310 --> 00:03:03,608 The right way to make that mathematical is using equivalence relations. 55 00:03:03,608 --> 00:03:06,620 So over in this blue graph we want to say one, three, five, seven, and 56 00:03:06,620 --> 00:03:10,230 nine, which are all the same in the sense that they're mutually connected and so 57 00:03:10,230 --> 00:03:12,370 that's exactly what this relation is making precise. 58 00:03:12,370 --> 00:03:14,598 All five of those nodes are related to each other. 59 00:03:14,598 --> 00:03:15,994 2 and 4 are related to each other. 60 00:03:15,994 --> 00:03:18,920 6, 8, and 10, all pairs of them are related. 61 00:03:18,920 --> 00:03:22,174 So equivalence relations always have equivalence classes, 62 00:03:22,174 --> 00:03:24,057 the maximal mutual related stuff. 63 00:03:24,057 --> 00:03:27,963 And in this graph context, it's exactly these connected components, 64 00:03:27,963 --> 00:03:29,250 exactly what you want. 65 00:03:29,250 --> 00:03:33,624 So what I want to show you is you can use wrapped in an outer of 66 00:03:33,624 --> 00:03:39,040 the vertices to compute, to identify all of the connected components of a graph. 67 00:03:39,040 --> 00:03:42,580 In time linear in the graph in n plus n time. 68 00:03:42,580 --> 00:03:44,770 Now you might be wondering why do you want to do that. 69 00:03:46,850 --> 00:03:49,910 Well there's a lot of reasons, so an obvious one which is relevant for 70 00:03:49,910 --> 00:03:54,940 physical networks is to check if a network has broken into two pieces. 71 00:03:54,940 --> 00:03:58,320 So certainly if you're an Internet service provider, you want to make sure that 72 00:03:58,320 --> 00:04:00,829 from any point in your network, you can reach any other point in the network. 73 00:04:02,040 --> 00:04:05,010 And that boils down to just understanding whether the graph that represents you 74 00:04:05,010 --> 00:04:09,140 network, is a connected graph, that is if it's in one piece, or not in one piece. 75 00:04:09,140 --> 00:04:12,270 So obviously, you can ask this same question about recreational examples. 76 00:04:12,270 --> 00:04:15,200 So if you return to the movie graph, maybe you're wondering, 77 00:04:15,200 --> 00:04:20,800 can you get from every single actor in the IMDB database to Kevin Bacon? 78 00:04:20,800 --> 00:04:24,000 Or are there actors for which you cannot reach Kevin Bacon? 79 00:04:24,000 --> 00:04:25,450 Via a sequence of edges, 80 00:04:25,450 --> 00:04:30,300 a sequences of movies in which two actors have both played a role. 81 00:04:30,300 --> 00:04:33,030 So that's something that boils down to a connectivity computation. 82 00:04:34,520 --> 00:04:37,810 If you have networked data and you want to display it, you want to visualize it and 83 00:04:37,810 --> 00:04:40,440 show it to a group of people so that they can interpret it. 84 00:04:40,440 --> 00:04:43,240 Obviously one thing you want to do is you want to know if there's multiple pieces, 85 00:04:43,240 --> 00:04:45,420 and then you want to display the different pieces separately. 86 00:04:47,130 --> 00:04:50,270 So let me make sure that one probably a little less obvious application of 87 00:04:50,270 --> 00:04:54,790 undirected connectivity is that it gives us a nice quick and dirty heuristic for 88 00:04:54,790 --> 00:04:58,690 doing clustering if you have paralyzed information about objects. 89 00:04:58,690 --> 00:04:59,880 Let me be a little more concrete. 90 00:04:59,880 --> 00:05:02,720 Suppose you have a set of objects that you really care about. 91 00:05:02,720 --> 00:05:06,920 This could be documents, maybe web pages that you crawl, something like that. 92 00:05:06,920 --> 00:05:10,756 It could be a set of images, either your own or drawn from some data base. 93 00:05:10,756 --> 00:05:13,780 Or it could be, for example, a set of genomes. 94 00:05:13,780 --> 00:05:18,710 Suppose further, that you have a pairwise function, which for each pair of objects 95 00:05:18,710 --> 00:05:22,230 tells you whether they are very much like each other or very much different. 96 00:05:22,230 --> 00:05:25,646 So let's suppose that two objects are very similar to each other, 97 00:05:25,646 --> 00:05:28,087 like the two web pages that are almost the same. 98 00:05:28,087 --> 00:05:31,614 Or there are two genomes where you can get from one to the other with a small number 99 00:05:31,614 --> 00:05:32,450 of mutations. 100 00:05:32,450 --> 00:05:34,580 Then, they have a low score. 101 00:05:34,580 --> 00:05:35,490 So low numbers, 102 00:05:35,490 --> 00:05:40,520 close to zero, indicates that the objects are very similar to each other. 103 00:05:40,520 --> 00:05:43,200 High numbers, let's say, they can go up to a thousand or 104 00:05:43,200 --> 00:05:46,060 something, indicate that they are very different objects. 105 00:05:46,060 --> 00:05:49,470 Two webpages that have nothing to do with each other, two genomes for 106 00:05:49,470 --> 00:05:53,910 totally unrelated parts, or two images that seem to be of 107 00:05:53,910 --> 00:05:56,660 completely different people or even completely different objects. 108 00:05:58,540 --> 00:06:02,920 Now here's a graph you can construct using these objects and 109 00:06:02,920 --> 00:06:05,300 the similarity data that you have about them. 110 00:06:05,300 --> 00:06:07,930 So you can have a graph where the nodes are the objects. 111 00:06:07,930 --> 00:06:11,370 Okay, so for each image, for each document, whatever, 112 00:06:11,370 --> 00:06:14,671 you have a single node and then for a given pair of nodes, 113 00:06:14,671 --> 00:06:18,910 you put in an edge if and only if the two objects are very similar. 114 00:06:18,910 --> 00:06:22,070 So for example, you could put in an edge between two objects if and 115 00:06:22,070 --> 00:06:24,830 only if the score is at most ten. 116 00:06:24,830 --> 00:06:28,490 So remember, the more similar two objects are, the closer there score is to zero. 117 00:06:28,490 --> 00:06:31,070 So you're going to get an edge between very similar documents, 118 00:06:31,070 --> 00:06:33,570 very similar genomes, very similar images. 119 00:06:33,570 --> 00:06:37,690 Now in this graph you've constructed, you can find the connected components. 120 00:06:37,690 --> 00:06:41,290 So each of these connected components will be a group of objects, which more or 121 00:06:41,290 --> 00:06:43,110 less are all very similar to each other. 122 00:06:43,110 --> 00:06:47,380 So this would be a cluster of closely related objects in your database. 123 00:06:47,380 --> 00:06:51,420 And you can imagine a lot of reasons why, given a large set of unstructured data, 124 00:06:51,420 --> 00:06:53,480 just a bunch of pictures, a bunch of documents or 125 00:06:53,480 --> 00:06:57,209 whatever, you might want to find clusters of highly related objects. 126 00:06:58,790 --> 00:07:01,104 So we'll probably see more sophisticated heuristics for 127 00:07:01,104 --> 00:07:02,387 clustering in some SQL course. 128 00:07:02,387 --> 00:07:06,332 But already undirected connectivity gives you a super fast linear time quick and 129 00:07:06,332 --> 00:07:09,465 dirty heuristic for identifying clusters of similar objects, 130 00:07:09,465 --> 00:07:11,450 given pairwise data about similarity. 131 00:07:12,960 --> 00:07:14,870 So that's some reasons you might want to do it. 132 00:07:14,870 --> 00:07:17,683 Now let's actually talk about how to compute 133 00:07:17,683 --> 00:07:21,014 the components in the near time using just a simple for 134 00:07:21,014 --> 00:07:24,730 loop and breadth-first search as it's inner work horse. 135 00:07:24,730 --> 00:07:27,930 So here's the code to compute all the connected components 136 00:07:27,930 --> 00:07:30,060 of an undirected graph. 137 00:07:30,060 --> 00:07:32,810 So first we initialize all nodes as being unexplored. 138 00:07:34,100 --> 00:07:36,110 I'm also going to assume that the nodes have names. 139 00:07:36,110 --> 00:07:37,880 Let's say that the names are from 1 to n. 140 00:07:37,880 --> 00:07:42,349 So these names could just be the position in the node array that these nodes occupy. 141 00:07:42,349 --> 00:07:43,827 So this is going to be an outer for loop, 142 00:07:43,827 --> 00:07:47,620 which walks through the nodes in an arbitrary order, let's say from 1 to n. 143 00:07:47,620 --> 00:07:52,030 This outer for loop is to ensure that every single node of the graph will be 144 00:07:52,030 --> 00:07:55,520 inspected for sure at some point in the algorithm. 145 00:07:55,520 --> 00:07:58,550 Now, again, one of maxims is that we should never do redundant work, so 146 00:07:58,550 --> 00:08:01,620 before we start exploring from some node, we check if they've already been there. 147 00:08:03,190 --> 00:08:08,620 And if we haven't seen i before, then we invoke the breath-first search, 148 00:08:08,620 --> 00:08:13,206 a term we were talking about previously in the lecture, in the graph G, 149 00:08:13,206 --> 00:08:14,904 starting from the node I. 150 00:08:14,904 --> 00:08:16,405 So to make sure this is clear, 151 00:08:16,405 --> 00:08:20,320 let's just run this algorithm on this blue graph to the right. 152 00:08:20,320 --> 00:08:24,100 So we start in the outer for loop and we said i equal to 1. 153 00:08:24,100 --> 00:08:25,960 And we say have we explored node number 1 yet. 154 00:08:25,960 --> 00:08:28,440 And of course not, we haven't explored anything yet. 155 00:08:28,440 --> 00:08:31,280 So the first thing we're going to do is we're going to invoke 156 00:08:31,280 --> 00:08:33,870 BFS on node number 1 here. 157 00:08:33,870 --> 00:08:37,620 So now we start running the usual breadth for search subroutine starting 158 00:08:37,620 --> 00:08:42,600 from this node one and so we explore layer one here is going to be nodes 3 and 5. 159 00:08:42,600 --> 00:08:44,200 So we explore them in some order for 160 00:08:44,200 --> 00:08:47,770 example maybe node number 3 is what we explore second. 161 00:08:47,770 --> 00:08:50,260 Then node number five is what we explore third and 162 00:08:51,430 --> 00:08:55,290 then the second layer in this component is going to be the nodes 7 and 9. 163 00:08:55,290 --> 00:09:00,580 So we'll explore them in some order as well and say 7 first followed by 9. 164 00:09:00,580 --> 00:09:03,580 So after this BFS initiated from node number one completes, 165 00:09:03,580 --> 00:09:06,550 of course it will have found everything it could possibly find, 166 00:09:06,550 --> 00:09:09,820 namely the five nodes in the same connected component as node number 1. 167 00:09:09,820 --> 00:09:13,030 And of course, all of the five of these nodes will be marked as explored. 168 00:09:13,030 --> 00:09:17,440 So now we return once that exits we return to the outer for loop we increment I we go 169 00:09:17,440 --> 00:09:22,410 to I equal 2, and we say we already explored node number 2 no we have not. 170 00:09:22,410 --> 00:09:25,640 And so now we invoke BFS again from node number 2. 171 00:09:25,640 --> 00:09:27,390 So that will be the sixth node we explore. 172 00:09:27,390 --> 00:09:28,590 There's not much to do from two, 173 00:09:28,590 --> 00:09:32,940 all we can do is go to 4, so that's the seventh node we explore. 174 00:09:32,940 --> 00:09:36,410 That BFS terminates at finding the nodes in this connected component, 175 00:09:36,410 --> 00:09:37,935 then we go back to the outer for loop. 176 00:09:37,935 --> 00:09:41,360 We increment i to 3, we say we've already seen node number three. 177 00:09:41,360 --> 00:09:43,690 Yes we have, we saw that in the first breadth first search. 178 00:09:43,690 --> 00:09:48,520 So we certainly don't bother to BFS from node 3, then we increment item four. 179 00:09:48,520 --> 00:09:49,410 Have we seen 4? 180 00:09:49,410 --> 00:09:51,380 Yes we have, in the second called to BFS. 181 00:09:51,380 --> 00:09:52,650 Have we seen node 5? 182 00:09:52,650 --> 00:09:54,955 Yes we have, in the first call to BFS. 183 00:09:54,955 --> 00:09:56,377 Have we seen node 6? 184 00:09:56,377 --> 00:09:57,590 No, we have not. 185 00:09:57,590 --> 00:10:01,920 So the final indication of breadth-first search begins from node number 6. 186 00:10:01,920 --> 00:10:04,640 That's going to be the eighth node overall that we see. 187 00:10:04,640 --> 00:10:07,850 And then we're going to see the notes 8 and 10 in some order, so for 188 00:10:07,850 --> 00:10:10,530 example maybe we first explore note number 8. 189 00:10:10,530 --> 00:10:13,140 That's one of the first layer in this component, 190 00:10:13,140 --> 00:10:18,650 and then note number 10 is the other note of the first layer in this component. 191 00:10:18,650 --> 00:10:21,640 So in general, what's going on, well what we know 192 00:10:21,640 --> 00:10:24,770 will happen when we invoked that first search from a given node i. 193 00:10:24,770 --> 00:10:27,860 We're going to discover exactly the nodes in i's connected component. 194 00:10:27,860 --> 00:10:31,410 Right, anything where there's a path from i to that node, we'll find it. 195 00:10:31,410 --> 00:10:34,750 That's the BFS guarantee, that's also the definition of a connected component. 196 00:10:34,750 --> 00:10:38,040 All the other nodes which have a path to i. 197 00:10:39,720 --> 00:10:42,160 Another thing that I hope was clear from the example, but 198 00:10:42,160 --> 00:10:46,920 just to reiterate it, is every search call, 199 00:10:46,920 --> 00:10:50,260 when you explore a node, you remember that through the entire for loop. 200 00:10:50,260 --> 00:10:54,610 So when we invoke breadth-first search from node number 1, we explore nodes 1, 3, 201 00:10:54,610 --> 00:10:59,790 5, 7 and 9, and we keep those marked as explored for the rest of this algorithm. 202 00:10:59,790 --> 00:11:00,680 And that's so 203 00:11:00,680 --> 00:11:03,867 we don't do redundant work when we get to later stages of the for loop. 204 00:11:04,950 --> 00:11:07,400 So as far as what does this algorithm accomplish, well, 205 00:11:07,400 --> 00:11:10,400 it certainly finds every connected component. 206 00:11:10,400 --> 00:11:13,573 There is absolutely no way it can miss a node because this for 207 00:11:13,573 --> 00:11:17,266 loop literally walks through the nodes, all of them, one at a time. 208 00:11:17,266 --> 00:11:18,340 So you're not going to miss a node. 209 00:11:18,340 --> 00:11:21,420 Moreover, we know that as soon as you hit a connected component for 210 00:11:21,420 --> 00:11:24,470 the first time, and you do breadth-first search from that node, 211 00:11:24,470 --> 00:11:25,600 you're going to find the whole thing. 212 00:11:25,600 --> 00:11:28,390 That the breadth-first a search guarantee. 213 00:11:28,390 --> 00:11:32,740 As far as what's the running time, well it's going to be exactly what we want. 214 00:11:32,740 --> 00:11:37,860 It's going to be linear time, which again means proportional to the number of edges 215 00:11:37,860 --> 00:11:39,260 plus the number of vertices. 216 00:11:39,260 --> 00:11:42,318 And again depending on the graph, one of these might be bigger that the other. 217 00:11:42,318 --> 00:11:44,940 So why is it O of m plus n? 218 00:11:44,940 --> 00:11:48,800 Well as far as the nodes, we have to do this initialization there where we mark 219 00:11:48,800 --> 00:11:52,650 them all as unexplored, so that takes constant time per node. 220 00:11:52,650 --> 00:11:57,552 We have just the basic overhead of a for loop, so that's constant time per node. 221 00:11:57,552 --> 00:12:01,897 And then we have this check, constant time per node, so that's O of n. 222 00:12:01,897 --> 00:12:05,035 And then recall we proved that within breath for research, 223 00:12:05,035 --> 00:12:07,050 you do a amount of work proportional. 224 00:12:07,050 --> 00:12:10,070 You do constant time for each node in that connected component. 225 00:12:10,070 --> 00:12:13,078 Now, each of the nodes of the graph is in exactly one of the connected components. 226 00:12:13,078 --> 00:12:14,970 So you'll do constant time for 227 00:12:14,970 --> 00:12:18,220 each node in the BFS in which you discover that node. 228 00:12:18,220 --> 00:12:21,150 So that's again, o of n over all of the connected components. 229 00:12:21,150 --> 00:12:24,420 And as far as the edges, note we don't even bother to look at edges until 230 00:12:24,420 --> 00:12:26,950 we're inside one of these breadth first search calls. 231 00:12:26,950 --> 00:12:30,250 They played no role in the outer for loop or in the pre-processing. 232 00:12:30,250 --> 00:12:33,240 And remember what we proved about an indication of breadth first search. 233 00:12:33,240 --> 00:12:35,690 The running time, you only do constant amount of work 234 00:12:35,690 --> 00:12:38,710 per edge in the connected component that you're exploring. 235 00:12:38,710 --> 00:12:41,530 In the worst case, you look at an edge once from either endpoint and 236 00:12:41,530 --> 00:12:43,602 each of that triggers a constant amount of work. 237 00:12:43,602 --> 00:12:46,060 So when you discover a given connected component, 238 00:12:46,060 --> 00:12:50,190 the edge work is proportional to the number of edges in that kind of component. 239 00:12:50,190 --> 00:12:54,230 Each edge of the graph is only in exactly one of the connect components, so 240 00:12:54,230 --> 00:12:57,270 over this entire for loop, over all of these BFS calls. 241 00:12:57,270 --> 00:13:00,270 For each edge of the graph, you'll only be responsible for 242 00:13:00,270 --> 00:13:01,505 a constant amount of work of the algorithm. 243 00:13:01,505 --> 00:13:06,159 So summarizing because breadth-first search from a given starting node. 244 00:13:06,159 --> 00:13:09,864 That is, works in time proportional to the size of that component, piggybacking on 245 00:13:09,864 --> 00:13:12,639 that sub routine and looping over all of the nodes of the graph. 246 00:13:12,639 --> 00:13:16,166 We find all of the connecting components in time proportional to the amount of 247 00:13:16,166 --> 00:13:17,860 edges and nodes in the entire graph.