So, what's the problem? Well, so I did say most of this stuff about graph search it really doesn't matter undirected or directed. It's pretty much cosmetic changes, but the big exception is when you're computing connectivity. When you're computing the pieces of the graph. So, right now I'm only gonna talk about undirected graphs. In the directed case, we can again get a very efficient algorithms for it, but it's quite a bit harder work. So, that's gonna be covered in detail in a separate video. So, for now focus just on a undirected graph, G. And we're certainly not going to assume that G is connected. Indeed part of the point here is to figure out whether or not it's connected, i.e. in one piece. So, maybe the graph looks like this. So, for example maybe the graph has ten vertices and looks like this, on the right. And intuitively, especially given that I've drawn it in such a clean way, it's clear that this graph has three pieces, and those are the things that we wanna call the connected components. But we do want a somewhat more formal definition. Something which is actually you know, in math, that we can say is true of false about a given graph. And roughly we define the connected components of an undirected graph as the maximal regions that are connected. In the sense you can get from any vertex in the region from any other vertex in the region using a path. So maximal connected regions in that sense. Now the slick way to do this is using an equivalence relation. And I'm gonna this here in part because it's really the right way to think about the directed graph case, which we'll talk about in some detail later. So for undirected graphs. So this isn't super important but let me go ahead and state the formal definition just for completeness about what is a connected component. What do I mean by a maximal region that's mutually connected. So a good formal definition is as the equivalence classes of the relation on vertices where which we define by U being related to V if and only if there's a path between U and V in the graph G. So I'll leave it for you to do the simple check that the squiggle is indeed an equivalence relation. I'm gonna remind you what equivalence relations are. This is something you generally learn in your first class on proofs or your first class on discrete math. So it's just something which may or may not be true about pairs of objects. And to be an equivalence relation, you have to satisfy three properties. So, first you have to be reflexive, meaning A, everything has to be related to itself, and indeed, in a graph, there is a path from any node to itself, namely the empty path. So, also equivalence relations have to be symmetric, meaning if U and V are related then V and U are related. Because this is an undirected graph, it's clear that this is symmetric. If there's a path from U to V in the graph, there's also a path from V to U, so no problem there. Finally equivalence class has got to be transitive. So that means if U and V are related and so are V and W, then so are U and W. That is if U and V have a path, V and W have a path, then so does U and W. And you can prove transitivity just by pasting the two paths together. And so the upshot is when you want to say something like the maximal subset of something where everything is the same, the right way to make that mathematical is using equivalence relations. So over in this blue graph, we want to say one, three, five, seven and nine, are sort of all the same in the sense that they are mutually connected and so that's exactly what this relation is making precise. All five of those nodes are related to each other. Two and four are related to each other. Six, eight, and ten all pairs of them are related to each other. So the equivalence, so equivalence relations always have equivalence classes of the maximal mutually related stuff, and in this graph context that's exactly this connected component. That's exactly what you want. So, what I wanna show you is that you can use breadth first search wrapped in an outer for loop over all of the vertices to compute, to identify, all the connected components of the graph in time linear in the graph. In O of N plus M time. Now you might be wondering, you know, why do you wanna do that. Well there's a lot of reasons. So, an obvious one which is relevant for physical networks is to check if a network has broken into two pieces. So, certainly if you're an internet service provider, you wanna make sure that from any point in your network you can reach any other point in the network. And that boils down to just understanding whether the graph that represents your network is a connected graph, that is, if it's in one piece or if it's not in one piece. So obviously you can ask this same question about recreational examples so if you return to the movie graph, maybe you're wondering, can you get from every single actor, in the IMDB database, to Kevin Bacon, or are there actors for which you cannot reach Kevin Bacon, via a sequence of edges, the sequence of movies in which two actors have both played a role. So that's something that boils down to a connectivity computation. If you have network data and you want to display it, you want to visualize it and show it to a group of people so that they can interpret it. Obviously one thing you wanna do is you wanna know if there's multiple pieces and then you want to display the pieces separately. So let me mention one probably a little less obvious application of undirected connectivity, which is it gives a nice quick and dirty heuristic for doing clustering, if you have pairwise information about objects. So let me be a little, let me be a little more concrete. Suppose you have a set of objects that you really care about. So this could be a set of documents, maybe web pages that you crawled, something like that. It could be a set of images, either your own or drawn from some database, or it could be for example a set of genomes. Suppose further that you have a pairwise function. Okay? Which for each pair of objects tells you whether they are very much like each other or very much different. And so let's suppose that if two objects are very similar to each other, like they're two web pages that are almost the same, or they're two genomes where you can get from one to the other with a small number of mutations, then they have a low score. So low numbers close to zero indicate that the objects are very similar to each other. High numbers, let's say they could go up to even a thousand or something, indicate that they are very different objects. Two web pages that have nothing to do with each other. Two genomes for totally unrelated parts or two images that seem to be of completely different people or even completely different objects. Now here's a graph you can construct using these objects and the similarity data that you have about them. So you can have a graph where the nodes are the objects. So for each pixel, for each image, for each document, whatever, you have a single node. And then for a given pair of nodes, you put in an edge if, and only if, the two objects are very similar. So, for example, you could put in an edge between two objects if, and only if, the score is at most ten. So remember, the more similar two objects are, the closer their score is to zero. So you're gonna get an edge between very similar documents, very similar genomes, very similar images. Now in this graph you've constructed, you can find the connected components. So, each of these connected components will be a group of objects, which more or less are all very similar to each other. So, this will be a cluster of closely related objects in your database. And you can imagine a lot of reasons why, given a large set of unstructured data, just a bunch of pictures, a bunch of documents, or whatever, you might wanna find clusters of highly related objects. So we'll probably see more sophisticated heuristics for clustering in some sequel course. But already undirected connectivity gives you a super fast, linear time, quick and dirty heuristic for identifying clusters of similar objects given pairwise data about similarity. So that's some reasons you might wanna do it. Now let's actually talk about how to compute the connected components in linear time using just a simple for loop and breadth-first search as its inner work horse. So here's the code to compute all the connected components of an undirected graph. So first we initialize all nodes as being unexplored. I'm also going to assume that the nodes have names. Let's say their names are from one to N. So, these names could just be the position in the node array that these nodes occupy. So there's gonna be an outer for loop, which walks through the nodes in an arbitrary order, let's say from one to N. This outer for loop is to ensure that every single node of the graph will be inspected for sure at some point in the algorithm. And again, one of our maxims is that we should never do redundant work, so before we start exploring from some node, we check if we've already been there. And if we haven't seen I before, then we invoke the breadth-first search subroutine we were talking about previously in the lecture in the graph G starting from the node I. So to make sure this is clear, let's just run this algorithm on this blue graph to the right. So we start in the outer for loop and we set I equal to one, and we say have we explored node number one yet? And of course not, we haven't explored anything yet. So the first thing we're gonna do is we're gonna, we're gonna invoke BFS on node number one here. So now we start running the usual breadth first search subroutine starting from this node one. And so we explore, you know, layer one here is gonna be nodes three and five and so we explore them in some order. For example, maybe, node number three is what we explore second and then node number five is what we explore, third. And then the second layer in this component is going to be the nodes seven and nine. So, we'll explore them in some order as well. Let's say seven first, followed by nine. So, after this BFS initiated from node number one completes of course it will have found everything that it could possibly find, namely the five nodes in the same connected component as node number one. And of course all of the five of these nodes will be marked as explored. So, now we return once that exits. We return to the other for loop we increment I, we go to I equals two, where we'll say, oh, have we already explored node number two? No we have not. And so now we invoke BFS again from node number two. So that'll be the sixth node we explore. There's not much to do from two. All we can do is go to node number four. So, that's the seventh node we explore. That BFS terminates finding the nodes in this connected component. Then we go back to the outer for loop. We increment i to three. We say ah, have we already seen node number three? Yes we have. We saw that in the first breadth first search. So we certainly don't bother to BFS from this node three. Then we increment i to four, have we seen four? Yes we have, in the second call to BFS. Have we seen node five? Yes we have, in the first call to BFS. Have we seen node six? No, we have not. So the final invocation of breadth-first search begins from node number six. That's gonna be the eighth node overall that we see. And then we're gonna see the nodes eight and ten in some order. So for example, maybe we first explore node number eight. That's a N, that's one of the first layer in this component, and then node number ten is the other node of the first layer in this component. So, in general what's going on, well, what we know about, what we know what will happen when we invoke breadth first search from a given node I, we're going to discover exactly the nodes in I's connected component. Right? Anything where there's a path from I to that node, we'll find it. That's the BFS guarantee. That's also the definition of a connected component. All the other nodes, which have a path to, to I. Another thing that I hope is clear from the example, but you know, just to reiterate it, is every breadth first search call when you explore a node, you remember that through the entire for loop. Right? So when we invoke breadth first search from node number one, we explore nodes one, three, five, seven, and nine, and we keep those marked as explored for the rest of this, for the rest of this algorithm. Right? So and that's so that we don't do redundant work when we get to later stages of the for loop. So, as far as what does this algorithm accomplish? Well, it certainly finds every connected component. Alright, there is absolutely no way it can miss a node because this for loop literally walks through the nodes, all of them, one at a time. And so you're not gonna miss a node. Moreover, we know that, you know, as soon as you f-, hit a connected component for the first time, and you do breadth first search from that node, you're gonna find the whole thing. That's the breadth first search guarantee. As far as what's the running time? Well it's going to be exactly what we want. It's going to be linear time which again means proportional to the number of edges plus the number of vertices. And again, depending on the graph, one of these might be bigger than the other. So why is it O of M plus N? Well as far as the nodes we have to do this initialization there where we marked them all as unexplored so that takes constant time per node. We have just the basic overhead of a for loop so that's constant time per node and then we have this check. Constant time per node so that's O of N. And then recall we proved that within breadth first search you do amount of work proportional, you do constant time for each node in that connected component. Now each of the nodes of the graph is in exactly one of the connected components, so you'll do constant time for each node in the BFS in which you discover that node. So, that's again O of N over all of the connected components. And as far as the edges, no, we don't even bother to look at edges until we're inside one of these breadth first search calls. They play no role in the outer for loop or in the preprocessing, and remember what we proved about an invocation of breadth first search. The running time, you only do constant amount of work per edge in the connected component that you're exploring. In the worst case you look at the edge once from either end point and each of that triggers a constant amount of work. So when you discover a given connected component the edge work is proportional to the number of edges in that connected component. Each edge of the graph is only, is in exactly one of the connected components. So over this entire for loop over all of these BFS calls for each edge of the graph you'll only be responsible for a constant amount of work to the algorithm. So summarizing, because breadth first search from a given starting node does, works in time proportional to the size of that component. Piggy backing on that sub routine, and looping over all of the nodes of the graph we find all of the connected components, in time proportional to the number of edges and nodes in the entire graph.