So let's talk about the absolutely fundamental problem of searching a graph, and the very related problem of finding paths through graphs. So why would one be interested in searching a graph, or figuring out if there's a path from point A to point B? Well there's many, many reasons. I'm going to give you a highly non-exhaustive list on this slide. >>So let me begin with a very sorta obvious and literal example, which is if you have a physical network, then often you want to make sure that the network is fully connected in the sense that you can get from any starting point to any other point. So, for example, think back to the phone network. It would've been a disaster if callers from California could only reach callers in Nevada, but not their family members in Utah. So obviously a minimal condition for functionality in something like a phone network is that you can get from any one place to any other place, similarly for road networks within a given country, and so on. It can also be fun to think about other non-physical networks and ask if they're connected. So one network that's fun to play around with is the movie network. So this is the graph where the nodes correspond to actors and actresses, and you have an edge between two nodes, if they played a role in a common movie. So this is going to be an undirected graph, where the edges correspond to, not necessarily co-starring, but both the actors appearing at least some point in the same movie. So versions of this movie network you should be able to find publicly available on the web, and there's lots of fun questions you can ask about the movie network. Like, for example, what's the minimum number of hops, where a hop here again is the movie that two people both played a role in? The minimum number of hops or edges from one actor to another actor, so perhaps the most famous statistic that's been thought about with the movie is the Bacon Number. So this refers to the fairly ubiquitous actor Kevin Bacon, and the question the Bacon Number of an actor is defined as the minimum number of hops you need in this movie graph to get to Kevin Bacon. So, for example, you could ask about Jon Hamm, also known as Don Draper from "Mad Men". And you could ask how many edges do you need on a path through the movie graph to get to Kevin Bacon? And it turns out that the answer is 1, excuse me, 2 edges. You need one intermediate point, namely Colin Firth. And that's became, that's because Colin Firth and Kevin Bacon both starred in Atom Egoyan's movie, "Where "the Truth Lies", and Jon Hamm and Colin Firth were both in the movie "A Single Man". So that would give Jon Hamm a Bacon Number of 2. So, these are the kind of questions you're gonna ask about connectivity. Not just physical networks, like telephone and telecommunication networks, but also logical networks about parallel relationships between objects in general. So the Bacon Number is fundamentally not just about any path, but actually shortest paths, the minimum number of edges you need to traverse to get from one actor to Kevin Bacon. And shortest paths are also have a very practical use, that you might use yourself in the driving directions. So when you use a website or a phone app and you ask for the best way to get from where you are now to say some restaurant where you're gonna have dinner, obviously you're trying to find some kind of path through a network through a graph, and indeed often you want the, the shortest path, perhaps in mileage or perhaps in anticipated travel time. Now I realize that when you are thinking about paths and graphs, it's natural to focus on sort of very literal paths and quite literal physical networks. Things like routes through a road network or paths through the internet and so on. You should really think more abstractly as a path as just a sequence of decisions, taking you from some initial state to some final state. And it's this abstract mentality which is what makes graph search so ubiquitous, it feels like artificial intelligence, where you want to formulate a plan of how to get from an initial state to some goal state. So, to give a simple recreational example, you can imagine just trying to understand how to compute automatically a way to fill in a Sudoku puzzle so that you get to, so that you solve the puzzle correctly. So you might ask, you know, what is the graph that we're talking about, when we wanna solve a Sudoku puzzle. Well this is gonna be a directed graph, where here the nodes correspond to partially completed puzzles. So, for example, at one node of this extremely large graph, perhaps 40 out of the 81 cells are filled in with some kind of number, and now, again, remember a path is supposed to correspond to a sequence of decisions. So, what are the actions that you take in solving Sudoku? Well, you fill in a number into a square. So, an edge which here is going to be directed, is going to move you from one partially completed puzzle to another, where one previously empty square gets filled in with one number. And of course then the path is that you're interested in computing, or what your searching for when you search this graph. You begin with the initial state of the Sudoku puzzle and you want to reach some goal state where the Sudoku puzzle is completely filled in without any violations of the rules of Sudoku. And of course it's easy to imagine millions of other situations where you wanna formulate some kind of plan like this, for example if you have a robotic hand and you wanna grasp some object, you need to think about exactly how to approach the object with this robotic hand, so that you can grab it without, for example, first knocking it over, and you can think of millions of other examples. Another thing which turns out to be closely related to graph search, as we'll see, it has many applications in its own right, is that of computing connectivity information about graphs, in particular the connected components. So this, especially for undirected graphs, corresponds to the pieces of the graph. We'll talk about these topics in their own right, and I'll give you applications for them later. So for undirected graphs I'll briefly mention an easy clustering heuristic you can derive out of computing connected components. For directed graphs where the very definition of computing components is a bit more subtle, I'll show you applications to understanding the structure of the web. So these are a few of the reasons why it's important for you understand how to efficiently search graphs. It's a, a fundamental and widely applicable graph primitive. And I'm happy to report that in this section of the course, pretty much anything, any questions we wanna answer about graph search, computing connected components, and so on, there's gonna be really fast algorithms to do it. So, this will be the part of the course where there's lots of what I call for free primitives, processing steps, subroutines you can run without even thinking about it. All of these algorithms we're gonna discuss in the next several lectures, are gonna run in linear time, and they're gonna have quite reasonable constants. So, they're really barely slower than reading the input. So, if you have a graph and you're trying to reason about it and you're trying to make sense about it, you should in some sense feel free to apply any of these subroutines we're gonna discuss to try and glean some more information about what they look like, how you might use the network data. There's a lot of different approaches to systematically searching a graph. So, there's many methods. In this class we're gonna focus on two very important ones, mainly breadth first search and depth first search. But all of the graph search methods share some things in common. So, in this slide let me just tell you the high order bits of really any graph search algorithm. So graph search subroutines generally are passed as input a starting search vertex from which the search originates. So that's often called source vertex. And your goal then is to find everything findable from the search vertex and obviously you're not gonna find anything that you can't find that is not findable. What I mean by findable, I mean, there's a path from the starting point to this other node. So any other node to which you can get along on a path from the starting point you should discover. So, for example, if you're given an undirected graph that has three different pieces, like this one I'm drawing on the right, then perhaps S is this left most node here, then the findable vertices starting from S, i.e. the ones which you can reach from a path to S, is clearly precisely these four vertices. So, you would want graph search to automatically discover and efficiently discover these four vertices if you started at S. You can also think about a directed version of the exact same graph, where I'm gonna direct the vertices like so. So now the definition of the findable nodes is a little bit different. We're only expecting to follow arcs forward, along the forward direction. So we should only expect at best to find all of the nodes that you can reach, by following a succession of forward arcs, that is, any node that there's a path to from S. So in this case, these three nodes would be the ones we'd be hoping to find. This blue node to the right, we would no longer expect to find, because the only way to get there from S, is by going backward along arcs. And that's not what we're going to be thinking about in our graph searches. So we want to find everything findable, i.e. that we can get to along paths, and we want to do it efficiently. Efficiently means we don't want to explore anything twice. Right, so the graph has m arcs, m edges and n nodes or n vertices and really we wanna just look at either each piece of the graph only once for a small cost number of times. So looking for running time which is linear on the size of the graph that is big O of m plus n. Now when we were talking about representing graphs, I said that in many applications, it's natural to focus on connected graphs, in which case M is gonna dominate N, and you're gonna have at least as many edges as nodes, essentially. But connectivity is the classic case where you might have the number of edges of being much smaller than the number of nodes. There might be many pieces of the whole point of what you're trying to do is discover them. So, for this sequence of lectures where we talk about graph search and connectivity, we will usually write M plus N. We'll think that either one can be bigger or smaller than the other. So let me now give you a generic approach to graph search. It's gonna be under-specified, there'll be many different ways to instantiate it. Two particular instantiations will give us breadth first search and depth first search but here is just a general systematic method to finding everything findable without exploring anything more than once. So motivated by the second goal, the fact that we don't want to explore anything twice, with each node, with each vertex, we're gonna remember whether or not we explored it before. So we just need one Boolean per node and we will initialize it by having everything unexplored except S, our starting point we'll have it start off as explored. And it's useful to think of the nodes thus far as being in some sense territory conquered by the algorithm. And then there's going to be a frontier in between the conquered and unconquered territory. And the goal of the generic outcome is that each step we supplement the conquered territory by one new node, assuming that there is one adjacent to the territory you've already conquered. So for example in this top example with the undirected network, initially the only thing we've explored is the starting point S. So that's sort of our home base. That's all that we have conquered so far. And then in our main while loop, which we iterate as many times as we can until we don't have any edges meeting the following criterion, we look for an edge with one end point that we've already explored. One end point inside the conquered territory and then the other end point outside. So this is how we can in one hop supplement the number of nodes we've seen by one new one. If we can't find such an edge then this is where the search stops. If we can find such an edge, well then we suck V into the conquered territory. We think of it being explored. And we return to the main while loop. So, for example, in this example on the right, we start with the only explored node being S. Now, there's actually two edges that cross the frontier, in the sense one of the endpoints is explored, namely one of the endpoints is S, and the other one is some other vertex. Right? There's this there's these two vert, two edges to the left, two vertices adjacent to S. So, in this algorithm we pick either one. It's un, under-specified which one we pick. Maybe we pick the top one. And then all of the sudden, this second top vertex is also explored so the conquered territory is a union of them, and so now we have a new frontier. So now again we have two edges that cross from the explored nodes to the unexplored nodes. These are the edges that are in some sense going from northwest to southeast. Again, we pick one of them. It's not clear how. The algorithm doesn't tell us, we just pick any of them. So, maybe for example we pick this right most edge crossing the frontier. Now the right most edge of these-- right most vertex of these four is explored so our conquered territory is the top three vertices. And now again we have two edges crossing the frontier. The two edges that are incident to the bottom node, we pick one of them, not clear which one, maybe this one. And now the bottom node is also explored. And now there are no edges crossing the frontier. So there are no edges who, on the one hand, have one end-point being explored, and the other end-point being unexplored. So these will be the four vertices, as one would hope, that the search will explore started from S. Well generally the claim is that this generic graph search algorithm does what we wanted. It finds everything findable from the starting point and moreover it doesn't explore anything twice. I think that's fairly clear that it doesn't explore anything twice. Right? As soon as you look at a node for the first time, you suck it into the conquered territory never to look at it again. Similarly as soon as you look at an edge, you suck them in. But when we explore breadth and depth first search, I'll be more precise about the running time and exactly what I mean by you don't explore something twice. So, at this level of generality, I just wanna focus on the first point, that any way you instantiate this search algorithm, it's going to find everything findable. So, what do I really mean by that? The formal claim is that at the termination of this algorithm, the nodes that we've marked exp-, explored, are precisely the ones that can be reached via a path from S. That's the sense in which the algorithm explores everything that could potentially be findable from the starting point S. And one thing I wanna mention is that this claim and the proof I'm going to give of it, it holds whether or not G is an undirected graph or a directed graph. In fact, almost all of the things that I'm gonna say about graph search, and I'm talking about breadth first search and depth first search, work in essentially the same way, either in undirected graphs or directed graphs. The obvious difference being in an undirected graph you can traverse an edge in either direction. In a directed graph, we're only supposed to traverse it in a forward direction from the tail to the head. The one big difference between undirected and directed graphs is when we connectivity computations and I'll remind you when we get to that point which one we're talking about. Okay? But for the most part, when we just talk about basic graph search it works essentially the same way whether it's undirected or directed. So keep that in mind. Alright, so why is this true? Why are the nodes that get explored precisely the nodes for which there's a path to them from S? Well, one direction is easy. Which is, you can't find anything which is not findable, that is, if you wind up exploring a node, the only reason that can happen is because you traversed a sequence of edges that got you there. And that sequence of edges obviously defines a path from S to V. If you really want to be pedantic about the forward direction that explored nodes have to have paths from S. Then you can just do an easy induction. And I'll leave this for you to check, if you want, in the privacy of your own home. So the important direction of this claim is really the opposite. Why is it that no matter how we instantiate this generic graph search procedure, it's impossible for us to miss anything. That's the crucial point, we don't miss anything that we could, in principle, find via a path. But we're gonna proceed by contradiction. So, what does that mean, we're going to assume that, the statement that we want to prove is true, is not true. Which means that, it's possible that, G has a path from s to v and yet, somehow our algorithm misses it, doesn't mark it as explored. Alright, that's the thing we're really hoping doesn't happen. So let's suppose it does happen and then derive a contradiction. So suppose G does have a path from s to some vertex v. Call the path P. I'm gonna draw the picture for an undirected graph but the situation would be same in the, in the directed case. So there's a bunch of hops, there's a bunch of edges and then eventually this path ends at v. Now the bad situation, the situation from which we want to derive a contradiction is that V is unexplored at the end of this algorithm. So let's take stock of what we know. S for sure is explored, right. We initialized this search procedure so that S is marked as explored. V by hypothesis in this proof by contradiction is unexplored. So S is explored, V is unexplored. So now imagine we, just in our heads as a thought experiment which traverse this path P. We start at S and we know it's explored. We go the next vertex, it may or may not have been explored, we're not sure. We go to the third vertex, again who knows. Might be explored, might be unexplored and so on, but by time we get to V, we know it's unexplored. So we start at S, it's been explored, we get to V it's been unexplored. So at some point there's some hop, along this path P, from which we move from an explored vertex, to an unexplored vertex. There has to be a switch, at some point, cuz the end of the day at the end of the path we're at an unexplored node. So consider the first edge, and there must be one that we switch from being at an explore node to being at an unexplored node. So, I'm going to call the end points of this purported edge U and W. Where U is the explored one and W is the unexplored one. Now, for all we know U could be the same as S, that's a possibility, or for all we know, W could be same as V. That's also a possibility. In the picture, I'll draw it as if this edge UX was somewhere in the middle of this path. But, again it may be at one of the ends. That's totally fine. But now in this case, there's something I need you to explain to me. How is it possible that, on the one hand, our algorithm terminated. And on the other hand, there's this edge U comma X. Where U has been explored and X has not been explored. That, my friends, is impossible. Our generic search algorithm does not give up. It does not terminate, unless there are no edges where the one end point is explored and the other end point is unexplored. As long as there's such an edge, it has, is gonna suck in that unexplored vertex into the conquered territory, it's gonna keep going. So the upshot is there's no way that our algorithm terminated with this picture. With there being an edge U X, U explored, X unexplored. So, that's the contradiction. This contradicts the fact that our algorithm terminated with V unexplored. So that is a general approach to graph search. So that I hope gives you the flavor of how this is going to work. But now there's two particular instantiations of this generic method that are really important and have their own suites of applications. So we're gonna focus on breadth-first search and depth-first search. We'll cover them in detail in the next couple of videos. I wanna give you the highlights to conclude this video. Now let me just make sure it's clear where the ambiguity in our generic method is. Why we can have different instantiations of it that potentially have different properties and different applications. The question is at a given iteration of this while loop, what do you got? You've got your nodes that you've already explored, so that includes S plus probably some other stuff, and then you've got your nodes that are unexplored, and then you have your crossing edges. Right? So, there are edges with one point in each side. And for an undirected graph, there's no orientation to worry about. These crossing edges just have one endpoint on the explored side, one endpoint on the unexplored side. In the directed case, you focus on edges where the tail of the edge is on the explored side and the head of the edge is on the unexplored side. So, they go from the explored side to the unexplored side. And the question is, in general, in an iteration of this while loop there's gonna be many such crossing edges. There are many different unexplored nodes we could go to next, and different strategies for picking the unexplored node to explore next leads us to different graph search algorithms with different properties. So the first specific search strategy we're gonna study is breadth first search, colloquially known as BFS. So let me tell you sort of the high level idea and applications of bread first search. So, the goal is going to be to explore the nodes in what I call, layers. So, the starting point S will be in its own layer, Layer-0. The neighbors of S will constitute Layer-1, and then Layer-2 will be the nodes that are neighbors of Layer-1 but that are not already in layer zero or layer one, and so on. So layer i plus one, is the stuff next to layer i that you haven't already seen yet. You can think of this as a fairly cautious and tentative exploration of the graph. And it's gonna turn out that there's a close correspondence between these layers and shortest path distances. So if you wanna know the minimal number of hops, the minimal number of edges you need in a path to get from point A to point B in a graph. The way we wanted to know the fewest number of edges in the movie graph necessary to connect to John Hamm to Kevin Bacon. That corresponds directly to these layers. So if a node is in layer i, then you need i edges to get from S to i in the graph. Once we discuss breadth-first search, we'll also discuss how to compute the connected components, or the different pieces, of an undirected graph. Turns out this isn't that special to breadth-first search, you can use any number of graph search strategies to compute connected components in undirected graphs. But I'll show you how to do it using a simple looped version of breadth-first search. And we'll be able to do this stuff in the linear time that we want. The very ambitious goal of getting linear time. To get the linear time implementation, you do wanna use the right data structure, but it's a simple, simple data structure, something probably you've seen in the past. Namely a queue. So, something's that first in and first out. So, the second search strategy that's super important to know is depth first search, also known as DFS to its friends. Depth first search has a rather different feel than breadth first search. It's a much more aggressive search where you immediately try to plunge as deep as you can. It's very much in the spirit of how you might explore a maze, where you go as deeply as you can only backtracking when absolutely necessary. Depth first search will also have its own set of applications. It's not, for example, very useful for computing shortest path information, but especially in directed graphs it's going to do some remarkable things for us. So, in directed acyclic graphs, so a directed graph with no directed cycles it will give us what's called the topological ordering. So it'll sequence the nodes in a linear ordering from the first to the last, so that all of the arcs of the directed graph go forward. So this is useful for example if you have a number of tasks that need to get completed with certain precedence constraints. Like for example you have to take all of the classes in your undergraduate major, and there was certain prerequisites, topological ordering will give you a way in which to do it, respecting all of the prerequisites. And finally where for undirected graphs it doesn't really matter whether you use BFS or DFS to connect the components, in the directed graphs where even defining connected components is a little tricky it turns out depth first search is exactly what you want. That's what you're going to get a linear time implementation for computing the right notion of connected components in the directed graph case. Time-wise, both of these are superb strategies for exploring a graph. They're both linear time with very good constants. So depth-first search again, we're gonna get O of M plus N time in a graph with M edges and N vertices. You do wanna use a different data structure reflecting the different search strategy. So, here because you're exploring aggressively, as soon as you get to a node you'll meet and you start exploring its neighbors, you wanna last-in first-out data structure, also known as a stack. Depth first search also admits a very elegant recursive formulation, and in that formulation, you don't even need to maintain a stack data structure explicitly, the stack is implicitly taken care of in the recursion. So that concludes this overview of graph search. Both what it is, what our goals are, what kind of applications they have and two of the most common strategies. The next couple videos are going to explore these search strategies, as well as a couple of these applications in greater depth.