So let's talk about the absolutely fundamental problem of searching a graph and the very related problem of finding paths through graphs. So why would one be interested in searching a graph or figuring out if there's a path from point A to point B? Well there's many, many reasons, I'm going to give you a highly non-exhaustive list on this slide. So let me begin with a very sort of obvious and literal example, which is if you have a physical network. Then often you want to make sure that the network is fully connected, in the sense that you can get from any starting point to any other point. So, for example, think back to the phone network. It would've been a disaster if callers from California could only reach callers in Nevada, but not their family members in Utah. So obviously, a minimal condition for functionality of something like a phone network is that you can get from any one place to any other place, similarly for road networks within a given country, and so on. It can also be fun to think about other, non-physical networks and ask if they're connected. So, one network that's fun to play around with is the movie network. So this is the graph where the nodes correspond to actors and actresses and you have an edge between the two nodes if they played a role in a common movie. So this is going to be an undirected graph where the edges correspond to, not necessarily co-starring, but both the actors appearing at least at some point in the same movie. So versions of this movie network you should be able to find publicly available on the web and there's lots of fun questions you can ask about the movie network, like for example, what's the minimum number of hops, where a hop here again is a movie where two people both played a role in, the minimum number of hops or edges you can get from one actor to another actor. So perhaps the most famous statistic that's been thought about with the movie network is the Bacon number. So this refers to the fairly ubiquitous actor Kevin Bacon and the question, the Bacon number of an actor is defined as the minimum number of hops you need in this movie graph, to get to Kevin Bacon. So for example, you can ask About Jon Hamm, also known as Don Draper from Mad Men. And you could ask how many edges do you need on a path through the movie graph to get to Kevin Bacon and it turns out that the answer is one, excuse me, two edges. You need one intermediate point, namely Colin Firth. And, that's became, that's because that's because Colin Firth and Kevin Bacon both starred in the Atom Egoyan movie, Where The Truth Lies, and John Hamm and Colin Firth were both in the movie, A Single Man. So, that would give John Hamm a Bacon number of two. So, these are the kinds of questions you can ask about connectivity not just in physical networks, like telephone and telecommunication networks, but also logical networks about pairwise relationships between objects more generally. So the Bacon number is fundamentally not just about any path, but actually shortest paths. The minimum number of edges you need to traverse to get from one actor to Kevin Bacon. And shortest paths are also have a very practical use that you might use yourself in driving directions. So when you use a website or a phone app and you ask for the best way to get from where you are now to say, some restaurant where you're going to have dinner. Obviously, you're trying to find some kind of path through a network through a graph. And indeed, often you want the, the shortest path. Perhaps in mileage or perhaps in anticipated travel time. Now I realize that when you're thinking about paths and graphs, it's natural to focus on, sort of, very literal paths and quite literal, physical networks, things like roots through a road network or paths through the internet and so on. But, you should really think more abstractedly as a path, as just a sequence of decisions, taking you from some initial state, to some final state. And it's this abstract mentality which is what makes graph search so ubiquitous, it feels like artificial intelligence. Where you want to formulate a plan of how to get from an initial state to some goal state. So, to give a simple recreational example, you can imagine just trying to understand how to compute automatically a way to fill in a Sudoku puzzle. So that you get to, so that you solve the puzzle correctly. So you might ask you know what is the graph that we're talking about when we want to solve a Sudoku puzzle. Well this is going to be a directed graph where here the nodes correspond to partially completed puzzles. So for example, at one node of this extremely large graph perhaps 40 out of the 81 cells are filled in with some kind of number. And now, again remember a path is supposed to correspond to a sequence of decisions, so what are the actions that you take in solving Suduko while you fill in a number into a square. So an edge which here is going to be directed is going to move you from one partially completed puzzle to another, where one previously empty square gets filled in with one number. And of course, then the path that you're interested in computing, or what you're searching for or what you search for when you search this graph., You begin with the initial state of the Sudoku puzzle and you want to reach some goal state where the Sudoku puzzle is completely filled in without any violations of the rules of Sudoku. And of course it's easy to imagine millions of other situations where you want to formulate some kind of plan like this. Like, for example, if you have a robotic hand and you just want to grasp some object, you need to think about exactly how to approach, the object with this robotic hand, so that you can grab it without, for example, first knocking it over. And you can think of millions of other examples. Another thing which turns out to be closely related to graph search as we'll see and has many applications in it's own right is that of computing connectivity information about graphs. So in particular the connected components. So this, especially for undirected graphs corresponds to the pieces of the graph. We'll talk about these topics in their own right and I'll give you applications for them later. So for undirected graphs I'll briefly mention an easy clustering heristic you can derive out of computing connected components. For directed graphs where the very definition of computing components is a bit more subtle I'll show you applications to understanding the structure of the web. So these are a few of the reasons why it's important for you to understand how to efficiently search graphs. It's a, a fundamental and widely applicable graph primitive. And I'm happy to report that in this section of the course pretty much anything any questions we want to answer about graph search, computing connecting components, and so on. There's going to be really fast algorithms to do it. So this will be a part of the course where there's lots of what I call four free primitives processing steps serpentine's that you can run without even thinking about it. All of these algorithms we're going to discuss in the next several lectures are going to run in linear time and there going to have quite reasonable constants. So there barely slower than reading the input. So if you have a graph and your trying to reason about it, your trying to make sense about it. You should in some sense feel free to apply any of these subroutines we're going to discuss to try and glean some more information about what they look like, how you might use the network data. There's a lot of different approaches to systematically searching a graph. So there's many methods. In this class, we're going to focus on two very important ones. Mainly breadth verse search and depth verse search. But all of the graph search methods share some things in common. So on this slide, let me just tell you the high order bits of, really, any graph search algorithm. So grass search subroutines generally are passed as input, a starting vertex from which the search originates. So, that's often called a source vertex. And your goal, then, is to find everything findable from this search vertex And obviously, you're not going to find anything that you can't find, that's not findable. What do I mean by findable? I mean, there's a path from the starting point to this other node. So for any other node to which you can get along a path from the starting point, you should discover it. So for example, if you're given an undricted graph that has three different pieces like this one I'm drawing on the right and perhaps S, is this left most node here. Then the findable vertices starting from S are either the ones to which you can reach from a path to S is clearly precisely these four vertices. So you would want graph search to automatically discover and efficiently discover these four vertices to if you started at S. You could also think about a directed version of the exact same graph, wherein the direct vertices like so. So now the definition of definable nodes is a little bit different. We're only expecting to follow arcs forward along the forward direction so we should only expect at best to find all of the nodes that you can reach by following a succession of forward arcs that is any, any node that there's a path to from S so in this case these three nodes would be the ones we'd be hoping to find this blue node to the right we would no longer expect to find because the only way to get there from S is by going backwards along arcs and that's not what we're going to be thinking about in our graph surges. So we want to find everything findable, i.e., that we can get to along paths. And, but we want to do it efficiently. And efficiently means we don't explore anything twice, right? So the graph has M arcs, M edges, and N nodes or N vertices. And really, we want to just look at, each piece of the graph only once, or a small constant number of times. So we're looking for running time, which is linear in the size of the graph. That is big O of MN. plus N. Now, we were talking about representing graphs, I said that in many applications it's natural to focus on connected graphs, in which case M is going to dominate N. You're going to have at least as many edges as nodes essentially. But connectivity is the classic case where you might have the number of edges being much smaller than the number of nodes. There might be many pieces and the whole point of what you're trying to do is discover them. So for this sequence of lectures when we talk about graph search and connectivity, we will usually write Mn. plus N. We'll think that either one could be bigger or smaller than the other. So let me know give you a generic approach to graph search. It's going to be under specified. There will be many different ways to instantiate it. Two particular instantiations will get us breath first search and depth first search. But here is just a general systematic method for finding everything findable without exploring anything more than once. So motivated by the second goal, the fact that we don't want to explore anything twice, with each node, with each vertex, we're going to remember whether or not we've explored it before. So we just need one Boolean per node and we'll initialize it by having everything unexplored, except S our starting point, we'll have it start off as explored. And it's useful to think of the nodes thus far as being, in some sense, territory conquered by the algorithm. And then there's going to be a frontier in between the conquered and unconquered territory. And the goal of the generic algorithm is that, each step, we supplement the conquered territory by one new node. Assuming that there is one adjacent to the territory we've already conquered. So for example, in this top example with the undirected network. Initially, the only thing we've explored is the starting point S. So that's sort of our home base, that's all that we have conquered so far. And then, in our main while loop. Which we iterate as many times as we can, until we don't have any edges meeting the following criterion. We look for a edge. With one endpoint that we've already explored one endpoint inside the conquer territory and then the other endpoint outside. So this is how we can in one hop supplement the number of nodes we've seen by one new one. If we can't find such an edge then this is where the search stops. If we can find such an edge well then we suck V into the conquer territory we think of it being explored. And we return to the while loop. So for example, on this example on the right, we start with the only explored node being S. Now, there's actually two edges that cross the frontier. In the sense one of the N points is explored. Namely one of the points is S, and the other one is some other vertex, right? There's this there's this two ver, two edges to the left two vertices adjacent to S. So, in this algorithm we pick either one. It's unspecified which one we pick. But maybe we pick the top one. And so then all of a sudden this seconds tab vertex is now also explored. So the conquer territory is the union of them, and so now we have a new frontier. So now again we have two edges that cross from the explored nodes. The unexplored nodes is at the edges that are in some sense going from Northwest to Southeast. Again we pick one of them, it's not clear how the algorithm doesn't tell us. We just pick any of them. So maybe for example, we pick this right most edge crossing the front here. Now, the right most edge of these four, right most vertex of these four is explored so our conquer territory is the top three vertices. Now again, we have two edges crossing the frontier, the two edges that are incident to the bottom node. We pick one of them not clear which one, maybe this one. And now, the bottom node is also explored, and now there are no edges crossing the frontier. So there are no edges who, on the one hand, have one endpoint being explored and the other endpoint being unexplored. So these will be the four vertices, as one would hope, that the search will explore, started from S. Well generally, the claim is that this generic graph search algorithm does what we wanted. It finds everything findable from the starting point and moreover, it doesn't explore anything twice. I think it's fairly clear that it doesn't explore anything twice, right? As soon as you look at a node for the first time, you suck it into the conquered territory, never to look at it again. Similarly, as soon as you look at, at edge, you suck them in. but when we explore breadth and depth, first search will be more precise about the running time and exactly what I mean by you don't explore something twice. So at this level of generality, I just want to focus on the first point, that anyway you [INAUDIBLE] this search algorithm it's going to find everything findable. So what do I really mean by that? The formal claim is that at the termination of this algorithm. The nodes that we've marked explored are precisely the ones that can be reached via a path from S. That's the sense in which the algorithm explores everything that could potentially be findable from the starting point S. And one thing I want to mention is that this claim and the proof we're going to give of it, it holds whether or not you use an undirected graph or a directed graph. In fact, almost all of the things that I'm going to say about graph search and, in particular, breadth-first search and depth-first search work in essentially the same way, either undirected graphs or directed graphs. The obvious difference being in an undirected graph, you can traverse an engine in either direction. In a directed graph, we're only supposed to traverse it in the forward direction, from the tail to the head. The one big difference between undirected and directed graphs is when we do connectivity computations. And I'll remind you when we get to that point, which one we're talking about. Okay, but for the most part. We just talked about basic graph search, where it's essentially the same way, whether it's undirected or directed, so keep that in mind. All right, so why is this true? Why are the nodes that get explored precisely the nodes for which there is a path to them from S? Well, one direction is easy. Which is, you can't find anything which is not findable. That is, if you wind up exploring a node, the only reason that can happen is because you traversed a sequence of edges that got you there, and that sequence of edges obviously defines a path from S to V. If you really want to be pedantic about the forward direction that explored nodes have to have paths from S then you can just do an easy induction. And I'll leave this for you to check, if you want, in the privacy of your own home. So the important direction of this claim is really the opposite. Why is it that no matter how we instantiate this generic graph search procedure it's impossible for us to miss anything. That's the crucial point we don't miss anything that we couldn't principal find via a path. But we're going to perceive by contradiction. So what does that mean? We're going to assume that the statement that we want to prove is true, is not true. Which means that, it's possible that G has a path from S to V. And yet, somehow, our algorithm misses it. Doesn't mark it as explored. That's the thing we really hoping doesn't happen. So let's suppose it does happen and then derive a contradiction. So let's suppose G does have a path from S to some vertex V, call the path P. I'm going to draw the picture from undirected graph, but the situation would be same in, in the directive case. So there's a bunch of hops there's a bunch of edges and ยทย then eventually this path ends at V. Now, the bad situation the situation from which we wanted to rather contradiction is that V is unexplored at the end of this algorithm. So let's take stock of what we know, S for sure is explored right, we initialized the search procedure so that S is marked as explored. V by hypothesis in this proof by contradiction is unexplored. So S is explored V is unexplored. So now, I imagine we just in our heads as a thought experiment we traversed as path peak. We started S and we know it's explored we go to the next vertex it may or not have been explored, we're not sure. We go to the third vertex again who knows it might be explored it might be unexplored and so on. But by the time we get to V we know its unexplored. So we start at S it's been explored we get to V it's been unexplored. So at some point, there's some hop along this path P from which we move from an explored vertex to an unexplored vertex. There has to be a switch, at some point because at the end of the day, at the end of the path we're at an unexplored node. So consider the first edge, and there must be one. That we switch from an explored node to being an unexplored node. So I'm going to call the endpoints of this purported edge U and W, where U is the explored one and W is the unexplored one. Now for all we know U could be exactly the same as S, that's a possibility, possibility. Or, for all we know, W could be the same as V, that's also a possibility. In the picture I'll draw it as if this edge ux was somewhere in the middle of this path. But again, it may be at one of the ends, that's totally fine. But now in this case, there's something I need you to explain to me. How is it possible that, on the one hand, our algorithm terminated, and on the other hand, there's this edge U comma X, where U has been explored and X has not been explored. That, my friends, is impossible. Our generic search algorithm does not give up. It does not terminate unless there are no edges where the one endpoint is explored and the other endpoint is unexplored. As long as there's such an edge it has is going to suck in unexplored vertex into the conquered territory and it's going to keep going. So, the, the upshot is there's no way that our algorithm terminated with this picture, with there being an edge UX, U explored, X unexplored. So that's the contradiction. This contradicts the fact that our algorithm terminated with V unexplored. So that is a general approach to graph search. So that, I hope, gives you the flavor of how this is going to work. But now there's two particular instantiations of this generic method that are really important and have their own suites of applications. So we're going to focus on breadth-first search and depth-first search. We'll cover them in detail in the next couple of videos. I want to give you the highlights to conclude this video. Now let me just make sure it's clear where the ambiguity in our generic method is. Why we can have different substantiations of it that potentially have different properties and different applications. The question is, given the iteration of this wile loop, what have you got? You got your nodes that you've already explored, so that includes S plus probably some other stuff. And then you've got your nodes that are unexplored and then you have your crossing edges, right? So they're edges with one point in each side. And for an undirected graph there's no orientation to worry about. These crossing edges just have one endpoint on the explored side and one endpoint on the unexplored side. In the directed case you focus on edges where the tail of the edges's on the explored side and the head of the edges's on the unexplored side. So they go from the explored side to the unexplored side. And the question is in general an iteration this while loop there's going to be many such crossing edges. There are many different unexplored nodes we could go to next. A different strategy is for picking the unexplored node to explore next, leads us to different graph search algorithms with different properties. So the first specific search strategy we're going to study is breadth-first search, colloquially known as BFS. So let me tell you some of the high level idea and applications of breadth-first search. So the goal is going to be to explore the nodes in what I call layers. So the starting point S will be in it's own layer, layer zero. The neighbors of S will constitute layer one. And then layer two will be the nodes that are neighbors of layer one but that are not already in layer zero or layer one and so on. So layer I9 plus one is the stuff next to layer I that you haven't already seen yet. You can think of this as a fairly cautious and tentative exploration of the graph. And it's going to turn out that there is a close correspondence between these layers and shortest path distances. So if you want to know the minimum number of hops, the minimum of edges you need in a path to get from point A to point B in a graph. The way we wanted to know the fewest number of edges in the movie graph necessary to connect a Jon Ham to Kevin Bacon, that corresponds directly to these layers. So if a node is in layer I, then you need I edges to get from S to I in the graph. Once we discuss breadth-first search we'll also discuss how to compute the connected components, or the different pieces, of an undirected graph. Turns out this isn't that special the breadth-first search, you can use any number of graph search strategies to compute connected components in undirected graphs. But I'll show you how to do it using a simple looped version of breadth-first search. And we'll be able to do this stuff in the linear time that we want, the very ambitious goal of getting linear time. To get the linear time implementation, you do want to use the right data structure; but it's a simple, simple data structure, something you've probably seen in the past, namely a, a cube. So it's something that's first in first out. So the second structural strategy that's super important to know is depth-first search, also known as DFS to its friends. Depth-first search has a rather different feel than breadth-first search. It's a much more aggressive search, where you immediately try to plunge as deeply as you can. It's very much in the spirit of how you might explore a maze. Where you go as deeply as you can, only backtracking when absolutely necessary. Depth-first search will also have its own set of applications. It's not for example, very useful for computing shortest path information. But especially, in directed graphs, it's going to do some remarkable things for us. So in directed acyclic graph. So a directed graph with no directed cycles. It will give us what's called the topological ordering. So it will sequence the nodes in a linear ordering from the first to the last. So that all of the arc's of the directed graphs go forward, so this is useful. Fo example, if you have a number of tasks that have to get completed with certain precedence constraints. Like for example, you have to take all of the classes in your undergraduate major and there are certain prerequisites topalogical ordering will give you a way in which to do it respecting all of the prerequisites. And finally, where, for undirected graphs, it doesn't really matter whether you use BFS or DFS to compute connected components. In directed graphs, where even defining connected components is a little tricky. it turns out depth-first search is exactly what you want. That's what you're going to get a linear time implementation for computing the right notion of connected components in the directed graph case. Time wise, both of these are superb strategies for exploring a graph. They're both linear time with very good constants. So depth-first search again. We're going to get ordle, O of M plus N time. In a graph with M edges and N vertices. you do want to use a different data structure reflecting the different search strategy. So here, because you're exploring aggressively as soon as you get to a node, you immediately start exploring its neighbors. You want a lasting first out data structure, also known as a stack. Depth-first search also admits a very elegant recursive formulation and in that formulation, you don't even need to maintain a stack data structure. Explicitly, the stack is implicitly taken care of in the recursion. So that concludes this overview of graph search both what it is? What our goals are? What kind of applications they have adn two of the most common strategies. The next couple videos are going to explore these search strategies as well as a couple of these applications in greater depth.