1 00:00:00,000 --> 00:00:04,200 So let's talk about the absolutely fundamental problem of searching a graph 2 00:00:04,200 --> 00:00:07,486 and the very related problem of finding paths through graphs. 3 00:00:07,486 --> 00:00:11,130 So why would one be interested in searching a graph or figuring out if 4 00:00:11,130 --> 00:00:14,929 there's a path from point A to point B? Well there's many, many reasons, I'm 5 00:00:14,929 --> 00:00:18,060 going to give you a highly non-exhaustive list on this slide. 6 00:00:18,060 --> 00:00:21,962 So let me begin with a very sort of obvious and literal example, which is if 7 00:00:21,962 --> 00:00:25,932 you have a physical network. Then often you want to make sure that the network is 8 00:00:25,932 --> 00:00:30,009 fully connected, in the sense that you can get from any starting point to any 9 00:00:30,009 --> 00:00:32,710 other point. So, for example, think back to the phone 10 00:00:32,710 --> 00:00:35,198 network. It would've been a disaster if callers 11 00:00:35,198 --> 00:00:39,434 from California could only reach callers in Nevada, but not their family members 12 00:00:39,434 --> 00:00:41,817 in Utah. So obviously, a minimal condition for 13 00:00:41,817 --> 00:00:46,106 functionality of something like a phone network is that you can get from any one 14 00:00:46,106 --> 00:00:50,342 place to any other place, similarly for road networks within a given country, and 15 00:00:50,342 --> 00:00:54,118 so on. It can also be fun to think about other, 16 00:00:54,118 --> 00:00:57,024 non-physical networks and ask if they're connected. 17 00:00:57,024 --> 00:01:00,841 So, one network that's fun to play around with is the movie network. 18 00:01:00,841 --> 00:01:05,170 So this is the graph where the nodes correspond to actors and actresses and 19 00:01:05,170 --> 00:01:09,671 you have an edge between the two nodes if they played a role in a common movie. 20 00:01:09,671 --> 00:01:14,057 So this is going to be an undirected graph where the edges correspond to, not 21 00:01:14,057 --> 00:01:18,444 necessarily co-starring, but both the actors appearing at least at some point 22 00:01:18,444 --> 00:01:21,373 in the same movie. So versions of this movie network you 23 00:01:21,373 --> 00:01:25,231 should be able to find publicly available on the web and there's lots of fun 24 00:01:25,231 --> 00:01:28,938 questions you can ask about the movie network, like for example, what's the 25 00:01:28,938 --> 00:01:32,895 minimum number of hops, where a hop here again is a movie where two people both 26 00:01:32,895 --> 00:01:36,903 played a role in, the minimum number of hops or edges you can get from one actor 27 00:01:36,903 --> 00:01:39,608 to another actor. So perhaps the most famous statistic 28 00:01:39,608 --> 00:01:43,065 that's been thought about with the movie network is the Bacon number. 29 00:01:43,065 --> 00:01:47,022 So this refers to the fairly ubiquitous actor Kevin Bacon and the question, the 30 00:01:47,022 --> 00:01:51,011 Bacon number of an actor is defined as the minimum number of hops you need in 31 00:01:51,011 --> 00:01:57,340 this movie graph, to get to Kevin Bacon. So for example, you can ask About Jon 32 00:01:57,340 --> 00:02:00,660 Hamm, also known as Don Draper from Mad Men. 33 00:02:00,660 --> 00:02:06,152 And you could ask how many edges do you need on a path through the movie graph to 34 00:02:06,152 --> 00:02:11,443 get to Kevin Bacon and it turns out that the answer is one, excuse me, two edges. 35 00:02:11,443 --> 00:02:14,860 You need one intermediate point, namely Colin Firth. 36 00:02:14,860 --> 00:02:19,405 And, that's became, that's because that's because Colin Firth and Kevin Bacon both 37 00:02:19,405 --> 00:02:23,789 starred in the Atom Egoyan movie, Where The Truth Lies, and John Hamm and Colin 38 00:02:23,789 --> 00:02:26,062 Firth were both in the movie, A Single Man. 39 00:02:26,062 --> 00:02:28,822 So, that would give John Hamm a Bacon number of two. 40 00:02:28,822 --> 00:02:33,260 So, these are the kinds of questions you can ask about connectivity not just in 41 00:02:33,260 --> 00:02:37,372 physical networks, like telephone and telecommunication networks, but also 42 00:02:37,372 --> 00:02:40,998 logical networks about pairwise relationships between objects more 43 00:02:40,998 --> 00:02:43,989 generally. So the Bacon number is fundamentally not 44 00:02:43,989 --> 00:02:46,529 just about any path, but actually shortest paths. 45 00:02:46,529 --> 00:02:50,920 The minimum number of edges you need to traverse to get from one actor to Kevin 46 00:02:50,920 --> 00:02:53,354 Bacon. And shortest paths are also have a very 47 00:02:53,354 --> 00:02:56,900 practical use that you might use yourself in driving directions. 48 00:02:56,900 --> 00:03:00,667 So when you use a website or a phone app and you ask for the best way to get from 49 00:03:00,667 --> 00:03:03,975 where you are now to say, some restaurant where you're going to have dinner. 50 00:03:03,975 --> 00:03:07,559 Obviously, you're trying to find some kind of path through a network through a 51 00:03:07,559 --> 00:03:09,397 graph. And indeed, often you want the, the 52 00:03:09,397 --> 00:03:11,557 shortest path. Perhaps in mileage or perhaps in 53 00:03:11,557 --> 00:03:15,073 anticipated travel time. Now I realize that when you're thinking 54 00:03:15,073 --> 00:03:19,841 about paths and graphs, it's natural to focus on, sort of, very literal paths and 55 00:03:19,841 --> 00:03:24,427 quite literal, physical networks, things like roots through a road network or 56 00:03:24,427 --> 00:03:28,651 paths through the internet and so on. But, you should really think more 57 00:03:28,651 --> 00:03:32,936 abstractedly as a path, as just a sequence of decisions, taking you from 58 00:03:32,936 --> 00:03:37,567 some initial state, to some final state. And it's this abstract mentality which is 59 00:03:37,567 --> 00:03:41,689 what makes graph search so ubiquitous, it feels like artificial intelligence. 60 00:03:41,689 --> 00:03:46,245 Where you want to formulate a plan of how to get from an initial state to some goal 61 00:03:46,245 --> 00:03:48,360 state. So, to give a simple recreational 62 00:03:48,360 --> 00:03:52,753 example, you can imagine just trying to understand how to compute automatically a 63 00:03:52,753 --> 00:03:56,658 way to fill in a Sudoku puzzle. So that you get to, so that you solve the 64 00:03:56,658 --> 00:04:00,862 puzzle correctly. So you might ask you know what is the 65 00:04:00,862 --> 00:04:05,034 graph that we're talking about when we want to solve a Sudoku puzzle. 66 00:04:05,034 --> 00:04:09,497 Well this is going to be a directed graph where here the nodes correspond to 67 00:04:09,497 --> 00:04:13,026 partially completed puzzles. So for example, at one node of this 68 00:04:13,026 --> 00:04:17,494 extremely large graph perhaps 40 out of the 81 cells are filled in with some kind 69 00:04:17,494 --> 00:04:19,837 of number. And now, again remember a path is 70 00:04:19,837 --> 00:04:24,142 supposed to correspond to a sequence of decisions, so what are the actions that 71 00:04:24,142 --> 00:04:27,847 you take in solving Suduko while you fill in a number into a square. 72 00:04:27,847 --> 00:04:31,987 So an edge which here is going to be directed is going to move you from one 73 00:04:31,987 --> 00:04:36,183 partially completed puzzle to another, where one previously empty square gets 74 00:04:36,183 --> 00:04:39,476 filled in with one number. And of course, then the path that you're 75 00:04:39,476 --> 00:04:43,275 interested in computing, or what you're searching for or what you search for when 76 00:04:43,275 --> 00:04:47,027 you search this graph., You begin with the initial state of the Sudoku puzzle 77 00:04:47,027 --> 00:04:50,733 and you want to reach some goal state where the Sudoku puzzle is completely 78 00:04:50,733 --> 00:04:53,360 filled in without any violations of the rules of Sudoku. 79 00:04:53,360 --> 00:04:56,905 And of course it's easy to imagine millions of other situations where you 80 00:04:56,905 --> 00:04:58,930 want to formulate some kind of plan like this. 81 00:04:58,930 --> 00:05:02,337 Like, for example, if you have a robotic hand and you just want to grasp some 82 00:05:02,337 --> 00:05:05,836 object, you need to think about exactly how to approach, the object with this 83 00:05:05,836 --> 00:05:09,289 robotic hand, so that you can grab it without, for example, first knocking it 84 00:05:09,289 --> 00:05:11,315 over. And you can think of millions of other 85 00:05:11,315 --> 00:05:13,427 examples. Another thing which turns out to be 86 00:05:13,427 --> 00:05:17,341 closely related to graph search as we'll see and has many applications in it's own 87 00:05:17,341 --> 00:05:20,406 right is that of computing connectivity information about graphs. 88 00:05:20,406 --> 00:05:22,387 So in particular the connected components. 89 00:05:22,387 --> 00:05:26,160 So this, especially for undirected graphs corresponds to the pieces of the graph. 90 00:05:26,160 --> 00:05:29,932 We'll talk about these topics in their own right and I'll give you applications 91 00:05:29,932 --> 00:05:32,655 for them later. So for undirected graphs I'll briefly 92 00:05:32,655 --> 00:05:36,791 mention an easy clustering heristic you can derive out of computing connected 93 00:05:36,791 --> 00:05:39,231 components. For directed graphs where the very 94 00:05:39,231 --> 00:05:42,943 definition of computing components is a bit more subtle I'll show you 95 00:05:42,943 --> 00:05:45,860 applications to understanding the structure of the web. 96 00:05:45,860 --> 00:05:49,985 So these are a few of the reasons why it's important for you to understand how 97 00:05:49,985 --> 00:05:53,170 to efficiently search graphs. It's a, a fundamental and widely 98 00:05:53,170 --> 00:05:56,262 applicable graph primitive. And I'm happy to report that in this 99 00:05:56,262 --> 00:05:59,671 section of the course pretty much anything any questions we want to answer 100 00:05:59,671 --> 00:06:02,444 about graph search, computing connecting components, and so on. 101 00:06:02,444 --> 00:06:04,808 There's going to be really fast algorithms to do it. 102 00:06:04,808 --> 00:06:08,263 So this will be a part of the course where there's lots of what I call four 103 00:06:08,263 --> 00:06:11,627 free primitives processing steps serpentine's that you can run without 104 00:06:11,627 --> 00:06:14,445 even thinking about it. All of these algorithms we're going to 105 00:06:14,445 --> 00:06:18,037 discuss in the next several lectures are going to run in linear time and there 106 00:06:18,037 --> 00:06:21,719 going to have quite reasonable constants. So there barely slower than reading the 107 00:06:21,719 --> 00:06:23,901 input. So if you have a graph and your trying to 108 00:06:23,901 --> 00:06:26,220 reason about it, your trying to make sense about it. 109 00:06:26,220 --> 00:06:30,040 You should in some sense feel free to apply any of these subroutines we're 110 00:06:30,040 --> 00:06:33,809 going to discuss to try and glean some more information about what they look 111 00:06:33,809 --> 00:06:37,909 like, how you might use the network data. There's a lot of different approaches to 112 00:06:37,909 --> 00:06:40,563 systematically searching a graph. So there's many methods. 113 00:06:40,563 --> 00:06:43,310 In this class, we're going to focus on two very important ones. 114 00:06:43,310 --> 00:06:45,685 Mainly breadth verse search and depth verse search. 115 00:06:45,685 --> 00:06:48,666 But all of the graph search methods share some things in common. 116 00:06:48,666 --> 00:06:52,298 So on this slide, let me just tell you the high order bits of, really, any graph 117 00:06:52,298 --> 00:06:55,306 search algorithm. So grass search subroutines generally are 118 00:06:55,306 --> 00:06:58,843 passed as input, a starting vertex from which the search originates. 119 00:06:58,843 --> 00:07:02,485 So, that's often called a source vertex. And your goal, then, is to find 120 00:07:02,485 --> 00:07:06,707 everything findable from this search vertex And obviously, you're not going to 121 00:07:06,707 --> 00:07:09,557 find anything that you can't find, that's not findable. 122 00:07:09,557 --> 00:07:13,094 What do I mean by findable? I mean, there's a path from the starting 123 00:07:13,094 --> 00:07:16,472 point to this other node. So for any other node to which you can 124 00:07:16,472 --> 00:07:19,850 get along a path from the starting point, you should discover it. 125 00:07:19,850 --> 00:07:24,731 So for example, if you're given an undricted graph that has three different 126 00:07:24,731 --> 00:07:29,550 pieces like this one I'm drawing on the right and perhaps S, is this left most 127 00:07:29,550 --> 00:07:32,702 node here. Then the findable vertices starting from 128 00:07:32,702 --> 00:07:37,212 S are either the ones to which you can reach from a path to S is clearly 129 00:07:37,212 --> 00:07:41,167 precisely these four vertices. So you would want graph search to 130 00:07:41,167 --> 00:07:45,987 automatically discover and efficiently discover these four vertices to if you 131 00:07:45,987 --> 00:07:49,138 started at S. You could also think about a directed 132 00:07:49,138 --> 00:07:53,340 version of the exact same graph, wherein the direct vertices like so. 133 00:07:53,340 --> 00:07:57,076 So now the definition of definable nodes is a little bit different. 134 00:07:57,076 --> 00:08:01,426 We're only expecting to follow arcs forward along the forward direction so we 135 00:08:01,426 --> 00:08:06,111 should only expect at best to find all of the nodes that you can reach by following 136 00:08:06,111 --> 00:08:10,908 a succession of forward arcs that is any, any node that there's a path to from S so 137 00:08:10,908 --> 00:08:15,481 in this case these three nodes would be the ones we'd be hoping to find this blue 138 00:08:15,481 --> 00:08:19,720 node to the right we would no longer expect to find because the only way to 139 00:08:19,720 --> 00:08:23,958 get there from S is by going backwards along arcs and that's not what we're 140 00:08:23,958 --> 00:08:27,220 going to be thinking about in our graph surges. 141 00:08:27,220 --> 00:08:31,291 So we want to find everything findable, i.e., that we can get to along paths. 142 00:08:31,291 --> 00:08:35,586 And, but we want to do it efficiently. And efficiently means we don't explore 143 00:08:35,586 --> 00:08:36,757 anything twice, right? 144 00:08:36,757 --> 00:08:39,992 So the graph has M arcs, M edges, and N nodes or N vertices. 145 00:08:39,992 --> 00:08:44,342 And really, we want to just look at, each piece of the graph only once, or a small 146 00:08:44,342 --> 00:08:47,968 constant number of times. So we're looking for running time, which 147 00:08:47,968 --> 00:08:50,868 is linear in the size of the graph. That is big O of MN. 148 00:08:50,868 --> 00:08:53,313 plus N. Now, we were talking about representing 149 00:08:53,313 --> 00:08:56,960 graphs, I said that in many applications it's natural to focus on connected 150 00:08:56,960 --> 00:08:59,196 graphs, in which case M is going to dominate N. 151 00:08:59,196 --> 00:09:02,502 You're going to have at least as many edges as nodes essentially. 152 00:09:02,502 --> 00:09:06,295 But connectivity is the classic case where you might have the number of edges 153 00:09:06,295 --> 00:09:08,434 being much smaller than the number of nodes. 154 00:09:08,434 --> 00:09:12,226 There might be many pieces and the whole point of what you're trying to do is 155 00:09:12,226 --> 00:09:14,900 discover them. So for this sequence of lectures when we 156 00:09:14,900 --> 00:09:18,109 talk about graph search and connectivity, we will usually write Mn. 157 00:09:18,109 --> 00:09:20,005 plus N. We'll think that either one could be 158 00:09:20,005 --> 00:09:23,281 bigger or smaller than the other. So let me know give you a generic 159 00:09:23,281 --> 00:09:25,985 approach to graph search. It's going to be under specified. 160 00:09:25,985 --> 00:09:28,541 There will be many different ways to instantiate it. 161 00:09:28,541 --> 00:09:32,375 Two particular instantiations will get us breath first search and depth first 162 00:09:32,375 --> 00:09:34,588 search. But here is just a general systematic 163 00:09:34,588 --> 00:09:38,324 method for finding everything findable without exploring anything more than 164 00:09:38,324 --> 00:09:40,734 once. So motivated by the second goal, the fact 165 00:09:40,734 --> 00:09:44,693 that we don't want to explore anything twice, with each node, with each vertex, 166 00:09:44,693 --> 00:09:47,913 we're going to remember whether or not we've explored it before. 167 00:09:47,913 --> 00:09:52,400 So we just need one Boolean per node and we'll initialize it by having everything 168 00:09:52,400 --> 00:09:56,360 unexplored, except S our starting point, we'll have it start off as explored. 169 00:09:56,360 --> 00:09:59,951 And it's useful to think of the nodes thus far as being, in some sense, 170 00:09:59,951 --> 00:10:03,902 territory conquered by the algorithm. And then there's going to be a frontier 171 00:10:03,902 --> 00:10:06,519 in between the conquered and unconquered territory. 172 00:10:06,519 --> 00:10:10,317 And the goal of the generic algorithm is that, each step, we supplement the 173 00:10:10,317 --> 00:10:14,165 conquered territory by one new node. Assuming that there is one adjacent to 174 00:10:14,165 --> 00:10:18,167 the territory we've already conquered. So for example, in this top example with 175 00:10:18,167 --> 00:10:21,400 the undirected network. Initially, the only thing we've explored 176 00:10:21,400 --> 00:10:24,274 is the starting point S. So that's sort of our home base, 177 00:10:24,274 --> 00:10:28,020 that's all that we have conquered so far. And then, in our main while loop. 178 00:10:28,020 --> 00:10:33,074 Which we iterate as many times as we can, until we don't have any edges meeting the 179 00:10:33,074 --> 00:10:35,450 following criterion. We look for a edge. 180 00:10:35,450 --> 00:10:39,899 With one endpoint that we've already explored one endpoint inside the conquer 181 00:10:39,899 --> 00:10:42,522 territory and then the other endpoint outside. 182 00:10:42,522 --> 00:10:46,800 So this is how we can in one hop supplement the number of nodes we've seen 183 00:10:46,800 --> 00:10:49,937 by one new one. If we can't find such an edge then this 184 00:10:49,937 --> 00:10:53,759 is where the search stops. If we can find such an edge well then we 185 00:10:53,759 --> 00:10:57,410 suck V into the conquer territory we think of it being explored. 186 00:10:57,410 --> 00:11:01,032 And we return to the while loop. So for example, on this example on the 187 00:11:01,032 --> 00:11:03,750 right, we start with the only explored node being S. 188 00:11:03,750 --> 00:11:06,680 Now, there's actually two edges that cross the frontier. 189 00:11:06,680 --> 00:11:09,077 In the sense one of the N points is explored. 190 00:11:09,077 --> 00:11:12,754 Namely one of the points is S, and the other one is some other vertex, 191 00:11:12,754 --> 00:11:15,258 right? There's this there's this two ver, two 192 00:11:15,258 --> 00:11:17,655 edges to the left two vertices adjacent to S. 193 00:11:17,655 --> 00:11:21,651 So, in this algorithm we pick either one. It's unspecified which one we pick. 194 00:11:21,651 --> 00:11:25,604 But maybe we pick the top one. And so then all of a sudden this seconds 195 00:11:25,604 --> 00:11:29,797 tab vertex is now also explored. So the conquer territory is the union of 196 00:11:29,797 --> 00:11:34,450 them, and so now we have a new frontier. So now again we have two edges that cross 197 00:11:34,450 --> 00:11:38,241 from the explored nodes. The unexplored nodes is at the edges that 198 00:11:38,241 --> 00:11:41,400 are in some sense going from Northwest to Southeast. 199 00:11:41,400 --> 00:11:45,248 Again we pick one of them, it's not clear how the algorithm doesn't 200 00:11:45,248 --> 00:11:47,144 tell us. We just pick any of them. 201 00:11:47,144 --> 00:11:52,023 So maybe for example, we pick this right most edge crossing the front here. Now, 202 00:11:52,023 --> 00:11:57,030 the right most edge of these four, right most vertex of these four is explored so 203 00:11:57,030 --> 00:11:59,872 our conquer territory is the top three vertices. 204 00:11:59,872 --> 00:12:04,258 Now again, we have two edges crossing the frontier, the two edges that are incident 205 00:12:04,258 --> 00:12:07,414 to the bottom node. We pick one of them not clear which one, 206 00:12:07,414 --> 00:12:09,928 maybe this one. And now, the bottom node is also 207 00:12:09,928 --> 00:12:13,030 explored, and now there are no edges crossing the frontier. 208 00:12:13,030 --> 00:12:17,042 So there are no edges who, on the one hand, have one endpoint being explored 209 00:12:17,042 --> 00:12:21,214 and the other endpoint being unexplored. So these will be the four vertices, as 210 00:12:21,214 --> 00:12:24,370 one would hope, that the search will explore, started from S. 211 00:12:24,370 --> 00:12:28,764 Well generally, the claim is that this generic graph search algorithm does what 212 00:12:28,764 --> 00:12:31,490 we wanted. It finds everything findable from the 213 00:12:31,490 --> 00:12:34,940 starting point and moreover, it doesn't explore anything twice. 214 00:12:34,940 --> 00:12:37,844 I think it's fairly clear that it doesn't explore anything twice, 215 00:12:37,844 --> 00:12:39,810 right? As soon as you look at a node for the 216 00:12:39,810 --> 00:12:43,027 first time, you suck it into the conquered territory, never to look at it 217 00:12:43,027 --> 00:12:44,904 again. Similarly, as soon as you look at, at 218 00:12:44,904 --> 00:12:47,719 edge, you suck them in. but when we explore breadth and depth, 219 00:12:47,719 --> 00:12:51,338 first search will be more precise about the running time and exactly what I mean 220 00:12:51,338 --> 00:12:54,690 by you don't explore something twice. So at this level of generality, I just 221 00:12:54,690 --> 00:12:57,951 want to focus on the first point, that anyway you [INAUDIBLE] this search 222 00:12:57,951 --> 00:13:00,275 algorithm it's going to find everything findable. 223 00:13:00,275 --> 00:13:03,618 So what do I really mean by that? The formal claim is that at the 224 00:13:03,618 --> 00:13:08,290 termination of this algorithm. The nodes that we've marked explored are precisely 225 00:13:08,290 --> 00:13:11,034 the ones that can be reached via a path from S. 226 00:13:11,034 --> 00:13:15,180 That's the sense in which the algorithm explores everything that could 227 00:13:15,180 --> 00:13:18,100 potentially be findable from the starting point S. 228 00:13:18,100 --> 00:13:21,381 And one thing I want to mention is that this claim and the proof we're going to 229 00:13:21,381 --> 00:13:24,750 give of it, it holds whether or not you use an undirected graph or a directed 230 00:13:24,750 --> 00:13:26,675 graph. In fact, almost all of the things that 231 00:13:26,675 --> 00:13:30,001 I'm going to say about graph search and, in particular, breadth-first search and 232 00:13:30,001 --> 00:13:33,632 depth-first search work in essentially the same way, either undirected graphs or 233 00:13:33,632 --> 00:13:35,864 directed graphs. The obvious difference being in an 234 00:13:35,864 --> 00:13:38,664 undirected graph, you can traverse an engine in either direction. 235 00:13:38,664 --> 00:13:42,296 In a directed graph, we're only supposed to traverse it in the forward direction, 236 00:13:42,296 --> 00:13:45,390 from the tail to the head. The one big difference between undirected 237 00:13:45,390 --> 00:13:47,981 and directed graphs is when we do connectivity computations. 238 00:13:47,981 --> 00:13:51,263 And I'll remind you when we get to that point, which one we're talking about. 239 00:13:51,263 --> 00:13:54,294 Okay, but for the most part. We just talked about basic graph search, 240 00:13:54,294 --> 00:13:57,790 where it's essentially the same way, whether it's undirected or directed, so 241 00:13:57,790 --> 00:14:00,812 keep that in mind. All right, so why is this true? 242 00:14:00,812 --> 00:14:06,414 Why are the nodes that get explored precisely the nodes for which there is a 243 00:14:06,414 --> 00:14:10,090 path to them from S? Well, one direction is easy. 244 00:14:10,090 --> 00:14:13,212 Which is, you can't find anything which is not findable. 245 00:14:13,212 --> 00:14:17,470 That is, if you wind up exploring a node, the only reason that can happen is 246 00:14:17,470 --> 00:14:22,125 because you traversed a sequence of edges that got you there, and that sequence of 247 00:14:22,125 --> 00:14:24,566 edges obviously defines a path from S to V. 248 00:14:24,566 --> 00:14:29,221 If you really want to be pedantic about the forward direction that explored nodes 249 00:14:29,221 --> 00:14:33,081 have to have paths from S then you can just do an easy induction. 250 00:14:33,081 --> 00:14:37,396 And I'll leave this for you to check, if you want, in the privacy of your own 251 00:14:37,396 --> 00:14:40,114 home. So the important direction of this claim 252 00:14:40,114 --> 00:14:43,380 is really the opposite. Why is it that no matter how we 253 00:14:43,380 --> 00:14:48,071 instantiate this generic graph search procedure it's impossible for us to miss 254 00:14:48,071 --> 00:14:50,922 anything. That's the crucial point we don't miss 255 00:14:50,922 --> 00:14:54,010 anything that we couldn't principal find via a path. 256 00:14:54,010 --> 00:14:56,360 But we're going to perceive by contradiction. 257 00:14:56,360 --> 00:15:00,256 So what does that mean? We're going to assume that the statement 258 00:15:00,256 --> 00:15:02,813 that we want to prove is true, is not true. 259 00:15:02,813 --> 00:15:06,527 Which means that, it's possible that G has a path from S to V. 260 00:15:06,527 --> 00:15:08,962 And yet, somehow, our algorithm misses it. 261 00:15:08,962 --> 00:15:13,224 Doesn't mark it as explored. That's the thing we really hoping doesn't 262 00:15:13,224 --> 00:15:16,146 happen. So let's suppose it does happen and then 263 00:15:16,146 --> 00:15:20,043 derive a contradiction. So let's suppose G does have a path from 264 00:15:20,043 --> 00:15:22,152 S to some vertex V, call the path P. 265 00:15:22,152 --> 00:15:26,675 I'm going to draw the picture from undirected graph, but the situation would 266 00:15:26,675 --> 00:15:30,997 be same in, in the directive case. So there's a bunch of hops there's a 267 00:15:30,997 --> 00:15:34,019 bunch of edges and ยทย then eventually this path ends at V. 268 00:15:34,019 --> 00:15:38,916 Now, the bad situation the situation from which we wanted to rather contradiction 269 00:15:38,916 --> 00:15:42,120 is that V is unexplored at the end of this algorithm. 270 00:15:42,120 --> 00:15:47,184 So let's take stock of what we know, S for sure is explored right, we 271 00:15:47,184 --> 00:15:52,025 initialized the search procedure so that S is marked as explored. 272 00:15:52,025 --> 00:15:56,568 V by hypothesis in this proof by contradiction is unexplored. 273 00:15:56,568 --> 00:16:02,079 So S is explored V is unexplored. So now, I imagine we just in our heads as 274 00:16:02,079 --> 00:16:05,840 a thought experiment we traversed as path peak. 275 00:16:05,840 --> 00:16:10,136 We started S and we know it's explored we go to the next vertex it may or not have 276 00:16:10,136 --> 00:16:14,278 been explored, we're not sure. We go to the third vertex again who knows it might 277 00:16:14,278 --> 00:16:16,607 be explored it might be unexplored and so on. 278 00:16:16,607 --> 00:16:19,247 But by the time we get to V we know its unexplored. 279 00:16:19,247 --> 00:16:22,819 So we start at S it's been explored we get to V it's been unexplored. 280 00:16:22,819 --> 00:16:26,147 So at some point, there's some hop along this path P from 281 00:16:26,147 --> 00:16:29,940 which we move from an explored vertex to an unexplored vertex. 282 00:16:29,940 --> 00:16:34,834 There has to be a switch, at some point because at the end of the day, at the end 283 00:16:34,834 --> 00:16:39,545 of the path we're at an unexplored node. So consider the first edge, and there 284 00:16:39,545 --> 00:16:42,458 must be one. That we switch from an explored node to 285 00:16:42,458 --> 00:16:45,725 being an unexplored node. So I'm going to call the endpoints of 286 00:16:45,725 --> 00:16:50,082 this purported edge U and W, where U is the explored one and W is the unexplored 287 00:16:50,082 --> 00:16:52,423 one. Now for all we know U could be exactly 288 00:16:52,423 --> 00:16:54,983 the same as S, that's a possibility, possibility. 289 00:16:54,983 --> 00:16:57,379 Or, for all we know, W could be the same as V, 290 00:16:57,379 --> 00:17:00,918 that's also a possibility. In the picture I'll draw it as if this 291 00:17:00,918 --> 00:17:03,587 edge ux was somewhere in the middle of this path. 292 00:17:03,587 --> 00:17:06,800 But again, it may be at one of the ends, that's totally fine. 293 00:17:06,800 --> 00:17:11,280 But now in this case, there's something I need you to explain to me. 294 00:17:11,280 --> 00:17:16,706 How is it possible that, on the one hand, our algorithm terminated, and on the 295 00:17:16,706 --> 00:17:22,348 other hand, there's this edge U comma X, where U has been explored and X has not 296 00:17:22,348 --> 00:17:25,490 been explored. That, my friends, is impossible. 297 00:17:25,490 --> 00:17:28,049 Our generic search algorithm does not give up. 298 00:17:28,049 --> 00:17:32,666 It does not terminate unless there are no edges where the one endpoint is explored 299 00:17:32,666 --> 00:17:37,061 and the other endpoint is unexplored. As long as there's such an edge it has is 300 00:17:37,061 --> 00:17:41,623 going to suck in unexplored vertex into the conquered territory and it's going to 301 00:17:41,623 --> 00:17:44,238 keep going. So, the, the upshot is there's no way 302 00:17:44,238 --> 00:17:48,466 that our algorithm terminated with this picture, with there being an edge UX, 303 00:17:48,466 --> 00:17:51,359 U explored, X unexplored. So that's the contradiction. 304 00:17:51,359 --> 00:17:57,040 This contradicts the fact that our algorithm terminated with V unexplored. 305 00:17:57,040 --> 00:17:59,445 So that is a general approach to graph search. 306 00:17:59,445 --> 00:18:02,843 So that, I hope, gives you the flavor of how this is going to work. 307 00:18:02,843 --> 00:18:06,451 But now there's two particular instantiations of this generic method 308 00:18:06,451 --> 00:18:10,007 that are really important and have their own suites of applications. 309 00:18:10,007 --> 00:18:13,719 So we're going to focus on breadth-first search and depth-first search. 310 00:18:13,719 --> 00:18:16,647 We'll cover them in detail in the next couple of videos. 311 00:18:16,647 --> 00:18:19,680 I want to give you the highlights to conclude this video. 312 00:18:19,680 --> 00:18:23,238 Now let me just make sure it's clear where the ambiguity in our generic method 313 00:18:23,238 --> 00:18:25,264 is. Why we can have different substantiations 314 00:18:25,264 --> 00:18:28,687 of it that potentially have different properties and different applications. 315 00:18:28,687 --> 00:18:31,930 The question is, given the iteration of this wile loop, what have you got? 316 00:18:31,930 --> 00:18:35,173 You got your nodes that you've already explored, so that includes S plus 317 00:18:35,173 --> 00:18:38,146 probably some other stuff. And then you've got your nodes that are 318 00:18:38,146 --> 00:18:40,623 unexplored and then you have your crossing edges, right? 319 00:18:40,623 --> 00:18:42,650 So they're edges with one point in each side. 320 00:18:42,650 --> 00:18:46,142 And for an undirected graph there's no orientation to worry about. 321 00:18:46,142 --> 00:18:50,165 These crossing edges just have one endpoint on the explored side and one 322 00:18:50,165 --> 00:18:53,975 endpoint on the unexplored side. In the directed case you focus on edges 323 00:18:53,975 --> 00:18:58,156 where the tail of the edges's on the explored side and the head of the edges's 324 00:18:58,156 --> 00:19:01,543 on the unexplored side. So they go from the explored side to the 325 00:19:01,543 --> 00:19:04,189 unexplored side. And the question is in general an 326 00:19:04,189 --> 00:19:07,788 iteration this while loop there's going to be many such crossing edges. 327 00:19:07,788 --> 00:19:11,069 There are many different unexplored nodes we could go to next. 328 00:19:11,069 --> 00:19:14,880 A different strategy is for picking the unexplored node to explore next, 329 00:19:14,880 --> 00:19:19,550 leads us to different graph search algorithms with different properties. 330 00:19:19,550 --> 00:19:24,330 So the first specific search strategy we're going to study is breadth-first 331 00:19:24,330 --> 00:19:29,242 search, colloquially known as BFS. So let me tell you some of the high level 332 00:19:29,242 --> 00:19:32,254 idea and applications of breadth-first search. 333 00:19:32,254 --> 00:19:36,838 So the goal is going to be to explore the nodes in what I call layers. 334 00:19:36,838 --> 00:19:40,833 So the starting point S will be in it's own layer, layer zero. 335 00:19:40,833 --> 00:19:43,780 The neighbors of S will constitute layer one. 336 00:19:43,780 --> 00:19:48,948 And then layer two will be the nodes that are neighbors of layer one but that are 337 00:19:48,948 --> 00:19:52,036 not already in layer zero or layer one and so on. 338 00:19:52,036 --> 00:19:56,447 So layer I9 plus one is the stuff next to layer I that you haven't already seen 339 00:19:56,447 --> 00:19:58,524 yet. You can think of this as a fairly 340 00:19:58,524 --> 00:20:00,993 cautious and tentative exploration of the graph. 341 00:20:00,993 --> 00:20:05,179 And it's going to turn out that there is a close correspondence between these 342 00:20:05,179 --> 00:20:09,205 layers and shortest path distances. So if you want to know the minimum number 343 00:20:09,205 --> 00:20:13,606 of hops, the minimum of edges you need in a path to get from point A to point B in 344 00:20:13,606 --> 00:20:17,899 a graph. The way we wanted to know the fewest number of edges in the movie graph 345 00:20:17,899 --> 00:20:22,247 necessary to connect a Jon Ham to Kevin Bacon, that corresponds directly to these 346 00:20:22,247 --> 00:20:24,823 layers. So if a node is in layer I, then you need 347 00:20:24,823 --> 00:20:28,590 I edges to get from S to I in the graph. Once we discuss breadth-first search 348 00:20:28,590 --> 00:20:32,109 we'll also discuss how to compute the connected components, or the different 349 00:20:32,109 --> 00:20:35,258 pieces, of an undirected graph. Turns out this isn't that special the 350 00:20:35,258 --> 00:20:38,591 breadth-first search, you can use any number of graph search strategies to 351 00:20:38,591 --> 00:20:41,925 compute connected components in undirected graphs. But I'll show you how 352 00:20:41,925 --> 00:20:44,750 to do it using a simple looped version of breadth-first search. 353 00:20:44,750 --> 00:20:48,643 And we'll be able to do this stuff in the linear time that we want, the very 354 00:20:48,643 --> 00:20:52,537 ambitious goal of getting linear time. To get the linear time implementation, 355 00:20:52,537 --> 00:20:56,225 you do want to use the right data structure; but it's a simple, simple data 356 00:20:56,225 --> 00:20:59,760 structure, something you've probably seen in the past, namely a, a cube. 357 00:20:59,760 --> 00:21:02,014 So it's something that's first in first out. 358 00:21:02,014 --> 00:21:06,420 So the second structural strategy that's super important to know is depth-first 359 00:21:06,420 --> 00:21:10,633 search, also known as DFS to its friends. Depth-first search has a rather different 360 00:21:10,633 --> 00:21:14,341 feel than breadth-first search. It's a much more aggressive search, where 361 00:21:14,341 --> 00:21:16,967 you immediately try to plunge as deeply as you can. 362 00:21:16,967 --> 00:21:20,109 It's very much in the spirit of how you might explore a maze. 363 00:21:20,109 --> 00:21:24,126 Where you go as deeply as you can, only backtracking when absolutely necessary. 364 00:21:24,126 --> 00:21:27,319 Depth-first search will also have its own set of applications. 365 00:21:27,319 --> 00:21:31,079 It's not for example, very useful for computing shortest path information. 366 00:21:31,079 --> 00:21:35,096 But especially, in directed graphs, it's going to do some remarkable things for 367 00:21:35,096 --> 00:21:36,944 us. So in directed acyclic graph. 368 00:21:36,944 --> 00:21:39,288 So a directed graph with no directed cycles. 369 00:21:39,288 --> 00:21:42,217 It will give us what's called the topological ordering. 370 00:21:42,217 --> 00:21:46,199 So it will sequence the nodes in a linear ordering from the first to the last. 371 00:21:46,199 --> 00:21:48,880 So that all of the arc's of the directed graphs go forward, 372 00:21:48,880 --> 00:21:51,607 so this is useful. Fo example, if you have a number of tasks 373 00:21:51,607 --> 00:21:54,470 that have to get completed with certain precedence constraints. 374 00:21:54,470 --> 00:21:58,150 Like for example, you have to take all of the classes in your undergraduate major 375 00:21:58,150 --> 00:22:01,786 and there are certain prerequisites topalogical ordering will give you a way 376 00:22:01,786 --> 00:22:04,240 in which to do it respecting all of the prerequisites. 377 00:22:04,240 --> 00:22:07,756 And finally, where, for undirected graphs, it doesn't really matter whether 378 00:22:07,756 --> 00:22:10,213 you use BFS or DFS to compute connected components. 379 00:22:10,213 --> 00:22:14,020 In directed graphs, where even defining connected components is a little tricky. 380 00:22:14,020 --> 00:22:16,910 it turns out depth-first search is exactly what you want. 381 00:22:16,910 --> 00:22:20,716 That's what you're going to get a linear time implementation for computing the 382 00:22:20,716 --> 00:22:23,800 right notion of connected components in the directed graph case. 383 00:22:23,800 --> 00:22:28,447 Time wise, both of these are superb strategies for exploring a graph. 384 00:22:28,447 --> 00:22:31,864 They're both linear time with very good constants. 385 00:22:31,864 --> 00:22:36,238 So depth-first search again. We're going to get ordle, O of M plus N 386 00:22:36,238 --> 00:22:38,606 time. In a graph with M edges and N vertices. 387 00:22:38,606 --> 00:22:42,711 you do want to use a different data structure reflecting the different search 388 00:22:42,711 --> 00:22:44,893 strategy. So here, because you're exploring 389 00:22:44,893 --> 00:22:48,947 aggressively as soon as you get to a node, you immediately start exploring its 390 00:22:48,947 --> 00:22:51,233 neighbors. You want a lasting first out data 391 00:22:51,233 --> 00:22:54,870 structure, also known as a stack. Depth-first search also admits a very 392 00:22:54,870 --> 00:22:58,871 elegant recursive formulation and in that formulation, you don't even need to 393 00:22:58,871 --> 00:23:01,106 maintain a stack data structure. Explicitly, 394 00:23:01,106 --> 00:23:04,740 the stack is implicitly taken care of in the recursion. 395 00:23:04,740 --> 00:23:08,972 So that concludes this overview of graph search both what it is? What our goals 396 00:23:08,972 --> 00:23:13,150 are? What kind of applications they have adn two of the most common strategies. 397 00:23:13,150 --> 00:23:17,221 The next couple videos are going to explore these search strategies as well 398 00:23:17,221 --> 00:23:19,954 as a couple of these applications in greater depth.