1 00:00:00,000 --> 00:00:04,004 So let's talk about the absolutely fundamental problem of searching a graph, 2 00:00:04,004 --> 00:00:08,020 and the very related problem of finding paths through graphs. So why would one be 3 00:00:08,020 --> 00:00:12,030 interested in searching a graph, or figuring out if there's a path from point 4 00:00:12,030 --> 00:00:15,051 A to point B? Well there's many, many reasons. I'm going to give you a highly 5 00:00:15,051 --> 00:00:20,055 non-exhaustive list on this slide. >>So let me begin with a very sorta obvious and 6 00:00:20,055 --> 00:00:23,099 literal example, which is if you have a physical network, then often you want to 7 00:00:23,099 --> 00:00:27,074 make sure that the network is fully connected in the sense that you can get 8 00:00:27,074 --> 00:00:31,084 from any starting point to any other point. So, for example, think back to the 9 00:00:31,084 --> 00:00:36,066 phone network. It would've been a disaster if callers from California could only 10 00:00:36,066 --> 00:00:40,075 reach callers in Nevada, but not their family members in Utah. So obviously a 11 00:00:40,075 --> 00:00:45,003 minimal condition for functionality in something like a phone network is that you 12 00:00:45,003 --> 00:00:48,021 can get from any one place to any other place, similarly for road networks within 13 00:00:48,021 --> 00:00:54,086 a given country, and so on. It can also be fun to think about other non-physical 14 00:00:54,086 --> 00:00:59,073 networks and ask if they're connected. So one network that's fun to play around with 15 00:00:59,073 --> 00:01:04,054 is the movie network. So this is the graph where the nodes correspond to actors and 16 00:01:04,054 --> 00:01:09,023 actresses, and you have an edge between two nodes, if they played a role in a common 17 00:01:09,023 --> 00:01:12,064 movie. So this is going to be an undirected graph, where the edges 18 00:01:12,064 --> 00:01:16,008 correspond to, not necessarily co-starring, but both the actors appearing 19 00:01:16,008 --> 00:01:21,075 at least some point in the same movie. So versions of this movie network you should be able to 20 00:01:21,075 --> 00:01:25,056 find publicly available on the web, and there's lots of fun questions you can ask about 21 00:01:25,056 --> 00:01:29,029 the movie network. Like, for example, what's the minimum number of hops, where a 22 00:01:29,029 --> 00:01:33,091 hop here again is the movie that two people both played a role in? The minimum number of hops or 23 00:01:33,091 --> 00:01:38,064 edges from one actor to another actor, so perhaps the most famous statistic that's 24 00:01:38,064 --> 00:01:43,050 been thought about with the movie is the Bacon Number. So this refers to the fairly 25 00:01:43,050 --> 00:01:47,068 ubiquitous actor Kevin Bacon, and the question the Bacon Number of an actor is 26 00:01:47,068 --> 00:01:54,015 defined as the minimum number of hops you need in this movie graph to get to Kevin Bacon. 27 00:01:54,015 --> 00:02:00,031 So, for example, you could ask about Jon Hamm, also known as Don Draper from "Mad Men". 28 00:02:00,031 --> 00:02:05,080 And you could ask how many edges do you need on a path through the movie graph 29 00:02:05,080 --> 00:02:10,096 to get to Kevin Bacon? And it turns out that the answer is 1, excuse me, 2 30 00:02:10,096 --> 00:02:15,081 edges. You need one intermediate point, namely Colin Firth. And that's became, 31 00:02:15,081 --> 00:02:20,052 that's because Colin Firth and Kevin Bacon both starred in Atom Egoyan's movie, "Where 32 00:02:20,052 --> 00:02:27,054 "the Truth Lies", and Jon Hamm and Colin Firth were both in the movie "A Single Man". 33 00:02:27,054 --> 00:02:30,016 So that would give Jon Hamm a Bacon Number of 2. So, these are the kind of 34 00:02:30,016 --> 00:02:34,005 questions you're gonna ask about connectivity. Not just physical networks, 35 00:02:34,005 --> 00:02:38,036 like telephone and telecommunication networks, but also logical networks about 36 00:02:38,036 --> 00:02:42,080 parallel relationships between objects in general. So the Bacon Number is 37 00:02:42,080 --> 00:02:46,071 fundamentally not just about any path, but actually shortest paths, the minimum 38 00:02:46,071 --> 00:02:51,010 number of edges you need to traverse to get from one actor to Kevin Bacon. And 39 00:02:51,010 --> 00:02:54,048 shortest paths are also have a very practical use, that you might use 40 00:02:54,048 --> 00:02:58,092 yourself in the driving directions. So when you use a website or a phone app and 41 00:02:58,092 --> 00:03:02,078 you ask for the best way to get from where you are now to say some restaurant where 42 00:03:02,078 --> 00:03:05,091 you're gonna have dinner, obviously you're trying to find some kind of path through a 43 00:03:05,091 --> 00:03:09,079 network through a graph, and indeed often you want the, the shortest path, perhaps 44 00:03:09,079 --> 00:03:14,034 in mileage or perhaps in anticipated travel time. Now I realize that when you 45 00:03:14,034 --> 00:03:17,052 are thinking about paths and graphs, it's natural to focus on sort of very literal 46 00:03:17,052 --> 00:03:22,045 paths and quite literal physical networks. Things like routes through a road network or 47 00:03:22,045 --> 00:03:27,024 paths through the internet and so on. You should really think more abstractly as a 48 00:03:27,024 --> 00:03:33,070 path as just a sequence of decisions, taking you from some initial state to some 49 00:03:33,070 --> 00:03:38,060 final state. And it's this abstract mentality which is what makes graph search 50 00:03:38,060 --> 00:03:42,070 so ubiquitous, it feels like artificial intelligence, where you want to formulate 51 00:03:42,070 --> 00:03:47,042 a plan of how to get from an initial state to some goal state. So, to give a simple 52 00:03:47,042 --> 00:03:51,057 recreational example, you can imagine just trying to understand how to compute 53 00:03:51,057 --> 00:03:55,093 automatically a way to fill in a Sudoku puzzle so that you get to, so that you 54 00:03:55,093 --> 00:04:00,096 solve the puzzle correctly. So you might ask, you know, what is the graph that 55 00:04:00,096 --> 00:04:05,001 we're talking about, when we wanna solve a Sudoku puzzle. Well this is gonna be a 56 00:04:05,001 --> 00:04:09,086 directed graph, where here the nodes correspond to partially completed puzzles. 57 00:04:09,086 --> 00:04:15,012 So, for example, at one node of this extremely large graph, perhaps 40 out of 58 00:04:15,012 --> 00:04:20,024 the 81 cells are filled in with some kind of number, and now, again, remember a path 59 00:04:20,024 --> 00:04:24,012 is supposed to correspond to a sequence of decisions. So, what are the actions that 60 00:04:24,012 --> 00:04:28,038 you take in solving Sudoku? Well, you fill in a number into a square. So, an edge 61 00:04:28,038 --> 00:04:32,048 which here is going to be directed, is going to move you from one partially 62 00:04:32,048 --> 00:04:36,004 completed puzzle to another, where one previously empty square gets filled in 63 00:04:36,004 --> 00:04:40,073 with one number. And of course then the path is that you're interested in computing, 64 00:04:40,073 --> 00:04:43,090 or what your searching for when you search this graph. You begin with the initial 65 00:04:43,090 --> 00:04:49,033 state of the Sudoku puzzle and you want to reach some goal state where the Sudoku 66 00:04:49,033 --> 00:04:53,063 puzzle is completely filled in without any violations of the rules of Sudoku. And of 67 00:04:53,063 --> 00:04:57,043 course it's easy to imagine millions of other situations where you wanna formulate 68 00:04:57,043 --> 00:05:01,030 some kind of plan like this, for example if you have a robotic hand and you 69 00:05:01,030 --> 00:05:04,097 wanna grasp some object, you need to think about exactly how to approach the object 70 00:05:04,097 --> 00:05:09,005 with this robotic hand, so that you can grab it without, for example, first knocking 71 00:05:09,005 --> 00:05:12,057 it over, and you can think of millions of other examples. Another thing which turns 72 00:05:12,057 --> 00:05:16,027 out to be closely related to graph search, as we'll see, it has many applications in 73 00:05:16,027 --> 00:05:20,053 its own right, is that of computing connectivity information about graphs, in 74 00:05:20,053 --> 00:05:23,092 particular the connected components. So this, especially for undirected graphs, 75 00:05:23,092 --> 00:05:27,079 corresponds to the pieces of the graph. We'll talk about these topics in their own 76 00:05:27,079 --> 00:05:31,098 right, and I'll give you applications for them later. So for undirected graphs I'll 77 00:05:31,098 --> 00:05:36,020 briefly mention an easy clustering heuristic you can derive out of computing 78 00:05:36,020 --> 00:05:40,047 connected components. For directed graphs where the very definition of computing 79 00:05:40,047 --> 00:05:44,070 components is a bit more subtle, I'll show you applications to understanding the 80 00:05:44,070 --> 00:05:49,005 structure of the web. So these are a few of the reasons why it's important for you 81 00:05:49,005 --> 00:05:53,009 understand how to efficiently search graphs. It's a, a fundamental and widely 82 00:05:53,009 --> 00:05:56,089 applicable graph primitive. And I'm happy to report that in this section of the 83 00:05:56,089 --> 00:06:00,007 course, pretty much anything, any questions we wanna answer about graph 84 00:06:00,007 --> 00:06:03,057 search, computing connected components, and so on, there's gonna be really fast 85 00:06:03,057 --> 00:06:07,025 algorithms to do it. So, this will be the part of the course where there's lots of 86 00:06:07,025 --> 00:06:10,075 what I call for free primitives, processing steps, subroutines you can run 87 00:06:10,075 --> 00:06:14,030 without even thinking about it. All of these algorithms we're gonna discuss in 88 00:06:14,030 --> 00:06:17,085 the next several lectures, are gonna run in linear time, and they're gonna have 89 00:06:17,085 --> 00:06:21,030 quite reasonable constants. So, they're really barely slower than reading the 90 00:06:21,030 --> 00:06:24,080 input. So, if you have a graph and you're trying to reason about it and you're 91 00:06:24,080 --> 00:06:28,067 trying to make sense about it, you should in some sense feel free to apply any of 92 00:06:28,067 --> 00:06:32,080 these subroutines we're gonna discuss to try and glean some more information about 93 00:06:32,080 --> 00:06:36,074 what they look like, how you might use the network data. There's a lot of 94 00:06:36,074 --> 00:06:40,013 different approaches to systematically searching a graph. So, there's many 95 00:06:40,013 --> 00:06:43,095 methods. In this class we're gonna focus on two very important ones, mainly breadth 96 00:06:43,095 --> 00:06:47,081 first search and depth first search. But all of the graph search methods share some 97 00:06:47,081 --> 00:06:51,044 things in common. So, in this slide let me just tell you the high order bits of 98 00:06:51,044 --> 00:06:55,000 really any graph search algorithm. So graph search subroutines 99 00:06:55,000 --> 00:06:59,005 generally are passed as input a starting search vertex from which the search 100 00:06:59,005 --> 00:07:03,021 originates. So that's often called source vertex. And your goal then is to find 101 00:07:03,021 --> 00:07:07,043 everything findable from the search vertex and obviously you're not gonna find 102 00:07:07,043 --> 00:07:11,075 anything that you can't find that is not findable. What I mean by findable, I mean, 103 00:07:11,075 --> 00:07:16,023 there's a path from the starting point to this other node. So any other node to which you 104 00:07:16,023 --> 00:07:20,024 can get along on a path from the starting point you should discover. So, for 105 00:07:20,024 --> 00:07:25,016 example, if you're given an undirected graph that has three different pieces, 106 00:07:25,016 --> 00:07:29,089 like this one I'm drawing on the right, then perhaps S is this left most node 107 00:07:29,089 --> 00:07:34,074 here, then the findable vertices starting from S, i.e. the ones which you can 108 00:07:34,074 --> 00:07:39,016 reach from a path to S, is clearly precisely these four vertices. So, you 109 00:07:39,016 --> 00:07:44,020 would want graph search to automatically discover and efficiently discover these 110 00:07:44,020 --> 00:07:49,049 four vertices if you started at S. You can also think about a directed version of the 111 00:07:49,049 --> 00:07:53,090 exact same graph, where I'm gonna direct the vertices like so. So now the 112 00:07:53,090 --> 00:07:58,037 definition of the findable nodes is a little bit different. We're only expecting 113 00:07:58,037 --> 00:08:02,096 to follow arcs forward, along the forward direction. So we should only expect at best 114 00:08:02,096 --> 00:08:07,005 to find all of the nodes that you can reach, by following a succession of 115 00:08:07,005 --> 00:08:11,047 forward arcs, that is, any node that there's a path to from S. So in this case, 116 00:08:11,047 --> 00:08:16,000 these three nodes would be the ones we'd be hoping to find. This blue node to the 117 00:08:16,000 --> 00:08:20,053 right, we would no longer expect to find, because the only way to get there from S, 118 00:08:20,053 --> 00:08:24,095 is by going backward along arcs. And that's not what we're going to be thinking 119 00:08:24,095 --> 00:08:30,009 about in our graph searches. So we want to find everything findable, i.e. that we can get 120 00:08:30,009 --> 00:08:34,043 to along paths, and we want to do it efficiently. Efficiently means we don't 121 00:08:34,043 --> 00:08:38,099 want to explore anything twice. Right, so the graph has m arcs, m edges and n nodes 122 00:08:38,099 --> 00:08:43,049 or n vertices and really we wanna just look at either each piece of the graph only 123 00:08:43,049 --> 00:08:48,004 once for a small cost number of times. So looking for running time which is linear 124 00:08:48,004 --> 00:08:52,042 on the size of the graph that is big O of m plus n. Now when we were talking about 125 00:08:52,042 --> 00:08:56,022 representing graphs, I said that in many applications, it's natural to focus on 126 00:08:56,022 --> 00:09:00,032 connected graphs, in which case M is gonna dominate N, and you're gonna have at least 127 00:09:00,032 --> 00:09:04,031 as many edges as nodes, essentially. But connectivity is the classic case where you 128 00:09:04,031 --> 00:09:08,011 might have the number of edges of being much smaller than the number of nodes. 129 00:09:08,011 --> 00:09:11,086 There might be many pieces of the whole point of what you're trying to do is 130 00:09:11,086 --> 00:09:15,076 discover them. So, for this sequence of lectures where we talk about graph search 131 00:09:15,076 --> 00:09:19,095 and connectivity, we will usually write M plus N. We'll think that either one can be 132 00:09:19,095 --> 00:09:23,081 bigger or smaller than the other. So let me now give you a generic approach to 133 00:09:23,081 --> 00:09:26,083 graph search. It's gonna be under-specified, there'll be many 134 00:09:26,083 --> 00:09:30,064 different ways to instantiate it. Two particular instantiations will give us 135 00:09:30,064 --> 00:09:34,021 breadth first search and depth first search but here is just a general 136 00:09:34,021 --> 00:09:38,028 systematic method to finding everything findable without exploring anything more 137 00:09:38,028 --> 00:09:41,094 than once. So motivated by the second goal, the fact that we don't want to 138 00:09:41,094 --> 00:09:45,076 explore anything twice, with each node, with each vertex, we're gonna remember 139 00:09:45,076 --> 00:09:49,098 whether or not we explored it before. So we just need one Boolean per node and we 140 00:09:49,098 --> 00:09:53,094 will initialize it by having everything unexplored except S, our starting point 141 00:09:53,094 --> 00:09:58,059 we'll have it start off as explored. And it's useful to think of the nodes thus far 142 00:09:58,059 --> 00:10:02,088 as being in some sense territory conquered by the algorithm. And then there's going 143 00:10:02,088 --> 00:10:07,022 to be a frontier in between the conquered and unconquered territory. And the goal of 144 00:10:07,022 --> 00:10:11,052 the generic outcome is that each step we supplement the conquered territory by one 145 00:10:11,052 --> 00:10:15,050 new node, assuming that there is one adjacent to the territory you've already 146 00:10:15,050 --> 00:10:19,027 conquered. So for example in this top example with the undirected network, 147 00:10:19,027 --> 00:10:23,052 initially the only thing we've explored is the starting point S. So that's sort of 148 00:10:23,052 --> 00:10:27,045 our home base. That's all that we have conquered so far. And then in our main 149 00:10:27,045 --> 00:10:32,025 while loop, which we iterate as many times as we can until we don't have any edges 150 00:10:32,025 --> 00:10:37,007 meeting the following criterion, we look for an edge with one end point that we've 151 00:10:37,007 --> 00:10:41,082 already explored. One end point inside the conquered territory and then the other end 152 00:10:41,082 --> 00:10:46,024 point outside. So this is how we can in one hop supplement the number of nodes 153 00:10:46,024 --> 00:10:50,065 we've seen by one new one. If we can't find such an edge then this is where the 154 00:10:50,065 --> 00:10:55,012 search stops. If we can find such an edge, well then we suck V into the conquered 155 00:10:55,012 --> 00:10:59,042 territory. We think of it being explored. And we return to the main while loop. So, 156 00:10:59,042 --> 00:11:03,076 for example, in this example on the right, we start with the only explored node being 157 00:11:03,076 --> 00:11:07,073 S. Now, there's actually two edges that cross the frontier, in the sense one of 158 00:11:07,073 --> 00:11:11,086 the endpoints is explored, namely one of the endpoints is S, and the other one is 159 00:11:11,086 --> 00:11:16,000 some other vertex. Right? There's this there's these two vert, two edges to the 160 00:11:16,000 --> 00:11:19,097 left, two vertices adjacent to S. So, in this algorithm we pick either one. It's 161 00:11:19,097 --> 00:11:24,015 un, under-specified which one we pick. Maybe we pick the top one. And then all of 162 00:11:24,015 --> 00:11:28,096 the sudden, this second top vertex is also explored so the conquered territory is a 163 00:11:28,096 --> 00:11:33,042 union of them, and so now we have a new frontier. So now again we have two edges 164 00:11:33,042 --> 00:11:38,029 that cross from the explored nodes to the unexplored nodes. These are the edges that 165 00:11:38,029 --> 00:11:42,082 are in some sense going from northwest to southeast. Again, we pick one of them. 166 00:11:42,082 --> 00:11:47,028 It's not clear how. The algorithm doesn't tell us, we just pick any of them. So, 167 00:11:47,028 --> 00:11:52,015 maybe for example we pick this right most edge crossing the frontier. Now the 168 00:11:52,015 --> 00:11:57,005 right most edge of these-- right most vertex of these four is explored so our 169 00:11:57,005 --> 00:12:01,050 conquered territory is the top three vertices. And now again we have two edges 170 00:12:01,050 --> 00:12:05,078 crossing the frontier. The two edges that are incident to the bottom node, we pick 171 00:12:05,078 --> 00:12:09,095 one of them, not clear which one, maybe this one. And now the bottom node is also 172 00:12:09,095 --> 00:12:13,086 explored. And now there are no edges crossing the frontier. So there are no 173 00:12:13,086 --> 00:12:17,076 edges who, on the one hand, have one end-point being explored, and the other 174 00:12:17,076 --> 00:12:21,098 end-point being unexplored. So these will be the four vertices, as one would hope, 175 00:12:21,098 --> 00:12:26,045 that the search will explore started from S. Well generally the claim is that this 176 00:12:26,045 --> 00:12:31,015 generic graph search algorithm does what we wanted. It finds everything findable from 177 00:12:31,015 --> 00:12:35,055 the starting point and moreover it doesn't explore anything twice. I think that's 178 00:12:35,055 --> 00:12:39,003 fairly clear that it doesn't explore anything twice. Right? As soon as you look 179 00:12:39,003 --> 00:12:42,078 at a node for the first time, you suck it into the conquered territory never to look 180 00:12:42,078 --> 00:12:46,057 at it again. Similarly as soon as you look at an edge, you suck them in. But when we 181 00:12:46,057 --> 00:12:50,005 explore breadth and depth first search, I'll be more precise about the running 182 00:12:50,005 --> 00:12:53,075 time and exactly what I mean by you don't explore something twice. So, at this level 183 00:12:53,075 --> 00:12:57,037 of generality, I just wanna focus on the first point, that any way you instantiate 184 00:12:57,037 --> 00:13:01,012 this search algorithm, it's going to find everything findable. So, what do I really 185 00:13:01,012 --> 00:13:05,049 mean by that? The formal claim is that at the termination of this algorithm, the 186 00:13:05,049 --> 00:13:10,014 nodes that we've marked exp-, explored, are precisely the ones that can be reached 187 00:13:10,014 --> 00:13:14,062 via a path from S. That's the sense in which the algorithm explores everything 188 00:13:14,062 --> 00:13:18,083 that could potentially be findable from the starting point S. And one thing I 189 00:13:18,083 --> 00:13:22,025 wanna mention is that this claim and the proof I'm going to give of it, it holds 190 00:13:22,025 --> 00:13:25,085 whether or not G is an undirected graph or a directed graph. In fact, almost all 191 00:13:25,085 --> 00:13:29,010 of the things that I'm gonna say about graph search, and I'm talking about 192 00:13:29,010 --> 00:13:32,048 breadth first search and depth first search, work in essentially the same way, 193 00:13:32,048 --> 00:13:36,009 either in undirected graphs or directed graphs. The obvious difference being in an 194 00:13:36,009 --> 00:13:39,069 undirected graph you can traverse an edge in either direction. In a directed graph, 195 00:13:39,069 --> 00:13:43,015 we're only supposed to traverse it in a forward direction from the tail to the 196 00:13:43,015 --> 00:13:46,070 head. The one big difference between undirected and directed graphs is when 197 00:13:46,070 --> 00:13:50,018 we connectivity computations and I'll remind you when we get to that point which 198 00:13:50,018 --> 00:13:53,084 one we're talking about. Okay? But for the most part, when we just talk about basic 199 00:13:53,084 --> 00:13:57,063 graph search it works essentially the same way whether it's undirected or directed. 200 00:13:57,063 --> 00:14:02,058 So keep that in mind. Alright, so why is this true? Why are the nodes that get 201 00:14:02,058 --> 00:14:08,055 explored precisely the nodes for which there's a path to them from S? Well, one 202 00:14:08,055 --> 00:14:13,035 direction is easy. Which is, you can't find anything which is not findable, that 203 00:14:13,035 --> 00:14:17,088 is, if you wind up exploring a node, the only reason that can happen is because you 204 00:14:17,088 --> 00:14:22,002 traversed a sequence of edges that got you there. And that sequence of edges 205 00:14:22,002 --> 00:14:26,050 obviously defines a path from S to V. If you really want to be pedantic about the 206 00:14:26,050 --> 00:14:31,021 forward direction that explored nodes have to have paths from S. Then you can 207 00:14:31,021 --> 00:14:36,005 just do an easy induction. And I'll leave this for you to check, if you want, in the 208 00:14:36,005 --> 00:14:40,083 privacy of your own home. So the important direction of this claim is really the 209 00:14:40,083 --> 00:14:45,055 opposite. Why is it that no matter how we instantiate this generic graph search 210 00:14:45,055 --> 00:14:50,021 procedure, it's impossible for us to miss anything. That's the crucial point, we 211 00:14:50,021 --> 00:14:54,057 don't miss anything that we could, in principle, find via a path. But we're 212 00:14:54,057 --> 00:14:59,009 gonna proceed by contradiction. So, what does that mean, we're going to assume 213 00:14:59,009 --> 00:15:03,090 that, the statement that we want to prove is true, is not true. Which means that, 214 00:15:03,090 --> 00:15:08,071 it's possible that, G has a path from s to v and yet, somehow our algorithm misses 215 00:15:08,071 --> 00:15:13,023 it, doesn't mark it as explored. Alright, that's the thing we're really hoping 216 00:15:13,023 --> 00:15:18,010 doesn't happen. So let's suppose it does happen and then derive a contradiction. So 217 00:15:18,010 --> 00:15:22,093 suppose G does have a path from s to some vertex v. Call the path P. I'm gonna draw 218 00:15:22,093 --> 00:15:27,066 the picture for an undirected graph but the situation would be same in the, in the 219 00:15:27,066 --> 00:15:32,053 directed case. So there's a bunch of hops, there's a bunch of edges and then 220 00:15:32,053 --> 00:15:37,023 eventually this path ends at v. Now the bad situation, the situation from which we 221 00:15:37,023 --> 00:15:41,052 want to derive a contradiction is that V is unexplored at the end of this 222 00:15:41,052 --> 00:15:46,098 algorithm. So let's take stock of what we know. S for sure is explored, right. We 223 00:15:46,098 --> 00:15:52,087 initialized this search procedure so that S is marked as explored. V by hypothesis 224 00:15:52,087 --> 00:15:57,067 in this proof by contradiction is unexplored. So S is explored, V is 225 00:15:57,067 --> 00:16:03,020 unexplored. So now imagine we, just in our heads as a thought experiment which 226 00:16:03,020 --> 00:16:09,003 traverse this path P. We start at S and we know it's explored. We go the next vertex, 227 00:16:09,003 --> 00:16:13,017 it may or may not have been explored, we're not sure. We go to the third vertex, 228 00:16:13,017 --> 00:16:17,037 again who knows. Might be explored, might be unexplored and so on, but by time we 229 00:16:17,037 --> 00:16:21,062 get to V, we know it's unexplored. So we start at S, it's been explored, we get to V 230 00:16:21,062 --> 00:16:26,010 it's been unexplored. So at some point there's some hop, along this path P, from 231 00:16:26,010 --> 00:16:31,015 which we move from an explored vertex, to an unexplored vertex. There has to be a 232 00:16:31,015 --> 00:16:36,013 switch, at some point, cuz the end of the day at the end of the path we're at an 233 00:16:36,013 --> 00:16:41,006 unexplored node. So consider the first edge, and there must be one that we switch 234 00:16:41,006 --> 00:16:45,058 from being at an explore node to being at an unexplored node. So, I'm going to call 235 00:16:45,058 --> 00:16:50,021 the end points of this purported edge U and W. Where U is the explored one and W is the 236 00:16:50,021 --> 00:16:54,035 unexplored one. Now, for all we know U could be the same as S, that's a 237 00:16:54,035 --> 00:16:58,065 possibility, or for all we know, W could be same as V. That's also a possibility. 238 00:16:58,065 --> 00:17:02,095 In the picture, I'll draw it as if this edge UX was somewhere in the middle of 239 00:17:02,095 --> 00:17:07,049 this path. But, again it may be at one of the ends. That's totally fine. But now in 240 00:17:07,049 --> 00:17:13,015 this case, there's something I need you to explain to me. How is it possible that, on 241 00:17:13,015 --> 00:17:18,086 the one hand, our algorithm terminated. And on the other hand, there's this edge U 242 00:17:18,086 --> 00:17:24,044 comma X. Where U has been explored and X has not been explored. That, my friends, 243 00:17:24,044 --> 00:17:28,070 is impossible. Our generic search algorithm does not give up. It does not 244 00:17:28,070 --> 00:17:33,020 terminate, unless there are no edges where the one end point is explored and the 245 00:17:33,020 --> 00:17:37,059 other end point is unexplored. As long as there's such an edge, it has, is gonna 246 00:17:37,059 --> 00:17:41,087 suck in that unexplored vertex into the conquered territory, it's gonna keep 247 00:17:41,087 --> 00:17:46,025 going. So the upshot is there's no way that our algorithm terminated with this 248 00:17:46,025 --> 00:17:50,048 picture. With there being an edge U X, U explored, X unexplored. So, that's the 249 00:17:50,048 --> 00:17:55,053 contradiction. This contradicts the fact that our algorithm terminated with V 250 00:17:55,053 --> 00:18:00,031 unexplored. So that is a general approach to graph search. So that I hope 251 00:18:00,031 --> 00:18:04,056 gives you the flavor of how this is going to work. But now there's two particular 252 00:18:04,056 --> 00:18:08,091 instantiations of this generic method that are really important and have their own 253 00:18:08,091 --> 00:18:12,069 suites of applications. So we're gonna focus on breadth-first search and 254 00:18:12,069 --> 00:18:16,083 depth-first search. We'll cover them in detail in the next couple of videos. I 255 00:18:16,083 --> 00:18:21,000 wanna give you the highlights to conclude this video. Now let me just make sure it's 256 00:18:21,000 --> 00:18:24,046 clear where the ambiguity in our generic method is. Why we can have different 257 00:18:24,046 --> 00:18:27,095 instantiations of it that potentially have different properties and different 258 00:18:27,095 --> 00:18:31,045 applications. The question is at a given iteration of this while loop, what do you 259 00:18:31,045 --> 00:18:34,099 got? You've got your nodes that you've already explored, so that includes S plus 260 00:18:34,099 --> 00:18:38,062 probably some other stuff, and then you've got your nodes that are unexplored, and 261 00:18:38,062 --> 00:18:41,075 then you have your crossing edges. Right? So, there are edges with one 262 00:18:41,075 --> 00:18:45,078 point in each side. And for an undirected graph, there's no orientation to worry 263 00:18:45,078 --> 00:18:49,092 about. These crossing edges just have one endpoint on the explored side, one 264 00:18:49,092 --> 00:18:54,005 endpoint on the unexplored side. In the directed case, you focus on edges where 265 00:18:54,005 --> 00:18:58,008 the tail of the edge is on the explored side and the head of the edge is on the 266 00:18:58,008 --> 00:19:02,016 unexplored side. So, they go from the explored side to the unexplored side. And 267 00:19:02,016 --> 00:19:06,019 the question is, in general, in an iteration of this while loop there's gonna 268 00:19:06,019 --> 00:19:10,037 be many such crossing edges. There are many different unexplored nodes we could 269 00:19:10,037 --> 00:19:14,056 go to next, and different strategies for picking the unexplored node to explore 270 00:19:14,056 --> 00:19:19,070 next leads us to different graph search algorithms with different properties. So 271 00:19:19,070 --> 00:19:24,080 the first specific search strategy we're gonna study is breadth first search, 272 00:19:24,080 --> 00:19:29,082 colloquially known as BFS. So let me tell you sort of the high level idea and 273 00:19:29,082 --> 00:19:34,098 applications of bread first search. So, the goal is going to be to explore the 274 00:19:34,098 --> 00:19:40,008 nodes in what I call, layers. So, the starting point S will be in its own layer, 275 00:19:40,008 --> 00:19:44,091 Layer-0. The neighbors of S will constitute Layer-1, and then Layer-2 will 276 00:19:44,091 --> 00:19:50,044 be the nodes that are neighbors of Layer-1 but that are not already in layer zero or 277 00:19:50,044 --> 00:19:55,008 layer one, and so on. So layer i plus one, is the stuff next to layer i that you 278 00:19:55,008 --> 00:19:59,067 haven't already seen yet. You can think of this as a fairly cautious and tentative 279 00:19:59,067 --> 00:20:03,038 exploration of the graph. And it's gonna turn out that there's a close 280 00:20:03,038 --> 00:20:07,057 correspondence between these layers and shortest path distances. So if you wanna 281 00:20:07,057 --> 00:20:11,076 know the minimal number of hops, the minimal number of edges you need in a path 282 00:20:11,076 --> 00:20:15,095 to get from point A to point B in a graph. The way we wanted to know the fewest 283 00:20:15,095 --> 00:20:20,008 number of edges in the movie graph necessary to connect to John Hamm to Kevin 284 00:20:20,008 --> 00:20:24,033 Bacon. That corresponds directly to these layers. So if a node is in layer i, then 285 00:20:24,033 --> 00:20:28,054 you need i edges to get from S to i in the graph. Once we discuss breadth-first search, 286 00:20:28,054 --> 00:20:32,010 we'll also discuss how to compute the connected components, or the different 287 00:20:32,010 --> 00:20:35,084 pieces, of an undirected graph. Turns out this isn't that special to breadth-first 288 00:20:35,084 --> 00:20:38,093 search, you can use any number of graph search strategies to compute 289 00:20:38,093 --> 00:20:42,067 connected components in undirected graphs. But I'll show you how to do it using a 290 00:20:42,067 --> 00:20:46,054 simple looped version of breadth-first search. And we'll be able to do this stuff 291 00:20:46,054 --> 00:20:50,084 in the linear time that we want. The very ambitious goal of getting linear time. To 292 00:20:50,084 --> 00:20:55,004 get the linear time implementation, you do wanna use the right data structure, but 293 00:20:55,004 --> 00:20:58,093 it's a simple, simple data structure, something probably you've seen in the 294 00:20:58,093 --> 00:21:02,097 past. Namely a queue. So, something's that first in and first out. So, the second 295 00:21:02,097 --> 00:21:07,032 search strategy that's super important to know is depth first search, also known as 296 00:21:07,032 --> 00:21:11,052 DFS to its friends. Depth first search has a rather different feel than breadth 297 00:21:11,052 --> 00:21:15,086 first search. It's a much more aggressive search where you immediately try to plunge 298 00:21:15,086 --> 00:21:20,000 as deep as you can. It's very much in the spirit of how you might explore a maze, 299 00:21:20,000 --> 00:21:24,003 where you go as deeply as you can only backtracking when absolutely necessary. 300 00:21:24,003 --> 00:21:27,090 Depth first search will also have its own set of applications. It's not, for 301 00:21:27,090 --> 00:21:31,077 example, very useful for computing shortest path information, but especially 302 00:21:31,077 --> 00:21:36,012 in directed graphs it's going to do some remarkable things for us. So, in directed 303 00:21:36,012 --> 00:21:40,030 acyclic graphs, so a directed graph with no directed cycles it will give us 304 00:21:40,030 --> 00:21:44,042 what's called the topological ordering. So it'll sequence the nodes in a linear 305 00:21:44,042 --> 00:21:48,028 ordering from the first to the last, so that all of the arcs of the directed graph 306 00:21:48,028 --> 00:21:52,015 go forward. So this is useful for example if you have a number of tasks that need to 307 00:21:52,015 --> 00:21:55,078 get completed with certain precedence constraints. Like for example you have to 308 00:21:55,078 --> 00:21:59,018 take all of the classes in your undergraduate major, and there was certain 309 00:21:59,018 --> 00:22:02,068 prerequisites, topological ordering will give you a way in which to do it, 310 00:22:02,068 --> 00:22:06,086 respecting all of the prerequisites. And finally where for undirected graphs it doesn't really 311 00:22:06,086 --> 00:22:11,025 matter whether you use BFS or DFS to connect the components, in the directed graphs where 312 00:22:11,025 --> 00:22:15,053 even defining connected components is a little tricky it turns out depth first 313 00:22:15,053 --> 00:22:19,098 search is exactly what you want. That's what you're going to get a linear time 314 00:22:19,098 --> 00:22:25,024 implementation for computing the right notion of connected components in the directed graph case. Time-wise, 315 00:22:25,024 --> 00:22:30,030 both of these are superb strategies for exploring a graph. They're both linear 316 00:22:30,030 --> 00:22:35,003 time with very good constants. So depth-first search again, we're gonna 317 00:22:35,003 --> 00:22:39,079 get O of M plus N time in a graph with M edges and N vertices. You do wanna use a 318 00:22:39,079 --> 00:22:43,069 different data structure reflecting the different search strategy. So, here 319 00:22:43,069 --> 00:22:48,007 because you're exploring aggressively, as soon as you get to a node you'll meet and 320 00:22:48,007 --> 00:22:52,013 you start exploring its neighbors, you wanna last-in first-out data structure, 321 00:22:52,013 --> 00:22:56,019 also known as a stack. Depth first search also admits a very elegant recursive 322 00:22:56,019 --> 00:23:00,046 formulation, and in that formulation, you don't even need to maintain a stack data 323 00:23:00,046 --> 00:23:04,073 structure explicitly, the stack is implicitly taken care of in the recursion. 324 00:23:04,073 --> 00:23:08,096 So that concludes this overview of graph search. Both what it is, what our goals 325 00:23:08,096 --> 00:23:13,035 are, what kind of applications they have and two of the most common strategies. The 326 00:23:13,035 --> 00:23:17,048 next couple videos are going to explore these search strategies, as well as a 327 00:23:17,048 --> 00:23:19,095 couple of these applications in greater depth.