1 00:00:00,000 --> 00:00:04,036 So in this lecture we're going to drill down into our first specific, search 2 00:00:04,036 --> 00:00:08,183 strategy for graphs and also explore some applications. Namely, breadth first 3 00:00:08,183 --> 00:00:13,224 search. So let me remind you the intuition and applications of breath first search. 4 00:00:13,224 --> 00:00:19,418 The plan is to systematically explore the nodes of this graph beginning 5 00:00:19,418 --> 00:00:25,144 with the given starting vertex in layers. So let's think about the following example 6 00:00:25,144 --> 00:00:31,557 graph. Where S is the starting point for our breadth first search. So to start 7 00:00:31,557 --> 00:00:37,002 vertex S will constitute the first layer. So we'll call that L zero. And then 8 00:00:37,002 --> 00:00:41,389 the neighbors of S are going to be the first layer. And so those are the vertices 9 00:00:41,389 --> 00:00:48,623 that we explore just after S. So those are L one. Now the second layer is going 10 00:00:48,623 --> 00:00:53,517 to be the vertices that are neighboring vertices of L one but are not 11 00:00:53,517 --> 00:00:57,713 themselves in L one or for that matter L zero. So that's going to be 12 00:00:57,713 --> 00:01:04,531 C and D. That's going to be the second layer. Now you'll notice for example S is 13 00:01:04,531 --> 00:01:10,572 itself a neighbor of these nodes in layer one, but we've already counted that in a 14 00:01:10,572 --> 00:01:13,283 previous layer so we don't count S toward L two. And then finally the 15 00:01:13,283 --> 00:01:18,335 neighbors of L two, which are not already put in some layer is E. That will 16 00:01:18,335 --> 00:01:24,016 be layer three. Again notice C and D are neighbors of each other but, they've 17 00:01:24,016 --> 00:01:27,840 already been classified in layer two. So, that's where they belong, not in layer 18 00:01:27,840 --> 00:01:31,362 three. So that's the high level picture of breadth first search you should have. We'll 19 00:01:31,362 --> 00:01:34,264 talk about how to actually precisely implement it on the next slide. 20 00:01:34,264 --> 00:01:38,378 Again just a couple other things that you can do with breadth first search which we'll 21 00:01:38,378 --> 00:01:42,836 explore in this video is computing shortest paths. So it turns out shortest path 22 00:01:42,836 --> 00:01:47,843 distances correspond precisely to these layers. So, for example if you had that as 23 00:01:47,843 --> 00:01:52,152 S, you had that as the Kevin Bacon node in the movie graph, then Jon Hamm would pop 24 00:01:52,152 --> 00:01:57,473 up in the second layer from the breadth first search from Kevin Bacon. I'm also 25 00:01:57,473 --> 00:02:00,094 going to show you how to compute the connected components of an undirected 26 00:02:00,094 --> 00:02:03,907 graph. That is to compute its pieces. We'll do that in linear time. And for this 27 00:02:03,907 --> 00:02:07,148 entire sequence of videos on graph primitives, we will be satisfied with 28 00:02:07,148 --> 00:02:12,480 nothing less than the holy grail of linear time. And again, remember in a graph you 29 00:02:12,480 --> 00:02:16,299 have two different size parameters, the number of edges M and the number of nodes 30 00:02:16,299 --> 00:02:20,314 N. For these videos I'm not going to assume any relationship between M and N. 31 00:02:20,314 --> 00:02:25,384 Either one could be bigger. So linear time's gonna mean O of M plus N. So let's 32 00:02:25,384 --> 00:02:29,490 talk about how you'd actually implement breadth first search in linear time. So 33 00:02:29,490 --> 00:02:36,431 the sub routine is given as input both the graph G. I'm gonna explain this as if 34 00:02:36,431 --> 00:02:39,558 it's undirected, but this entire procedure will work in exactly the same way for a 35 00:02:39,558 --> 00:02:43,046 directed graph. Again, obviously in an undirected graph you can traverse an edge 36 00:02:43,046 --> 00:02:46,320 in either direction. For a directed graph, you have to be careful only to traverse 37 00:02:46,320 --> 00:02:50,583 arcs in the intended direction from the tail to the head, that is traverse them 38 00:02:50,583 --> 00:02:54,037 forward. So as we discussed when we talked about just generic strategies for graph 39 00:02:54,037 --> 00:02:57,333 search, we don't want to explore anything twice, that would certainly be 40 00:02:57,333 --> 00:03:00,866 inefficient. So we're going to keep a boolean at each node, marking whether 41 00:03:00,866 --> 00:03:04,792 we've already explored it or not. And by default, I'm just, we're just going to 42 00:03:04,792 --> 00:03:07,773 assume that nodes are unexplored. They're only explored if we explicitly mark them 43 00:03:07,773 --> 00:03:13,241 as such. So we're going to initialize the search with the starting vertex S. So we 44 00:03:13,241 --> 00:03:16,739 mark S as explored and then we're gonna put that in what I was previously calling 45 00:03:16,739 --> 00:03:20,531 conquered territory the nodes we have already started to explore. So to get 46 00:03:20,531 --> 00:03:25,132 linear time we are gonna have to manage those in a slightly non naive but, but 47 00:03:25,132 --> 00:03:29,452 pretty straightforward way namely via a queue, which is a first in first out data 48 00:03:29,452 --> 00:03:33,130 structure that I assume you have seen. If you have never seen a queue before, please 49 00:03:33,130 --> 00:03:37,540 look it up in a programming textbook or on the web. Basically a queue is just 50 00:03:37,540 --> 00:03:41,512 something where you can add stuff to the back in constant time and you can take 51 00:03:41,512 --> 00:03:45,404 stuff from the front in constant time. You can implement these, for example, using a 52 00:03:45,404 --> 00:03:50,167 doubly linked list. Now recall that in the general systematic approach to 53 00:03:50,167 --> 00:03:54,533 graph search, the trick was to, in each iteration of some while loop, to add one 54 00:03:54,533 --> 00:03:59,741 new vertex to the conquered territory. To identify one unexplored node that is now 55 00:03:59,741 --> 00:04:05,939 going to be explored. So that while loop's gonna translate into one in which we just check 56 00:04:05,939 --> 00:04:10,586 if the queue is non-empty. So we're assuming that the queue data structure 57 00:04:10,586 --> 00:04:15,806 supports that query in constant time which is easy to implement. And if the queue is 58 00:04:15,806 --> 00:04:19,936 not empty we remove a node from it. And because it's a queue, removing nodes from 59 00:04:19,936 --> 00:04:24,559 the front is what you can do in constant time. So call the node that you get out of 60 00:04:24,559 --> 00:04:29,644 the queue V. So, now we're going to look at V's neighbors, vertices with which it 61 00:04:29,644 --> 00:04:32,467 shares edges, and we're gonna see if any of them have not already been explored. 62 00:04:32,467 --> 00:04:37,293 So, if W's something we haven't seen before, that's unexplored, that means it's 63 00:04:37,293 --> 00:04:40,691 in the unconquered territory, which is great. So, we have a new victim. We can 64 00:04:40,691 --> 00:04:44,831 mark W as explored. We can put it in our queue and we've advanced the frontier and 65 00:04:44,831 --> 00:04:49,601 now we have one more explored node than we did previously. And again, a queue by 66 00:04:49,601 --> 00:04:53,932 construction, it supports adding constant time additions at the end of the queue, so 67 00:04:53,932 --> 00:04:58,139 it's where we put W. So, let's see how this code actually executes in this same graph 68 00:04:58,139 --> 00:05:02,380 that we were looking at in the previous slide. And what I'm gonna do is I'm gonna 69 00:05:02,380 --> 00:05:08,523 number the nodes in the order in which they are explored. So, obviously 70 00:05:08,523 --> 00:05:14,182 the first node to get explored is S. That's where the queue starts. So now, when we 71 00:05:14,182 --> 00:05:17,174 follow the code, what happens? Well in the first iteration of the while loop we ask 72 00:05:17,174 --> 00:05:21,686 is the queue empty? No it's not, because S is in it. So we remove in this case the only 73 00:05:21,686 --> 00:05:28,570 node of the queue. It's S. And then we iterate over the edges incident to S. Now there are two 74 00:05:28,570 --> 00:05:32,688 of them. There's the edge between S and A and there's the edge between S and B. And 75 00:05:32,688 --> 00:05:36,457 again this is still a little under specified. In the sense that the algorithm 76 00:05:36,457 --> 00:05:39,618 doesn't tell us which of those two edges we should look at. Turns out it doesn't 77 00:05:39,618 --> 00:05:43,015 matter. Each of those is a valid execution of breadth first search. But 78 00:05:43,015 --> 00:05:47,234 for concreteness, let's suppose that of the two possible edges, we look at the 79 00:05:47,234 --> 00:05:52,319 edge S comma A. So, then we ask, has A already been explored? No, it hasn't. This 80 00:05:52,319 --> 00:05:55,322 is the first time we've seen it, so we say, oh goody. This is sort of new grist 81 00:05:55,322 --> 00:06:01,510 for the mill. So, we can add A to the queue at the end and we mark W as, sorry 82 00:06:01,510 --> 00:06:08,316 mark A as explored. So, A is gonna be the second vertex that we mark. So, after 83 00:06:08,316 --> 00:06:11,883 marking A as explored and adding it to the queue, so now we go back to the for loop, 84 00:06:11,883 --> 00:06:15,798 and so now we move on to the second edge. It's into S, that's the edge between S and 85 00:06:15,798 --> 00:06:19,882 B. So, we ask, have we already explored B? Nope, this is the first time we've seen 86 00:06:19,882 --> 00:06:23,177 it. So, now we have the same thing with B. So, B gets marked as explored and gets 87 00:06:23,177 --> 00:06:29,368 added to the queue at the end. So the queue at this juncture has first a record 88 00:06:29,368 --> 00:06:34,321 for A, cause that was the first one we put in it after we took S out. And then B 89 00:06:34,321 --> 00:06:39,530 follows A in the queue. Again, depending on the execution this could go either way. But 90 00:06:39,530 --> 00:06:43,659 for concreteness, I've done it so that A got added before B. So at this point, this 91 00:06:43,659 --> 00:06:47,090 is what the queue looks like. So now we go back up to the while loop, we say is 92 00:06:47,090 --> 00:06:50,510 the queue empty? Certainly not. There's actually two elements. Now we remove the 93 00:06:50,510 --> 00:06:55,326 first node from queue, in this case, that's the node A that was the one we put in 94 00:06:55,326 --> 00:06:59,195 before the node B. And so now we say, well, let's look at all the edges incident 95 00:06:59,195 --> 00:07:02,581 to A. And in this case A has two two incident edges. It has one that it 96 00:07:02,581 --> 00:07:06,699 shares with S and it has one that it shares with C. And so, if we look at the 97 00:07:06,699 --> 00:07:11,399 edge between A and S, then we'd be asking an if statement. Has S already been 98 00:07:11,399 --> 00:07:14,842 explored? Yes it has, that's where we started. So, there's no reason to do 99 00:07:14,842 --> 00:07:18,644 anything with S. That's already been taken out of the queue. So, in this for 100 00:07:18,644 --> 00:07:23,189 loop for A, there's two iterations. One involves the edge with S, and that one we 101 00:07:23,189 --> 00:07:27,487 completely ignore. But then there's the other edge that A shares with C, and C we 102 00:07:27,487 --> 00:07:32,077 haven't seen yet. So, at that part of the for loop, we say ahah. C is a new thing, 103 00:07:32,077 --> 00:07:36,250 new node we can mark as explored and put in the queue. So, that's gonna be our 104 00:07:36,250 --> 00:07:41,425 number four. So now how has the queue changed. Well, we got rid of A. And 105 00:07:41,425 --> 00:07:47,478 so now B is in the front and we added C at the end. And so now the same thing 106 00:07:47,478 --> 00:07:51,392 happens. We go back to the while loop, the queue is not empty, we take off the first 107 00:07:51,392 --> 00:07:56,162 vertex, in this case that's gonna be B. B has three incident edges, it has one 108 00:07:56,162 --> 00:08:00,831 incident S but that's irrelevant, we've already seen S. It has one incident to C, 109 00:08:00,831 --> 00:08:03,441 that's also irrelevant, that's also irrelevant, because we've already seen C. 110 00:08:03,441 --> 00:08:06,805 True, we just saw it very recently, but we've already seen it. But the edge 111 00:08:06,805 --> 00:08:11,271 between B and D is new, and so that means we can take the node D, mark it as explored 112 00:08:11,271 --> 00:08:17,459 and add it to the queue. So D is going to be the fifth one that we see. And now the 113 00:08:17,459 --> 00:08:23,761 queue has the element C followed by D. So now we go back to the while loop and we take C off 114 00:08:23,761 --> 00:08:30,421 of the queue. It again has four now edges. The one with A is irrelevant, we've already 115 00:08:30,421 --> 00:08:33,312 seen A. The one with B is irrelevant, we've already seen B. The one with D is 116 00:08:33,312 --> 00:08:36,834 irrelevant, we've already seen D. But we haven't seen E yet. So, when we get to the 117 00:08:36,834 --> 00:08:42,200 part of the for loop, or the edge between C and E, we say, aha, E is new. So E will 118 00:08:42,200 --> 00:08:47,206 be the sixth and final vertex to be marked as explored. And that will get added at 119 00:08:47,206 --> 00:08:51,554 the end of the queue. So then in the final two iterations of the while loop 120 00:08:51,554 --> 00:08:54,479 the D is going to be removed, we'll iterate through its three edges, none of 121 00:08:54,479 --> 00:08:58,258 those will be relevant because we've seen the three endpoints. And then we'll go 122 00:08:58,258 --> 00:09:01,330 back to the while loop and we'll get rid of the E. E is irrelevant cause it has two 123 00:09:01,330 --> 00:09:04,131 edges we've already seen the other endpoints. Now we go back to the while loop. 124 00:09:04,131 --> 00:09:09,700 The queue is empty. And we stop. That is breadth-first search. And to see how this 125 00:09:09,700 --> 00:09:14,751 simulates the notion of the layers that we were discussing in the previous slide 126 00:09:14,751 --> 00:09:22,559 notice that the nodes are numbered according to the layer that they're in, so 127 00:09:22,559 --> 00:09:30,018 S was layer zero. And then the two nodes that S caused to get added to the queue, the A 128 00:09:30,018 --> 00:09:37,071 and the B, are number two and three, and the edges of layer three are precisely the 129 00:09:37,071 --> 00:09:40,604 ones, sorry the edges of layer two are precisely the ones that got added to the 130 00:09:40,604 --> 00:09:45,858 queue, while we were processing the nodes from layer one. That is, C and D are 131 00:09:45,858 --> 00:09:50,324 precisely the nodes that got added to the queue while we were processing A and B. 132 00:09:50,324 --> 00:09:58,109 So, this is level zero, level one, and level two. E is the only node that got 133 00:09:58,109 --> 00:10:01,777 added to the queue while we were processing level, layer two. The vertices 134 00:10:01,777 --> 00:10:09,000 C and D. So E will be the third layer. So, in that sense, by using a first in first 135 00:10:09,000 --> 00:10:14,827 out data structure, this queue, we do wind up kinda processing the nodes according to 136 00:10:14,827 --> 00:10:19,079 the layers that we discussed earlier. So, the claim that breadth first search is a 137 00:10:19,079 --> 00:10:22,949 good way to explore a graph, in the sense that it meets the two high level goals 138 00:10:22,949 --> 00:10:27,305 that we delineated in the previous video. First of all it finds everything findable, 139 00:10:27,305 --> 00:10:31,387 and obviously nothing else, and second of all, it does it without redundancy. It 140 00:10:31,387 --> 00:10:34,739 does it without exploring anything twice, which is the key to its linear time 141 00:10:34,739 --> 00:10:40,443 implementation. So a little bit more formally, claim number one. At the end of 142 00:10:40,443 --> 00:10:46,157 the algorithm, the vertices that we've explored are precisely the ones such that 143 00:10:46,157 --> 00:10:51,355 there was a path from S to that vertex. Again this claim is equally valid, whether 144 00:10:51,355 --> 00:10:54,786 you're running BFS in an undirected graph or a directed graph. Of course in an 145 00:10:54,786 --> 00:10:58,669 undirected graph, meaning an undirected path from S to V, whereas a directed graph 146 00:10:58,669 --> 00:11:02,819 in a directed path from S to V. That means a path where every arc in the path gets 147 00:11:02,819 --> 00:11:06,857 traversed in the forward direction. So, why is this true? Well, this is true, we 148 00:11:06,857 --> 00:11:11,245 basically proved this more generally for any graph search strategy of a certain 149 00:11:11,245 --> 00:11:17,702 form of which breadth first search is one. If it's hard for you to see the right way 150 00:11:17,702 --> 00:11:20,279 to interpret breadth-first search as a special case of our generic search 151 00:11:20,279 --> 00:11:23,507 algorithm, you can also just look at our proof for the generic search algorithm and 152 00:11:23,507 --> 00:11:27,749 copy it down for breadth-first search. So it's clear that you're only gonna, 153 00:11:27,749 --> 00:11:31,056 again, the forward direction of this claim is clear. If you actually 154 00:11:31,056 --> 00:11:35,297 find something, if something's marked as explored, it's only because you found a 155 00:11:35,297 --> 00:11:38,672 sequence of edges that led you there. So the only way you mark something as 156 00:11:38,672 --> 00:11:42,812 explored is if there's a path from S to V. Conversely, to prove that anything with an 157 00:11:42,812 --> 00:11:46,940 S to V, for with a path from V will be found, you can proceed by contradiction: 158 00:11:46,940 --> 00:11:50,788 you can look at the part of the path from S to V that, that BFS does successfully 159 00:11:50,788 --> 00:11:55,581 explore, and then you gotta ask, why didn't it go one more hop? It never 160 00:11:55,581 --> 00:11:59,180 would've terminated before reaching all the way to V. So, you can also just copy 161 00:11:59,180 --> 00:12:02,956 that same proof that we had for the generic search strategy in the previous 162 00:12:02,956 --> 00:12:06,043 video. Okay? So, again, the upshot. Breadth first search finds everything 163 00:12:06,043 --> 00:12:09,204 you'd wanna find. Okay? So, it only traverses paths, so you're not gonna find 164 00:12:09,204 --> 00:12:12,388 anything where there isn't a path to it. But it never misses out. Okay? Anything 165 00:12:12,388 --> 00:12:17,226 where there's a path, BFS, guaranteed to find it. No problem. Claim number two is 166 00:12:17,226 --> 00:12:22,367 that the running time is exactly what we want and I am gonna state it in a form 167 00:12:22,367 --> 00:12:26,079 that will be useful later when we talk about connected components. So the running 168 00:12:26,079 --> 00:12:30,208 time of the main while loop, ignoring any kind of pre processing or initialization is 169 00:12:30,208 --> 00:12:37,486 proportional to what I am gonna call NS and MS which is the number of nodes that 170 00:12:37,486 --> 00:12:43,133 can be reached from S and number of edges that can be reached from S. And the reason 171 00:12:43,133 --> 00:12:47,094 for this claim it just becomes clear if you inspect the code which we'll do in a 172 00:12:47,094 --> 00:12:51,605 second. So let's return to the code and just tally up all the work that gets done. 173 00:12:51,605 --> 00:12:55,869 So I'm gonna ignore this initialization. I'm just gonna focus on the main while 174 00:12:55,869 --> 00:13:00,740 loop. So we can summarize the total work done in this while loop as follows. First 175 00:13:00,740 --> 00:13:05,600 we just think about the vertices so in this search we're only gonna deal, ever 176 00:13:05,600 --> 00:13:09,695 deal with the vertices that are findable from S, so that's NS. And what do we do 177 00:13:09,695 --> 00:13:13,724 for the given node, well we insert it into the queue and we delete it from the 178 00:13:13,724 --> 00:13:16,998 queue. Alright? So we're never gonna deal with a single node more than once. So 179 00:13:16,998 --> 00:13:20,237 that's constant time overhead per vertex that we ever see, so that's the proportion 180 00:13:20,237 --> 00:13:24,614 of the NS part. Now, a given edge, we might look at it twice. So, for an edge V 181 00:13:24,614 --> 00:13:28,866 W, we might consider it once when we first look at the vertex V, and we might 182 00:13:28,866 --> 00:13:32,602 consider it again when we look at the vertex W. Each time we look at an edge we 183 00:13:32,602 --> 00:13:37,967 do constant work. So that means we're only gonna do constant work per edge. Okay. So 184 00:13:37,967 --> 00:13:41,680 we look at each vertex at most once. We look at each edge findable from S at most 185 00:13:41,680 --> 00:13:45,786 twice. We do constant time, constant work when we look at something. So the overall 186 00:13:45,786 --> 00:13:50,657 running time is going to be proportional to the number of vertices findable from S 187 00:13:50,657 --> 00:13:55,090 plus the number of edges findable from S. So, that's really cool. We have a linear 188 00:13:55,090 --> 00:13:59,781 time of implementation of a really nice graph search strategy. Moreover we just need 189 00:13:59,781 --> 00:14:03,921 very basic data structures, a queue, to make it run fast with small constants. But 190 00:14:03,921 --> 00:14:08,511 it gets even better. We can use breadth first search as a work horse for some 191 00:14:08,511 --> 00:14:17,083 interesting applications. So, that's what we'll talk about in the rest of this video.