So let's talk about the absolutely 
fundamental problem of searching a graph 
and the very related problem of finding 
paths through graphs. 
So why would one be interested in 
searching a graph or figuring out if 
there's a path from point A to point B? 
Well there's many, many reasons, I'm 
going to give you a highly non-exhaustive 
list on this slide. 
So let me begin with a very sort of 
obvious and literal example, which is if 
you have a physical network. Then often 
you want to make sure that the network is 
fully connected, in the sense that you 
can get from any starting point to any 
other point. 
So, for example, think back to the phone 
network. 
It would've been a disaster if callers 
from California could only reach callers 
in Nevada, but not their family members 
in Utah. 
So obviously, a minimal condition for 
functionality of something like a phone 
network is that you can get from any one 
place to any other place, similarly for 
road networks within a given country, and 
so on. 
It can also be fun to think about other, 
non-physical networks and ask if they're 
connected. 
So, one network that's fun to play around 
with is the movie network. 
So this is the graph where the nodes 
correspond to actors and actresses and 
you have an edge between the two nodes if 
they played a role in a common movie. 
So this is going to be an undirected 
graph where the edges correspond to, not 
necessarily co-starring, but both the 
actors appearing at least at some point 
in the same movie. 
So versions of this movie network you 
should be able to find publicly available 
on the web and there's lots of fun 
questions you can ask about the movie 
network, like for example, what's the 
minimum number of hops, where a hop here 
again is a movie where two people both 
played a role in, the minimum number of 
hops or edges you can get from one actor 
to another actor. 
So perhaps the most famous statistic 
that's been thought about with the movie 
network is the Bacon number. 
So this refers to the fairly ubiquitous 
actor Kevin Bacon and the question, the 
Bacon number of an actor is defined as 
the minimum number of hops you need in 
this movie graph, to get to Kevin Bacon. 
So for example, you can ask About Jon 
Hamm, also known as Don Draper from Mad 
Men. 
And you could ask how many edges do you 
need on a path through the movie graph to 
get to Kevin Bacon and it turns out that 
the answer is one, excuse me, two edges. 
You need one intermediate point, namely 
Colin Firth. 
And, that's became, that's because that's 
because Colin Firth and Kevin Bacon both 
starred in the Atom Egoyan movie, Where 
The Truth Lies, and John Hamm and Colin 
Firth were both in the movie, A Single 
Man. 
So, that would give John Hamm a Bacon 
number of two. 
So, these are the kinds of questions you 
can ask about connectivity not just in 
physical networks, like telephone and 
telecommunication networks, but also 
logical networks about pairwise 
relationships between objects more 
generally. 
So the Bacon number is fundamentally not 
just about any path, 
but actually shortest paths. 
The minimum number of edges you need to 
traverse to get from one actor to Kevin 
Bacon. 
And shortest paths are also have a very 
practical use that you might use yourself 
in driving directions. 
So when you use a website or a phone app 
and you ask for the best way to get from 
where you are now to say, some restaurant 
where you're going to have dinner. 
Obviously, you're trying to find some 
kind of path through a network through a 
graph. 
And indeed, often you want the, the 
shortest path. 
Perhaps in mileage or perhaps in 
anticipated travel time. 
Now I realize that when you're thinking 
about paths and graphs, it's natural to 
focus on, sort of, very literal paths and 
quite literal, physical networks, things 
like roots through a road network or 
paths through the internet and so on. 
But, you should really think more 
abstractedly as a path, as just a 
sequence of decisions, taking you from 
some initial state, to some final state. 
And it's this abstract mentality which is 
what makes graph search so ubiquitous, it 
feels like artificial intelligence. 
Where you want to formulate a plan of how 
to get from an initial state to some goal 
state. 
So, to give a simple recreational 
example, you can imagine just trying to 
understand how to compute automatically a 
way to fill in a Sudoku puzzle. 
So that you get to, so that you solve the 
puzzle correctly. 
So you might ask you know what is the 
graph that we're talking about when we 
want to solve a Sudoku puzzle. 
Well this is going to be a directed graph 
where here the nodes correspond to 
partially completed puzzles. 
So for example, at one node of this 
extremely large graph perhaps 40 out of 
the 81 cells are filled in with some kind 
of number. 
And now, again remember a path is 
supposed to correspond to a sequence of 
decisions, so what are the actions that 
you take in solving Suduko while you fill 
in a number into a square. 
So an edge which here is going to be 
directed is going to move you from one 
partially completed puzzle to another, 
where one previously empty square gets 
filled in with one number. 
And of course, then the path that you're 
interested in computing, or what you're 
searching for or what you search for when 
you search this graph., You begin with 
the initial state of the Sudoku puzzle 
and you want to reach some goal state 
where the Sudoku puzzle is completely 
filled in without any violations of the 
rules of Sudoku. 
And of course it's easy to imagine 
millions of other situations where you 
want to formulate some kind of plan like 
this. 
Like, for example, if you have a robotic 
hand and you just want to grasp some 
object, you need to think about exactly 
how to approach, the object with this 
robotic hand, so that you can grab it 
without, for example, first knocking it 
over. 
And you can think of millions of other 
examples. 
Another thing which turns out to be 
closely related to graph search as we'll 
see and has many applications in it's own 
right is that of computing connectivity 
information about graphs. 
So in particular the connected 
components. 
So this, especially for undirected graphs 
corresponds to the pieces of the graph. 
We'll talk about these topics in their 
own right and I'll give you applications 
for them later. 
So for undirected graphs I'll briefly 
mention an easy clustering heristic you 
can derive out of computing connected 
components. 
For directed graphs where the very 
definition of computing components is a 
bit more subtle I'll show you 
applications to understanding the 
structure of the web. 
So these are a few of the reasons why 
it's important for you to understand how 
to efficiently search graphs. 
It's a, a fundamental and widely 
applicable graph primitive. 
And I'm happy to report that in this 
section of the course pretty much 
anything any questions we want to answer 
about graph search, computing connecting 
components, and so on. 
There's going to be really fast 
algorithms to do it. 
So this will be a part of the course 
where there's lots of what I call four 
free primitives processing steps 
serpentine's that you can run without 
even thinking about it. 
All of these algorithms we're going to 
discuss in the next several lectures are 
going to run in linear time and there 
going to have quite reasonable constants. 
So there barely slower than reading the 
input. 
So if you have a graph and your trying to 
reason about it, your trying to make 
sense about it. 
You should in some sense feel free to 
apply any of these subroutines we're 
going to discuss to try and glean some 
more information about what they look 
like, how you might use the network data. 
There's a lot of different approaches to 
systematically searching a graph. 
So there's many methods. 
In this class, we're going to focus on 
two very important ones. 
Mainly breadth verse search and depth 
verse search. 
But all of the graph search methods share 
some things in common. 
So on this slide, let me just tell you 
the high order bits of, really, any graph 
search algorithm. 
So grass search subroutines generally are 
passed as input, a starting vertex from 
which the search originates. 
So, that's often called a source vertex. 
And your goal, then, is to find 
everything findable from this search 
vertex And obviously, you're not going to 
find anything that you can't find, that's 
not findable. 
What do I mean by findable? 
I mean, there's a path from the starting 
point to this other node. 
So for any other node to which you can 
get along a path from the starting point, 
you should discover it. 
So for example, if you're given an 
undricted graph that has three different 
pieces like this one I'm drawing on the 
right and perhaps S, is this left most 
node here. 
Then the findable vertices starting from 
S are either the ones to which you can 
reach from a path to S is clearly 
precisely these four vertices. 
So you would want graph search to 
automatically discover and efficiently 
discover these four vertices to if you 
started at S. 
You could also think about a directed 
version of the exact same graph, wherein 
the direct vertices like so. 
So now the definition of definable nodes 
is a little bit different. 
We're only expecting to follow arcs 
forward along the forward direction so we 
should only expect at best to find all of 
the nodes that you can reach by following 
a succession of forward arcs that is any, 
any node that there's a path to from S so 
in this case these three nodes would be 
the ones we'd be hoping to find this blue 
node to the right we would no longer 
expect to find because the only way to 
get there from S is by going backwards 
along arcs and that's not what we're 
going to be thinking about in our graph 
surges. 
So we want to find everything findable, 
i.e., that we can get to along paths. 
And, but we want to do it efficiently. 
And efficiently means we don't explore 
anything twice, 
right? 
So the graph has M arcs, M edges, and N 
nodes or N vertices. 
And really, we want to just look at, each 
piece of the graph only once, or a small 
constant number of times. 
So we're looking for running time, which 
is linear in the size of the graph. 
That is big O of MN. 
plus N. 
Now, we were talking about representing 
graphs, I said that in many applications 
it's natural to focus on connected 
graphs, in which case M is going to 
dominate N. 
You're going to have at least as many 
edges as nodes essentially. 
But connectivity is the classic case 
where you might have the number of edges 
being much smaller than the number of 
nodes. 
There might be many pieces and the whole 
point of what you're trying to do is 
discover them. 
So for this sequence of lectures when we 
talk about graph search and connectivity, 
we will usually write Mn. 
plus N. 
We'll think that either one could be 
bigger or smaller than the other. 
So let me know give you a generic 
approach to graph search. 
It's going to be under specified. 
There will be many different ways to 
instantiate it. 
Two particular instantiations will get us 
breath first search and depth first 
search. 
But here is just a general systematic 
method for finding everything findable 
without exploring anything more than 
once. 
So motivated by the second goal, the fact 
that we don't want to explore anything 
twice, with each node, with each vertex, 
we're going to remember whether or not 
we've explored it before. 
So we just need one Boolean per node and 
we'll initialize it by having everything 
unexplored, except S our starting point, 
we'll have it start off as explored. 
And it's useful to think of the nodes 
thus far as being, in some sense, 
territory conquered by the algorithm. 
And then there's going to be a frontier 
in between the conquered and unconquered 
territory. 
And the goal of the generic algorithm is 
that, each step, we supplement the 
conquered territory by one new node. 
Assuming that there is one adjacent to 
the territory we've already conquered. 
So for example, in this top example with 
the undirected network. 
Initially, the only thing we've explored 
is the starting point S. 
So that's sort of our home base, 
that's all that we have conquered so far. 
And then, in our main while loop. 
Which we iterate as many times as we can, 
until we don't have any edges meeting the 
following criterion. 
We look for a edge. 
With one endpoint that we've already 
explored one endpoint inside the conquer 
territory and then the other endpoint 
outside. 
So this is how we can in one hop 
supplement the number of nodes we've seen 
by one new one. 
If we can't find such an edge then this 
is where the search stops. 
If we can find such an edge well then we 
suck V into the conquer territory we 
think of it being explored. 
And we return to the while loop. 
So for example, on this example on the 
right, we start with the only explored 
node being S. 
Now, there's actually two edges that 
cross the frontier. 
In the sense one of the N points is 
explored. 
Namely one of the points is S, and the 
other one is some other vertex, 
right? 
There's this there's this two ver, two 
edges to the left two vertices adjacent 
to S. 
So, in this algorithm we pick either one. 
It's unspecified which one we pick. 
But maybe we pick the top one. 
And so then all of a sudden this seconds 
tab vertex is now also explored. 
So the conquer territory is the union of 
them, and so now we have a new frontier. 
So now again we have two edges that cross 
from the explored nodes. 
The unexplored nodes is at the edges that 
are in some sense going from Northwest to 
Southeast. 
Again we pick one of them, 
it's not clear how the algorithm doesn't 
tell us. 
We just pick any of them. 
So maybe for example, we pick this right 
most edge crossing the front here. Now, 
the right most edge of these four, right 
most vertex of these four is explored so 
our conquer territory is the top three 
vertices. 
Now again, we have two edges crossing the 
frontier, the two edges that are incident 
to the bottom node. 
We pick one of them not clear which one, 
maybe this one. 
And now, the bottom node is also 
explored, and now there are no edges 
crossing the frontier. 
So there are no edges who, on the one 
hand, have one endpoint being explored 
and the other endpoint being unexplored. 
So these will be the four vertices, as 
one would hope, that the search will 
explore, started from S. 
Well generally, the claim is that this 
generic graph search algorithm does what 
we wanted. 
It finds everything findable from the 
starting point and moreover, it doesn't 
explore anything twice. 
I think it's fairly clear that it doesn't 
explore anything twice, 
right? 
As soon as you look at a node for the 
first time, you suck it into the 
conquered territory, never to look at it 
again. 
Similarly, as soon as you look at, at 
edge, you suck them in. 
but when we explore breadth and depth, 
first search will be more precise about 
the running time and exactly what I mean 
by you don't explore something twice. 
So at this level of generality, I just 
want to focus on the first point, that 
anyway you [INAUDIBLE] this search 
algorithm it's going to find everything 
findable. 
So what do I really mean by that? 
The formal claim is that at the 
termination of this algorithm. The nodes 
that we've marked explored are precisely 
the ones that can be reached via a path 
from S. 
That's the sense in which the algorithm 
explores everything that could 
potentially be findable from the starting 
point S. 
And one thing I want to mention is that 
this claim and the proof we're going to 
give of it, it holds whether or not you 
use an undirected graph or a directed 
graph. 
In fact, almost all of the things that 
I'm going to say about graph search and, 
in particular, breadth-first search and 
depth-first search work in essentially 
the same way, either undirected graphs or 
directed graphs. 
The obvious difference being in an 
undirected graph, you can traverse an 
engine in either direction. 
In a directed graph, we're only supposed 
to traverse it in the forward direction, 
from the tail to the head. 
The one big difference between undirected 
and directed graphs is when we do 
connectivity computations. 
And I'll remind you when we get to that 
point, which one we're talking about. 
Okay, but for the most part. 
We just talked about basic graph search, 
where it's essentially the same way, 
whether it's undirected or directed, so 
keep that in mind. 
All right, so why is this true? 
Why are the nodes that get explored 
precisely the nodes for which there is a 
path to them from S? 
Well, one direction is easy. 
Which is, you can't find anything which 
is not findable. 
That is, if you wind up exploring a node, 
the only reason that can happen is 
because you traversed a sequence of edges 
that got you there, and that sequence of 
edges obviously defines a path from S to 
V. 
If you really want to be pedantic about 
the forward direction that explored nodes 
have to have paths from S then you can 
just do an easy induction. 
And I'll leave this for you to check, if 
you want, in the privacy of your own 
home. 
So the important direction of this claim 
is really the opposite. 
Why is it that no matter how we 
instantiate this generic graph search 
procedure it's impossible for us to miss 
anything. 
That's the crucial point we don't miss 
anything that we couldn't principal find 
via a path. 
But we're going to perceive by 
contradiction. 
So what does that mean? 
We're going to assume that the statement 
that we want to prove is true, is not 
true. 
Which means that, it's possible that G 
has a path from S to V. 
And yet, somehow, our algorithm misses 
it. 
Doesn't mark it as explored. 
That's the thing we really hoping doesn't 
happen. 
So let's suppose it does happen and then 
derive a contradiction. 
So let's suppose G does have a path from 
S to some vertex V, 
call the path P. 
I'm going to draw the picture from 
undirected graph, but the situation would 
be same in, in the directive case. 
So there's a bunch of hops there's a 
bunch of edges and · then eventually this 
path ends at V. 
Now, the bad situation the situation from 
which we wanted to rather contradiction 
is that V is unexplored at the end of 
this algorithm. 
So let's take stock of what we know, S 
for sure is explored right, we 
initialized the search procedure so that 
S is marked as explored. 
V by hypothesis in this proof by 
contradiction is unexplored. 
So S is explored V is unexplored. 
So now, I imagine we just in our heads as 
a thought experiment we traversed as path 
peak. 
We started S and we know it's explored we 
go to the next vertex it may or not have 
been explored, we're not sure. We go to 
the third vertex again who knows it might 
be explored it might be unexplored and so 
on. 
But by the time we get to V we know its 
unexplored. 
So we start at S it's been explored we 
get to V it's been unexplored. 
So at some point, 
there's some hop along this path P from 
which we move from an explored vertex to 
an unexplored vertex. 
There has to be a switch, at some point 
because at the end of the day, at the end 
of the path we're at an unexplored node. 
So consider the first edge, and there 
must be one. 
That we switch from an explored node to 
being an unexplored node. 
So I'm going to call the endpoints of 
this purported edge U and W, where U is 
the explored one and W is the unexplored 
one. 
Now for all we know U could be exactly 
the same as S, 
that's a possibility, possibility. 
Or, for all we know, W could be the same 
as V, 
that's also a possibility. 
In the picture I'll draw it as if this 
edge ux was somewhere in the middle of 
this path. 
But again, it may be at one of the ends, 
that's totally fine. 
But now in this case, there's something I 
need you to explain to me. 
How is it possible that, on the one hand, 
our algorithm terminated, and on the 
other hand, there's this edge U comma X, 
where U has been explored and X has not 
been explored. 
That, my friends, is impossible. 
Our generic search algorithm does not 
give up. 
It does not terminate unless there are no 
edges where the one endpoint is explored 
and the other endpoint is unexplored. 
As long as there's such an edge it has is 
going to suck in unexplored vertex into 
the conquered territory and it's going to 
keep going. 
So, the, the upshot is there's no way 
that our algorithm terminated with this 
picture, with there being an edge UX, 
U explored, X unexplored. 
So that's the contradiction. 
This contradicts the fact that our 
algorithm terminated with V unexplored. 
So that is a general approach to graph 
search. 
So that, I hope, gives you the flavor of 
how this is going to work. 
But now there's two particular 
instantiations of this generic method 
that are really important and have their 
own suites of applications. 
So we're going to focus on breadth-first 
search and depth-first search. 
We'll cover them in detail in the next 
couple of videos. 
I want to give you the highlights to 
conclude this video. 
Now let me just make sure it's clear 
where the ambiguity in our generic method 
is. 
Why we can have different substantiations 
of it that potentially have different 
properties and different applications. 
The question is, given the iteration of 
this wile loop, what have you got? 
You got your nodes that you've already 
explored, so that includes S plus 
probably some other stuff. 
And then you've got your nodes that are 
unexplored and then you have your 
crossing edges, right? 
So they're edges with one point in each 
side. 
And for an undirected graph there's no 
orientation to worry about. 
These crossing edges just have one 
endpoint on the explored side and one 
endpoint on the unexplored side. 
In the directed case you focus on edges 
where the tail of the edges's on the 
explored side and the head of the edges's 
on the unexplored side. 
So they go from the explored side to the 
unexplored side. 
And the question is in general an 
iteration this while loop there's going 
to be many such crossing edges. 
There are many different unexplored nodes 
we could go to next. 
A different strategy is for picking the 
unexplored node to explore next, 
leads us to different graph search 
algorithms with different properties. 
So the first specific search strategy 
we're going to study is breadth-first 
search, colloquially known as BFS. 
So let me tell you some of the high level 
idea and applications of breadth-first 
search. 
So the goal is going to be to explore the 
nodes in what I call layers. 
So the starting point S will be in it's 
own layer, layer zero. 
The neighbors of S will constitute layer 
one. 
And then layer two will be the nodes that 
are neighbors of layer one but that are 
not already in layer zero or layer one 
and so on. 
So layer I9 plus one is the stuff next to 
layer I that you haven't already seen 
yet. 
You can think of this as a fairly 
cautious and tentative exploration of the 
graph. 
And it's going to turn out that there is 
a close correspondence between these 
layers and shortest path distances. 
So if you want to know the minimum number 
of hops, the minimum of edges you need in 
a path to get from point A to point B in 
a graph. The way we wanted to know the 
fewest number of edges in the movie graph 
necessary to connect a Jon Ham to Kevin 
Bacon, that corresponds directly to these 
layers. 
So if a node is in layer I, then you need 
I edges to get from S to I in the graph. 
Once we discuss breadth-first search 
we'll also discuss how to compute the 
connected components, or the different 
pieces, of an undirected graph. 
Turns out this isn't that special the 
breadth-first search, you can use any 
number of graph search strategies to 
compute connected components in 
undirected graphs. But I'll show you how 
to do it using a simple looped version of 
breadth-first search. 
And we'll be able to do this stuff in the 
linear time that we want, the very 
ambitious goal of getting linear time. 
To get the linear time implementation, 
you do want to use the right data 
structure; but it's a simple, simple data 
structure, something you've probably seen 
in the past, namely a, a cube. 
So it's something that's first in first 
out. 
So the second structural strategy that's 
super important to know is depth-first 
search, also known as DFS to its friends. 
Depth-first search has a rather different 
feel than breadth-first search. 
It's a much more aggressive search, where 
you immediately try to plunge as deeply 
as you can. 
It's very much in the spirit of how you 
might explore a maze. 
Where you go as deeply as you can, only 
backtracking when absolutely necessary. 
Depth-first search will also have its own 
set of applications. 
It's not for example, very useful for 
computing shortest path information. 
But especially, in directed graphs, it's 
going to do some remarkable things for 
us. 
So in directed acyclic graph. 
So a directed graph with no directed 
cycles. 
It will give us what's called the 
topological ordering. 
So it will sequence the nodes in a linear 
ordering from the first to the last. 
So that all of the arc's of the directed 
graphs go forward, 
so this is useful. 
Fo example, if you have a number of tasks 
that have to get completed with certain 
precedence constraints. 
Like for example, you have to take all of 
the classes in your undergraduate major 
and there are certain prerequisites 
topalogical ordering will give you a way 
in which to do it respecting all of the 
prerequisites. 
And finally, where, for undirected 
graphs, it doesn't really matter whether 
you use BFS or DFS to compute connected 
components. 
In directed graphs, where even defining 
connected components is a little tricky. 
it turns out depth-first search is 
exactly what you want. 
That's what you're going to get a linear 
time implementation for computing the 
right notion of connected components in 
the directed graph case. 
Time wise, both of these are superb 
strategies for exploring a graph. 
They're both linear time with very good 
constants. 
So depth-first search again. 
We're going to get ordle, O of M plus N 
time. 
In a graph with M edges and N vertices. 
you do want to use a different data 
structure reflecting the different search 
strategy. 
So here, because you're exploring 
aggressively as soon as you get to a 
node, you immediately start exploring its 
neighbors. 
You want a lasting first out data 
structure, also known as a stack. 
Depth-first search also admits a very 
elegant recursive formulation and in that 
formulation, you don't even need to 
maintain a stack data structure. 
Explicitly, 
the stack is implicitly taken care of in 
the recursion. 
So that concludes this overview of graph 
search both what it is? What our goals 
are? What kind of applications they have 
adn two of the most common strategies. 
The next couple videos are going to 
explore these search strategies as well 
as a couple of these applications in 
greater depth.