1
00:00:00,000 --> 00:00:04,200
So let's talk about the absolutely 
fundamental problem of searching a graph 

2
00:00:04,200 --> 00:00:07,486
and the very related problem of finding 
paths through graphs. 

3
00:00:07,486 --> 00:00:11,130
So why would one be interested in 
searching a graph or figuring out if 

4
00:00:11,130 --> 00:00:14,929
there's a path from point A to point B? 
Well there's many, many reasons, I'm 

5
00:00:14,929 --> 00:00:18,060
going to give you a highly non-exhaustive 
list on this slide. 

6
00:00:18,060 --> 00:00:21,962
So let me begin with a very sort of 
obvious and literal example, which is if 

7
00:00:21,962 --> 00:00:25,932
you have a physical network. Then often 
you want to make sure that the network is 

8
00:00:25,932 --> 00:00:30,009
fully connected, in the sense that you 
can get from any starting point to any 

9
00:00:30,009 --> 00:00:32,710
other point. 
So, for example, think back to the phone 

10
00:00:32,710 --> 00:00:35,198
network. 
It would've been a disaster if callers 

11
00:00:35,198 --> 00:00:39,434
from California could only reach callers 
in Nevada, but not their family members 

12
00:00:39,434 --> 00:00:41,817
in Utah. 
So obviously, a minimal condition for 

13
00:00:41,817 --> 00:00:46,106
functionality of something like a phone 
network is that you can get from any one 

14
00:00:46,106 --> 00:00:50,342
place to any other place, similarly for 
road networks within a given country, and 

15
00:00:50,342 --> 00:00:54,118
so on. 
It can also be fun to think about other, 

16
00:00:54,118 --> 00:00:57,024
non-physical networks and ask if they're 
connected. 

17
00:00:57,024 --> 00:01:00,841
So, one network that's fun to play around 
with is the movie network. 

18
00:01:00,841 --> 00:01:05,170
So this is the graph where the nodes 
correspond to actors and actresses and 

19
00:01:05,170 --> 00:01:09,671
you have an edge between the two nodes if 
they played a role in a common movie. 

20
00:01:09,671 --> 00:01:14,057
So this is going to be an undirected 
graph where the edges correspond to, not 

21
00:01:14,057 --> 00:01:18,444
necessarily co-starring, but both the 
actors appearing at least at some point 

22
00:01:18,444 --> 00:01:21,373
in the same movie. 
So versions of this movie network you 

23
00:01:21,373 --> 00:01:25,231
should be able to find publicly available 
on the web and there's lots of fun 

24
00:01:25,231 --> 00:01:28,938
questions you can ask about the movie 
network, like for example, what's the 

25
00:01:28,938 --> 00:01:32,895
minimum number of hops, where a hop here 
again is a movie where two people both 

26
00:01:32,895 --> 00:01:36,903
played a role in, the minimum number of 
hops or edges you can get from one actor 

27
00:01:36,903 --> 00:01:39,608
to another actor. 
So perhaps the most famous statistic 

28
00:01:39,608 --> 00:01:43,065
that's been thought about with the movie 
network is the Bacon number. 

29
00:01:43,065 --> 00:01:47,022
So this refers to the fairly ubiquitous 
actor Kevin Bacon and the question, the 

30
00:01:47,022 --> 00:01:51,011
Bacon number of an actor is defined as 
the minimum number of hops you need in 

31
00:01:51,011 --> 00:01:57,340
this movie graph, to get to Kevin Bacon. 
So for example, you can ask About Jon 

32
00:01:57,340 --> 00:02:00,660
Hamm, also known as Don Draper from Mad 
Men. 

33
00:02:00,660 --> 00:02:06,152
And you could ask how many edges do you 
need on a path through the movie graph to 

34
00:02:06,152 --> 00:02:11,443
get to Kevin Bacon and it turns out that 
the answer is one, excuse me, two edges. 

35
00:02:11,443 --> 00:02:14,860
You need one intermediate point, namely 
Colin Firth. 

36
00:02:14,860 --> 00:02:19,405
And, that's became, that's because that's 
because Colin Firth and Kevin Bacon both 

37
00:02:19,405 --> 00:02:23,789
starred in the Atom Egoyan movie, Where 
The Truth Lies, and John Hamm and Colin 

38
00:02:23,789 --> 00:02:26,062
Firth were both in the movie, A Single 
Man. 

39
00:02:26,062 --> 00:02:28,822
So, that would give John Hamm a Bacon 
number of two. 

40
00:02:28,822 --> 00:02:33,260
So, these are the kinds of questions you 
can ask about connectivity not just in 

41
00:02:33,260 --> 00:02:37,372
physical networks, like telephone and 
telecommunication networks, but also 

42
00:02:37,372 --> 00:02:40,998
logical networks about pairwise 
relationships between objects more 

43
00:02:40,998 --> 00:02:43,989
generally. 
So the Bacon number is fundamentally not 

44
00:02:43,989 --> 00:02:46,529
just about any path, 
but actually shortest paths. 

45
00:02:46,529 --> 00:02:50,920
The minimum number of edges you need to 
traverse to get from one actor to Kevin 

46
00:02:50,920 --> 00:02:53,354
Bacon. 
And shortest paths are also have a very 

47
00:02:53,354 --> 00:02:56,900
practical use that you might use yourself 
in driving directions. 

48
00:02:56,900 --> 00:03:00,667
So when you use a website or a phone app 
and you ask for the best way to get from 

49
00:03:00,667 --> 00:03:03,975
where you are now to say, some restaurant 
where you're going to have dinner. 

50
00:03:03,975 --> 00:03:07,559
Obviously, you're trying to find some 
kind of path through a network through a 

51
00:03:07,559 --> 00:03:09,397
graph. 
And indeed, often you want the, the 

52
00:03:09,397 --> 00:03:11,557
shortest path. 
Perhaps in mileage or perhaps in 

53
00:03:11,557 --> 00:03:15,073
anticipated travel time. 
Now I realize that when you're thinking 

54
00:03:15,073 --> 00:03:19,841
about paths and graphs, it's natural to 
focus on, sort of, very literal paths and 

55
00:03:19,841 --> 00:03:24,427
quite literal, physical networks, things 
like roots through a road network or 

56
00:03:24,427 --> 00:03:28,651
paths through the internet and so on. 
But, you should really think more 

57
00:03:28,651 --> 00:03:32,936
abstractedly as a path, as just a 
sequence of decisions, taking you from 

58
00:03:32,936 --> 00:03:37,567
some initial state, to some final state. 
And it's this abstract mentality which is 

59
00:03:37,567 --> 00:03:41,689
what makes graph search so ubiquitous, it 
feels like artificial intelligence. 

60
00:03:41,689 --> 00:03:46,245
Where you want to formulate a plan of how 
to get from an initial state to some goal 

61
00:03:46,245 --> 00:03:48,360
state. 
So, to give a simple recreational 

62
00:03:48,360 --> 00:03:52,753
example, you can imagine just trying to 
understand how to compute automatically a 

63
00:03:52,753 --> 00:03:56,658
way to fill in a Sudoku puzzle. 
So that you get to, so that you solve the 

64
00:03:56,658 --> 00:04:00,862
puzzle correctly. 
So you might ask you know what is the 

65
00:04:00,862 --> 00:04:05,034
graph that we're talking about when we 
want to solve a Sudoku puzzle. 

66
00:04:05,034 --> 00:04:09,497
Well this is going to be a directed graph 
where here the nodes correspond to 

67
00:04:09,497 --> 00:04:13,026
partially completed puzzles. 
So for example, at one node of this 

68
00:04:13,026 --> 00:04:17,494
extremely large graph perhaps 40 out of 
the 81 cells are filled in with some kind 

69
00:04:17,494 --> 00:04:19,837
of number. 
And now, again remember a path is 

70
00:04:19,837 --> 00:04:24,142
supposed to correspond to a sequence of 
decisions, so what are the actions that 

71
00:04:24,142 --> 00:04:27,847
you take in solving Suduko while you fill 
in a number into a square. 

72
00:04:27,847 --> 00:04:31,987
So an edge which here is going to be 
directed is going to move you from one 

73
00:04:31,987 --> 00:04:36,183
partially completed puzzle to another, 
where one previously empty square gets 

74
00:04:36,183 --> 00:04:39,476
filled in with one number. 
And of course, then the path that you're 

75
00:04:39,476 --> 00:04:43,275
interested in computing, or what you're 
searching for or what you search for when 

76
00:04:43,275 --> 00:04:47,027
you search this graph., You begin with 
the initial state of the Sudoku puzzle 

77
00:04:47,027 --> 00:04:50,733
and you want to reach some goal state 
where the Sudoku puzzle is completely 

78
00:04:50,733 --> 00:04:53,360
filled in without any violations of the 
rules of Sudoku. 

79
00:04:53,360 --> 00:04:56,905
And of course it's easy to imagine 
millions of other situations where you 

80
00:04:56,905 --> 00:04:58,930
want to formulate some kind of plan like 
this. 

81
00:04:58,930 --> 00:05:02,337
Like, for example, if you have a robotic 
hand and you just want to grasp some 

82
00:05:02,337 --> 00:05:05,836
object, you need to think about exactly 
how to approach, the object with this 

83
00:05:05,836 --> 00:05:09,289
robotic hand, so that you can grab it 
without, for example, first knocking it 

84
00:05:09,289 --> 00:05:11,315
over. 
And you can think of millions of other 

85
00:05:11,315 --> 00:05:13,427
examples. 
Another thing which turns out to be 

86
00:05:13,427 --> 00:05:17,341
closely related to graph search as we'll 
see and has many applications in it's own 

87
00:05:17,341 --> 00:05:20,406
right is that of computing connectivity 
information about graphs. 

88
00:05:20,406 --> 00:05:22,387
So in particular the connected 
components. 

89
00:05:22,387 --> 00:05:26,160
So this, especially for undirected graphs 
corresponds to the pieces of the graph. 

90
00:05:26,160 --> 00:05:29,932
We'll talk about these topics in their 
own right and I'll give you applications 

91
00:05:29,932 --> 00:05:32,655
for them later. 
So for undirected graphs I'll briefly 

92
00:05:32,655 --> 00:05:36,791
mention an easy clustering heristic you 
can derive out of computing connected 

93
00:05:36,791 --> 00:05:39,231
components. 
For directed graphs where the very 

94
00:05:39,231 --> 00:05:42,943
definition of computing components is a 
bit more subtle I'll show you 

95
00:05:42,943 --> 00:05:45,860
applications to understanding the 
structure of the web. 

96
00:05:45,860 --> 00:05:49,985
So these are a few of the reasons why 
it's important for you to understand how 

97
00:05:49,985 --> 00:05:53,170
to efficiently search graphs. 
It's a, a fundamental and widely 

98
00:05:53,170 --> 00:05:56,262
applicable graph primitive. 
And I'm happy to report that in this 

99
00:05:56,262 --> 00:05:59,671
section of the course pretty much 
anything any questions we want to answer 

100
00:05:59,671 --> 00:06:02,444
about graph search, computing connecting 
components, and so on. 

101
00:06:02,444 --> 00:06:04,808
There's going to be really fast 
algorithms to do it. 

102
00:06:04,808 --> 00:06:08,263
So this will be a part of the course 
where there's lots of what I call four 

103
00:06:08,263 --> 00:06:11,627
free primitives processing steps 
serpentine's that you can run without 

104
00:06:11,627 --> 00:06:14,445
even thinking about it. 
All of these algorithms we're going to 

105
00:06:14,445 --> 00:06:18,037
discuss in the next several lectures are 
going to run in linear time and there 

106
00:06:18,037 --> 00:06:21,719
going to have quite reasonable constants. 
So there barely slower than reading the 

107
00:06:21,719 --> 00:06:23,901
input. 
So if you have a graph and your trying to 

108
00:06:23,901 --> 00:06:26,220
reason about it, your trying to make 
sense about it. 

109
00:06:26,220 --> 00:06:30,040
You should in some sense feel free to 
apply any of these subroutines we're 

110
00:06:30,040 --> 00:06:33,809
going to discuss to try and glean some 
more information about what they look 

111
00:06:33,809 --> 00:06:37,909
like, how you might use the network data. 
There's a lot of different approaches to 

112
00:06:37,909 --> 00:06:40,563
systematically searching a graph. 
So there's many methods. 

113
00:06:40,563 --> 00:06:43,310
In this class, we're going to focus on 
two very important ones. 

114
00:06:43,310 --> 00:06:45,685
Mainly breadth verse search and depth 
verse search. 

115
00:06:45,685 --> 00:06:48,666
But all of the graph search methods share 
some things in common. 

116
00:06:48,666 --> 00:06:52,298
So on this slide, let me just tell you 
the high order bits of, really, any graph 

117
00:06:52,298 --> 00:06:55,306
search algorithm. 
So grass search subroutines generally are 

118
00:06:55,306 --> 00:06:58,843
passed as input, a starting vertex from 
which the search originates. 

119
00:06:58,843 --> 00:07:02,485
So, that's often called a source vertex. 
And your goal, then, is to find 

120
00:07:02,485 --> 00:07:06,707
everything findable from this search 
vertex And obviously, you're not going to 

121
00:07:06,707 --> 00:07:09,557
find anything that you can't find, that's 
not findable. 

122
00:07:09,557 --> 00:07:13,094
What do I mean by findable? 
I mean, there's a path from the starting 

123
00:07:13,094 --> 00:07:16,472
point to this other node. 
So for any other node to which you can 

124
00:07:16,472 --> 00:07:19,850
get along a path from the starting point, 
you should discover it. 

125
00:07:19,850 --> 00:07:24,731
So for example, if you're given an 
undricted graph that has three different 

126
00:07:24,731 --> 00:07:29,550
pieces like this one I'm drawing on the 
right and perhaps S, is this left most 

127
00:07:29,550 --> 00:07:32,702
node here. 
Then the findable vertices starting from 

128
00:07:32,702 --> 00:07:37,212
S are either the ones to which you can 
reach from a path to S is clearly 

129
00:07:37,212 --> 00:07:41,167
precisely these four vertices. 
So you would want graph search to 

130
00:07:41,167 --> 00:07:45,987
automatically discover and efficiently 
discover these four vertices to if you 

131
00:07:45,987 --> 00:07:49,138
started at S. 
You could also think about a directed 

132
00:07:49,138 --> 00:07:53,340
version of the exact same graph, wherein 
the direct vertices like so. 

133
00:07:53,340 --> 00:07:57,076
So now the definition of definable nodes 
is a little bit different. 

134
00:07:57,076 --> 00:08:01,426
We're only expecting to follow arcs 
forward along the forward direction so we 

135
00:08:01,426 --> 00:08:06,111
should only expect at best to find all of 
the nodes that you can reach by following 

136
00:08:06,111 --> 00:08:10,908
a succession of forward arcs that is any, 
any node that there's a path to from S so 

137
00:08:10,908 --> 00:08:15,481
in this case these three nodes would be 
the ones we'd be hoping to find this blue 

138
00:08:15,481 --> 00:08:19,720
node to the right we would no longer 
expect to find because the only way to 

139
00:08:19,720 --> 00:08:23,958
get there from S is by going backwards 
along arcs and that's not what we're 

140
00:08:23,958 --> 00:08:27,220
going to be thinking about in our graph 
surges. 

141
00:08:27,220 --> 00:08:31,291
So we want to find everything findable, 
i.e., that we can get to along paths. 

142
00:08:31,291 --> 00:08:35,586
And, but we want to do it efficiently. 
And efficiently means we don't explore 

143
00:08:35,586 --> 00:08:36,757
anything twice, 
right? 

144
00:08:36,757 --> 00:08:39,992
So the graph has M arcs, M edges, and N 
nodes or N vertices. 

145
00:08:39,992 --> 00:08:44,342
And really, we want to just look at, each 
piece of the graph only once, or a small 

146
00:08:44,342 --> 00:08:47,968
constant number of times. 
So we're looking for running time, which 

147
00:08:47,968 --> 00:08:50,868
is linear in the size of the graph. 
That is big O of MN. 

148
00:08:50,868 --> 00:08:53,313
plus N. 
Now, we were talking about representing 

149
00:08:53,313 --> 00:08:56,960
graphs, I said that in many applications 
it's natural to focus on connected 

150
00:08:56,960 --> 00:08:59,196
graphs, in which case M is going to 
dominate N. 

151
00:08:59,196 --> 00:09:02,502
You're going to have at least as many 
edges as nodes essentially. 

152
00:09:02,502 --> 00:09:06,295
But connectivity is the classic case 
where you might have the number of edges 

153
00:09:06,295 --> 00:09:08,434
being much smaller than the number of 
nodes. 

154
00:09:08,434 --> 00:09:12,226
There might be many pieces and the whole 
point of what you're trying to do is 

155
00:09:12,226 --> 00:09:14,900
discover them. 
So for this sequence of lectures when we 

156
00:09:14,900 --> 00:09:18,109
talk about graph search and connectivity, 
we will usually write Mn. 

157
00:09:18,109 --> 00:09:20,005
plus N. 
We'll think that either one could be 

158
00:09:20,005 --> 00:09:23,281
bigger or smaller than the other. 
So let me know give you a generic 

159
00:09:23,281 --> 00:09:25,985
approach to graph search. 
It's going to be under specified. 

160
00:09:25,985 --> 00:09:28,541
There will be many different ways to 
instantiate it. 

161
00:09:28,541 --> 00:09:32,375
Two particular instantiations will get us 
breath first search and depth first 

162
00:09:32,375 --> 00:09:34,588
search. 
But here is just a general systematic 

163
00:09:34,588 --> 00:09:38,324
method for finding everything findable 
without exploring anything more than 

164
00:09:38,324 --> 00:09:40,734
once. 
So motivated by the second goal, the fact 

165
00:09:40,734 --> 00:09:44,693
that we don't want to explore anything 
twice, with each node, with each vertex, 

166
00:09:44,693 --> 00:09:47,913
we're going to remember whether or not 
we've explored it before. 

167
00:09:47,913 --> 00:09:52,400
So we just need one Boolean per node and 
we'll initialize it by having everything 

168
00:09:52,400 --> 00:09:56,360
unexplored, except S our starting point, 
we'll have it start off as explored. 

169
00:09:56,360 --> 00:09:59,951
And it's useful to think of the nodes 
thus far as being, in some sense, 

170
00:09:59,951 --> 00:10:03,902
territory conquered by the algorithm. 
And then there's going to be a frontier 

171
00:10:03,902 --> 00:10:06,519
in between the conquered and unconquered 
territory. 

172
00:10:06,519 --> 00:10:10,317
And the goal of the generic algorithm is 
that, each step, we supplement the 

173
00:10:10,317 --> 00:10:14,165
conquered territory by one new node. 
Assuming that there is one adjacent to 

174
00:10:14,165 --> 00:10:18,167
the territory we've already conquered. 
So for example, in this top example with 

175
00:10:18,167 --> 00:10:21,400
the undirected network. 
Initially, the only thing we've explored 

176
00:10:21,400 --> 00:10:24,274
is the starting point S. 
So that's sort of our home base, 

177
00:10:24,274 --> 00:10:28,020
that's all that we have conquered so far. 
And then, in our main while loop. 

178
00:10:28,020 --> 00:10:33,074
Which we iterate as many times as we can, 
until we don't have any edges meeting the 

179
00:10:33,074 --> 00:10:35,450
following criterion. 
We look for a edge. 

180
00:10:35,450 --> 00:10:39,899
With one endpoint that we've already 
explored one endpoint inside the conquer 

181
00:10:39,899 --> 00:10:42,522
territory and then the other endpoint 
outside. 

182
00:10:42,522 --> 00:10:46,800
So this is how we can in one hop 
supplement the number of nodes we've seen 

183
00:10:46,800 --> 00:10:49,937
by one new one. 
If we can't find such an edge then this 

184
00:10:49,937 --> 00:10:53,759
is where the search stops. 
If we can find such an edge well then we 

185
00:10:53,759 --> 00:10:57,410
suck V into the conquer territory we 
think of it being explored. 

186
00:10:57,410 --> 00:11:01,032
And we return to the while loop. 
So for example, on this example on the 

187
00:11:01,032 --> 00:11:03,750
right, we start with the only explored 
node being S. 

188
00:11:03,750 --> 00:11:06,680
Now, there's actually two edges that 
cross the frontier. 

189
00:11:06,680 --> 00:11:09,077
In the sense one of the N points is 
explored. 

190
00:11:09,077 --> 00:11:12,754
Namely one of the points is S, and the 
other one is some other vertex, 

191
00:11:12,754 --> 00:11:15,258
right? 
There's this there's this two ver, two 

192
00:11:15,258 --> 00:11:17,655
edges to the left two vertices adjacent 
to S. 

193
00:11:17,655 --> 00:11:21,651
So, in this algorithm we pick either one. 
It's unspecified which one we pick. 

194
00:11:21,651 --> 00:11:25,604
But maybe we pick the top one. 
And so then all of a sudden this seconds 

195
00:11:25,604 --> 00:11:29,797
tab vertex is now also explored. 
So the conquer territory is the union of 

196
00:11:29,797 --> 00:11:34,450
them, and so now we have a new frontier. 
So now again we have two edges that cross 

197
00:11:34,450 --> 00:11:38,241
from the explored nodes. 
The unexplored nodes is at the edges that 

198
00:11:38,241 --> 00:11:41,400
are in some sense going from Northwest to 
Southeast. 

199
00:11:41,400 --> 00:11:45,248
Again we pick one of them, 
it's not clear how the algorithm doesn't 

200
00:11:45,248 --> 00:11:47,144
tell us. 
We just pick any of them. 

201
00:11:47,144 --> 00:11:52,023
So maybe for example, we pick this right 
most edge crossing the front here. Now, 

202
00:11:52,023 --> 00:11:57,030
the right most edge of these four, right 
most vertex of these four is explored so 

203
00:11:57,030 --> 00:11:59,872
our conquer territory is the top three 
vertices. 

204
00:11:59,872 --> 00:12:04,258
Now again, we have two edges crossing the 
frontier, the two edges that are incident 

205
00:12:04,258 --> 00:12:07,414
to the bottom node. 
We pick one of them not clear which one, 

206
00:12:07,414 --> 00:12:09,928
maybe this one. 
And now, the bottom node is also 

207
00:12:09,928 --> 00:12:13,030
explored, and now there are no edges 
crossing the frontier. 

208
00:12:13,030 --> 00:12:17,042
So there are no edges who, on the one 
hand, have one endpoint being explored 

209
00:12:17,042 --> 00:12:21,214
and the other endpoint being unexplored. 
So these will be the four vertices, as 

210
00:12:21,214 --> 00:12:24,370
one would hope, that the search will 
explore, started from S. 

211
00:12:24,370 --> 00:12:28,764
Well generally, the claim is that this 
generic graph search algorithm does what 

212
00:12:28,764 --> 00:12:31,490
we wanted. 
It finds everything findable from the 

213
00:12:31,490 --> 00:12:34,940
starting point and moreover, it doesn't 
explore anything twice. 

214
00:12:34,940 --> 00:12:37,844
I think it's fairly clear that it doesn't 
explore anything twice, 

215
00:12:37,844 --> 00:12:39,810
right? 
As soon as you look at a node for the 

216
00:12:39,810 --> 00:12:43,027
first time, you suck it into the 
conquered territory, never to look at it 

217
00:12:43,027 --> 00:12:44,904
again. 
Similarly, as soon as you look at, at 

218
00:12:44,904 --> 00:12:47,719
edge, you suck them in. 
but when we explore breadth and depth, 

219
00:12:47,719 --> 00:12:51,338
first search will be more precise about 
the running time and exactly what I mean 

220
00:12:51,338 --> 00:12:54,690
by you don't explore something twice. 
So at this level of generality, I just 

221
00:12:54,690 --> 00:12:57,951
want to focus on the first point, that 
anyway you [INAUDIBLE] this search 

222
00:12:57,951 --> 00:13:00,275
algorithm it's going to find everything 
findable. 

223
00:13:00,275 --> 00:13:03,618
So what do I really mean by that? 
The formal claim is that at the 

224
00:13:03,618 --> 00:13:08,290
termination of this algorithm. The nodes 
that we've marked explored are precisely 

225
00:13:08,290 --> 00:13:11,034
the ones that can be reached via a path 
from S. 

226
00:13:11,034 --> 00:13:15,180
That's the sense in which the algorithm 
explores everything that could 

227
00:13:15,180 --> 00:13:18,100
potentially be findable from the starting 
point S. 

228
00:13:18,100 --> 00:13:21,381
And one thing I want to mention is that 
this claim and the proof we're going to 

229
00:13:21,381 --> 00:13:24,750
give of it, it holds whether or not you 
use an undirected graph or a directed 

230
00:13:24,750 --> 00:13:26,675
graph. 
In fact, almost all of the things that 

231
00:13:26,675 --> 00:13:30,001
I'm going to say about graph search and, 
in particular, breadth-first search and 

232
00:13:30,001 --> 00:13:33,632
depth-first search work in essentially 
the same way, either undirected graphs or 

233
00:13:33,632 --> 00:13:35,864
directed graphs. 
The obvious difference being in an 

234
00:13:35,864 --> 00:13:38,664
undirected graph, you can traverse an 
engine in either direction. 

235
00:13:38,664 --> 00:13:42,296
In a directed graph, we're only supposed 
to traverse it in the forward direction, 

236
00:13:42,296 --> 00:13:45,390
from the tail to the head. 
The one big difference between undirected 

237
00:13:45,390 --> 00:13:47,981
and directed graphs is when we do 
connectivity computations. 

238
00:13:47,981 --> 00:13:51,263
And I'll remind you when we get to that 
point, which one we're talking about. 

239
00:13:51,263 --> 00:13:54,294
Okay, but for the most part. 
We just talked about basic graph search, 

240
00:13:54,294 --> 00:13:57,790
where it's essentially the same way, 
whether it's undirected or directed, so 

241
00:13:57,790 --> 00:14:00,812
keep that in mind. 
All right, so why is this true? 

242
00:14:00,812 --> 00:14:06,414
Why are the nodes that get explored 
precisely the nodes for which there is a 

243
00:14:06,414 --> 00:14:10,090
path to them from S? 
Well, one direction is easy. 

244
00:14:10,090 --> 00:14:13,212
Which is, you can't find anything which 
is not findable. 

245
00:14:13,212 --> 00:14:17,470
That is, if you wind up exploring a node, 
the only reason that can happen is 

246
00:14:17,470 --> 00:14:22,125
because you traversed a sequence of edges 
that got you there, and that sequence of 

247
00:14:22,125 --> 00:14:24,566
edges obviously defines a path from S to 
V. 

248
00:14:24,566 --> 00:14:29,221
If you really want to be pedantic about 
the forward direction that explored nodes 

249
00:14:29,221 --> 00:14:33,081
have to have paths from S then you can 
just do an easy induction. 

250
00:14:33,081 --> 00:14:37,396
And I'll leave this for you to check, if 
you want, in the privacy of your own 

251
00:14:37,396 --> 00:14:40,114
home. 
So the important direction of this claim 

252
00:14:40,114 --> 00:14:43,380
is really the opposite. 
Why is it that no matter how we 

253
00:14:43,380 --> 00:14:48,071
instantiate this generic graph search 
procedure it's impossible for us to miss 

254
00:14:48,071 --> 00:14:50,922
anything. 
That's the crucial point we don't miss 

255
00:14:50,922 --> 00:14:54,010
anything that we couldn't principal find 
via a path. 

256
00:14:54,010 --> 00:14:56,360
But we're going to perceive by 
contradiction. 

257
00:14:56,360 --> 00:15:00,256
So what does that mean? 
We're going to assume that the statement 

258
00:15:00,256 --> 00:15:02,813
that we want to prove is true, is not 
true. 

259
00:15:02,813 --> 00:15:06,527
Which means that, it's possible that G 
has a path from S to V. 

260
00:15:06,527 --> 00:15:08,962
And yet, somehow, our algorithm misses 
it. 

261
00:15:08,962 --> 00:15:13,224
Doesn't mark it as explored. 
That's the thing we really hoping doesn't 

262
00:15:13,224 --> 00:15:16,146
happen. 
So let's suppose it does happen and then 

263
00:15:16,146 --> 00:15:20,043
derive a contradiction. 
So let's suppose G does have a path from 

264
00:15:20,043 --> 00:15:22,152
S to some vertex V, 
call the path P. 

265
00:15:22,152 --> 00:15:26,675
I'm going to draw the picture from 
undirected graph, but the situation would 

266
00:15:26,675 --> 00:15:30,997
be same in, in the directive case. 
So there's a bunch of hops there's a 

267
00:15:30,997 --> 00:15:34,019
bunch of edges and · then eventually this 
path ends at V. 

268
00:15:34,019 --> 00:15:38,916
Now, the bad situation the situation from 
which we wanted to rather contradiction 

269
00:15:38,916 --> 00:15:42,120
is that V is unexplored at the end of 
this algorithm. 

270
00:15:42,120 --> 00:15:47,184
So let's take stock of what we know, S 
for sure is explored right, we 

271
00:15:47,184 --> 00:15:52,025
initialized the search procedure so that 
S is marked as explored. 

272
00:15:52,025 --> 00:15:56,568
V by hypothesis in this proof by 
contradiction is unexplored. 

273
00:15:56,568 --> 00:16:02,079
So S is explored V is unexplored. 
So now, I imagine we just in our heads as 

274
00:16:02,079 --> 00:16:05,840
a thought experiment we traversed as path 
peak. 

275
00:16:05,840 --> 00:16:10,136
We started S and we know it's explored we 
go to the next vertex it may or not have 

276
00:16:10,136 --> 00:16:14,278
been explored, we're not sure. We go to 
the third vertex again who knows it might 

277
00:16:14,278 --> 00:16:16,607
be explored it might be unexplored and so 
on. 

278
00:16:16,607 --> 00:16:19,247
But by the time we get to V we know its 
unexplored. 

279
00:16:19,247 --> 00:16:22,819
So we start at S it's been explored we 
get to V it's been unexplored. 

280
00:16:22,819 --> 00:16:26,147
So at some point, 
there's some hop along this path P from 

281
00:16:26,147 --> 00:16:29,940
which we move from an explored vertex to 
an unexplored vertex. 

282
00:16:29,940 --> 00:16:34,834
There has to be a switch, at some point 
because at the end of the day, at the end 

283
00:16:34,834 --> 00:16:39,545
of the path we're at an unexplored node. 
So consider the first edge, and there 

284
00:16:39,545 --> 00:16:42,458
must be one. 
That we switch from an explored node to 

285
00:16:42,458 --> 00:16:45,725
being an unexplored node. 
So I'm going to call the endpoints of 

286
00:16:45,725 --> 00:16:50,082
this purported edge U and W, where U is 
the explored one and W is the unexplored 

287
00:16:50,082 --> 00:16:52,423
one. 
Now for all we know U could be exactly 

288
00:16:52,423 --> 00:16:54,983
the same as S, 
that's a possibility, possibility. 

289
00:16:54,983 --> 00:16:57,379
Or, for all we know, W could be the same 
as V, 

290
00:16:57,379 --> 00:17:00,918
that's also a possibility. 
In the picture I'll draw it as if this 

291
00:17:00,918 --> 00:17:03,587
edge ux was somewhere in the middle of 
this path. 

292
00:17:03,587 --> 00:17:06,800
But again, it may be at one of the ends, 
that's totally fine. 

293
00:17:06,800 --> 00:17:11,280
But now in this case, there's something I 
need you to explain to me. 

294
00:17:11,280 --> 00:17:16,706
How is it possible that, on the one hand, 
our algorithm terminated, and on the 

295
00:17:16,706 --> 00:17:22,348
other hand, there's this edge U comma X, 
where U has been explored and X has not 

296
00:17:22,348 --> 00:17:25,490
been explored. 
That, my friends, is impossible. 

297
00:17:25,490 --> 00:17:28,049
Our generic search algorithm does not 
give up. 

298
00:17:28,049 --> 00:17:32,666
It does not terminate unless there are no 
edges where the one endpoint is explored 

299
00:17:32,666 --> 00:17:37,061
and the other endpoint is unexplored. 
As long as there's such an edge it has is 

300
00:17:37,061 --> 00:17:41,623
going to suck in unexplored vertex into 
the conquered territory and it's going to 

301
00:17:41,623 --> 00:17:44,238
keep going. 
So, the, the upshot is there's no way 

302
00:17:44,238 --> 00:17:48,466
that our algorithm terminated with this 
picture, with there being an edge UX, 

303
00:17:48,466 --> 00:17:51,359
U explored, X unexplored. 
So that's the contradiction. 

304
00:17:51,359 --> 00:17:57,040
This contradicts the fact that our 
algorithm terminated with V unexplored. 

305
00:17:57,040 --> 00:17:59,445
So that is a general approach to graph 
search. 

306
00:17:59,445 --> 00:18:02,843
So that, I hope, gives you the flavor of 
how this is going to work. 

307
00:18:02,843 --> 00:18:06,451
But now there's two particular 
instantiations of this generic method 

308
00:18:06,451 --> 00:18:10,007
that are really important and have their 
own suites of applications. 

309
00:18:10,007 --> 00:18:13,719
So we're going to focus on breadth-first 
search and depth-first search. 

310
00:18:13,719 --> 00:18:16,647
We'll cover them in detail in the next 
couple of videos. 

311
00:18:16,647 --> 00:18:19,680
I want to give you the highlights to 
conclude this video. 

312
00:18:19,680 --> 00:18:23,238
Now let me just make sure it's clear 
where the ambiguity in our generic method 

313
00:18:23,238 --> 00:18:25,264
is. 
Why we can have different substantiations 

314
00:18:25,264 --> 00:18:28,687
of it that potentially have different 
properties and different applications. 

315
00:18:28,687 --> 00:18:31,930
The question is, given the iteration of 
this wile loop, what have you got? 

316
00:18:31,930 --> 00:18:35,173
You got your nodes that you've already 
explored, so that includes S plus 

317
00:18:35,173 --> 00:18:38,146
probably some other stuff. 
And then you've got your nodes that are 

318
00:18:38,146 --> 00:18:40,623
unexplored and then you have your 
crossing edges, right? 

319
00:18:40,623 --> 00:18:42,650
So they're edges with one point in each 
side. 

320
00:18:42,650 --> 00:18:46,142
And for an undirected graph there's no 
orientation to worry about. 

321
00:18:46,142 --> 00:18:50,165
These crossing edges just have one 
endpoint on the explored side and one 

322
00:18:50,165 --> 00:18:53,975
endpoint on the unexplored side. 
In the directed case you focus on edges 

323
00:18:53,975 --> 00:18:58,156
where the tail of the edges's on the 
explored side and the head of the edges's 

324
00:18:58,156 --> 00:19:01,543
on the unexplored side. 
So they go from the explored side to the 

325
00:19:01,543 --> 00:19:04,189
unexplored side. 
And the question is in general an 

326
00:19:04,189 --> 00:19:07,788
iteration this while loop there's going 
to be many such crossing edges. 

327
00:19:07,788 --> 00:19:11,069
There are many different unexplored nodes 
we could go to next. 

328
00:19:11,069 --> 00:19:14,880
A different strategy is for picking the 
unexplored node to explore next, 

329
00:19:14,880 --> 00:19:19,550
leads us to different graph search 
algorithms with different properties. 

330
00:19:19,550 --> 00:19:24,330
So the first specific search strategy 
we're going to study is breadth-first 

331
00:19:24,330 --> 00:19:29,242
search, colloquially known as BFS. 
So let me tell you some of the high level 

332
00:19:29,242 --> 00:19:32,254
idea and applications of breadth-first 
search. 

333
00:19:32,254 --> 00:19:36,838
So the goal is going to be to explore the 
nodes in what I call layers. 

334
00:19:36,838 --> 00:19:40,833
So the starting point S will be in it's 
own layer, layer zero. 

335
00:19:40,833 --> 00:19:43,780
The neighbors of S will constitute layer 
one. 

336
00:19:43,780 --> 00:19:48,948
And then layer two will be the nodes that 
are neighbors of layer one but that are 

337
00:19:48,948 --> 00:19:52,036
not already in layer zero or layer one 
and so on. 

338
00:19:52,036 --> 00:19:56,447
So layer I9 plus one is the stuff next to 
layer I that you haven't already seen 

339
00:19:56,447 --> 00:19:58,524
yet. 
You can think of this as a fairly 

340
00:19:58,524 --> 00:20:00,993
cautious and tentative exploration of the 
graph. 

341
00:20:00,993 --> 00:20:05,179
And it's going to turn out that there is 
a close correspondence between these 

342
00:20:05,179 --> 00:20:09,205
layers and shortest path distances. 
So if you want to know the minimum number 

343
00:20:09,205 --> 00:20:13,606
of hops, the minimum of edges you need in 
a path to get from point A to point B in 

344
00:20:13,606 --> 00:20:17,899
a graph. The way we wanted to know the 
fewest number of edges in the movie graph 

345
00:20:17,899 --> 00:20:22,247
necessary to connect a Jon Ham to Kevin 
Bacon, that corresponds directly to these 

346
00:20:22,247 --> 00:20:24,823
layers. 
So if a node is in layer I, then you need 

347
00:20:24,823 --> 00:20:28,590
I edges to get from S to I in the graph. 
Once we discuss breadth-first search 

348
00:20:28,590 --> 00:20:32,109
we'll also discuss how to compute the 
connected components, or the different 

349
00:20:32,109 --> 00:20:35,258
pieces, of an undirected graph. 
Turns out this isn't that special the 

350
00:20:35,258 --> 00:20:38,591
breadth-first search, you can use any 
number of graph search strategies to 

351
00:20:38,591 --> 00:20:41,925
compute connected components in 
undirected graphs. But I'll show you how 

352
00:20:41,925 --> 00:20:44,750
to do it using a simple looped version of 
breadth-first search. 

353
00:20:44,750 --> 00:20:48,643
And we'll be able to do this stuff in the 
linear time that we want, the very 

354
00:20:48,643 --> 00:20:52,537
ambitious goal of getting linear time. 
To get the linear time implementation, 

355
00:20:52,537 --> 00:20:56,225
you do want to use the right data 
structure; but it's a simple, simple data 

356
00:20:56,225 --> 00:20:59,760
structure, something you've probably seen 
in the past, namely a, a cube. 

357
00:20:59,760 --> 00:21:02,014
So it's something that's first in first 
out. 

358
00:21:02,014 --> 00:21:06,420
So the second structural strategy that's 
super important to know is depth-first 

359
00:21:06,420 --> 00:21:10,633
search, also known as DFS to its friends. 
Depth-first search has a rather different 

360
00:21:10,633 --> 00:21:14,341
feel than breadth-first search. 
It's a much more aggressive search, where 

361
00:21:14,341 --> 00:21:16,967
you immediately try to plunge as deeply 
as you can. 

362
00:21:16,967 --> 00:21:20,109
It's very much in the spirit of how you 
might explore a maze. 

363
00:21:20,109 --> 00:21:24,126
Where you go as deeply as you can, only 
backtracking when absolutely necessary. 

364
00:21:24,126 --> 00:21:27,319
Depth-first search will also have its own 
set of applications. 

365
00:21:27,319 --> 00:21:31,079
It's not for example, very useful for 
computing shortest path information. 

366
00:21:31,079 --> 00:21:35,096
But especially, in directed graphs, it's 
going to do some remarkable things for 

367
00:21:35,096 --> 00:21:36,944
us. 
So in directed acyclic graph. 

368
00:21:36,944 --> 00:21:39,288
So a directed graph with no directed 
cycles. 

369
00:21:39,288 --> 00:21:42,217
It will give us what's called the 
topological ordering. 

370
00:21:42,217 --> 00:21:46,199
So it will sequence the nodes in a linear 
ordering from the first to the last. 

371
00:21:46,199 --> 00:21:48,880
So that all of the arc's of the directed 
graphs go forward, 

372
00:21:48,880 --> 00:21:51,607
so this is useful. 
Fo example, if you have a number of tasks 

373
00:21:51,607 --> 00:21:54,470
that have to get completed with certain 
precedence constraints. 

374
00:21:54,470 --> 00:21:58,150
Like for example, you have to take all of 
the classes in your undergraduate major 

375
00:21:58,150 --> 00:22:01,786
and there are certain prerequisites 
topalogical ordering will give you a way 

376
00:22:01,786 --> 00:22:04,240
in which to do it respecting all of the 
prerequisites. 

377
00:22:04,240 --> 00:22:07,756
And finally, where, for undirected 
graphs, it doesn't really matter whether 

378
00:22:07,756 --> 00:22:10,213
you use BFS or DFS to compute connected 
components. 

379
00:22:10,213 --> 00:22:14,020
In directed graphs, where even defining 
connected components is a little tricky. 

380
00:22:14,020 --> 00:22:16,910
it turns out depth-first search is 
exactly what you want. 

381
00:22:16,910 --> 00:22:20,716
That's what you're going to get a linear 
time implementation for computing the 

382
00:22:20,716 --> 00:22:23,800
right notion of connected components in 
the directed graph case. 

383
00:22:23,800 --> 00:22:28,447
Time wise, both of these are superb 
strategies for exploring a graph. 

384
00:22:28,447 --> 00:22:31,864
They're both linear time with very good 
constants. 

385
00:22:31,864 --> 00:22:36,238
So depth-first search again. 
We're going to get ordle, O of M plus N 

386
00:22:36,238 --> 00:22:38,606
time. 
In a graph with M edges and N vertices. 

387
00:22:38,606 --> 00:22:42,711
you do want to use a different data 
structure reflecting the different search 

388
00:22:42,711 --> 00:22:44,893
strategy. 
So here, because you're exploring 

389
00:22:44,893 --> 00:22:48,947
aggressively as soon as you get to a 
node, you immediately start exploring its 

390
00:22:48,947 --> 00:22:51,233
neighbors. 
You want a lasting first out data 

391
00:22:51,233 --> 00:22:54,870
structure, also known as a stack. 
Depth-first search also admits a very 

392
00:22:54,870 --> 00:22:58,871
elegant recursive formulation and in that 
formulation, you don't even need to 

393
00:22:58,871 --> 00:23:01,106
maintain a stack data structure. 
Explicitly, 

394
00:23:01,106 --> 00:23:04,740
the stack is implicitly taken care of in 
the recursion. 

395
00:23:04,740 --> 00:23:08,972
So that concludes this overview of graph 
search both what it is? What our goals 

396
00:23:08,972 --> 00:23:13,150
are? What kind of applications they have 
adn two of the most common strategies. 

397
00:23:13,150 --> 00:23:17,221
The next couple videos are going to 
explore these search strategies as well 

398
00:23:17,221 --> 00:23:19,954
as a couple of these applications in 
greater depth.