1
00:00:00,985 --> 00:00:05,841
So, what's the problem? Well, so I did say most of
this stuff about graph search it really

2
00:00:05,841 --> 00:00:09,385
doesn't matter undirected or directed.
It's pretty much cosmetic changes, but the

3
00:00:09,385 --> 00:00:12,737
big exception is when you're computing
connectivity. When you're computing the

4
00:00:12,737 --> 00:00:16,675
pieces of the graph. So, right now I'm
only gonna talk about undirected graphs.

5
00:00:16,675 --> 00:00:20,928
In the directed case, we can again get a
very efficient algorithms for it, but it's

6
00:00:20,928 --> 00:00:23,853
quite a bit harder work. So, that's gonna
be covered in detail in a separate video.

7
00:00:23,853 --> 00:00:30,389
So, for now focus just on a undirected
graph, G. And we're certainly not going to

8
00:00:30,389 --> 00:00:34,191
assume that G is connected. Indeed part of
the point here is to figure out whether or

9
00:00:34,191 --> 00:00:37,960
not it's connected, i.e. in one
piece. So, maybe the graph looks like

10
00:00:37,960 --> 00:00:43,821
this. So, for example maybe the graph has
ten vertices and looks like this, on the

11
00:00:43,821 --> 00:00:48,794
right. And intuitively, especially given
that I've drawn it in such a clean way,

12
00:00:48,794 --> 00:00:53,316
it's clear that this graph has three
pieces, and those are the things that we

13
00:00:53,316 --> 00:00:57,400
wanna call the connected components. But
we do want a somewhat more formal

14
00:00:57,400 --> 00:01:01,203
definition. Something which is actually
you know, in math, that we can say is true

15
00:01:01,203 --> 00:01:05,478
of false about a given graph. And roughly
we define the connected components of an

16
00:01:05,478 --> 00:01:10,990
undirected graph as the maximal regions
that are connected. In the sense you can

17
00:01:10,990 --> 00:01:15,400
get from any vertex in the region from any
other vertex in the region using a path.

18
00:01:15,400 --> 00:01:19,202
So maximal connected regions in that
sense. Now the slick way to do this is

19
00:01:19,202 --> 00:01:23,199
using an equivalence relation. And I'm
gonna this here in part because it's

20
00:01:23,199 --> 00:01:26,634
really the right way to think about the
directed graph case, which we'll talk

21
00:01:26,634 --> 00:01:32,427
about in some detail later. So for undirected
graphs. So this isn't super important but

22
00:01:32,427 --> 00:01:35,555
let me go ahead and state the formal
definition just for completeness about

23
00:01:35,555 --> 00:01:41,911
what is a connected component. What do I
mean by a maximal region that's mutually

24
00:01:41,911 --> 00:01:48,054
connected. So a good formal definition is
as the equivalence classes of the relation

25
00:01:48,054 --> 00:01:54,793
on vertices where which we define by U being
related to V if and only if there's a path

26
00:01:54,793 --> 00:01:59,180
between U and V in the graph G. So
I'll leave it for you to do the simple

27
00:01:59,180 --> 00:02:03,950
check that the squiggle is indeed an
equivalence relation. I'm gonna remind you

28
00:02:03,950 --> 00:02:07,027
what equivalence relations are. This is
something you generally learn in your

29
00:02:07,027 --> 00:02:11,308
first class on proofs or your first class
on discrete math. So it's just something

30
00:02:11,308 --> 00:02:15,132
which may or may not be true about pairs
of objects. And to be an equivalence

31
00:02:15,132 --> 00:02:18,216
relation, you have to satisfy three
properties. So, first you have to be

32
00:02:18,216 --> 00:02:22,762
reflexive, meaning A, everything has to be
related to itself, and indeed, in a graph,

33
00:02:22,762 --> 00:02:26,980
there is a path from any node to itself,
namely the empty path. So, also

34
00:02:26,980 --> 00:02:29,309
equivalence relations have to be
symmetric, meaning if U and V are

35
00:02:29,309 --> 00:02:33,145
related then V and U are related. Because
this is an undirected graph, it's clear

36
00:02:33,145 --> 00:02:37,110
that this is symmetric. If there's a path
from U to V in the graph, there's also a

37
00:02:37,110 --> 00:02:40,722
path from V to U, so no problem there.
Finally equivalence class has got to be

38
00:02:40,722 --> 00:02:45,175
transitive. So that means if U and V are
related and so are V and W, then so are U

39
00:02:45,175 --> 00:02:49,687
and W. That is if U and V have a path, V
and W have a path, then so does U and W.

40
00:02:49,687 --> 00:02:53,690
And you can prove transitivity just by
pasting the two paths together. And so the

41
00:02:53,690 --> 00:02:58,469
upshot is when you want to say something
like the maximal subset of something where

42
00:02:58,469 --> 00:03:02,485
everything is the same, the right way to
make that mathematical is using

43
00:03:02,485 --> 00:03:05,950
equivalence relations. So over in this
blue graph, we want to say one, three,

44
00:03:05,950 --> 00:03:09,437
five, seven and nine, are sort of all the
same in the sense that they are mutually

45
00:03:09,437 --> 00:03:12,688
connected and so that's exactly what
this relation is making precise. All five

46
00:03:12,688 --> 00:03:15,906
of those nodes are related to each other.
Two and four are related to each other.

47
00:03:15,906 --> 00:03:19,629
Six, eight, and ten all pairs of them are
related to each other. So the equivalence,

48
00:03:19,629 --> 00:03:22,948
so equivalence relations always have
equivalence classes of the maximal mutually

49
00:03:22,948 --> 00:03:26,496
related stuff, and in this graph context
that's exactly this connected component.

50
00:03:26,496 --> 00:03:31,941
That's exactly what you want. So, what I
wanna show you is that you can use breadth

51
00:03:31,941 --> 00:03:36,256
first search wrapped in an outer for loop
over all of the vertices to compute, to

52
00:03:36,256 --> 00:03:41,026
identify, all the connected components of
the graph in time linear in the graph. In

53
00:03:41,026 --> 00:03:44,547
O of N plus M time. Now you might be
wondering, you know, why do you wanna do

54
00:03:44,547 --> 00:03:49,261
that. Well there's a lot of reasons. So,
an obvious one which is relevant for

55
00:03:49,261 --> 00:03:55,381
physical networks is to check if a network
has broken into two pieces. So, certainly

56
00:03:55,381 --> 00:03:59,431
if you're an internet service provider,
you wanna make sure that from any point in your

57
00:03:59,431 --> 00:04:03,031
network you can reach any other point in
the network. And that boils down to just

58
00:04:03,031 --> 00:04:05,854
understanding whether the graph that
represents your network is a connected

59
00:04:05,854 --> 00:04:09,556
graph, that is, if it's in one piece or if
it's not in one piece. So obviously you

60
00:04:09,556 --> 00:04:12,908
can ask this same question about
recreational examples so if you return to

61
00:04:12,908 --> 00:04:17,925
the movie graph, maybe you're wondering,
can you get from every single actor, in

62
00:04:17,925 --> 00:04:22,920
the IMDB database, to Kevin Bacon, or are
there actors for which you cannot reach

63
00:04:22,920 --> 00:04:27,837
Kevin Bacon, via a sequence of edges, the
sequence of movies in which two actors

64
00:04:27,837 --> 00:04:31,470
have both played a role. So that's
something that boils down to a

65
00:04:31,470 --> 00:04:36,848
connectivity computation. If you have
network data and you want to display it,

66
00:04:36,848 --> 00:04:39,886
you want to visualize it and show it to a
group of people so that they can interpret

67
00:04:39,886 --> 00:04:42,641
it. Obviously one thing you wanna do is
you wanna know if there's multiple pieces

68
00:04:42,641 --> 00:04:47,962
and then you want to display the pieces
separately. So let me mention one

69
00:04:47,962 --> 00:04:51,912
probably a little less obvious application of
undirected connectivity, which is it gives

70
00:04:51,912 --> 00:04:56,298
a nice quick and dirty heuristic for doing
clustering, if you have pairwise

71
00:04:56,298 --> 00:05:00,214
information about objects. So let me be a
little, let me be a little more concrete.

72
00:05:00,214 --> 00:05:03,285
Suppose you have a set of objects that you
really care about. So this could be a set

73
00:05:03,285 --> 00:05:07,211
of documents, maybe web pages that you
crawled, something like that. It could

74
00:05:07,211 --> 00:05:11,205
be a set of images, either your own or
drawn from some database, or it could be

75
00:05:11,205 --> 00:05:16,864
for example a set of genomes. Suppose
further that you have a pairwise function.

76
00:05:16,864 --> 00:05:20,554
Okay? Which for each pair of objects tells
you whether they are very much like each

77
00:05:20,554 --> 00:05:24,075
other or very much different. And so let's
suppose that if two objects are very

78
00:05:24,075 --> 00:05:28,563
similar to each other, like they're two web
pages that are almost the same, or they're two

79
00:05:28,563 --> 00:05:31,443
genomes where you can get from one to the
other with a small number of mutations,

80
00:05:31,443 --> 00:05:38,835
then they have a low score. So low numbers
close to zero indicate that the objects

81
00:05:38,835 --> 00:05:42,615
are very similar to each other. High
numbers, let's say they could go up to

82
00:05:42,615 --> 00:05:46,092
even a thousand or something, indicate
that they are very different objects. Two

83
00:05:46,092 --> 00:05:49,568
web pages that have nothing to do with
each other. Two genomes for totally

84
00:05:49,568 --> 00:05:55,170
unrelated parts or two images that seem to
be of completely different people or even

85
00:05:55,170 --> 00:06:01,953
completely different objects. Now here's a
graph you can construct using these

86
00:06:01,953 --> 00:06:06,454
objects and the similarity data that you
have about them. So you can have a graph

87
00:06:06,454 --> 00:06:10,245
where the nodes are the objects. So for
each pixel, for each image, for each

88
00:06:10,245 --> 00:06:13,766
document, whatever, you have a single
node. And then for a given pair of nodes,

89
00:06:13,766 --> 00:06:18,738
you put in an edge if, and only if, the
two objects are very similar. So, for

90
00:06:18,738 --> 00:06:23,295
example, you could put in an edge between
two objects if, and only if, the score is

91
00:06:23,295 --> 00:06:27,614
at most ten. So remember, the more similar
two objects are, the closer their score is

92
00:06:27,614 --> 00:06:31,012
to zero. So you're gonna get an edge
between very similar documents, very

93
00:06:31,012 --> 00:06:35,332
similar genomes, very similar images. Now
in this graph you've constructed, you can

94
00:06:35,332 --> 00:06:40,282
find the connected components. So, each of
these connected components will be a group of

95
00:06:40,282 --> 00:06:43,781
objects, which more or less are all very
similar to each other. So, this will be a

96
00:06:43,781 --> 00:06:47,290
cluster of closely related objects in your
database. And you can imagine a lot of

97
00:06:47,290 --> 00:06:52,264
reasons why, given a large set of
unstructured data, just a bunch of

98
00:06:52,264 --> 00:06:55,289
pictures, a bunch of documents, or
whatever, you might wanna find clusters of

99
00:06:55,289 --> 00:07:00,419
highly related objects. So we'll probably
see more sophisticated heuristics for

100
00:07:00,419 --> 00:07:03,693
clustering in some sequel course. But
already undirected connectivity gives

101
00:07:03,693 --> 00:07:08,733
you a super fast, linear time, quick and dirty heuristic
for identifying clusters of similar

102
00:07:08,733 --> 00:07:13,683
objects given pairwise data about
similarity. So that's some reasons you

103
00:07:13,683 --> 00:07:17,305
might wanna do it. Now let's actually talk
about how to compute the connected

104
00:07:17,305 --> 00:07:21,794
components in linear time using just a
simple for loop and breadth-first search

105
00:07:21,794 --> 00:07:27,486
as its inner work horse. So here's the
code to compute all the connected

106
00:07:27,486 --> 00:07:31,762
components of an undirected graph. So
first we initialize all nodes as being

107
00:07:31,762 --> 00:07:36,443
unexplored. I'm also going to assume that
the nodes have names. Let's say their

108
00:07:36,443 --> 00:07:40,514
names are from one to N. So, these names
could just be the position in the node

109
00:07:40,514 --> 00:07:44,497
array that these nodes occupy. So there's
gonna be an outer for loop, which walks

110
00:07:44,497 --> 00:07:47,939
through the nodes in an arbitrary order,
let's say from one to N. This outer for

111
00:07:47,939 --> 00:07:53,294
loop is to ensure that every single node
of the graph will be inspected for sure at

112
00:07:53,294 --> 00:07:57,299
some point in the algorithm. And again,
one of our maxims is that we should never

113
00:07:57,299 --> 00:08:00,799
do redundant work, so before we start
exploring from some node, we check if

114
00:08:00,799 --> 00:08:06,108
we've already been there. And if we
haven't seen I before, then we invoke the

115
00:08:06,108 --> 00:08:09,325
breadth-first search subroutine we were
talking about previously in the lecture in

116
00:08:09,325 --> 00:08:16,728
the graph G starting from the node I. So
to make sure this is clear, let's just run

117
00:08:16,728 --> 00:08:23,061
this algorithm on this blue graph to the
right. So we start in the outer for loop

118
00:08:23,061 --> 00:08:26,728
and we set I equal to one, and we say have
we explored node number one yet? And of

119
00:08:26,728 --> 00:08:29,800
course not, we haven't explored anything
yet. So the first thing we're gonna do is

120
00:08:29,800 --> 00:08:34,278
we're gonna, we're gonna invoke BFS on
node number one here. So now we start

121
00:08:34,278 --> 00:08:38,698
running the usual breadth first search
subroutine starting from this node one.

122
00:08:38,698 --> 00:08:42,771
And so we explore, you know, layer one
here is gonna be nodes three and five and

123
00:08:42,771 --> 00:08:46,573
so we explore them in some order. For
example, maybe, node number three is what

124
00:08:46,573 --> 00:08:50,556
we explore second and then node number
five is what we explore, third. And then

125
00:08:50,556 --> 00:08:55,146
the second layer in this component is
going to be the nodes seven and nine. So,

126
00:08:55,146 --> 00:08:59,870
we'll explore them in some order as well.
Let's say seven first, followed by nine.

127
00:08:59,870 --> 00:09:04,239
So, after this BFS initiated from node number one
completes of course it will have found

128
00:09:04,239 --> 00:09:08,245
everything that it could possibly find,
namely the five nodes in the same

129
00:09:08,245 --> 00:09:11,294
connected component as node number one. And
of course all of the five of these nodes

130
00:09:11,294 --> 00:09:15,793
will be marked as explored. So, now we
return once that exits. We return to the

131
00:09:15,793 --> 00:09:19,844
other for loop we increment I, we go to I
equals two, where we'll say, oh, have we already

132
00:09:19,844 --> 00:09:24,005
explored node number two? No we have
not. And so now we invoke BFS again from

133
00:09:24,005 --> 00:09:28,224
node number two. So that'll be the sixth
node we explore. There's not much to do

134
00:09:28,224 --> 00:09:31,588
from two. All we can do is go to node
number four. So, that's the seventh node

135
00:09:31,588 --> 00:09:36,167
we explore. That BFS terminates finding
the nodes in this connected component.

136
00:09:36,167 --> 00:09:40,115
Then we go back to the outer for loop. We
increment i to three. We say ah, have we already

137
00:09:40,115 --> 00:09:43,461
seen node number three? Yes we have. We
saw that in the first breadth first

138
00:09:43,461 --> 00:09:47,046
search. So we certainly don't bother to
BFS from this node three. Then we

139
00:09:47,046 --> 00:09:51,130
increment i to four, have we seen four?
Yes we have, in the second call to BFS.

140
00:09:51,130 --> 00:09:56,002
Have we seen node five? Yes we have, in
the first call to BFS. Have we seen node

141
00:09:56,002 --> 00:09:59,670
six? No, we have not. So the final
invocation of breadth-first search begins

142
00:09:59,670 --> 00:10:04,832
from node number six. That's gonna be the
eighth node overall that we see. And then

143
00:10:04,832 --> 00:10:08,567
we're gonna see the nodes eight and ten in
some order. So for example, maybe we first

144
00:10:08,567 --> 00:10:13,044
explore node number eight. That's a N,
that's one of the first layer in this

145
00:10:13,044 --> 00:10:17,028
component, and then node number ten is the
other node of the first layer in this

146
00:10:17,028 --> 00:10:21,336
component. So, in general what's going on,
well, what we know about, what we know

147
00:10:21,336 --> 00:10:24,418
what will happen when we invoke breadth
first search from a given node I, we're

148
00:10:24,418 --> 00:10:28,525
going to discover exactly the nodes in I's
connected component. Right? Anything where

149
00:10:28,525 --> 00:10:32,361
there's a path from I to that node, we'll
find it. That's the BFS guarantee.

150
00:10:32,361 --> 00:10:36,051
That's also the definition of a connected
component. All the other nodes, which have

151
00:10:36,051 --> 00:10:42,251
a path to, to I. Another thing that I hope
is clear from the example, but you know,

152
00:10:42,251 --> 00:10:47,754
just to reiterate it, is every breadth
first search call when you explore a node,

153
00:10:47,754 --> 00:10:51,523
you remember that through the entire for
loop. Right? So when we invoke breadth

154
00:10:51,523 --> 00:10:54,796
first search from node number one, we
explore nodes one, three, five, seven, and

155
00:10:54,796 --> 00:10:59,206
nine, and we keep those marked as explored
for the rest of this, for the rest of this

156
00:10:59,206 --> 00:11:02,221
algorithm. Right? So and that's so that we
don't do redundant work when we get to

157
00:11:02,221 --> 00:11:06,451
later stages of the for loop. So, as far
as what does this algorithm accomplish?

158
00:11:06,451 --> 00:11:11,075
Well, it certainly finds every connected
component. Alright, there is absolutely no

159
00:11:11,075 --> 00:11:15,226
way it can miss a node because this
for loop literally walks through the

160
00:11:15,226 --> 00:11:18,725
nodes, all of them, one at a time. And so
you're not gonna miss a node. Moreover, we

161
00:11:18,725 --> 00:11:21,617
know that, you know, as soon as you f-,
hit a connected component for the first

162
00:11:21,617 --> 00:11:25,182
time, and you do breadth first search from
that node, you're gonna find the whole

163
00:11:25,182 --> 00:11:29,030
thing. That's the breadth first search
guarantee. As far as what's the running

164
00:11:29,030 --> 00:11:33,203
time? Well it's going to be exactly what
we want. It's going to be linear time

165
00:11:33,203 --> 00:11:38,443
which again means proportional to the
number of edges plus the number of

166
00:11:38,443 --> 00:11:41,507
vertices. And again, depending on the
graph, one of these might be bigger than

167
00:11:41,507 --> 00:11:46,434
the other. So why is it O of M plus N?
Well as far as the nodes we have to do

168
00:11:46,434 --> 00:11:50,595
this initialization there where we marked
them all as unexplored so that takes

169
00:11:50,595 --> 00:11:55,301
constant time per node. We have just the
basic overhead of a for loop so that's

170
00:11:55,301 --> 00:12:00,374
constant time per node and then we have
this check. Constant time per node so

171
00:12:00,374 --> 00:12:05,178
that's O of N. And then recall we proved
that within breadth first search you do

172
00:12:05,178 --> 00:12:08,677
amount of work proportional, you do
constant time for each node in that

173
00:12:08,677 --> 00:12:11,703
connected component. Now each of the nodes
of the graph is in exactly one of the

174
00:12:11,703 --> 00:12:16,788
connected components, so you'll do constant
time for each node in the BFS in which

175
00:12:16,788 --> 00:12:22,610
you discover that node. So, that's again O
of N over all of the connected components. And as far as the edges, no, we

176
00:12:22,610 --> 00:12:26,266
don't even bother to look at edges until
we're inside one of these breadth first

177
00:12:26,266 --> 00:12:29,517
search calls. They play no role in the
outer for loop or in the preprocessing,

178
00:12:29,517 --> 00:12:32,701
and remember what we proved about an
invocation of breadth first search. The

179
00:12:32,701 --> 00:12:37,347
running time, you only do constant amount
of work per edge in the connected

180
00:12:37,347 --> 00:12:40,678
component that you're exploring. In the
worst case you look at the edge once from

181
00:12:40,678 --> 00:12:43,963
either end point and each of that triggers
a constant amount of work. So when you

182
00:12:43,963 --> 00:12:48,642
discover a given connected component the
edge work is proportional to the number of edges in

183
00:12:48,642 --> 00:12:52,760
that connected component. Each edge of the graph is
only, is in exactly one of the connected

184
00:12:52,760 --> 00:12:56,900
components. So over this entire for loop
over all of these BFS calls for each edge

185
00:12:56,900 --> 00:13:01,186
of the graph you'll only be responsible
for a constant amount of work to the

186
00:13:01,186 --> 00:13:05,439
algorithm. So summarizing, because
breadth first search from a given

187
00:13:05,439 --> 00:13:08,566
starting node does, works in time
proportional to the size of that

188
00:13:08,566 --> 00:13:11,896
component. Piggy backing on that sub
routine, and looping over all of the nodes

189
00:13:11,896 --> 00:13:15,901
of the graph we find all of the connected 
components, in time proportional to the

190
00:13:15,901 --> 00:13:18,561
number of edges and nodes in the entire
graph.