1
00:00:00,012 --> 00:00:03,675
Okay, so this video's not about any
particular graph problem, not about a, any

2
00:00:03,687 --> 00:00:07,450
particular graph algorithm. Just, sort of,
the preliminaries we need to discuss

3
00:00:07,462 --> 00:00:11,250
algorithms on graphs. How do we measure
their size? How do we represent them, and

4
00:00:11,262 --> 00:00:15,150
so on. Remember what a graph is, it really
has two ingredients. First of all, there's

5
00:00:15,162 --> 00:00:18,475
this set of objects we're talking about.
Those might be called vertices.

6
00:00:18,562 --> 00:00:22,450
Synonymously, we might call them nodes. We
represent pair wise relationships using

7
00:00:22,462 --> 00:00:26,882
edges. These can be either un-directed in
which case, they're ordered pairs or an

8
00:00:26,894 --> 00:00:30,893
edge can be directed from 1 to another. In
that case, they're ordered pairs, and we

9
00:00:30,905 --> 00:00:34,486
have a directed graph. Now, when we talk
about say, the size of a graph, or the

10
00:00:34,498 --> 00:00:38,524
running time of an algorithm that operates
on a graph. We need to think about what we

11
00:00:38,536 --> 00:00:42,148
mean by input size. In particular, for a
graph, there's really two different

12
00:00:42,160 --> 00:00:46,055
parameters that control how big it is,
unlike an array. For arrays, we just had a

13
00:00:46,067 --> 00:00:50,066
single number, the length. For graphs, we
have the number of vertices, and we have

14
00:00:50,078 --> 00:00:54,429
the number of edges.
Usually we'll use the notation n for the

15
00:00:54,441 --> 00:00:58,600
number vertices, m for the number of
edges. So the next quiz will ask you to

16
00:00:58,612 --> 00:01:03,126
think about how the number of edges m, can
depend on the number of vertices, n. So,

17
00:01:03,224 --> 00:01:07,780
in particular, I want you to think about
in this quiz, an un-directed graph It has

18
00:01:07,792 --> 00:01:11,859
n vertices. There's no parallel edges.
'Kay, so for a given pair of vertices,

19
00:01:11,954 --> 00:01:16,087
there's either zero or one edge between
them. Moreover, let's assume that the

20
00:01:16,099 --> 00:01:19,768
graph is unconnected. 'Kay?
So I don't want you to think about graphs

21
00:01:19,780 --> 00:01:24,076
that have zero edges. Now, I haven't
defined what a graph is. What it means for

22
00:01:24,088 --> 00:01:28,554
a graph to be connected formally, yet, but
I hope you get the idea. It means it's in

23
00:01:28,566 --> 00:01:33,041
one piece, you can't break it into two
parts that have no edges crossing between

24
00:01:33,053 --> 00:01:37,342
them. So, for such a graph, no parallel
edges, in one piece, n vertices, think

25
00:01:37,354 --> 00:01:41,661
about what is the minimum number of edges
it could possibly have, and what is the

26
00:01:41,673 --> 00:01:46,148
maximum number of edg es, as a function of
n, that it could possibly have. All right,

27
00:01:46,244 --> 00:01:50,916
so the correct option is the first one The
fewest number of edges that a connected

28
00:01:50,928 --> 00:01:54,768
undirected graph we can have is n minus 1,
and the maximum number of edges that an

29
00:01:54,780 --> 00:01:58,889
undirected graph with no parallel edges
can have is n times n minus 1 over 2,

30
00:01:58,987 --> 00:02:03,343
better known as n choose 2. So why does it
need at least n minus 1 edges, if it's

31
00:02:03,355 --> 00:02:07,985
going to be in one piece. Well think about
at, adding the edges one at a time. Okay,

32
00:02:07,997 --> 00:02:12,445
on each of the edges of the graph. Now,
initially, you just have a graph with zero

33
00:02:12,457 --> 00:02:16,900
edges, the graph has indifferent pieces
and isolated vertices has no edges at all.

34
00:02:16,997 --> 00:02:21,360
Now each time you add one edge, what you
do is you take two of the existing pieces,

35
00:02:21,457 --> 00:02:25,655
at best, and fuse them into one. So, the
maximum decrease you can have in the

36
00:02:25,667 --> 00:02:30,005
number of different pieces of a graph is
it can decrease by 1 each time you add an

37
00:02:30,017 --> 00:02:34,150
edge. So from a starting point of n
different pieces, you've got to get down

38
00:02:34,162 --> 00:02:38,645
to 1 piece. So that requires the addition
of n minus 1 edges. You can also convince

39
00:02:38,657 --> 00:02:43,040
yourself of this best, by drawing a few
pictures and noticing that trees achieve

40
00:02:43,052 --> 00:02:47,588
this bound exactly, so for example here is
a 4 vertex tree that has 3 edges. So this

41
00:02:47,600 --> 00:02:51,817
is a case where m is indeed, n minus 1.
Now, for the upper bound, why can't you

42
00:02:51,829 --> 00:02:56,037
have more than n choose 2? Well, it's
clear that the largest number of edges you

43
00:02:56,049 --> 00:03:00,348
can have is for the complete graph. Where
every single pair of edges has 1 between

44
00:03:00,360 --> 00:03:05,590
them. Again, there's no parallel arcs and
edges are unordered. So, there's at most,

45
00:03:05,697 --> 00:03:10,595
n choose 2 possibilities of where to put
an edge. So again, if n equals 4, here

46
00:03:10,607 --> 00:03:15,220
would be an example with a maximum
possible number, 6 edges. So, now that

47
00:03:15,232 --> 00:03:20,100
I've got you thinking about how the number
of edges can vary with the number of

48
00:03:20,112 --> 00:03:24,378
vertices. Let me talk about the
distinction between sparse and dense

49
00:03:24,390 --> 00:03:28,631
graphs. It's important to distinguish
between these two concepts because some

50
00:03:28,643 --> 00:03:32,884
data structures and algorithms are better
suited for sparse graphs. Other data

51
00:03:32,896 --> 00:03:37,069
structures and algorithms are better
suited for dense graphs. So, to make this

52
00:03:37,081 --> 00:03:41,320
precise, let me just put down this very
common notation n is the number of

53
00:03:41,332 --> 00:03:46,652
vertices of the graph under discussion, m
is the number of inches. This is quite

54
00:03:46,664 --> 00:03:52,585
standard notation. Please get used to it
and use it yourself. If you reverse these,

55
00:03:52,703 --> 00:03:59,029
you will confuse a lot of people who have
familiarity with graph algorithms and data

56
00:03:59,041 --> 00:04:05,373
structures. Now one thing we learned from
the previous quiz is the following. So in

57
00:04:05,385 --> 00:04:11,424
most applications, not all applications,
but most applications, m is at least

58
00:04:11,436 --> 00:04:15,946
linear in n. Remember in the quiz we saw
is at least n minus 1 if you wanted the

59
00:04:15,958 --> 00:04:19,798
graph to be connected, and it's also big O
of n squared. This is under the assumption

60
00:04:19,810 --> 00:04:24,156
that there's no parallel arcs. Now, there
are cases where we want to allow parallel

61
00:04:24,168 --> 00:04:27,894
arcs. In fact we'll do that in the
contraction algorithm for the min cut

62
00:04:27,906 --> 00:04:31,646
problem. There are cases where we want to
allow the number of edges to drop so low,

63
00:04:31,730 --> 00:04:35,054
that the graph breaks into multiple
pieces. For example, when we talk about

64
00:04:35,066 --> 00:04:38,600
connected components but more often than
not, we're thinking about a connected

65
00:04:38,612 --> 00:04:42,274
graph with no parallel edges. And then we
can pin down the number of edges m to be

66
00:04:42,286 --> 00:04:45,619
somewhere between the linear and the
number of nodes, linear and n and

67
00:04:45,631 --> 00:04:49,897
quadratic in it. Now I'm not going to give
you a super formal definition of what a

68
00:04:49,909 --> 00:04:54,092
sparse or a dense graph is, and people are
a little loose with this, this terminology

69
00:04:54,328 --> 00:04:58,718
in practice. But basically, sparse means
you're closer to the lower bound, closer

70
00:04:58,730 --> 00:05:03,233
to linear. Dense means, you're closer to
the upper bound, closer to quadratic. Now

71
00:05:03,245 --> 00:05:07,148
I know this leaves ambiguity when the
number of edges is something you know like

72
00:05:07,478 --> 00:05:11,656
n to the 3 halves. usually in that case
you'd think of that as a dense graph. So

73
00:05:11,668 --> 00:05:15,615
usually anything which is more than N
times logarythmic terms, you'd think of

74
00:05:15,627 --> 00:05:20,074
that as a dense graph. But again, people
are a little bit sloppy with this when

75
00:05:20,086 --> 00:05:24,268
they talk about graphs. Next I want to
discuss two representations of graphs and

76
00:05:24,501 --> 00:05:28,480
we're mostly going to be using the s econd
one in this course, but this first one,

77
00:05:28,574 --> 00:05:32,888
the adjacency matrix, I do want to mention
just briefly, just on this slide. This is

78
00:05:32,900 --> 00:05:38,072
the supernatural idea where you represent
the edges in a graph using a matrix. Let

79
00:05:38,084 --> 00:05:43,096
me describe it first for undirected
graphs. So, the matrix is going to be

80
00:05:43,108 --> 00:05:48,573
denoted by capital A, and it says square n
by n matrix where n is the number of

81
00:05:48,585 --> 00:05:54,247
vertices of the graph. And the semantics
are the i-jth entry of the matrix is 1. If

82
00:05:54,259 --> 00:05:58,566
and only if there's an edge between the
vertices i and j in the graph. I'm

83
00:05:58,578 --> 00:06:03,167
assuming here that the vertices are named
1, 2, 3, 4, et cetera all the way up to n.

84
00:06:03,270 --> 00:06:08,228
It's easy to add bells and whistles to the
adjacency matrix to accommodate parallel

85
00:06:08,240 --> 00:06:13,429
edges to accommodate edge weights, which
is accommodate directed arcs, directed

86
00:06:13,441 --> 00:06:18,512
edges. If you wanted to have parallel
arcs, you could just have Aij denote the

87
00:06:18,524 --> 00:06:24,221
number of arcs that are between i and j.
If edges have different weights, you could

88
00:06:24,233 --> 00:06:29,354
just have Aij be the weight of the ij
edge. And for the directed graph you could

89
00:06:29,366 --> 00:06:34,343
use plus ones and minus ones. So if the
arc is directed from i to j, you'd set i,

90
00:06:34,452 --> 00:06:38,748
Aij to be plus 1. If the arc is directed
from j to i, you'd set Aij to minus 1.

91
00:06:38,843 --> 00:06:42,570
There are many metrics by which you can
evaluate a data structure, or a

92
00:06:42,582 --> 00:06:47,362
representation. two important ones I want
to discuss here. First of all, the number

93
00:06:47,374 --> 00:06:51,532
of resources it requires and in this
context, that's the amount of space that

94
00:06:51,544 --> 00:06:56,160
the data structure needs. The second thing
is what are the operations of the data

95
00:06:56,172 --> 00:07:00,945
structure supports. So let's just begin
with space requirements. What are they for

96
00:07:00,957 --> 00:07:05,327
the adjacency matrix? Alright, so the
answer at least with the sort of straight

97
00:07:05,339 --> 00:07:09,633
forward way of storing a matrix is n
squared. And this is n dependent of the

98
00:07:09,645 --> 00:07:14,279
number of edges. So you could try to beat
this down for sparse graphs using sparse

99
00:07:14,291 --> 00:07:18,235
matrix tricks. But for the basic idea of
just actually representing an n by n

100
00:07:18,247 --> 00:07:22,052
matrix, you got n squared entries, you
gotta store one bit in each whether the

101
00:07:22,064 --> 00:07:25,816
edge is there or not. So that's going to
give yo u n squared space. The constants

102
00:07:25,828 --> 00:07:29,605
are, of course, very small, because you're
just storing one bit per entry. But

103
00:07:29,617 --> 00:07:33,347
nonetheless this is quadratic in the
number of vertices. Now that's going to be

104
00:07:33,359 --> 00:07:37,239
fine if you have a dense graph, the number
of edges is as high as n squared, then

105
00:07:37,251 --> 00:07:41,825
you're not really wasting anything in this
representation. But in a sparse graph, if

106
00:07:41,837 --> 00:07:46,870
m is much closer to linear, then this is a
super wasteful representation. Let's talk

107
00:07:46,882 --> 00:07:50,800
about the ajacently list representation,
this is the, the dominant one we'll be

108
00:07:50,812 --> 00:07:55,381
using in this class. This has several
ingredients. So, first you keep track of

109
00:07:55,393 --> 00:07:59,610
both the vertices and the edges as
independent entities. So you're going to

110
00:07:59,622 --> 00:08:03,580
have an array, or a list of each. And then
we want these two arrays to

111
00:08:03,592 --> 00:08:08,357
cross-reference each other in the obvious
way. So given a vertex, you want to know

112
00:08:08,369 --> 00:08:12,730
which edges it's involved in. Given an
edge, you want to know what its endpoints

113
00:08:12,742 --> 00:08:16,853
are. So, let's say first, most simply,
each edge is going to have two pointers,

114
00:08:16,947 --> 00:08:21,229
one for each of the two endpoints. And in
directed graph, of course, it would keep

115
00:08:21,241 --> 00:08:25,491
track of which one is the head and which
one is the tail. Now, each vertex is going

116
00:08:25,503 --> 00:08:30,053
to point to all of the edges of which it's
a member. Now in an undirected graph, it's

117
00:08:30,065 --> 00:08:33,829
clear what I mean by that. In a directed
graph, you could do it in a couple ways.

118
00:08:33,918 --> 00:08:37,813
Generally you'd have a vertex, keep track
of all of the edges, for which it is the

119
00:08:37,825 --> 00:08:41,924
tail. That is, all of the edges which you
can follow one hop out from the edge. If

120
00:08:41,936 --> 00:08:45,960
you wanted to, you can also have a second
array, at a more expense of storage, where

121
00:08:45,972 --> 00:08:49,817
the vertex also keeps track of the edges
pointing to it. The edges for which it's

122
00:08:49,829 --> 00:08:55,034
the head. So, let me ask you the same
question I did with an adjacency matrix.

123
00:08:55,149 --> 00:09:00,969
What is the space required of an adjacency
list, as a function of the number of edges

124
00:09:00,981 --> 00:09:06,305
m, and the number of vertices n, of the
graph? So, the correct answer to this

125
00:09:06,317 --> 00:09:11,772
question is the third option, theta of m
plus n, which we're going to think of as

126
00:09:11,784 --> 00:09:16,163
linear space in the size of the gra ph.
So, this quiz is, is a little tricky. So,

127
00:09:16,175 --> 00:09:20,127
it's explain the answer when we return to
the slide with the ingredients of

128
00:09:20,139 --> 00:09:24,095
adjacency lists. And let's compute the
space for each of these four ingredients

129
00:09:24,107 --> 00:09:28,083
separately. Most of them are
straightforward. For example, consider the

130
00:09:28,095 --> 00:09:32,418
first ingredient. This is just an array,
or a list of the n vertices. And we just

131
00:09:32,430 --> 00:09:36,499
need constant space per vertex to keep
track of its existence. So this is going

132
00:09:36,499 --> 00:09:40,929
to be theta of n, linear in the number of
vertices. Similarly, for the m edges, we

133
00:09:40,941 --> 00:09:45,283
just need linear space in the number of
edges to remember their existence. So

134
00:09:45,295 --> 00:09:49,357
that's going to be theta of m. Now, each
edge has to keep track of both of its

135
00:09:49,369 --> 00:09:54,091
endpoints. So that's two pointers, but two
is a constant. For each of the m edges, we

136
00:09:54,103 --> 00:09:58,126
have a constant space to keep track of
endpoints. So that's going to give us

137
00:09:58,138 --> 00:10:04,026
another theta of m constant per edge. Now,
this fourth case, you might be feeling

138
00:10:04,038 --> 00:10:10,006
kind of nervous, because a vertex, in
principle could have edges involving all n

139
00:10:10,018 --> 00:10:15,022
minus 1 of the vertices. So the number of
point or is it a single vertex could be

140
00:10:15,034 --> 00:10:19,341
theta of n. Also you could have you know,
you do have n vertices that could be theta

141
00:10:19,353 --> 00:10:23,478
of n squared. And certainly in something
like a complete graph you really would

142
00:10:23,490 --> 00:10:27,747
have that function. But the point is in
sparse graphs n, n squared is way overkill

143
00:10:27,759 --> 00:10:32,073
to the space needed by this fourth set of
pointers. Actually, if you think about it

144
00:10:32,162 --> 00:10:37,568
for each pointer in the fourth category, a
vertex pointing to a given edge, there is

145
00:10:37,580 --> 00:10:42,911
a pointer in the third category pointing
in the opposite direction, from that edge

146
00:10:42,923 --> 00:10:48,064
back to that vertex. So, there's actually
a one to one correspondence. Between

147
00:10:48,076 --> 00:10:52,861
pointers in the third category, and
pointers in the fourth category. Since the

148
00:10:52,873 --> 00:10:57,507
third category has space theta of m, so
does all of the pointers in the fourth

149
00:10:57,519 --> 00:11:02,178
category. So adding up over the four
ingredients, we have one theta of n, and

150
00:11:02,190 --> 00:11:07,315
three theta of ms, so that's going to give
us overall a theta of m plus n. If you

151
00:11:07,327 --> 00:11:13,532
prefer, another way you could think about
this would be theta of the max of m and n.

152
00:11:13,656 --> 00:11:19,575
These are the same up to a constant
factor. Now, as we discussed in a previous

153
00:11:19,587 --> 00:11:23,860
slide. Often, m is going to be bigger than
n, but I wanted to do a generic analysis

154
00:11:23,872 --> 00:11:27,800
here, which applies even if the graph is
not connected, even, even if it is in

155
00:11:27,812 --> 00:11:31,805
multiple pieces. So the space of the
adjacency list is within a constant factor

156
00:11:31,817 --> 00:11:35,970
the same as the number of ingredients in
the graph, the number of vertices plus the

157
00:11:35,982 --> 00:11:39,750
number of edges. So in that sense, that's
exactly what you want. Now being

158
00:11:39,762 --> 00:11:43,423
confronted with these two graph
representations that I've shown you I'm

159
00:11:43,435 --> 00:11:48,336
sure you're asking, well, which one should
you remember? Which one should you use?

160
00:11:48,439 --> 00:11:53,044
And the answer, as it so often is, is it
depends. It depends on two things. It

161
00:11:53,056 --> 00:11:57,925
depends on the density of your graph. It
depends on how m compares to n. And it

162
00:11:57,937 --> 00:12:02,715
also depends on what kind of operations
that you support, want to support. Now

163
00:12:02,727 --> 00:12:07,420
given what we're covering in this class,
and also the motivating applications I

164
00:12:07,432 --> 00:12:12,185
have in mind I can give you basically a
clean answer to this question for the

165
00:12:12,197 --> 00:12:16,630
purposes of these five weeks. Which is
we're going to be focusing on adjacency

166
00:12:16,642 --> 00:12:21,435
lists. The reason we're going to focus on
adjacency lists in this class, is both, is

167
00:12:21,447 --> 00:12:26,350
for both of these reasons, both because of
the operations we want and both because of

168
00:12:26,362 --> 00:12:29,775
the graph density and motivating
applications. So, first of all, most of

169
00:12:29,787 --> 00:12:33,525
the graph primitives, not all, but most,
will be dealing with graph search and

170
00:12:33,537 --> 00:12:36,975
adjacency lists are perfect for doing
graph search. You get to a node. You

171
00:12:36,987 --> 00:12:40,870
follow an outgoing arc. You go to another
node. You follow an outgoing arc and so

172
00:12:40,882 --> 00:12:44,605
on. And so, adjacency lists are the
perfect thing to do graph search.

173
00:12:44,702 --> 00:12:49,215
Adjacency matrices are definitely good for
certain kinds of graph operations. But

174
00:12:49,227 --> 00:12:53,285
they're not things we're really going to
be covering in this class. So that's

175
00:12:53,297 --> 00:12:57,345
reason one. Reason two is, a lot of the
motivations for graph primitives these

176
00:12:57,357 --> 00:13:01,740
days comes from massive, massive networks.
I mentioned earlier how the web ca n be

177
00:13:01,752 --> 00:13:06,170
fruitfully thought of as a directed graph.
Where the vertices are individual web

178
00:13:06,182 --> 00:13:10,747
pages. And directed arcs correspond to
hyperlinks, going from the page with the

179
00:13:10,759 --> 00:13:15,289
hyperlink, pointing to the one that the
hyperlink goes to. Now, it's hard to get

180
00:13:15,301 --> 00:13:19,967
an exact measurement of the web graph, but
a conservative lower bound on the number

181
00:13:19,979 --> 00:13:25,190
of vertices is something like 10 billion.
So that's 10 to the 10. Now that's pushing

182
00:13:25,202 --> 00:13:30,120
the limits of what computers can do, but
it's within the limits. So if you work

183
00:13:30,132 --> 00:13:35,215
hard, you can actually operate on graphs
with 10 to the 10 nodes. Now, suppose we

184
00:13:35,227 --> 00:13:40,310
use an adjacency matrix representation. So
if n is 10 to the 10, then n squared is

185
00:13:40,322 --> 00:13:44,410
going to be like 10 to the 20. And now
we're getting close to the estimated

186
00:13:44,422 --> 00:13:49,000
number of atoms in the known universe. So
that is clearly not feasible now and it's

187
00:13:49,012 --> 00:13:53,285
not going to be feasible ever. So the
adjacency matrix representation is totally

188
00:13:53,297 --> 00:13:58,045
out for, huge sparse graphs like the web
graph. Adjacency lists, well, the degree,

189
00:13:58,142 --> 00:14:02,675
on average, in the web, is thought to be
something like 10. So, the number of edges

190
00:14:02,687 --> 00:14:06,760
is only going to be something like 10 to
the 11. And then the adjacency of this

191
00:14:06,772 --> 00:14:11,350
representation will be proportional to
that. And again, that's really pushing

192
00:14:11,362 --> 00:14:16,423
what we can do with current technology,
but it is within the limits, so using that

193
00:14:16,435 --> 00:14:21,580
representation we can do non-trivial
computations on graphs, even at the scale

194
00:14:21,592 --> 00:14:22,551
of the web graph.