1 00:00:00,000 --> 00:00:04,002 Okay so this video is not about any particular graph problem, not about any 2 00:00:04,002 --> 00:00:08,111 particular graph algorithm, just sort of the preliminaries we need to discuss 3 00:00:08,111 --> 00:00:10,780 algorithms on graphs. How do we measure their size, 4 00:00:10,780 --> 00:00:12,512 how do we represent them, and so on. 5 00:00:12,512 --> 00:00:15,218 Remember what a graph is? It really has two ingredients. 6 00:00:15,218 --> 00:00:18,170 First of all there's the set of objects we're talking about. 7 00:00:18,170 --> 00:00:21,565 Those might be called vertices. Synonymously we might call them nodes. 8 00:00:21,565 --> 00:00:24,124 We represent parallelized relationships using edges. 9 00:00:24,124 --> 00:00:27,420 These can be either undirected, in which case they're ordered pairs. 10 00:00:27,420 --> 00:00:31,160 Or an edge can be directed from one to another, in that case they're ordered 11 00:00:31,160 --> 00:00:34,801 pairs and we have a directed graph. Now, when we talk about, say, the size of 12 00:00:34,801 --> 00:00:38,344 a graph or the running time of an algorithm that operates on a graph, we 13 00:00:38,344 --> 00:00:40,656 need to think about what we mean by input size. 14 00:00:40,656 --> 00:00:44,150 In particular, for a graph, there's really two different parameters that 15 00:00:44,150 --> 00:00:48,221 control how big it is unlike an array. For arrays, we just had a single number, 16 00:00:48,221 --> 00:00:50,644 the length. For graphs, we have the number of 17 00:00:50,644 --> 00:00:55,157 vertices and we have the number of edges. Usually, we'll use the notation n for the 18 00:00:55,157 --> 00:00:57,635 number of vertices, m for the number of edges. 19 00:00:57,635 --> 00:01:02,093 So the next quiz asks you to think about how the number of edges m can depend on 20 00:01:02,093 --> 00:01:04,070 the number of vertices. End. 21 00:01:04,070 --> 00:01:08,084 So, in particular, I want you to think, for this quiz, about an undirected graph. 22 00:01:08,084 --> 00:01:10,639 It has n vertices. There's no parallel edges, okay? 23 00:01:10,639 --> 00:01:14,654 So for a given pair of vertices, there's either zero or one edge between them. 24 00:01:14,654 --> 00:01:17,365 Moreover, let's assume that the graph is unconnected. 25 00:01:17,365 --> 00:01:19,659 Okay? So I don't want you think about graphs 26 00:01:19,659 --> 00:01:22,735 that have zero edges. Now, I haven't defined what a graph is, 27 00:01:22,735 --> 00:01:25,603 what it means for a graph to be connected formally yet, 28 00:01:25,603 --> 00:01:28,470 but I hope you get the idea. It means it's in one piece. 29 00:01:28,470 --> 00:01:32,537 You can't break it into two parts, that have no edges crossing between them. 30 00:01:32,537 --> 00:01:35,196 So, for such a graph, no parallel edges in one piece, 31 00:01:35,196 --> 00:01:38,136 n vertices. Think about what is the minimum number of 32 00:01:38,136 --> 00:01:42,395 edges it could possibly have, and what is the maximum number of edges, 33 00:01:42,395 --> 00:01:46,260 as a function of N, that it could possible have. 34 00:01:46,260 --> 00:01:48,980 Alright. So the correct option is the first one. 35 00:01:48,980 --> 00:01:53,008 The fewest number of edges that the connected undirected graph can have is N 36 00:01:53,008 --> 00:01:57,351 minus one and the maximum number of edges that a undirected graph with no parallel 37 00:01:57,351 --> 00:02:01,328 edges can have is N times N minus one under two better known as N shoes two. 38 00:02:01,328 --> 00:02:05,461 So why does it need at least N minus one edges if its going to be in one piece, 39 00:02:05,461 --> 00:02:08,130 well think about adding the edges one at a time, 40 00:02:08,130 --> 00:02:12,355 at each of the edges of the graph. Now initially you just have a graph with 41 00:02:12,355 --> 00:02:16,637 zero edges, the graph has N different pieces, N isolated vertices and there's 42 00:02:16,637 --> 00:02:19,792 no edges at all. Now each time you add one edge, what you 43 00:02:19,792 --> 00:02:23,905 do is you take two of the existing pieces at best and fuse them into one. 44 00:02:23,905 --> 00:02:28,187 So the maximum decrease you can have, in the number of different pieces of a 45 00:02:28,187 --> 00:02:32,537 graph, is it can decrease by one each time you add an edge. So, from a starting 46 00:02:32,537 --> 00:02:36,952 point of N different pieces, you got to get down to one piece, so that requires 47 00:02:36,952 --> 00:02:41,084 the addition of N minus one edges. You can also convince yourself of this, 48 00:02:41,084 --> 00:02:45,103 by, by drawing a few pictures, and noticing the trees achieve this bound 49 00:02:45,103 --> 00:02:47,706 exactly. So for example, here is a four vertex 50 00:02:47,706 --> 00:02:51,201 tree that has three edges. So this is a case where M is, indeed, N 51 00:02:51,201 --> 00:02:53,568 minus 1. Now, for the upper bound, why can't you 52 00:02:53,568 --> 00:02:57,400 have more than N choose two? Well, it's clear that the largest number 53 00:02:57,400 --> 00:03:00,105 of edges you can have is for the complete graph, 54 00:03:00,105 --> 00:03:03,148 wherein every single pair of edges has one between them. 55 00:03:03,148 --> 00:03:06,247 Again, there's no parallel arcs and edges are unordered. 56 00:03:06,247 --> 00:03:10,191 So there's, at most, N choose two possibilities of where to put an edge. 57 00:03:10,191 --> 00:03:13,742 So again, if N4, equals 4, here, would be an example with a maximum possible 58 00:03:13,742 --> 00:03:15,700 number, six edges. 59 00:03:15,700 --> 00:03:20,421 So now that I've got you thinking about how the number of edges can vary with the 60 00:03:20,421 --> 00:03:23,645 number of vertices, let me talk about, the distinction 61 00:03:23,645 --> 00:03:27,675 between sparse and dense graphs. It's important to distinguish between 62 00:03:27,675 --> 00:03:31,936 these two concepts because some data structures and algorithms are better 63 00:03:31,936 --> 00:03:35,735 suited for sparse graphs. Other data structures and algorithms are 64 00:03:35,735 --> 00:03:39,996 better suited for dense graphs. So to make this precise, let me just, put 65 00:03:39,996 --> 00:03:44,141 down this very common notation. N is the number of vertices of the graph 66 00:03:44,141 --> 00:03:46,560 under discussion. M is the number of edges. 67 00:03:49,080 --> 00:03:53,260 This is quite standard notation. Please get used to it and use it 68 00:03:53,260 --> 00:03:56,411 yourself. If you reverse these, you will confuse a 69 00:03:56,411 --> 00:04:01,556 lot of people who have familiarity with graph algorithms and data structures. 70 00:04:01,556 --> 00:04:05,736 Now one thing we learned from the previous quiz is the following. 71 00:04:05,736 --> 00:04:10,045 So in most applications, not all applications, but most applications. 72 00:04:10,045 --> 00:04:13,188 . N is at least linear in N, remember in 73 00:04:13,188 --> 00:04:17,967 the quiz you saw, the least N minus one if you wanted to graph to be connected 74 00:04:17,967 --> 00:04:21,995 and it's also big O even squared. This is under the assumption that there's 75 00:04:21,995 --> 00:04:24,575 no parallel arcs. Now there are cases where we want to 76 00:04:24,575 --> 00:04:28,348 allow parallel arcs in fact, we'll do that in the contraction algorithm for the 77 00:04:28,348 --> 00:04:30,927 makeup problem. There are cases where we want to allow 78 00:04:30,927 --> 00:04:34,652 the number of edges to drop so low that the graph breaks into multiple pieces, 79 00:04:34,652 --> 00:04:38,568 for example, when we talk about connected components but more often then not, we're 80 00:04:38,568 --> 00:04:42,437 thinking about a connected graph with no parallel edges and then we can pin down 81 00:04:42,437 --> 00:04:46,019 the number of edges m to be somewhere between the linear and the number of 82 00:04:46,019 --> 00:04:47,760 nodes, linear in N and quadratic in N. 83 00:04:47,760 --> 00:04:51,918 Now I'm not going to give you a super formal definition of what a sparse or a 84 00:04:51,918 --> 00:04:56,289 dense graph is, and people are a little loose with this terminology in practice. 85 00:04:56,289 --> 00:05:00,501 But basically sparce means your closer to the lower bound closer to the linear 86 00:05:00,501 --> 00:05:04,820 dense means your closer to the upper bound, closer to quadratic. 87 00:05:04,820 --> 00:05:08,529 Now, I know this leaves ambiguity when the number of edges is something, you 88 00:05:08,529 --> 00:05:12,188 know, like N to the three halves. usually, in that case, you'd think of it 89 00:05:12,188 --> 00:05:15,056 as a dense graph. So usually anything which is more than N 90 00:05:15,056 --> 00:05:17,974 times some logarithmic terms, you'd think about as a dense graph. 91 00:05:17,974 --> 00:05:22,030 But again, people are a little bit sloppy with this, when they talk about graphs. 92 00:05:22,030 --> 00:05:24,977 Next, I want to discuss two representations of graphs. 93 00:05:24,977 --> 00:05:28,734 And we're mostly going be using the second one in this course. 94 00:05:28,734 --> 00:05:32,896 But this first one, the adjacency matrix, I do wann mention just briefly, 95 00:05:32,896 --> 00:05:36,191 just on this slide. This is a supernatural idea where you 96 00:05:36,191 --> 00:05:38,850 represent the edges in a graph using a matrix. 97 00:05:38,850 --> 00:05:42,004 Let me describe it first for undirected graphs. 98 00:05:42,004 --> 00:05:46,837 So the matrix is going to be denoted by capital a, and it's a square n by n 99 00:05:46,837 --> 00:05:52,341 matrix, where n is the number of vertices of the graph and the semantics are the I 100 00:05:52,341 --> 00:05:57,308 Jth entry of the matrix as one if and only if there's a edge between the 101 00:05:57,308 --> 00:06:02,006 vertices I and j in the graph. I'm assuming here that the vertices are 102 00:06:02,006 --> 00:06:04,960 named 1, 2, 3, 4, etc., all the way up to n. 103 00:06:04,960 --> 00:06:09,085 It's easy to add bells and whistles to the adjacency matrix to accommodate 104 00:06:09,085 --> 00:06:12,055 parallel edges, to accommodate edge weights, which is a, 105 00:06:12,055 --> 00:06:14,310 accommodate directed arcs, directed edges. 106 00:06:14,310 --> 00:06:19,368 If you wanted to have parallel arcs you could just have Aij denote the number of 107 00:06:19,368 --> 00:06:24,332 arcs that are between I and j. If edges have different weights, you can 108 00:06:24,332 --> 00:06:28,000 just have Aij be the weights of the ij edge. 109 00:06:28,000 --> 00:06:31,906 And for the directed graph you can use plus 1's and minus 1's. 110 00:06:31,906 --> 00:06:36,253 So if the arc is directed from I to J you'd set I, AIJ to be plus one. 111 00:06:36,253 --> 00:06:40,160 If the arc is directed from J to I, you'd set AIJ to minus one. 112 00:06:40,160 --> 00:06:43,939 There are many metrics by which you can evaluate a data structure or a 113 00:06:43,939 --> 00:06:46,761 representation. two important ones I want to discuss 114 00:06:46,761 --> 00:06:49,210 here. First all the number of resources it 115 00:06:49,210 --> 00:06:53,469 requires, and in this context that's the amount of space that the data structure 116 00:06:53,469 --> 00:06:55,545 needs. The second thing is, what are the 117 00:06:55,545 --> 00:06:57,781 operations of the data structure supports? 118 00:06:57,781 --> 00:07:00,123 So let's just begin with space requirements. 119 00:07:00,123 --> 00:07:02,633 What are they for the adjacency matrix? Alright. 120 00:07:02,633 --> 00:07:06,593 So, the answer, at least with the, sort of, straightforward way of storing a 121 00:07:06,593 --> 00:07:08,925 matrix is n2. squared, and this is independent of the 122 00:07:08,925 --> 00:07:11,908 number of edges. So you could try to beat this down for 123 00:07:11,908 --> 00:07:16,301 sparse graphs using sparse matrix tricks. But, for the basic idea of just actually 124 00:07:16,301 --> 00:07:19,122 representing an end by end matrix. You got n2 squared entries. 125 00:07:19,122 --> 00:07:22,647 You've got to store one bit in each, whether the edge is there or not. 126 00:07:22,647 --> 00:07:24,437 So that's going to give you n2 squared space. 127 00:07:24,437 --> 00:07:28,613 The constants are, of course, very small, because you're just storing one bit per 128 00:07:28,613 --> 00:07:31,162 entry. But nonetheless, this is quadratic in the 129 00:07:31,162 --> 00:07:34,326 number of vertices. Now that's going to be fine if you have a 130 00:07:34,326 --> 00:07:38,200 dense graph, if the number of edges is as high as, and squared, 131 00:07:38,200 --> 00:07:41,605 then you're not really wasting anything in this representation. 132 00:07:41,605 --> 00:07:45,552 But in a sparse graph, if M is much closer to linear, then this is a super 133 00:07:45,552 --> 00:07:49,114 wasteful representation. Let's talk about the adjacency list 134 00:07:49,114 --> 00:07:52,574 representation. This is the dominant one we'll be using 135 00:07:52,574 --> 00:07:55,280 in this class. This has several ingredients. 136 00:07:55,280 --> 00:07:59,334 So first you keep track of both the vertices and the edges as independent 137 00:07:59,334 --> 00:08:01,745 entities. So you're going to have an array or a 138 00:08:01,745 --> 00:08:04,485 list of each. And then we want these two arrays to 139 00:08:04,485 --> 00:08:07,005 cross-reference each other in the obvious way. 140 00:08:07,005 --> 00:08:10,402 So given a vertex you want to know which edges it's involved in. 141 00:08:10,402 --> 00:08:13,251 Given an edge, you want to know what its endpoints are. 142 00:08:13,251 --> 00:08:17,087 So lets say first most simply each edge is going to have two pointers, 143 00:08:17,087 --> 00:08:21,086 one for each of the two endpoints. In a directed graph, of course we'd keep 144 00:08:21,086 --> 00:08:24,210 track of which one is the head and which one is the tail. 145 00:08:24,210 --> 00:08:28,210 Now each vertex is going to point to all of the edges of which it's a member. 146 00:08:28,210 --> 00:08:31,171 Now in an undirected graph its clear what I mean by that. 147 00:08:31,171 --> 00:08:33,924 In a directed graph you could do it in a couple ways. 148 00:08:33,924 --> 00:08:38,132 Generally you'd have a vertex, keep track of all of the edges for which it is the 149 00:08:38,132 --> 00:08:40,418 tail. That is all of the edges which you can 150 00:08:40,418 --> 00:08:44,210 follow one hop out from the edge. If you wanted to, you can also have a 151 00:08:44,210 --> 00:08:47,847 second array of it more expensive storage, where the vertex also keeps 152 00:08:47,847 --> 00:08:51,900 track of the edges pointing to it, the edges for which it's the head. 153 00:08:51,900 --> 00:08:55,598 So let me ask you the same question I did with an adjacency matrix. 154 00:08:55,598 --> 00:08:59,407 What is the space required of an adjacency list as a function of the 155 00:08:59,407 --> 00:09:02,720 number of edges M and the number of vertices N of the graph? 156 00:09:04,280 --> 00:09:07,878 So the correct answer to this question is the third option. 157 00:09:07,878 --> 00:09:11,172 Feta of M plus N, which we're going to think of as linear 158 00:09:11,172 --> 00:09:15,015 space in the size of the graph. So this quiz is a little tricky. 159 00:09:15,015 --> 00:09:19,895 So to explain the answer let me return to the slide with the ingredients of 160 00:09:19,895 --> 00:09:24,714 adjancency lists, and let's compute the space for each of these four ingredients 161 00:09:24,714 --> 00:09:27,398 separately. Most of them are straightforward. 162 00:09:27,398 --> 00:09:29,791 For example, consider the first ingredient. 163 00:09:29,791 --> 00:09:32,559 This is just an array or a list of the N vertices. 164 00:09:32,559 --> 00:09:36,657 And we just need constant space per vertex to keep track of its existence. 165 00:09:36,657 --> 00:09:40,201 So this is going to be theta of N, linear in the number of vertices. 166 00:09:40,201 --> 00:09:44,575 Similarly, for the M edges, we just need linear space and the number of edges to 167 00:09:44,575 --> 00:09:47,622 remember their existence. So that's going to be feta m. 168 00:09:47,622 --> 00:09:52,292 Now each edge has to keep track of both of it's end points, so that's two 169 00:09:52,292 --> 00:09:57,408 pointers, but two is a constant for each of the m edges we have a constant space 170 00:09:57,408 --> 00:10:01,758 to keep track of m points, so that's going to give us another feta of m 171 00:10:01,758 --> 00:10:05,595 constant per edge. Now this fourth case you might be feeling 172 00:10:05,595 --> 00:10:10,712 kind of nervous because a vertex in principle could have edges involving all 173 00:10:10,712 --> 00:10:15,381 m minus one of the vertices, so the number of pointers of a single vertex 174 00:10:15,381 --> 00:10:17,940 could be theta of N. Also you could have, 175 00:10:17,940 --> 00:10:21,940 you know, you do have end verticies that could be theta of N squared and certainly 176 00:10:21,940 --> 00:10:25,250 in something like a complete graph you really would have that much. 177 00:10:25,250 --> 00:10:29,102 But the point is in sparse graphs N's K N squared is way overkill to the space 178 00:10:29,102 --> 00:10:32,560 needed by those fourth set of pointers. Actually if you think about it, 179 00:10:32,560 --> 00:10:37,227 for each pointer in the fourth category, a vertex pointing to a given edge, 180 00:10:37,227 --> 00:10:42,273 there's a pointer in the third category pointing in the opposite direction from 181 00:10:42,273 --> 00:10:46,247 that edge back to that vertex. So there's actually a one to one 182 00:10:46,247 --> 00:10:51,356 correspondence between pointers in the third category and pointers in the fourth 183 00:10:51,356 --> 00:10:54,447 category. Since the third category has space data 184 00:10:54,447 --> 00:10:58,600 of m, so does all of the pointers in the fourth category. 185 00:10:58,600 --> 00:11:03,857 So adding up over the four ingredients, we have one theta of N and three thetas 186 00:11:03,857 --> 00:11:07,982 of Ms', so that's going to give us overall, a theta of M plus N. 187 00:11:07,982 --> 00:11:13,106 If you'd prefer, another way you could think about this would be, theta of the 188 00:11:13,106 --> 00:11:17,180 max of M and N. These are the same up to a constant factor. 189 00:11:17,180 --> 00:11:21,301 Now as we discussed in a previous slide, often M is going to be bigger than N, 190 00:11:21,301 --> 00:11:25,646 but I wanted to do a generic analysis here, which applies even if the graph is 191 00:11:25,646 --> 00:11:28,542 not connected, even, even if it is in multiple pieces. 192 00:11:28,542 --> 00:11:32,664 So the space of the adjacency list is within a constant factor, the same as the 193 00:11:32,664 --> 00:11:36,953 number of ingredients in the graph, the number of vertices plus the number of 194 00:11:36,953 --> 00:11:39,571 edges. So in that sense, that's exactly what you 195 00:11:39,571 --> 00:11:42,441 want. Now being confronted with these two graph 196 00:11:42,441 --> 00:11:47,314 representations that I've shown you, I'm sure you're asking well one should you 197 00:11:47,314 --> 00:11:49,412 remember? Which one should you use? 198 00:11:49,412 --> 00:11:52,311 And the answer as it so often is, is it depends. 199 00:11:52,311 --> 00:11:56,321 It depends on two things. It depends on the density of your graph. 200 00:11:56,321 --> 00:12:00,825 It depends on how N compares to, N. And it also depends on what kind of 201 00:12:00,825 --> 00:12:03,540 operations that you support, want to support. 202 00:12:03,540 --> 00:12:06,770 Now given what we're covering in this class, an also. 203 00:12:06,770 --> 00:12:09,442 The motivating applications I have in mind. 204 00:12:09,442 --> 00:12:14,412 I can give you basically a clean answer to this question for the purposes of 205 00:12:14,412 --> 00:12:17,705 these five weeks, which is, we're going to be focusing on 206 00:12:17,705 --> 00:12:20,638 adjacency lists. The reason we're going to focus on 207 00:12:20,638 --> 00:12:24,351 adjacency lists in this class is both, is for both of these reasons. 208 00:12:24,351 --> 00:12:28,294 Both because of the operations we want, and both because of the graph density and 209 00:12:28,294 --> 00:12:31,117 motivating applications. So, first of all, most of the graph 210 00:12:31,117 --> 00:12:34,817 primitives, not all, but most will be dealing with graph search, and adjacency 211 00:12:34,817 --> 00:12:38,662 lists are perfect for doing graph search. You get a node, you follow an outgoing 212 00:12:38,662 --> 00:12:41,778 arc, you go to another node, you follow an outgoing arc, and so on. 213 00:12:41,778 --> 00:12:44,893 And so, adjacency lists are the perfect thing to do graph search. 214 00:12:44,893 --> 00:12:48,641 Adjacency matrices are definitely good for certain kinds of graph operations, 215 00:12:48,641 --> 00:12:52,098 but they're not things we're really going to be covering in this class. 216 00:12:52,098 --> 00:12:54,970 So, that's reason one. Reason two is, a lot of the motivation. 217 00:12:54,970 --> 00:12:58,970 For graph primitives these days comes from massive, massive. 218 00:12:58,970 --> 00:13:01,716 Networks. I mentioned early how the web can be 219 00:13:01,716 --> 00:13:06,613 fruitfully thought of as a directed wrath where the vertices are individually web 220 00:13:06,613 --> 00:13:11,390 pages and directed arcs corresponded to hyper links, going from the page with the 221 00:13:11,390 --> 00:13:14,675 hyper link pointing to one that the hyper link goes to. 222 00:13:14,675 --> 00:13:19,273 Now its hard to get an exact measurement of the web graft, but a conservative 223 00:13:19,273 --> 00:13:22,978 lower bound on the number of vertices is something like 10 billion. 224 00:13:22,978 --> 00:13:27,461 So that's 10 to the 10 Now that's pushing the limits of what computers can do, but 225 00:13:27,461 --> 00:13:30,663 it's within the limits. So if you work hard you can actually 226 00:13:30,663 --> 00:13:33,328 operate on graphs with 10 to the 10 nodes. 227 00:13:33,328 --> 00:13:38,363 Now suppose we used an adjacency matrix representation so if N is 10 to the 10 228 00:13:38,363 --> 00:13:43,586 then N squared is going to be like 10 to the 20, and now we're getting close to 229 00:13:43,586 --> 00:13:46,819 the estimated number of atoms in the known universe. 230 00:13:46,819 --> 00:13:51,357 So that is clearly not feasible now and is not going to be feasible ever. 231 00:13:51,357 --> 00:13:55,150 So the adjacency matrix representation is totally out for 232 00:13:55,150 --> 00:13:58,157 huge sparse graphs like the web graph. Adjacency lists, 233 00:13:58,157 --> 00:14:02,279 well, the degree, on average, in the web is thought to be something like 10. 234 00:14:02,279 --> 00:14:06,346 So the number of edges is only going to be something like 10 to the 11, 235 00:14:06,346 --> 00:14:10,468 and then the adjacency list representation will be proportional to 236 00:14:10,468 --> 00:14:12,974 that. And again, that's really pushing what we 237 00:14:12,974 --> 00:14:16,261 can do with current technology, but it is within the limits. 238 00:14:16,261 --> 00:14:20,828 So using that representation, we can do non trivial computations on graphs, even 239 00:14:20,828 --> 00:14:22,500 at the scale of the web graph.