1 00:00:00,000 --> 00:00:07,801 . As we'll see, in little bit later in the 2 00:00:07,801 --> 00:00:13,177 lecture the way that we're going to set up our graph processing algorithms, is to 3 00:00:13,177 --> 00:00:19,018 develop an API to cover our representation of the graph and provide simple set of 4 00:00:19,018 --> 00:00:22,271 methods for clients to call to process the graph. 5 00:00:22,271 --> 00:00:28,991 So, let's take a look at that in detail. So, the idea is we have to represent, a, 6 00:00:29,281 --> 00:00:34,214 graph within the computer. One of the first things, to remember is 7 00:00:34,214 --> 00:00:40,380 that, you can, draw a graph, that maybe, that provides some kind of intuition, 8 00:00:40,597 --> 00:00:45,748 about the structure, an, but you could have two drawings that represent, 9 00:00:45,748 --> 00:00:49,520 represent the same graph that look pretty different. 10 00:00:49,725 --> 00:00:55,414 And so one of the things to remember in any graph representation is, it can give 11 00:00:55,414 --> 00:00:59,459 you some intuition. But that intuition may be misleading. 12 00:00:59,664 --> 00:01:04,600 And we'll just remember that as we look at different representations. 13 00:01:04,600 --> 00:01:08,450 So first thing is how to represent the vertices. 14 00:01:08,688 --> 00:01:15,120 So what we're going to do in this lecture is just use integers between zero and V 15 00:01:15,120 --> 00:01:18,932 minus one. The reason we do that is it allows us, to 16 00:01:18,932 --> 00:01:23,776 use vertex indexed arrays, within the graph representation. 17 00:01:23,776 --> 00:01:29,494 And we understand, from earlier algorithms, lectures, that we can use a 18 00:01:29,494 --> 00:01:34,674 symbol table to convert names to integers, with a symbol table. 19 00:01:34,913 --> 00:01:40,104 And so, we'll, leave that part as a symbol table application. 20 00:01:40,104 --> 00:01:46,173 And, just work with graphs with vertex names between zero and V minus one where v 21 00:01:46,173 --> 00:01:51,692 is the number of vertices. Now we have to remember when we're doing 22 00:01:51,692 --> 00:01:56,700 representation we can have various anomalies in the graph. 23 00:01:56,935 --> 00:02:03,351 So we can we draw edges, but actually, in real data we might have multiple edges or 24 00:02:03,351 --> 00:02:09,064 we might have a self loop. And we'll take that into account when we 25 00:02:09,064 --> 00:02:12,820 look at the representation, graph representation. 26 00:02:12,820 --> 00:02:16,027 So here is the API that we're going to use. 27 00:02:16,027 --> 00:02:21,546 So again our graph processing algorithms are going to be clients of this API. 28 00:02:21,770 --> 00:02:25,499 The idea is that will use this to build graphs. 29 00:02:25,723 --> 00:02:30,496 And then our processing programs will be client at this program, 30 00:02:30,720 --> 00:02:35,493 In this, of this API. And the idea is most of them have a very 31 00:02:35,493 --> 00:02:41,833 simple operations that they need to do, and those are the one's that we put in the 32 00:02:41,833 --> 00:02:44,220 API. So we have two constructors. 33 00:02:44,448 --> 00:02:48,248 One that creates an empty graph would be vertices. 34 00:02:48,248 --> 00:02:52,352 Another one that creates a graph from an input stream. 35 00:02:52,580 --> 00:02:58,812 Then the basic operation for building a graph is just add edge, so that adds an 36 00:02:58,812 --> 00:03:04,838 edge connecting two vertices, V and W. The basic operations for processing our 37 00:03:04,838 --> 00:03:09,597 graph, well there's V and E that gives a number of edges and vertices. 38 00:03:09,804 --> 00:03:14,080 But then there's an iterator that takes the vertex's argument, 39 00:03:14,080 --> 00:03:19,260 Then iterates through the vertices adjacent to that vertex. 40 00:03:19,260 --> 00:03:25,456 All of our graph processing can be cast in terms of this iterator. 41 00:03:25,456 --> 00:03:33,061 So down here is an example of a client program that prints out every edge in the 42 00:03:33,061 --> 00:03:37,756 graph. So we created an input stream, maybe with 43 00:03:37,756 --> 00:03:43,108 a given final name, create a graph from that input stream, 44 00:03:43,108 --> 00:03:49,492 And we'll look at the code for that. And then the client program for every 45 00:03:49,492 --> 00:03:53,276 vertex, So remember the vertices are numbered 46 00:03:53,276 --> 00:03:57,467 between zero and G dot the number of vertices minus one. 47 00:03:57,468 --> 00:04:04,058 So for every vertex, we iterate through all the vertices adjacent to that vertex 48 00:04:04,293 --> 00:04:11,420 and print out an edge, if, if there's an edge connecting v and w we print out v and 49 00:04:11,420 --> 00:04:15,805 then a little dash to indicate an edge and then w. 50 00:04:15,805 --> 00:04:22,618 This actually prints out every edge twice and in an undirected graph because if the 51 00:04:22,618 --> 00:04:28,570 v and w are connected by an edge then w appears in v's adjacency list and v 52 00:04:28,570 --> 00:04:34,707 appears in w's adjacency list. So heres an example of a running that 53 00:04:34,707 --> 00:04:41,938 client, if, if we have a file tinyG.txt, our standard is we have a number of 54 00:04:41,938 --> 00:04:48,137 vertices as an integer in the first is the first integer in the file. 55 00:04:48,137 --> 00:04:55,110 Number of edges is the second integer in the file and then pairs of vertex names. 56 00:04:55,680 --> 00:05:00,626 And so. The constructor will read those two things 57 00:05:00,626 --> 00:05:07,271 and then call that edge for all of these pairs of things and that enables this 58 00:05:07,271 --> 00:05:12,737 client, to, if we run it for that graph, to print out all the edges. 59 00:05:12,737 --> 00:05:19,128 So everybody adjacent to zero, 61 and five, everybody adjacent to one is just 60 00:05:19,128 --> 00:05:24,008 zero, So you notice the edge 0-1 appears twice in the list. 61 00:05:24,008 --> 00:05:28,550 So that's the sample client of our basic graph in API. 62 00:05:28,550 --> 00:05:35,853 And so here's some typical and simpicle, simple graph processing code that uses the 63 00:05:35,853 --> 00:05:40,039 API. So you can write a static method that 64 00:05:40,039 --> 00:05:46,768 takes a graph and a vertex as argument. And returns the degree, the number of 65 00:05:46,768 --> 00:05:53,580 edges that are connect, number of vertices that are connected by an edge to V. 66 00:05:53,806 --> 00:06:00,454 So all it does is set a local variable degree to zero, And then iterate through 67 00:06:00,454 --> 00:06:05,290 all the vertices adjacent to V and increment that and return it. 68 00:06:05,290 --> 00:06:11,067 Similarly, you can compute the maximum degree of a vertex in a graph. 69 00:06:11,067 --> 00:06:17,779 And that's for every vertex, compute the degree and find the biggest one or the 70 00:06:17,779 --> 00:06:22,197 average degree. Well, the average degree, if you think 71 00:06:22,197 --> 00:06:28,381 about it, it's just twice the number of edges divided by the vertex or you could 72 00:06:28,381 --> 00:06:33,928 go through all the way through every vertex and edge and compute the total and 73 00:06:33,928 --> 00:06:39,545 divide, but this is probably a much more efficient way to do it or say number of 74 00:06:39,545 --> 00:06:45,232 self loops. and so that involves going through the whole graph of every vertex 75 00:06:45,232 --> 00:06:51,621 for every edge adjacent to that vertex you check whether it's v and we've got a self 76 00:06:51,621 --> 00:06:56,045 loop. And if it does then your return the number 77 00:06:56,045 --> 00:07:00,960 of self loops that divided by two because every edge is counted twice. 78 00:07:00,960 --> 00:07:07,158 So those are examples of static methods that a client might use. 79 00:07:07,423 --> 00:07:13,958 In just example of the use of the ATI. So now how we going to implement that? 80 00:07:14,166 --> 00:07:20,067 That's our usual standard of lets look at some clients and now let's talk about a 81 00:07:20,067 --> 00:07:24,510 representation that we can use to implement the graph API. 82 00:07:24,510 --> 00:07:30,626 So one possible representation is, set of edges representation, where, for every 83 00:07:30,626 --> 00:07:36,815 edge, we just maintain a list, maybe an array of edges or a linked list, of edges. 84 00:07:37,034 --> 00:07:42,568 So for every edge in the graph, there's, a representation of it. 85 00:07:42,786 --> 00:07:48,320 And that one is a possible representation, but it leads to, inefficient, 86 00:07:49,121 --> 00:07:55,092 implementation, much less efficient, that would make it unusable for, the huge 87 00:07:55,092 --> 00:08:01,263 graphs that we see in practice. Another one is called the adjacency matrix 88 00:08:01,263 --> 00:08:06,423 representation, Here we maintain a 2-dimensional v x v 89 00:08:06,423 --> 00:08:10,426 array, It's a boolean array, 0-1 or true or 90 00:08:10,426 --> 00:08:16,168 false. And for every edge v-w in the graph you 91 00:08:16,168 --> 00:08:22,930 put true for row v in column w and for row w in column v. 92 00:08:22,930 --> 00:08:30,251 So it's actually two representations of each edge in an adjacency matrix graph 93 00:08:30,251 --> 00:08:35,187 representation. S, you can immediately, give a v and w 94 00:08:35,187 --> 00:08:39,688 test whether there's an edge connecting v and w. 95 00:08:39,688 --> 00:08:47,470 But that's one of the few operations that's efficient with this representation 96 00:08:47,470 --> 00:08:54,596 and it's not very widely used, because for a huge graph, say with billions of, of, 97 00:08:54,596 --> 00:08:58,581 vertices, You would have to have billion squared 98 00:08:58,581 --> 00:09:04,489 number of entries in this array, which is going to be too big for your computer, 99 00:09:04,489 --> 00:09:08,527 most likely. So this one actually isn't that widely 100 00:09:08,527 --> 00:09:11,566 used. A, The one that is most widely used in 101 00:09:11,566 --> 00:09:15,699 practice, and the one that we'll stick with, is called the adjacency list 102 00:09:15,699 --> 00:09:20,778 representation. And that's where we keep a vertex index 103 00:09:20,778 --> 00:09:28,849 array where, for every vertex, we maintain a list of the vertices that are adjacent 104 00:09:28,849 --> 00:09:33,589 to that. So, for example, vertex four, Has, five, 105 00:09:33,596 --> 00:09:39,184 is connected to five, six, and three, so its list has five, six, and three on it. 106 00:09:39,184 --> 00:09:44,982 Now, on lower level representations we've talk about using linked width or array for 107 00:09:44,982 --> 00:09:48,893 these, But actually in modern lingo with the 108 00:09:48,893 --> 00:09:54,761 background that we'd built with algorisms what we're going to use is an abstract 109 00:09:54,761 --> 00:09:58,742 data type. Our bag representation for this, which is 110 00:09:58,742 --> 00:10:04,190 implemented with a linked list, but we don't have to think about it when we're 111 00:10:04,190 --> 00:10:08,540 talking about graphs. we'd keep the vertices, 112 00:10:08,793 --> 00:10:14,107 Names of, numbers of the vertices that are adjacent to each given vertex in a bag. 113 00:10:14,296 --> 00:10:19,440 And we know that we can implement it, such that we can iterate through and time 114 00:10:19,440 --> 00:10:24,678 proportional to the number of entries and the space taken is also proportional to 115 00:10:24,678 --> 00:10:28,703 the number of entries,, And that's going to enable us to process 116 00:10:28,703 --> 00:10:35,751 huge graphs. So here's the full implementation of the 117 00:10:36,152 --> 00:10:40,697 Jason C, List graph representation. 118 00:10:41,098 --> 00:10:49,190 So, the private instance variables that we're gonna use area the number of 119 00:10:49,190 --> 00:10:57,236 vertices in the graph and then a array of nags of integers. so data the types and 120 00:10:57,236 --> 00:11:03,911 set of values, set operations on those values, so those are the sets of values 121 00:11:04,161 --> 00:11:09,084 for a graph. So here's the constructor of an empty 122 00:11:09,084 --> 00:11:15,807 graph with V vertices. We keep the value v in the instance 123 00:11:15,807 --> 00:11:18,850 variable as usual. Then we create. 124 00:11:19,145 --> 00:11:23,979 An array of size V. And, of bags of integers. 125 00:11:24,275 --> 00:11:31,969 And, store that array in, [inaudible] as the adjacency array of the graph. 126 00:11:31,969 --> 00:11:36,527 Adjacency lists array. And then, as usual. 127 00:11:36,527 --> 00:11:42,395 When we create, a, a, an array of objects. We go through. 128 00:11:42,395 --> 00:11:49,954 And for every entry in the array, we initialize with, a, empty object. 129 00:11:49,954 --> 00:11:54,430 So, after this code, we have the empty bags. 130 00:11:54,430 --> 00:12:02,557 And so that's the constructor and then the other main engine and building graphs is 131 00:12:02,557 --> 00:12:09,963 at edge and so, to add an edge between V and W, we add W to V's bag, and we add V 132 00:12:09,963 --> 00:12:11,860 to W's bag. That's it. 133 00:12:11,860 --> 00:12:19,583 And to iterate through all the vertices adjacent to a given vertex we simply 134 00:12:19,583 --> 00:12:26,184 return the bag which is iterable. This is a nice example, illustrating the 135 00:12:26,184 --> 00:12:30,934 power of abstraction. Because we did the low level processing, 136 00:12:31,157 --> 00:12:37,021 for, that, that's involved, with our bag implementation in one of the early, 137 00:12:37,244 --> 00:12:41,103 lectures. And now we, we get to use that to give a 138 00:12:41,103 --> 00:12:47,116 very compact implementation, and efficient implementation, of the, graph data 139 00:12:47,116 --> 00:12:50,753 structure. So it's really important to understand 140 00:12:50,753 --> 00:12:54,687 this code. And you should make sure, that you study 141 00:12:54,687 --> 00:12:57,234 it. So, as I mentioned in practice. 142 00:12:57,446 --> 00:13:01,391 We're gonna be using this adjacency list representation. 143 00:13:01,602 --> 00:13:07,379 Because all the algorithms are based on iterating over the vertices adjacent to V. 144 00:13:07,379 --> 00:13:12,099 And this gets that done in time proportional to the number of such 145 00:13:12,099 --> 00:13:15,904 vertices. So that's number one and number two, in 146 00:13:15,904 --> 00:13:20,412 the real world, the graphs have lots of num, lots of vertices, 147 00:13:20,412 --> 00:13:27,901 But a pretty small vertex degre. so number one, we can afford to represent, represent 148 00:13:27,901 --> 00:13:34,262 the graph when we use the adjacency list representation because basically our space 149 00:13:34,262 --> 00:13:40,258 is proportional to the number of edges. And number two, we can afford to process 150 00:13:40,258 --> 00:13:46,619 it because our time taken is proportional to the number of edges that, that we 151 00:13:46,619 --> 00:13:52,176 examined, with the ray of edges representation in the adjacency matrix 152 00:13:52,176 --> 00:13:57,062 representation it gets very slow for some very simple task, 153 00:13:57,062 --> 00:14:03,080 But, mostly, it's very slow for iterating over the vertices given to, adjacent to a 154 00:14:03,080 --> 00:14:06,055 given vertex. Which is the key operation. 155 00:14:06,263 --> 00:14:11,936 A list of edges, you have to go through all the edges to find the ones adjacent to 156 00:14:11,936 --> 00:14:15,741 a given vertex. Adjacency matrix, you have to go through, 157 00:14:15,948 --> 00:14:21,206 all possible, vertices adjacent and that's just going to be much too slow in 158 00:14:21,206 --> 00:14:24,734 practice, Because adjacency list gets it done, in 159 00:14:24,734 --> 00:14:30,879 time proportional degree of v, which is much smaller, for the huge graph that we 160 00:14:30,879 --> 00:14:34,682 see in the real world. So that's our basic API. 161 00:14:34,682 --> 00:14:39,760 Next, we'll look at some algorithms that are clients of this API.