1 00:00:00,300 --> 00:00:04,240 So here's the Pregel model. 2 00:00:04,240 --> 00:00:08,040 It's an internal project which is used within Google. 3 00:00:08,040 --> 00:00:16,220 It distributes vertices to processors or nodes in this diagram, initialization you 4 00:00:16,220 --> 00:00:19,060 partition the vertices across the processors 5 00:00:19,060 --> 00:00:22,662 and then vertices are given edges. 6 00:00:22,662 --> 00:00:25,680 So then they read edges and then edges are given to the appropriate 7 00:00:25,680 --> 00:00:32,170 vertex, so vertex will get all the edges which outgo from that vertex. 8 00:00:32,170 --> 00:00:34,995 And each processor will have a whole bunch of vertices. 9 00:00:36,830 --> 00:00:37,530 And then 10 00:00:39,650 --> 00:00:45,175 vertices will exchange messages as per the Pregel program, that is written by 11 00:00:45,175 --> 00:00:45,450 [UNKNOWN] 12 00:00:45,450 --> 00:00:46,740 application. 13 00:00:46,740 --> 00:00:51,560 The nodes will some messages will be exchanged within a node itself when the 14 00:00:51,560 --> 00:00:58,260 destination vertex is within that node and others will have to travel between nodes. 15 00:00:58,260 --> 00:01:02,420 When the destination vertex is actually on another processor. 16 00:01:03,520 --> 00:01:10,645 This coordination of messages precedes in what I call super steps, 17 00:01:10,645 --> 00:01:14,825 we will come to that in the later chart. 18 00:01:14,825 --> 00:01:20,530 So, after all nodes have sort of delivered and received one round of messages. 19 00:01:20,530 --> 00:01:25,920 The master ensures that there's a synchronization across all processors. 20 00:01:25,920 --> 00:01:29,180 And then the next step proceeds. 21 00:01:29,180 --> 00:01:33,030 So everybody's sends messages and then the messages get delivered. 22 00:01:33,030 --> 00:01:34,908 Then again it does some computation. 23 00:01:34,908 --> 00:01:39,669 There's, everybody sends messages, messages get delivered 24 00:01:40,780 --> 00:01:45,230 and so on and so forth until there are no more messages getting delivered. 25 00:01:46,500 --> 00:01:50,100 We'll, we'll see examples of algorithms implemented using 26 00:01:50,100 --> 00:01:54,030 this Pregel model in, in a, in a bit. 27 00:01:55,460 --> 00:01:57,940 There are different implementations, Pregel itself is 28 00:01:57,940 --> 00:02:00,340 internal to Google as we mentioned just 29 00:02:00,340 --> 00:02:01,690 a minute ago. 30 00:02:01,690 --> 00:02:05,530 Giraph is an open source implementation of the 31 00:02:05,530 --> 00:02:08,790 Pregel model built on top of Hadoop, but there's, 32 00:02:08,790 --> 00:02:12,660 they're essentially only mappers and not reducers and all 33 00:02:12,660 --> 00:02:15,650 the message parsing communication happens through remote procedure calls. 34 00:02:17,060 --> 00:02:21,710 It, like Pregel itself, Giraph distributes vertices across nodes. 35 00:02:23,420 --> 00:02:24,630 GPS is another, 36 00:02:26,035 --> 00:02:31,850 implementation of the, of the graph model, based on the Pregel paradigm. 37 00:02:31,850 --> 00:02:34,120 It also distributes vertices at, across processors, 38 00:02:34,120 --> 00:02:37,600 but it also takes care of appropriately 39 00:02:37,600 --> 00:02:41,740 repartitioning the vertices between processors to make 40 00:02:41,740 --> 00:02:45,260 sure there is less communication, between processors. 41 00:02:45,260 --> 00:02:51,320 So this is an important additional, feature that GPS does, for you whereas 42 00:02:51,320 --> 00:02:54,685 in Giraffe you have to do anything like that yourself. 43 00:02:54,685 --> 00:03:00,010 GraphLab is yet another, graph processing model. 44 00:03:00,010 --> 00:03:05,070 very similar, from the programming perspective, to Pregel, in terms of every 45 00:03:05,070 --> 00:03:11,180 vertex is a program, but, in here, edges are distributed across nodes. 46 00:03:11,180 --> 00:03:14,850 So the information about a single vertex is actually distributed across processors. 47 00:03:15,860 --> 00:03:16,340 this is quite 48 00:03:16,340 --> 00:03:17,450 an interesting model. 49 00:03:17,450 --> 00:03:20,730 Obviously the architecture is very different from, from this one, 50 00:03:20,730 --> 00:03:26,010 because you're not distributing nodes, to vertices to nodes, but edges. 51 00:03:27,170 --> 00:03:30,020 We won't go to the great details of the implementation architectures, 52 00:03:30,020 --> 00:03:34,690 but will look at a few applications of the Pregel model. 53 00:03:34,690 --> 00:03:40,420 to some interesting problems. So, here is the Pregel programming model. 54 00:03:40,420 --> 00:03:41,620 Just like we had the map produce 55 00:03:41,620 --> 00:03:44,224 programming model where you had to write a map function and 56 00:03:44,224 --> 00:03:48,890 a reduced function, here we have to write code for every vertex. 57 00:03:50,260 --> 00:03:55,370 Each vertex executes code once the graph is noded and the assumption is that 58 00:03:55,370 --> 00:03:57,980 the process of loading the graph makes 59 00:03:57,980 --> 00:04:01,220 sure that each vertex knows it's outgoing edges. 60 00:04:01,220 --> 00:04:02,740 It may not have any idea of what it's 61 00:04:02,740 --> 00:04:05,870 incoming edge is, but it knows its own outgoing edges. 62 00:04:06,960 --> 00:04:09,610 Vertices are distributed across processors, so each 63 00:04:09,610 --> 00:04:13,260 processor executes the code for all its vertices. 64 00:04:13,260 --> 00:04:18,690 So, it doesn't update for every vertex that it, has. 65 00:04:19,730 --> 00:04:21,110 in what is called a superstep. 66 00:04:22,720 --> 00:04:28,310 After the superstep, is completed, a second superstep starts. 67 00:04:28,310 --> 00:04:31,950 And so on until the program halts, as we shall see in a minute. 68 00:04:33,410 --> 00:04:38,826 So each vertex first receives messages from all its in-neighbors. 69 00:04:38,826 --> 00:04:40,380 So, who where is sending those messages 70 00:04:40,380 --> 00:04:43,420 on its incoming edges, that's some computation 71 00:04:43,420 --> 00:04:46,250 using those, these messages, and then decides 72 00:04:46,250 --> 00:04:48,860 to send some messages to its out-neighbors, possibly. 73 00:04:50,810 --> 00:04:52,560 After that, it decides whether or not to halt. 74 00:04:54,000 --> 00:04:58,460 Or it means, what a halt means is that a vertex stays 75 00:04:58,460 --> 00:05:00,990 inactive once it's halted. 76 00:05:00,990 --> 00:05:04,430 Unless it receives a message from some other vertex in the future. 77 00:05:05,640 --> 00:05:11,160 If this does not happen, and all vertices are halted, then the computation halts. 78 00:05:11,160 --> 00:05:16,540 So the master, as you mentioned in the previous, chart, is responsible for making 79 00:05:16,540 --> 00:05:23,550 sure that the messages are sent and received from vertices, 80 00:05:24,870 --> 00:05:29,710 as well as check it's, where there are not, all the vertices are in halt state. 81 00:05:29,710 --> 00:05:32,370 If they are, then it ends the computation. 82 00:05:32,370 --> 00:05:35,460 So let's do a couple of, more examples using 83 00:05:35,460 --> 00:05:38,669 the Pregel programming model to solve a few important problems. 84 00:05:40,050 --> 00:05:45,110 A simple one is called max-node. We're essentially given 85 00:05:45,110 --> 00:05:50,330 values across all the vertices, we would like to find out the maximum 86 00:05:50,330 --> 00:05:53,690 of these and make sure that every vertex at the end 87 00:05:53,690 --> 00:05:57,280 has the maximum value across all the vertices in the graph. 88 00:05:58,550 --> 00:06:00,730 So this is a simple program. 89 00:06:00,730 --> 00:06:05,960 You send your value along your outgoing edges 90 00:06:05,960 --> 00:06:10,190 and then start receiving values from incoming edges. 91 00:06:10,190 --> 00:06:13,430 For each incoming message, if you get a value 92 00:06:13,430 --> 00:06:15,390 which is greater than your value then you replace 93 00:06:15,390 --> 00:06:16,652 yourself with that value. 94 00:06:16,652 --> 00:06:23,330 And and continue and don't, don't halt but if, if you never 95 00:06:23,330 --> 00:06:28,640 got, if you did not get any message which was greater than your value then you halt. 96 00:06:28,640 --> 00:06:31,940 Let's see what happens if this is executed for every node. 97 00:06:32,990 --> 00:06:35,762 In the first step what happens is that these 98 00:06:35,762 --> 00:06:38,931 are bi directional edges, which are mentioned over here. 99 00:06:38,931 --> 00:06:40,456 So this means there's an edge 100 00:06:40,456 --> 00:06:42,800 from three to six as well as from six to three. 101 00:06:42,800 --> 00:06:51,190 In the first step three comes to six, six doesn't do anything be because it's 102 00:06:51,190 --> 00:06:55,590 already the maximum and it's never going to get a message greater than itself. 103 00:06:55,590 --> 00:06:57,550 But three becomes a six, right? 104 00:06:58,560 --> 00:07:05,970 Similarly two gets a message from one. but it doesn't change because one 105 00:07:05,970 --> 00:07:09,890 is less than two. But one gets six, so it changes. 106 00:07:09,890 --> 00:07:12,720 So, six and two did not change. 107 00:07:12,720 --> 00:07:16,430 And therefore they halt. So they're shown as shaded. 108 00:07:36,470 --> 00:07:40,880 So it sends so one sends its value to six, six does nothing with it 109 00:07:42,155 --> 00:07:47,560 six sends its value to two, which now wakes up because it, it was 110 00:07:47,560 --> 00:07:51,230 sleeping, it was halted, but it got an incoming message so it wakes up and 111 00:07:51,230 --> 00:07:55,970 this super, next super step and checks that, yes it did get a new value. 112 00:07:55,970 --> 00:08:01,580 So it becomes six and then it sends it's value out to it's outgoing edges 113 00:08:01,580 --> 00:08:03,130 which doesn't do anything over there. 114 00:08:03,130 --> 00:08:08,650 So at the end, all the vertices are still, are halted, the computation 115 00:08:08,650 --> 00:08:13,600 ends, and as you can see, all the vertices have the maximum value of six.