1 00:00:00,240 --> 00:00:03,130 Let's look at one final problem and solve it in 2 00:00:03,130 --> 00:00:07,200 pregel, which is the problem of finding the shortest path 3 00:00:07,200 --> 00:00:10,260 from a particular source vertex in the graph, such as 4 00:00:10,260 --> 00:00:14,070 the left-most vertex here, to every other vertex in the graph. 5 00:00:15,150 --> 00:00:20,100 We also assume that in general, the edges can have weights, so we would like 6 00:00:20,100 --> 00:00:22,810 to compute the shortest path, taking into account 7 00:00:22,810 --> 00:00:25,420 the distance traveled along every edge, but for 8 00:00:25,420 --> 00:00:27,460 the moment, let's assume that these weights are just 1. 9 00:00:27,460 --> 00:00:31,190 So, here's how we compute the shortest path 10 00:00:31,190 --> 00:00:34,650 to every vertex from the source in pregel. 11 00:00:34,650 --> 00:00:37,540 Each vertex has an estimate of its shortest path. 12 00:00:37,540 --> 00:00:41,390 It starts off at infinity, and it waits 13 00:00:41,390 --> 00:00:45,519 for messages from its neighboring nodes via incoming edges. 14 00:00:46,520 --> 00:00:51,420 And if the estimate that it gets from its incoming 15 00:00:51,420 --> 00:00:59,266 messages is less than its current estimate replaces itself by the new estimate. 16 00:00:59,266 --> 00:01:00,820 And then what, what it sends out is its 17 00:01:00,820 --> 00:01:05,500 current estimate plus the edge weight of every outgoing edge. 18 00:01:05,500 --> 00:01:09,730 So, if we do that, then the messages work out to be the correct ones, and 19 00:01:09,730 --> 00:01:13,090 eventually we get the shortest path from the 20 00:01:13,090 --> 00:01:16,490 source to every vertex sitting inside the vertex values. 21 00:01:17,640 --> 00:01:20,770 So, at the end, every vertex ends up 22 00:01:20,770 --> 00:01:26,300 having the shortest distance from the source to itself. 23 00:01:27,300 --> 00:01:30,436 I'll leave it to you to homework to figure out now how one 24 00:01:30,436 --> 00:01:32,830 can compute the shortest path from the 25 00:01:32,830 --> 00:01:35,300 source to every vertex using this information. 26 00:01:36,730 --> 00:01:40,440 Let's walk through this example step-by-step. 27 00:01:40,440 --> 00:01:42,770 Initially, all the vertices have infinity. 28 00:01:43,920 --> 00:01:49,480 The source vertex figures out that it's, it is the source. 29 00:01:49,480 --> 00:01:54,205 And its m becomes 0 and therefore it assigns 30 00:01:54,205 --> 00:01:59,391 itself a new value which is 0. And sends out 31 00:01:59,391 --> 00:02:04,352 0 plus e, which is just 1, so 0 plus 1, out on 32 00:02:04,352 --> 00:02:09,323 its outgoing edge. Every other vertex does nothing, and 33 00:02:09,323 --> 00:02:13,010 at the end all the vertices halt at the end of the super step. 34 00:02:15,750 --> 00:02:19,160 In the second super step, all the vertices are halted, but 35 00:02:20,170 --> 00:02:24,410 the second vertex over here gets a message from its neighbor 36 00:02:24,410 --> 00:02:29,100 and therefore, it wakes up and checks whether that message is 37 00:02:29,100 --> 00:02:33,130 less than infinity, it is, and it assigns itself to be 1. 38 00:02:33,130 --> 00:02:38,160 And sends out 1 plus 1, which is 2, on this outgoing edge. 39 00:02:39,360 --> 00:02:41,620 Similarly, in the third step, 40 00:02:41,620 --> 00:02:43,830 initially, everybody is asleep. 41 00:02:43,830 --> 00:02:47,018 This vertex gets something, a 2 on its incoming edge. 42 00:02:47,018 --> 00:02:50,845 Figures out it's less than infinity, assigns itself to 43 00:02:50,845 --> 00:02:54,410 2, and sends out 3 on its outgoing edge. 44 00:02:54,410 --> 00:02:58,950 And finally the middle vertex would be this vertex over 45 00:02:58,950 --> 00:03:03,160 here assigns itself a 3, and then goes to sleep. 46 00:03:05,180 --> 00:03:12,350 It sends out before that, 3 plus 1, 4 on both these edges, but that 47 00:03:14,500 --> 00:03:20,430 doesn't cause any change to either of these vertices the 48 00:03:20,430 --> 00:03:25,529 second one and the last one and therefore they don't send any further messages out. 49 00:03:25,529 --> 00:03:29,890 and no further computation takes place because there are 50 00:03:29,890 --> 00:03:33,320 no messages flowing to wake nodes up, everybody's asleep. 51 00:03:33,320 --> 00:03:34,340 And the computation halts. 52 00:03:35,610 --> 00:03:39,600 Shortest path computations like this are quite useful in many 53 00:03:39,600 --> 00:03:43,828 tasks including planning. For example, think of a automated 54 00:03:43,828 --> 00:03:49,350 self-driving car trying to get from one end of a parking lot, O 55 00:03:49,350 --> 00:03:55,300 to its designated parking space, G navigating the the 56 00:03:55,300 --> 00:03:59,952 parking lot in a manner so that it doesn't bump into any existing parked cars. 57 00:03:59,952 --> 00:04:05,100 One can think of this as a graph problem with each cell 58 00:04:05,100 --> 00:04:10,790 being a vertex and the edges going from allowed parts from 59 00:04:10,790 --> 00:04:16,438 one vertex to another, according to this grid, avoiding the the parked cars. 60 00:04:16,438 --> 00:04:23,805 And then to find the shortest path from O to G one could just find the shortest path 61 00:04:23,805 --> 00:04:29,090 from O to every vertex and which includes G and then, one can just follow that path. 62 00:04:29,090 --> 00:04:30,140 Of course, 63 00:04:30,140 --> 00:04:34,160 this particular algorithm would find the shortest distance from O to G. 64 00:04:34,160 --> 00:04:37,130 And as I've already mentioned, it's a little piece of homework for you to 65 00:04:37,130 --> 00:04:43,490 figure out how to backtrack effectively, and find the shortest path from O to G. 66 00:04:43,490 --> 00:04:46,530 Or, for that matter, from the source to any other vertex. 67 00:04:47,910 --> 00:04:50,630 That's the end of our discussion of pregel. 68 00:04:50,630 --> 00:04:52,690 let's summarize and see what we've learned. 69 00:04:54,490 --> 00:04:55,740 pregel is very 70 00:04:55,740 --> 00:04:58,930 appropriate if the problem can be mapped to a graph. 71 00:04:58,930 --> 00:05:02,065 But the caveat is that all the data and especially the graph itself, 72 00:05:02,065 --> 00:05:05,678 with all it's edges, need to fit in total memory, across all the machines. 73 00:05:07,560 --> 00:05:10,350 If you can do that however, you avoid reloading data between iterations. 74 00:05:10,350 --> 00:05:14,166 And so, we often find out that the pregel implementation beats 75 00:05:14,166 --> 00:05:19,620 map reduce especially on things like page rank, shortest paths, et cetera. 76 00:05:19,620 --> 00:05:20,740 In fact, for the case, 77 00:05:20,740 --> 00:05:25,892 case of shortest paths, it's especially ba-, good because in page rank at 78 00:05:25,892 --> 00:05:31,150 least, every vertex was doing everything iteration, but in, in shortest parts. 79 00:05:31,150 --> 00:05:36,190 In the beginning phase only, the vertices near the, the origin or the goal 80 00:05:36,190 --> 00:05:38,390 are actually doing anything, and, and, over 81 00:05:38,390 --> 00:05:40,900 time the activity spreads throughout the graph. 82 00:05:40,900 --> 00:05:43,800 So a large part of the graph are not even active. 83 00:05:43,800 --> 00:05:47,320 So there's no point reloading the entire graph. 84 00:05:47,320 --> 00:05:48,060 as is done in map-reduce. 85 00:05:48,060 --> 00:05:52,150 Or, as it is required in map-reduce, is really adds to a lot of inefficiency. 86 00:05:53,650 --> 00:05:56,320 But the caveat for the pregel, an important part is that you 87 00:05:56,320 --> 00:06:00,250 need to properly distribute vertices to machines because they are there forever. 88 00:06:00,250 --> 00:06:02,720 I mean, in map-reduce at least you can rely on the fact that every 89 00:06:02,720 --> 00:06:05,300 time you're reloading things randomly, so hopefully 90 00:06:05,300 --> 00:06:06,858 most of the time you'll get things right. 91 00:06:06,858 --> 00:06:09,028 But in pregel you do it wrong once 92 00:06:09,028 --> 00:06:12,668 and, and, and you have an in-balanced computation and 93 00:06:12,668 --> 00:06:15,639 you're stuck for the entire set of iterations. 94 00:06:15,639 --> 00:06:18,939 And so, you need to do some kind of graph partitioning. 95 00:06:18,939 --> 00:06:21,399 As mentioned earlier, GPS is an implementation 96 00:06:21,399 --> 00:06:24,159 of the pregel model which does graph partitioning 97 00:06:24,159 --> 00:06:25,959 automatically as a first step in the 98 00:06:25,959 --> 00:06:28,688 iteration other, other cases, you do it yourself. 99 00:06:28,688 --> 00:06:33,444 Many important features of pregel of the pregel model we haven't covered. 100 00:06:33,444 --> 00:06:34,218 sometimes 101 00:06:34,218 --> 00:06:34,648 [COUGH] 102 00:06:34,648 --> 00:06:41,880 the messages that are being sent out from vertices to others are actually identical. 103 00:06:43,162 --> 00:06:47,747 so there's no point sending them again and again across processors. 104 00:06:47,747 --> 00:06:51,597 So messages which are actually identical that means their, 105 00:06:51,597 --> 00:06:55,307 their content is identical, they can be just grouped together 106 00:06:55,307 --> 00:06:55,657 [COUGH] 107 00:06:55,657 --> 00:06:58,737 and only their, their actual, their addresses can 108 00:06:58,737 --> 00:07:01,440 be stored and sent to the other processor. 109 00:07:01,440 --> 00:07:06,580 Or the, the destination vertices can be can be kept, kept track of. 110 00:07:06,580 --> 00:07:11,332 Second thing that we haven't covered is aggregation, that I mentioned earlier 111 00:07:11,332 --> 00:07:16,150 that how the aggregation allows vertices to compute a global function together. 112 00:07:17,730 --> 00:07:19,790 and that allows you to check for things like convergence. 113 00:07:19,790 --> 00:07:21,390 We haven't talked about that. 114 00:07:21,390 --> 00:07:25,520 And very important feature is called, is graph mutations. 115 00:07:25,520 --> 00:07:29,060 Vertices can actually add new vertices, new edges into the graph, and 116 00:07:29,060 --> 00:07:35,550 this is actually very useful for doing interesting things using the graph model. 117 00:07:35,550 --> 00:07:36,770 We won't talk about those here, 118 00:07:36,770 --> 00:07:39,350 but they're extreme interesting things which become 119 00:07:39,350 --> 00:07:41,380 very cumbersome and map-reduce get done very 120 00:07:41,380 --> 00:07:43,691 easily by using things like graph mutation.