1 00:00:00,361 --> 00:00:05,350 We've heard about page-rank in the very early lectures, as 2 00:00:05,350 --> 00:00:09,560 a way of finding which web page is most important. 3 00:00:09,560 --> 00:00:15,310 Page-rank is really a property of a graph and the graph structure. 4 00:00:16,870 --> 00:00:20,720 this is an example of one of the ways of computing the page 5 00:00:20,720 --> 00:00:24,300 rank, and it's a bit complicated, so I'll go through it a bit slowly. 6 00:00:24,300 --> 00:00:25,798 So the goal is to compute 7 00:00:25,798 --> 00:00:32,750 the page-rank for every node in a vertex, so there, there's, there are n vertices, 8 00:00:32,750 --> 00:00:38,470 in the, in a graph then, there'll be n values pi. 9 00:00:38,470 --> 00:00:39,720 One for every vertex. 10 00:00:41,500 --> 00:00:45,700 Further, every vertex has, l outgoing edges. 11 00:00:48,720 --> 00:00:52,900 So l of j is the number of outgoing edges 12 00:00:52,900 --> 00:00:57,970 that vertex j has. Now, the, 13 00:00:57,970 --> 00:01:02,580 the way you compute page-rank, one of the ways is, you start off with some random 14 00:01:02,580 --> 00:01:07,160 values at every vertex for p, just make everything one. 15 00:01:07,160 --> 00:01:13,790 Or, or, or half, or, or typically 1 by N is taken to the starting point and then 16 00:01:13,790 --> 00:01:17,350 you can, you, you, you do this kind of an update. 17 00:01:17,350 --> 00:01:23,690 So, you p of i at the next step becomes p of i times a factor. 18 00:01:23,690 --> 00:01:29,090 Typically d is something like 0.85, so this 19 00:01:29,090 --> 00:01:34,529 becomes 0.15 over N. So it's p of i multiplied 20 00:01:34,529 --> 00:01:39,900 by some factor, plus d which is say 0.85 21 00:01:39,900 --> 00:01:45,271 times the sum of the page ranks of all the vertices 22 00:01:45,271 --> 00:01:50,927 that point to i. That's j is pointing to i, divide 23 00:01:50,927 --> 00:01:56,260 by the number of links outgoing from j. 24 00:01:56,260 --> 00:02:00,140 So there's a vertex j which is pointing to i, which has a huge 25 00:02:00,140 --> 00:02:04,560 number of outgoing vertices, so the i is very unimportant for that vertex j. 26 00:02:04,560 --> 00:02:05,010 It's just one 27 00:02:05,010 --> 00:02:08,200 of millions of outgoing vertices for j, then its page 28 00:02:08,200 --> 00:02:12,670 rank, when it flows into j is not considered very important. 29 00:02:12,670 --> 00:02:19,970 However, if i is the only vertex that j points to then you know 30 00:02:19,970 --> 00:02:27,080 l is one and it, it's importance is much higher as it flows in to to i. 31 00:02:27,080 --> 00:02:30,161 So that's the, the intuition that you know, 32 00:02:30,161 --> 00:02:34,506 if I'm, if I'm being linked to from a page it's importance 33 00:02:34,506 --> 00:02:38,710 flows to me in a proportion to how important it thinks I am. 34 00:02:38,710 --> 00:02:43,310 Which is measured by what fraction of the total edges. 35 00:02:43,310 --> 00:02:48,200 That it points outwards to mi, that means, you know I just divided it. 36 00:02:48,200 --> 00:02:52,400 It, it divides its page rank equally along all 37 00:02:52,400 --> 00:02:55,520 its outgoing edges and, and I get that fraction 38 00:02:55,520 --> 00:02:56,420 of it. 39 00:02:56,420 --> 00:02:59,380 So if you can, if you, if you, 40 00:02:59,380 --> 00:03:04,560 if you iterate this again and again Ultimately these 41 00:03:04,560 --> 00:03:07,660 p's converge so they all settle down into a 42 00:03:07,660 --> 00:03:10,870 fixed value so they don't change after some time. 43 00:03:10,870 --> 00:03:14,860 and that value is the page rank of vertex i 44 00:03:14,860 --> 00:03:16,980 and similarly the same page rank of vertex j, etcetera. 45 00:03:16,980 --> 00:03:20,929 It's also global literation, because all the vertices are exchanging messages. 46 00:03:23,160 --> 00:03:25,360 Now let's see how you would do this in Pregel. 47 00:03:25,360 --> 00:03:27,780 To find the vertex, I have some estimate 48 00:03:27,780 --> 00:03:31,260 of my page rank, initially some random value. 49 00:03:31,260 --> 00:03:33,690 I, I send my value divided by the l, 50 00:03:33,690 --> 00:03:37,011 which is the number outgoing edges that I have, I 51 00:03:37,011 --> 00:03:39,927 send it to everybody that I can see on, on, 52 00:03:39,927 --> 00:03:44,500 an outgoing edge, and then I look for incoming messages. 53 00:03:44,500 --> 00:03:48,260 That would come from other vertices and then conduct this iteration. 54 00:03:48,260 --> 00:03:52,720 In other words, I collect all the incoming messages from all my incoming 55 00:03:54,770 --> 00:03:59,240 edges, and I simply add them up, multiply them by 56 00:03:59,240 --> 00:04:02,660 t and add them into one minus d by N. 57 00:04:02,660 --> 00:04:07,090 Times p and replace p with this value as per this formula. 58 00:04:08,090 --> 00:04:13,875 After a fixed number of such super steps, for example 30 which is often 59 00:04:13,875 --> 00:04:19,750 sufficient when we didn't stop and the page rank should have converted. 60 00:04:19,750 --> 00:04:22,810 On the other hand we could conduct convergence test to 61 00:04:22,810 --> 00:04:26,435 make sure that successive p values are hardly changing at all. 62 00:04:26,435 --> 00:04:28,885 But then we have to do a little bit more than 63 00:04:28,885 --> 00:04:33,514 is provided by the Pregel model that we have covered so far. 64 00:04:33,514 --> 00:04:39,362 which is that, we have to see whether the maximum difference between successive 65 00:04:39,362 --> 00:04:45,545 values across all vertices is still small. And that is the process of aggregation. 66 00:04:45,545 --> 00:04:50,081 Pregel allows for aggregation, we're not going to cover those, 67 00:04:50,081 --> 00:04:54,806 that feature much in detail here I'll cover it shortly. 68 00:04:54,806 --> 00:05:01,190 to see how this is this works, there's a sequential implementation of Pregel or 69 00:05:01,190 --> 00:05:07,220 the Pregel model, a very simple sequential implementation at this location. 70 00:05:07,220 --> 00:05:10,590 And I'll show you that code and how it computes 71 00:05:10,590 --> 00:05:11,700 the page rank. 72 00:05:11,700 --> 00:05:15,640 And I encourage you to study that and modify to compute 73 00:05:15,640 --> 00:05:18,390 a few other things to get to know Pregel a bit better. 74 00:05:19,750 --> 00:05:24,940 Here is the lightweight sequential simulation of 75 00:05:24,940 --> 00:05:27,700 Pregel from the link that I just mentioned. 76 00:05:29,040 --> 00:05:34,702 As you can see it's a toy single-machine version of Google's Pregel paradigm. 77 00:05:34,702 --> 00:05:35,806 clearly it's not for large 78 00:05:35,806 --> 00:05:39,080 scale processing, it's just a simulation. Let's see how it works. 79 00:05:39,080 --> 00:05:43,550 It has a class called Vertex, which has, it's obviously various things, including 80 00:05:43,550 --> 00:05:48,340 the value of the vertex, as well as a list of outgoing vertices. 81 00:05:48,340 --> 00:05:51,040 That means the ID, the IDs of the outgoing vertices. 82 00:05:51,040 --> 00:05:52,930 It also keeps track of things like incoming 83 00:05:52,930 --> 00:05:55,720 messages, and the number of supersteps being performed. 84 00:05:56,870 --> 00:05:59,399 Then Pregel, which is the class which does all the work. 85 00:06:01,020 --> 00:06:05,840 first sort of does things like partitioning the vertices across 86 00:06:08,880 --> 00:06:12,850 across different workers and each worker actually is, 87 00:06:14,018 --> 00:06:18,844 performing super steps. So if you look at the 88 00:06:18,844 --> 00:06:23,920 super step itself super step essentially looks at 89 00:06:23,920 --> 00:06:29,060 every partition. And tells the 90 00:06:29,060 --> 00:06:34,450 worker to do its work in every partition and the worker essentially ends up, 91 00:06:34,450 --> 00:06:37,570 running a super step and the super step in the end. 92 00:06:37,570 --> 00:06:40,820 After all this class hierarchy essentially for all the 93 00:06:40,820 --> 00:06:43,780 vertexes that it happens to own in that partition 94 00:06:46,680 --> 00:06:49,610 it cause vertex to update. 95 00:06:49,610 --> 00:06:54,168 There is no function vertex update provided in this class, and 96 00:06:54,168 --> 00:06:58,575 that's the function that you as a programmer have to write. 97 00:06:58,575 --> 00:07:01,239 By subclassing vertex, with a class of your 98 00:07:01,239 --> 00:07:04,935 own and and, and providing that update function. 99 00:07:04,935 --> 00:07:07,767 I won't go into the detail implementation of this, I mean those of you 100 00:07:07,767 --> 00:07:09,447 who know a little bit of object-oriented 101 00:07:09,447 --> 00:07:11,703 programming, it's fairly straight forward to figure 102 00:07:11,703 --> 00:07:17,714 out what's happening in this code. Or what we will do is look at how this 103 00:07:17,714 --> 00:07:22,509 basic Pregel module is 104 00:07:22,509 --> 00:07:27,852 been used to stimulate the page rank program, 105 00:07:27,852 --> 00:07:32,928 which we just described in our charts. 106 00:07:32,928 --> 00:07:37,068 So if we look at the page rank program using this module 107 00:07:37,068 --> 00:07:42,240 essentially import this module. and, then you simply 108 00:07:42,240 --> 00:07:47,386 write a page rank vertex class with subclasses vertex and or 109 00:07:47,386 --> 00:07:52,990 inherits from vertex and then it, you implement the update function. 110 00:07:52,990 --> 00:07:54,520 So what does the update function do? 111 00:07:54,520 --> 00:07:56,560 Very similar to our chart. 112 00:07:56,560 --> 00:07:59,880 if the number of super steps are less than 50 it 113 00:08:02,710 --> 00:08:07,380 changes its own value 2.5 by n times 0.85, times 114 00:08:07,380 --> 00:08:12,000 sum of everything that it gets from the incoming vertices. 115 00:08:12,000 --> 00:08:17,490 And then it sends it's own page rank value out after dividing by l. 116 00:08:18,890 --> 00:08:19,470 So very simple. 117 00:08:21,400 --> 00:08:23,620 And then it stops, after 50 steps it stops. 118 00:08:23,620 --> 00:08:26,700 There's no convergence test in this, this example. 119 00:08:26,700 --> 00:08:27,840 So when you when 120 00:08:27,840 --> 00:08:30,010 you, when you run this the way you 121 00:08:30,010 --> 00:08:33,159 run this is essentially called this program, it says. 122 00:08:34,160 --> 00:08:36,990 with instance of pregel, with the number of ver, with 123 00:08:36,990 --> 00:08:38,990 the, with the list of vertices and the number of 124 00:08:38,990 --> 00:08:42,240 workers that you're trying to simulate, then you say run 125 00:08:42,240 --> 00:08:46,870 and and then return the values of all the vertices. 126 00:08:46,870 --> 00:08:47,790 Right. 127 00:08:47,790 --> 00:08:51,660 the main program essentially does, just creates vertices, creates edges. 128 00:08:51,660 --> 00:08:53,240 There's a, there's a function 129 00:08:53,240 --> 00:08:59,310 below there which creates a random matrix, a random graph of ten vertices. 130 00:08:59,310 --> 00:09:03,450 With each vertex having four random edges and then you simply, say page 131 00:09:03,450 --> 00:09:08,740 rank Pregel which is this function on the number of vertices and we're done. 132 00:09:08,740 --> 00:09:14,370 Let's see how that actually works. If I call that, a program it computes, 133 00:09:14,370 --> 00:09:19,910 the page rank of a random craft. 134 00:09:19,910 --> 00:09:22,980 I have no idea whether this is a good page rank. 135 00:09:22,980 --> 00:09:25,110 so there is actually a test provided. 136 00:09:25,110 --> 00:09:31,300 So what the test does is it uses a closed form formula for the page rank which 137 00:09:31,300 --> 00:09:36,090 is not an easy formula to compute, because it involves the inverse of a matrix. 138 00:09:36,090 --> 00:09:38,870 But, it's there on Wikipedia if you want to look it up. 139 00:09:38,870 --> 00:09:42,010 it, it computes the test page rank and then 140 00:09:42,010 --> 00:09:45,180 compares the page rank computed using Pregel to the test 141 00:09:45,180 --> 00:09:47,290 page rank and looks at the difference. 142 00:09:47,290 --> 00:09:49,050 So if you print, print, so if you learn this 143 00:09:49,050 --> 00:09:53,430 program after printing the, after uncommenting the the test part. 144 00:09:53,430 --> 00:09:57,240 You see that the difference between the two page rank vectors is very small. 145 00:09:57,240 --> 00:09:58,770 So the, this one was the 146 00:09:58,770 --> 00:10:01,140 Pregel computation upgraded rank, the test computation 147 00:10:01,140 --> 00:10:04,530 which is using the closed form formula and we get the right answer. 148 00:10:06,710 --> 00:10:09,980 Returning now to the Pregel module itself one 149 00:10:09,980 --> 00:10:12,200 interesting exercise that some of you might want 150 00:10:12,200 --> 00:10:14,230 to try is see whether you can actually 151 00:10:15,240 --> 00:10:20,510 write a parallel version of this of this program. 152 00:10:20,510 --> 00:10:24,175 Possibly building on some of the the way mince 153 00:10:24,175 --> 00:10:28,130 meat.python actually built a lightweight parallel implementation of map reduce. 154 00:10:28,130 --> 00:10:31,800 That would be as an interesting contribution to the open source 155 00:10:31,800 --> 00:10:34,040 community, if any of you are interested to try it out.