We've heard about page-rank in the very
early lectures, as
a way of finding which web page is most
important.
Page-rank is really a property of a graph
and the graph structure.
this is an example of one of the ways of
computing the page
rank, and it's a bit complicated, so I'll
go through it a bit slowly.
So the goal is to compute
the page-rank for every node in a vertex,
so there, there's, there are n vertices,
in the, in a graph then, there'll be n
values pi.
One for every vertex.
Further, every vertex has, l outgoing
edges.
So l of j is the number of outgoing edges
that vertex j has.
Now, the,
the way you compute page-rank, one of the
ways is, you start off with some random
values at every vertex for p, just make
everything one.
Or, or, or half, or, or typically 1 by N
is taken to the starting point and then
you can, you, you, you do this kind of an
update.
So, you p of i at the next step becomes p
of i times a factor.
Typically d is something like 0.85, so
this
becomes 0.15 over N.
So it's p of i multiplied
by some factor, plus d which is say 0.85
times the sum of the page ranks of all the
vertices
that point to i.
That's j is pointing to i, divide
by the number of links outgoing from j.
So there's a vertex j which is pointing to
i, which has a huge
number of outgoing vertices, so the i is
very unimportant for that vertex j.
It's just one
of millions of outgoing vertices for j,
then its page
rank, when it flows into j is not
considered very important.
However, if i is the only vertex that j
points to then you know
l is one and it, it's importance is much
higher as it flows in to to i.
So that's the, the intuition that you
know,
if I'm, if I'm being linked to from a page
it's importance
flows to me in a proportion to how
important it thinks I am.
Which is measured by what fraction of the
total edges.
That it points outwards to mi, that means,
you know I just divided it.
It, it divides its page rank equally along
all
its outgoing edges and, and I get that
fraction
of it.
So if you can, if you, if you,
if you iterate this again and again
Ultimately these
p's converge so they all settle down into
a
fixed value so they don't change after
some time.
and that value is the page rank of vertex
i
and similarly the same page rank of vertex
j, etcetera.
It's also global literation, because all
the vertices are exchanging messages.
Now let's see how you would do this in
Pregel.
To find the vertex, I have some estimate
of my page rank, initially some random
value.
I, I send my value divided by the l,
which is the number outgoing edges that I
have, I
send it to everybody that I can see on,
on,
an outgoing edge, and then I look for
incoming messages.
That would come from other vertices and
then conduct this iteration.
In other words, I collect all the incoming
messages from all my incoming
edges, and I simply add them up, multiply
them by
t and add them into one minus d by N.
Times p and replace p with this value as
per this formula.
After a fixed number of such super steps,
for example 30 which is often
sufficient when we didn't stop and the
page rank should have converted.
On the other hand we could conduct
convergence test to
make sure that successive p values are
hardly changing at all.
But then we have to do a little bit more
than
is provided by the Pregel model that we
have covered so far.
which is that, we have to see whether the
maximum difference between successive
values across all vertices is still small.
And that is the process of aggregation.
Pregel allows for aggregation, we're not
going to cover those,
that feature much in detail here I'll
cover it shortly.
to see how this is this works, there's a
sequential implementation of Pregel or
the Pregel model, a very simple sequential
implementation at this location.
And I'll show you that code and how it
computes
the page rank.
And I encourage you to study that and
modify to compute
a few other things to get to know Pregel a
bit better.
Here is the lightweight sequential
simulation of
Pregel from the link that I just
mentioned.
As you can see it's a toy single-machine
version of Google's Pregel paradigm.
clearly it's not for large
scale processing, it's just a simulation.
Let's see how it works.
It has a class called Vertex, which has,
it's obviously various things, including
the value of the vertex, as well as a list
of outgoing vertices.
That means the ID, the IDs of the outgoing
vertices.
It also keeps track of things like
incoming
messages, and the number of supersteps
being performed.
Then Pregel, which is the class which does
all the work.
first sort of does things like
partitioning the vertices across
across different workers and each worker
actually is,
performing super steps.
So if you look at the
super step itself super step essentially
looks at
every partition.
And tells the
worker to do its work in every partition
and the worker essentially ends up,
running a super step and the super step in
the end.
After all this class hierarchy essentially
for all the
vertexes that it happens to own in that
partition
it cause vertex to update.
There is no function vertex update
provided in this class, and
that's the function that you as a
programmer have to write.
By subclassing vertex, with a class of
your
own and and, and providing that update
function.
I won't go into the detail implementation
of this, I mean those of you
who know a little bit of object-oriented
programming, it's fairly straight forward
to figure
out what's happening in this code.
Or what we will do is look at how this
basic Pregel module is
been used to stimulate the page rank
program,
which we just described in our charts.
So if we look at the page rank program
using this module
essentially import this module.
and, then you simply
write a page rank vertex class with
subclasses vertex and or
inherits from vertex and then it, you
implement the update function.
So what does the update function do?
Very similar to our chart.
if the number of super steps are less than
50 it
changes its own value 2.5 by n times 0.85,
times
sum of everything that it gets from the
incoming vertices.
And then it sends it's own page rank value
out after dividing by l.
So very simple.
And then it stops, after 50 steps it
stops.
There's no convergence test in this, this
example.
So when you when
you, when you run this the way you
run this is essentially called this
program, it says.
with instance of pregel, with the number
of ver, with
the, with the list of vertices and the
number of
workers that you're trying to simulate,
then you say run
and and then return the values of all the
vertices.
Right.
the main program essentially does, just
creates vertices, creates edges.
There's a, there's a function
below there which creates a random matrix,
a random graph of ten vertices.
With each vertex having four random edges
and then you simply, say page
rank Pregel which is this function on the
number of vertices and we're done.
Let's see how that actually works.
If I call that, a program it computes,
the page rank of a random craft.
I have no idea whether this is a good page
rank.
so there is actually a test provided.
So what the test does is it uses a closed
form formula for the page rank which
is not an easy formula to compute, because
it involves the inverse of a matrix.
But, it's there on Wikipedia if you want
to look it up.
it, it computes the test page rank and
then
compares the page rank computed using
Pregel to the test
page rank and looks at the difference.
So if you print, print, so if you learn
this
program after printing the, after
uncommenting the the test part.
You see that the difference between the
two page rank vectors is very small.
So the, this one was the
Pregel computation upgraded rank, the test
computation
which is using the closed form formula and
we get the right answer.
Returning now to the Pregel module itself
one
interesting exercise that some of you
might want
to try is see whether you can actually
write a parallel version of this of this
program.
Possibly building on some of the the way
mince
meat.python actually built a lightweight
parallel implementation of map reduce.
That would be as an interesting
contribution to the open source
community, if any of you are interested to
try it out.