You'll recall we began our discussion of
parallel computing by the notions of
parallel efficiency.
So it's worth trying to figure out what is
the parallel efficiency of MapReduce.
Let's assume that the data which is
emitted by all the mappers put together is
D.
This might be much less than the data
emitted by the.
If we use combiners efficiently, D may
actually be significantly less, than the
total amount of data input to the
MapReduce program.
We also assume that their will be
processors that will include mappers and
reducers.
And will assume there is really no overlap
between mappers and reducers so that we
have P by two mappers and P by two
reducers.
Clearly parallel efficiency depends on
overhead such as communications between
processors.
And so the ideal way to compute the
minimum parallel efficiency is to assume
that there is no real useful work being
done.
Now we will actually assume that there is
no real useful work being performed as in
terms of an algorithm.
Instead, we are only considering the
efficiency of the map-produced platform
itself, where all the work that it really
does, is, sort, p by two of these blocks,
each of size, d divided by p by two, and
then communicate the results, to, the
different reducers.
Now, the actual efficiency of any
algorithm using MapReduce will be at least
this much, provided that algorithm
requires this sorting process for its
completion.
That need not always be the case though,
but for the moment we assume that it is.
Now let's consider the communication
costs.
That is, 2D by P of data, in each block
here, each mapper, has to be communicated
to, to P by two reducers.
Each reducer gets one by one p by 2th of
the data so each reducer gets about 4d by
p squared data and this has, has to go, go
to p by two reducers in particular it has
to be collected from p by two mappers at
each reducer.
So.
This requires P by two separate steps of
communication.
So you multiply this by P by two to get 2D
by P as representing the communicating
costs.
The serial costs are of course.
The cost required to, to sort P by two
blocks of this size.
Which works to something, which is
constant multiplier times D by P.
Log D by P times P.
And so we get.
Of, a term like this.
The same term in the denominator, plus the
communication costs.
This, of course, is the formula for
efficiency, where you have the serial,
time, in the numerator, the parallel time,
in the denominator which is, the serial
time divided by p in this case.
And multiplied by P and the communication
cost.
Simplifying this, you get an expression
that looks something like one over one
plus, two times some communication cost
and constant divide by a work constant
times log of D by P.
Notice that we have liberally removed,
constants like the two which might have,
might occur in the, the log over here
because that only, adds a, constant,
additive factor.
We've also, been fairly liberal about,
factors of two here and there, just to get
the form of this equation, correctly.
Remember what we said about an algorithm
being scalable.
If efficiency approaches one as P by P
grows than the algorithm is set to be
scalable.
As d/p becomes very large this term
becomes extremely small and efficiency
does indeed become close to one.
So the platform for maproduce is actually
scalable in this sense The caveat is of
course that the algorithm, that we perform
on this platform actually requires this
serial work to be performed, in other
words, there is a need for sorting in the
algorithm, of some kind.
This is not always the case of course.
For example there are many tasks which can
just as well performed by only a map phase
not requiring any reduce or any sorting.
Such task may be called embarrassingly
parallel in which case simply using the
map reduce platform will add overhead,
which is avoidable.