Let's now turn to the map-reduce paradigm
itself.
As we just mentioned, it's a tailor
parallel, paradigm, with message passing.
It's also a pipelined procedure, where
there are two phases.
There's the map phase and the reduce
phase.
Further, as we also mentioned, it's a
higher level abstraction.
So programmers only need to specify what
each mapper actually does, and what each
reducer actually does And the map-reduce
implementation takes care of all the
message passing required between the two
phases.
In fact, one doesn't even need to worry
about how many processors there are, and
which one should be doing what kind of job
map-reduce.
That's all left to the map-reduce
implementation.
All the programmer needs to do is define
their task in terms of the map-reduce
paradigm which will come to in a moment.
Point the system at the data on which they
would like this task performed.
And the map-reduce implementation, figures
out which processors can be made available
to this task.
And does the rest without the programmer
having to worry about the details.
That's what makes map-reduce exceedingly
powerful for large scale data parallel
processing.
Since it bridges the gap between high
level specification of an algorithm, and
its massively parallel implementation in a
manner which has really never been
successfully done ever before in the
history of parallel computing.
As we shall see, many complex web
intelligence algorithms for machine
learning and processing big data are easy
to express using the mapreduce paradigm.
Making their large scale parallel
implementation much easier than if one had
to program these using either shared
memory or low level message passing as was
the case before map-reduce.
The most common example used to, explain
map-reduce is that of counting words and
documents.
Something we've seen last week when we
tried to compute inverse document
frequencies for example.
Suppose we are given many documents in
this case.
Say there are only ten documents here, but
there could be you know, hundreds or
thousands or even millions of documents.
And we need to compute the frequency that
each word takes across this whole set of
documents.
And we like to do it in parallel, using
many processors, so as to reduce the
overall execution time.
In our simple example we will take only
ten documents.
And three mapper processors which execute
map tasks.
Followed by two reducer processors that
execute reduced tasks.
Of course, once the mapper processors are
done they can start doing the reduced
work.
So the total number of, processors need
not be five, it might not even be as low
as three, or maybe four.
The mappers are given data, in this case
there are documents identified by document
IDs, which the map-reduce system randomly
distributes across the different mapper
processors.
Each mapper processor thus, gets a bunch
of documents.
It doesn't really know which documents
it's getting.
It just gets a bunch of them in any random
order.
Now, here's how the problem is defined in
terms of map and reduced tasks.
Remember, a mapper simply needs to perform
a whole bunch of map tasks on all the
documents that it gets.
So its getting a sequence of document IDs
followed by the contents of the document.
In other words its getting a key and a
value in a sequence as such key value
pairs.
A map task works on a key value pairs such
as a document ID and its contents and
produces a bunch of key value pairs in
this case words and the frequencies with
which they occur in this particular
document.
Remember that a map task acts on one key
value pair.
So, it needs to act in this case on one
document.
It is, however, free to emit many key
value pairs as its output.
In this case the many words that occur in
that document along with their
frequencies.
For example, the map tasks for the
document D1 would be to emit a W1 followed
by one.
A
W2 followed by one and W4 followed by a
one and similarly for the other documents.
In some cases such as the six, we would
emit a W2 followed by a two.
But in most other cases it would be a
count of one.
The second part of our problem definition
is the reduce phase, in which a list of
key value pairs corresponding to the same
W key, in this case is reduced or,
compressed, or combined, into one key
value pair, for that particular key.
So for example, all the counts for a
particular word are summed up in the
reduced phase to create one key value pair
for each word.
Now clearly in order to perform the reduce
function, all the key value pairs for
particular word need to be together if
they are to be so summed up.
Unfortunately, since each mapper acted on
an independent set of documents, it only
has access to those word count pairs for
the particular document it happens to see.
The best that a mapper can do is compute
the reduced function using the pairs that
it happens to have available.
So, for example, this mapper, it only had
these two documents D1 and D2, which had a
W1, so it computes a W1 comma two by
performing the map and reduced functions
on the set of documents that it has
available to it.
Similarly, this mapper adds up the sums
for W2 across all its documents to get a
value of W2, four.
Of course that is not enough, all these
partial summations themselves need to be
summed up and this is where the map
produce implementation comes in.
And this where the map-reduce
implementation comes in and sends all of
the key value pairs with the same key to
the same producer so that they are all
together and they can finally be reduced
to get the right answer.
So, all the pairs having W1 managed to get
to this reducer, where they can get added
up to get the correct value for W1.
So, we got W1 from here, we got a W1 from
here, we got a W1 from here.
And we added all of these up to get a
value of seven.
Similarly the second reducer also manages
to get all the key value pairs for each
word to be together, so that they could be
added up.
In this case the first reducer works on
pairs, where the keys are W1 and W2.
And the second reducer works on those,
where it's W3 and W4.
So now we're ready to define map-reduce
formally.
Mappers read data in the form of key value
pairs K1, B1 and for each such K1, B1
pair, a mapper may emit one or more pairs
of the form of K2, V2.
In our word counting example, K1 was
document ID.
And B1 was the contents of the document,
whereas K2 was the word and V2 the number
of times a word occurred in a particular
document.
The reduced function on the other hand
receives all the pairs for sum value of K2
and combines them through a reduce
function.
So the reduce of K2 and all the values
that came out of the map phase for that
particular value of K2 are combined using
a function F sub R.
And now we're counting example F sub R is
just a summation across all the
occurrences of a particular word in all
the documents that we process.
Some of you might have noticed that in our
example we actually did, some reduced work
within the mappers.
But we'll come to that point in a short
while.
For the moment.
The strict definition of a map function is
only a map that is from one key val-,
value pair produce one or more key value
pairs.
And reduce function is take all the key
value pairs for a particular value of the
key and apply the reduce function on them.
The map-reduce platform is responsible for
routing pairs to reducers, so that all the
pairs for the same key end up at the same
reducer.
Mapreduce essentially sorts the key value
pairs that come out of the map phase.
So that they land up in the right place.
Finally, a very important point to note is
that the map reduce process reads data,
possibly from a file system or a database,
and writes fresh data.
So, it's a batch process that always
produces more data.
Very different from a database, where one
is querying data and just getting an