[MUSIC] Now let's step back and summarize MapReduce in terms
of the execution overview. So in particular we assume we
have some large data set and we're going to this data
across machines and then on each machine,
this is our map phase. We're going to do some operation that's
data parallel, so splits across the data divided on these different machines,
and then emit some key value pair. So for example,
maybe we emit a key value pair uw ,1). So that might be what k1
might represent uw, and v1 the number 1 in the example
that we showed on the last slide. And then after emitting
all these key value pairs, then we do what's called
the shuffle phase. And the shuffle phase is where we
take these key value pairs and assign each of them that have this same key to
the same machine for the reduced phase. Every key value pair is
going to get assigned machine each of that key where
h's are hash function. So then what we end up with is
in the reduced phase there's going to be some
aggregation step where for every key we're going to look at all
values associated with that key. So we're going to look at
these key value pairs and for a given key we're going to
aggregate across all those values. And so for example in our word counting this might produce "uw", 30. That might be what K1 V1 represents here. But I want to mention that when
we're talking about our map phase, these are not necessarily unique keys that
appear across the different machines. So, may have same key appear on multiple machines whereas when we're in the reduce phase We're going to end up with
unique keys on every machine. And that's why in terms of our
notation here we say k1, k2 and this is k1 prime, k2 prime and so on. So these primes indicate that it could
take the same values what it did on the other machine, whereas here we're
just saying k1 and k2, k3, k4 k5, k6, and so on. These are each unique values from
an entire set of possible keys. Okay, and remember that this reduced
phase, it has to be in operation, that's data parallel across keys. And then finally, the last step, is that you're going to
save your results to disk. So, this is the summary of
the standard MapReduce Execution Flow. But I quickly wanted to mention that we
can often improve performance through something that are called combiners. Because this naive implementation of
MapReduce can be really wasteful in terms of communication in that shuffle phase
where you're taking all these key value pairs and
you're sending them to different machines. For example, if you have a large set of
documents and some reasonable vocabulary, you're going to have lots and lots and
lots of these counts of word,1, word,1, word, 1, word, 1. And all of those has to be
sent to a machine and so that's a lot of communication. So just to make this really explicit,
let's say we're looking at machine 1 and we are counting words on all documents
that were allocated to machine 1, and we emit (;uw',1). And we end up with, let's say, a whole bunch of UW words in
the documents on this machine. Just to make this concrete, let's say we end up with 4,002 instances, of ('uw,1) on machine 1. And all of these are let's say, going to be shuffled to machine 7 for
our reduce phase. And this would happen if h(uw), our hash function on key uw,
is equal to 7. That's how we would shuffle all
of these instances to machine 7. Okay, well, a combiner represents
a very simple solution to this problem, which is to perform a reduce step before
sending things over to another machine. So, perform this reduce locally
to each one of the machines in the map phase before doing the shuffle,
and the subsequent reduce. So in particular what
we would do is take our machine 1 and all these counts of (uw, 1) and we do reduce and our call to reduce would emit ('uw',4002) and then we would send just this one ('uw',4002) over to machine 7. And note that the reason that this works,
the fact that we can do this local reduce and then do another reduce after
we do a shuffle from the other machines, is because we have
a commutative-associative reduce step. [MUSIC]