[MUSIC] Now let's step back and summarize MapReduce in terms of the execution overview. So in particular we assume we have some large data set and we're going to this data across machines and then on each machine, this is our map phase. We're going to do some operation that's data parallel, so splits across the data divided on these different machines, and then emit some key value pair. So for example, maybe we emit a key value pair uw ,1). So that might be what k1 might represent uw, and v1 the number 1 in the example that we showed on the last slide. And then after emitting all these key value pairs, then we do what's called the shuffle phase. And the shuffle phase is where we take these key value pairs and assign each of them that have this same key to the same machine for the reduced phase. Every key value pair is going to get assigned machine each of that key where h's are hash function. So then what we end up with is in the reduced phase there's going to be some aggregation step where for every key we're going to look at all values associated with that key. So we're going to look at these key value pairs and for a given key we're going to aggregate across all those values. And so for example in our word counting this might produce "uw", 30. That might be what K1 V1 represents here. But I want to mention that when we're talking about our map phase, these are not necessarily unique keys that appear across the different machines. So, may have same key appear on multiple machines whereas when we're in the reduce phase We're going to end up with unique keys on every machine. And that's why in terms of our notation here we say k1, k2 and this is k1 prime, k2 prime and so on. So these primes indicate that it could take the same values what it did on the other machine, whereas here we're just saying k1 and k2, k3, k4 k5, k6, and so on. These are each unique values from an entire set of possible keys. Okay, and remember that this reduced phase, it has to be in operation, that's data parallel across keys. And then finally, the last step, is that you're going to save your results to disk. So, this is the summary of the standard MapReduce Execution Flow. But I quickly wanted to mention that we can often improve performance through something that are called combiners. Because this naive implementation of MapReduce can be really wasteful in terms of communication in that shuffle phase where you're taking all these key value pairs and you're sending them to different machines. For example, if you have a large set of documents and some reasonable vocabulary, you're going to have lots and lots and lots of these counts of word,1, word,1, word, 1, word, 1. And all of those has to be sent to a machine and so that's a lot of communication. So just to make this really explicit, let's say we're looking at machine 1 and we are counting words on all documents that were allocated to machine 1, and we emit (;uw',1). And we end up with, let's say, a whole bunch of UW words in the documents on this machine. Just to make this concrete, let's say we end up with 4,002 instances, of ('uw,1) on machine 1. And all of these are let's say, going to be shuffled to machine 7 for our reduce phase. And this would happen if h(uw), our hash function on key uw, is equal to 7. That's how we would shuffle all of these instances to machine 7. Okay, well, a combiner represents a very simple solution to this problem, which is to perform a reduce step before sending things over to another machine. So, perform this reduce locally to each one of the machines in the map phase before doing the shuffle, and the subsequent reduce. So in particular what we would do is take our machine 1 and all these counts of (uw, 1) and we do reduce and our call to reduce would emit ('uw',4002) and then we would send just this one ('uw',4002) over to machine 7. And note that the reason that this works, the fact that we can do this local reduce and then do another reduce after we do a shuffle from the other machines, is because we have a commutative-associative reduce step. [MUSIC]