[MUSIC] So, the MapReduce abstraction has two
different steps, one is the map step and the other one is the reduce step,
thus the name MapReduce. Well in the map step, we're going to
assume that the operation we want to perform is data-parallel over
elements like our documents. And the result of the map call is
going to be to generate a key value pair. So, for example,
in our word count example, we're going to take in a document. And then for every word in that document, we're just going to emit the pair (word,
1). So we'd cycle through our document. If we see the word, uw, we emit (uw, 1). We see the word, machine, we put (machine, 1). Then we happen to see uw again,
so we put (uw, 1). And then we see learning, so we emit (learning, 1), and so on. So in this example the key are the
different words in our vocabulary, uw, machine, learning, and so on, and
the value is just the count of that word. And for this specific implementation, we're emitting just a count of 1
every time we see a word instance. But when we're talking about
the MapReduce abstraction, the value can be any data type. Then for the reduce step here,
this step is going to be one that aggregates over the values
associated with every key. And this must be a commutative,
associative operation. So commutative means that the order
of the operations doesn't matter. So for example, a+ b = b + a. So addition is commutative, and it's also associative which means that
the grouping of operations doesn't matter. So again, addition satisfies this, where a + b,
if we do that first and then add c, that's the same as if we had
done a + the operation of b + c. Okay, and for a reduce step,
we need this operation to be one that's data-parallel over keys. So, let me emphasis what
the object is data-parallel over. In the map step, it's over the elements
like documents in our corpus. But here once we produce
these key value pairs, we take all the keys, and
the operation has to be able to be performed independently for
each key. And so just to make this a little bit more
explicit, in the map step we're saying that the word count in a given document
doesn't depend on any other document. In the reduce step we're saying,
the total count of a given word doesn't depend on the total
count of any other word. So let's go through the reduce example for
word counting. Here what we're going to do is
we're just going to initialize our counts of a given word to be 0. So the reducer takes in the specific
key as well as a list of values. So maybe I'll write this here,
so this is the key, and this is a list of values. And what we're going to do is for
every value in this list, we're going to increment
our count by that value. So in this case, it's a count. It doesn't really always
have to be a count. It could just be sums of values, and then
we're going to emit this key value pair. So here's our key value pair, and in this example this was our key
value pair that was omitted. So in particular if we go along with
this example here, if we were to, Call reduce on the key uw. And associated with
that key we had counts, let's say 1, 17, 0, 0, 12, 2, then what we would emit would be (uw, 32). So MapReduce has a really long
history in functional programming, and somewhat more recently it
was popularized by Google. And then really it gained a lot of
broad adoption through an open-source implementation called Hadoop
that was made by Yahoo. [MUSIC]