1
00:00:00,000 --> 00:00:04,555
[MUSIC]

2
00:00:04,555 --> 00:00:09,023
So, the MapReduce abstraction has two
different steps, one is the map step and

3
00:00:09,023 --> 00:00:13,200
the other one is the reduce step,
thus the name MapReduce.

4
00:00:13,200 --> 00:00:16,910
Well in the map step, we're going to
assume that the operation we want to

5
00:00:16,910 --> 00:00:21,700
perform is data-parallel over
elements like our documents.

6
00:00:21,700 --> 00:00:26,765
And the result of the map call is
going to be to generate a key value pair.

7
00:00:26,765 --> 00:00:31,005
So, for example,
in our word count example,

8
00:00:31,005 --> 00:00:34,059
we're going to take in a document.

9
00:00:34,059 --> 00:00:38,065
And then for every word in that document,

10
00:00:38,065 --> 00:00:42,652
we're just going to emit the pair (word,
1).

11
00:00:42,652 --> 00:00:46,310
So we'd cycle through our document.

12
00:00:46,310 --> 00:00:50,185
If we see the word, uw, we emit (uw, 1).

13
00:00:50,185 --> 00:00:54,580
We see the word, machine,

14
00:00:54,580 --> 00:00:58,800
we put (machine, 1).

15
00:00:58,800 --> 00:01:03,484
Then we happen to see uw again,
so we put (uw, 1).

16
00:01:03,484 --> 00:01:07,012
And then we see learning, so

17
00:01:07,012 --> 00:01:11,866
we emit (learning, 1), and so on.

18
00:01:11,866 --> 00:01:16,829
So in this example the key are the
different words in our vocabulary, uw,

19
00:01:16,829 --> 00:01:22,180
machine, learning, and so on, and
the value is just the count of that word.

20
00:01:22,180 --> 00:01:24,802
And for this specific implementation,

21
00:01:24,802 --> 00:01:29,207
we're emitting just a count of 1
every time we see a word instance.

22
00:01:29,207 --> 00:01:32,823
But when we're talking about
the MapReduce abstraction,

23
00:01:32,823 --> 00:01:34,710
the value can be any data type.

24
00:01:35,890 --> 00:01:40,313
Then for the reduce step here,
this step is going to be one that

25
00:01:40,313 --> 00:01:44,479
aggregates over the values
associated with every key.

26
00:01:44,479 --> 00:01:48,290
And this must be a commutative,
associative operation.

27
00:01:48,290 --> 00:01:53,960
So commutative means that the order
of the operations doesn't matter.

28
00:01:53,960 --> 00:01:57,810
So for example, a+ b = b + a.

29
00:01:57,810 --> 00:01:59,645
So addition is commutative, and

30
00:01:59,645 --> 00:02:04,439
it's also associative which means that
the grouping of operations doesn't matter.

31
00:02:04,439 --> 00:02:08,880
So again, addition satisfies this,

32
00:02:08,880 --> 00:02:14,286
where a + b,
if we do that first and then add c,

33
00:02:14,286 --> 00:02:21,250
that's the same as if we had
done a + the operation of b + c.

34
00:02:22,640 --> 00:02:27,844
Okay, and for a reduce step,
we need this operation

35
00:02:27,844 --> 00:02:32,356
to be one that's data-parallel over keys.

36
00:02:32,356 --> 00:02:39,030
So, let me emphasis what
the object is data-parallel over.

37
00:02:39,030 --> 00:02:43,069
In the map step, it's over the elements
like documents in our corpus.

38
00:02:43,069 --> 00:02:47,727
But here once we produce
these key value pairs,

39
00:02:47,727 --> 00:02:52,385
we take all the keys, and
the operation has to be

40
00:02:52,385 --> 00:02:57,318
able to be performed independently for
each key.

41
00:02:57,318 --> 00:03:03,730
And so just to make this a little bit more
explicit, in the map step we're saying

42
00:03:03,730 --> 00:03:09,750
that the word count in a given document
doesn't depend on any other document.

43
00:03:09,750 --> 00:03:14,460
In the reduce step we're saying,
the total count of a given word

44
00:03:14,460 --> 00:03:17,900
doesn't depend on the total
count of any other word.

45
00:03:17,900 --> 00:03:23,263
So let's go through the reduce example for
word counting.

46
00:03:23,263 --> 00:03:28,160
Here what we're going to do is
we're just going to initialize

47
00:03:28,160 --> 00:03:31,126
our counts of a given word to be 0.

48
00:03:31,126 --> 00:03:36,920
So the reducer takes in the specific
key as well as a list of values.

49
00:03:36,920 --> 00:03:40,450
So maybe I'll write this here,
so this is the key, and

50
00:03:40,450 --> 00:03:46,180
this is a list of values.

51
00:03:46,180 --> 00:03:50,622
And what we're going to do is for
every value in this list,

52
00:03:50,622 --> 00:03:54,510
we're going to increment
our count by that value.

53
00:03:57,260 --> 00:03:58,380
So in this case, it's a count.

54
00:03:58,380 --> 00:04:00,040
It doesn't really always
have to be a count.

55
00:04:00,040 --> 00:04:05,570
It could just be sums of values, and then
we're going to emit this key value pair.

56
00:04:08,470 --> 00:04:13,122
So here's our key value pair, and

57
00:04:13,122 --> 00:04:21,311
in this example this was our key
value pair that was omitted.

58
00:04:21,311 --> 00:04:26,935
So in particular if we go along with
this example here, if we were to,

59
00:04:29,168 --> 00:04:34,524
Call reduce on the key uw.

60
00:04:34,524 --> 00:04:41,707
And associated with
that key we had counts,

61
00:04:41,707 --> 00:04:46,947
let's say 1, 17, 0, 0,

62
00:04:46,947 --> 00:04:51,801
12, 2, then what we would

63
00:04:51,801 --> 00:04:56,671
emit would be (uw, 32).

64
00:04:56,671 --> 00:05:00,691
So MapReduce has a really long
history in functional programming, and

65
00:05:00,691 --> 00:05:04,160
somewhat more recently it
was popularized by Google.

66
00:05:04,160 --> 00:05:08,411
And then really it gained a lot of
broad adoption through an open-source

67
00:05:08,411 --> 00:05:11,630
implementation called Hadoop
that was made by Yahoo.

68
00:05:11,630 --> 00:05:15,679
[MUSIC]