1
00:00:00,035 --> 00:00:04,606
[MUSIC]

2
00:00:04,606 --> 00:00:05,775
Now let's step back and

3
00:00:05,775 --> 00:00:09,580
summarize MapReduce in terms
of the execution overview.

4
00:00:09,580 --> 00:00:12,779
So in particular we assume we
have some large data set and

5
00:00:12,779 --> 00:00:16,360
we're going to this data
across machines and

6
00:00:16,360 --> 00:00:20,720
then on each machine,
this is our map phase.

7
00:00:20,720 --> 00:00:26,630
We're going to do some operation that's
data parallel, so splits across the data

8
00:00:26,630 --> 00:00:31,850
divided on these different machines,
and then emit some key value pair.

9
00:00:31,850 --> 00:00:38,619
So for example,
maybe we emit a key value pair uw ,1).

10
00:00:38,619 --> 00:00:42,625
So that might be what k1
might represent uw, and

11
00:00:42,625 --> 00:00:48,530
v1 the number 1 in the example
that we showed on the last slide.

12
00:00:48,530 --> 00:00:52,200
And then after emitting
all these key value pairs,

13
00:00:52,200 --> 00:00:54,480
then we do what's called
the shuffle phase.

14
00:00:54,480 --> 00:01:01,240
And the shuffle phase is where we
take these key value pairs and assign

15
00:01:01,240 --> 00:01:07,510
each of them that have this same key to
the same machine for the reduced phase.

16
00:01:07,510 --> 00:01:12,138
Every key value pair is
going to get assigned machine

17
00:01:12,138 --> 00:01:16,446
each of that key where
h's are hash function.

18
00:01:16,446 --> 00:01:21,016
So then what we end up with is
in the reduced phase there's

19
00:01:21,016 --> 00:01:24,254
going to be some
aggregation step where for

20
00:01:24,254 --> 00:01:29,700
every key we're going to look at all
values associated with that key.

21
00:01:29,700 --> 00:01:31,860
So we're going to look at
these key value pairs and

22
00:01:31,860 --> 00:01:36,980
for a given key we're going to
aggregate across all those values.

23
00:01:36,980 --> 00:01:42,348
And so for example in our word counting

24
00:01:42,348 --> 00:01:47,395
this might produce "uw", 30.

25
00:01:47,395 --> 00:01:53,418
That might be what K1 V1 represents here.

26
00:01:53,418 --> 00:01:58,847
But I want to mention that when
we're talking about our map phase,

27
00:01:58,847 --> 00:02:05,773
these are not necessarily unique keys that
appear across the different machines.

28
00:02:05,773 --> 00:02:10,771
So, may have same key

29
00:02:10,771 --> 00:02:15,214
appear on multiple

30
00:02:15,214 --> 00:02:20,491
machines whereas when

31
00:02:20,491 --> 00:02:26,609
we're in the reduce phase

32
00:02:29,146 --> 00:02:33,321
We're going to end up with
unique keys on every machine.

33
00:02:37,713 --> 00:02:42,075
And that's why in terms of our
notation here we say k1, k2 and

34
00:02:42,075 --> 00:02:44,805
this is k1 prime, k2 prime and so on.

35
00:02:44,805 --> 00:02:48,673
So these primes indicate that it could
take the same values what it did on

36
00:02:48,673 --> 00:02:53,376
the other machine, whereas here we're
just saying k1 and k2, k3, k4 k5, k6,

37
00:02:53,376 --> 00:02:54,340
and so on.

38
00:02:54,340 --> 00:02:59,250
These are each unique values from
an entire set of possible keys.

39
00:02:59,250 --> 00:03:03,600
Okay, and remember that this reduced
phase, it has to be in operation,

40
00:03:03,600 --> 00:03:06,120
that's data parallel across keys.

41
00:03:08,040 --> 00:03:11,293
And then finally, the last step,

42
00:03:11,293 --> 00:03:15,900
is that you're going to
save your results to disk.

43
00:03:19,850 --> 00:03:24,380
So, this is the summary of
the standard MapReduce Execution Flow.

44
00:03:24,380 --> 00:03:28,982
But I quickly wanted to mention that we
can often improve performance through

45
00:03:28,982 --> 00:03:31,331
something that are called combiners.

46
00:03:31,331 --> 00:03:35,083
Because this naive implementation of
MapReduce can be really wasteful in terms

47
00:03:35,083 --> 00:03:38,835
of communication in that shuffle phase
where you're taking all these key value

48
00:03:38,835 --> 00:03:41,470
pairs and
you're sending them to different machines.

49
00:03:42,480 --> 00:03:47,920
For example, if you have a large set of
documents and some reasonable vocabulary,

50
00:03:47,920 --> 00:03:52,385
you're going to have lots and lots and
lots of these counts of word,1,

51
00:03:52,385 --> 00:03:56,108
word,1, word, 1, word, 1.

52
00:03:56,108 --> 00:04:00,680
And all of those has to be
sent to a machine and so

53
00:04:00,680 --> 00:04:02,490
that's a lot of communication.

54
00:04:02,490 --> 00:04:07,690
So just to make this really explicit,
let's say we're looking at machine 1 and

55
00:04:07,690 --> 00:04:13,640
we are counting words on all documents
that were allocated to machine 1,

56
00:04:13,640 --> 00:04:18,150
and we emit (;uw',1).

57
00:04:18,150 --> 00:04:22,670
And we end up with, let's say, a whole

58
00:04:22,670 --> 00:04:28,180
bunch of UW words in
the documents on this machine.

59
00:04:28,180 --> 00:04:33,380
Just to make this concrete, let's say

60
00:04:33,380 --> 00:04:38,424
we end up with 4,002 instances,

61
00:04:38,424 --> 00:04:42,690
of ('uw,1) on machine 1.

62
00:04:42,690 --> 00:04:45,828
And all of these are let's say,

63
00:04:45,828 --> 00:04:50,450
going to be shuffled to machine 7 for
our reduce phase.

64
00:04:50,450 --> 00:04:54,684
And this would happen if h(uw),

65
00:04:54,684 --> 00:05:00,087
our hash function on key uw,
is equal to 7.

66
00:05:00,087 --> 00:05:05,684
That's how we would shuffle all
of these instances to machine 7.

67
00:05:06,920 --> 00:05:11,730
Okay, well, a combiner represents
a very simple solution to this problem,

68
00:05:11,730 --> 00:05:17,340
which is to perform a reduce step before
sending things over to another machine.

69
00:05:17,340 --> 00:05:22,640
So, perform this reduce locally
to each one of the machines

70
00:05:22,640 --> 00:05:27,559
in the map phase before doing the shuffle,
and the subsequent reduce.

71
00:05:29,030 --> 00:05:34,631
So in particular what
we would do is take our

72
00:05:34,631 --> 00:05:39,922
machine 1 and all these counts of (uw,

73
00:05:39,922 --> 00:05:45,368
1) and we do reduce and our call to reduce

74
00:05:45,368 --> 00:05:50,349
would emit ('uw',4002) and

75
00:05:50,349 --> 00:05:54,393
then we would send just this one

76
00:05:54,393 --> 00:06:00,178
('uw',4002) over to machine 7.

77
00:06:00,178 --> 00:06:05,607
And note that the reason that this works,
the fact that we can do this local

78
00:06:05,607 --> 00:06:11,477
reduce and then do another reduce after
we do a shuffle from the other machines,

79
00:06:11,477 --> 00:06:16,058
is because we have
a commutative-associative reduce step.

80
00:06:16,058 --> 00:06:20,299
[MUSIC]