1
00:00:00,000 --> 00:00:04,037
Let's now turn to the map-reduce paradigm
itself.

2
00:00:04,037 --> 00:00:10,014
As we just mentioned, it's a tailor
parallel, paradigm, with message passing.

3
00:00:10,014 --> 00:00:14,075
It's also a pipelined procedure, where
there are two phases.

4
00:00:14,075 --> 00:00:18,032
There's the map phase and the reduce
phase.

5
00:00:18,032 --> 00:00:23,073
Further, as we also mentioned, it's a
higher level abstraction.

6
00:00:23,073 --> 00:00:30,090
So programmers only need to specify what
each mapper actually does, and what each

7
00:00:30,090 --> 00:00:37,297
reducer actually does And the map-reduce
implementation takes care of all the

8
00:00:37,297 --> 00:00:41,097
message passing required between the two
phases.

9
00:00:42,032 --> 00:00:48,096
In fact, one doesn't even need to worry
about how many processors there are, and

10
00:00:48,096 --> 00:00:53,046
which one should be doing what kind of job
map-reduce.

11
00:00:53,046 --> 00:00:57,048
That's all left to the map-reduce
implementation.

12
00:00:57,048 --> 00:01:03,096
All the programmer needs to do is define
their task in terms of the map-reduce

13
00:01:03,096 --> 00:01:10,514
paradigm which will come to in a moment.
Point the system at the data on which they

14
00:01:10,514 --> 00:01:16,388
would like this task performed.
And the map-reduce implementation, figures

15
00:01:16,388 --> 00:01:20,815
out which processors can be made available
to this task.

16
00:01:20,815 --> 00:01:26,292
And does the rest without the programmer
having to worry about the details.

17
00:01:26,292 --> 00:01:32,145
That's what makes map-reduce exceedingly
powerful for large scale data parallel

18
00:01:32,145 --> 00:01:35,489
processing.
Since it bridges the gap between high

19
00:01:35,489 --> 00:01:41,384
level specification of an algorithm, and
its massively parallel implementation in a

20
00:01:41,384 --> 00:01:46,821
manner which has really never been
successfully done ever before in the

21
00:01:46,821 --> 00:01:51,914
history of parallel computing.
As we shall see, many complex web

22
00:01:51,914 --> 00:01:58,607
intelligence algorithms for machine
learning and processing big data are easy

23
00:01:58,607 --> 00:02:04,680
to express using the mapreduce paradigm.
Making their large scale parallel

24
00:02:04,680 --> 00:02:11,098
implementation much easier than if one had
to program these using either shared

25
00:02:11,098 --> 00:02:16,805
memory or low level message passing as was
the case before map-reduce.

26
00:02:16,805 --> 00:02:23,260
The most common example used to, explain
map-reduce is that of counting words and

27
00:02:23,260 --> 00:02:27,103
documents.
Something we've seen last week when we

28
00:02:27,103 --> 00:02:32,089
tried to compute inverse document
frequencies for example.

29
00:02:32,089 --> 00:02:36,012
Suppose we are given many documents in
this case.

30
00:02:36,012 --> 00:02:41,394
Say there are only ten documents here, but
there could be you know, hundreds or

31
00:02:41,394 --> 00:02:47,074
thousands or even millions of documents.
And we need to compute the frequency that

32
00:02:47,074 --> 00:02:50,652
each word takes across this whole set of
documents.

33
00:02:50,652 --> 00:02:55,528
And we like to do it in parallel, using
many processors, so as to reduce the

34
00:02:55,528 --> 00:03:00,042
overall execution time.
In our simple example we will take only

35
00:03:00,042 --> 00:03:06,688
ten documents.
And three mapper processors which execute

36
00:03:06,688 --> 00:03:11,246
map tasks.
Followed by two reducer processors that

37
00:03:11,246 --> 00:03:16,146
execute reduced tasks.
Of course, once the mapper processors are

38
00:03:16,146 --> 00:03:19,392
done they can start doing the reduced
work.

39
00:03:19,392 --> 00:03:25,495
So the total number of, processors need
not be five, it might not even be as low

40
00:03:25,495 --> 00:03:31,052
as three, or maybe four.
The mappers are given data, in this case

41
00:03:31,052 --> 00:03:38,846
there are documents identified by document
IDs, which the map-reduce system randomly

42
00:03:38,846 --> 00:03:43,099
distributes across the different mapper
processors.

43
00:03:43,099 --> 00:03:48,043
Each mapper processor thus, gets a bunch
of documents.

44
00:03:48,043 --> 00:03:52,078
It doesn't really know which documents
it's getting.

45
00:03:52,078 --> 00:03:56,088
It just gets a bunch of them in any random
order.

46
00:03:56,088 --> 00:04:02,091
Now, here's how the problem is defined in
terms of map and reduced tasks.

47
00:04:02,091 --> 00:04:09,053
Remember, a mapper simply needs to perform
a whole bunch of map tasks on all the

48
00:04:09,053 --> 00:04:16,010
documents that it gets.
So its getting a sequence of document IDs

49
00:04:16,010 --> 00:04:24,038
followed by the contents of the document.
In other words its getting a key and a

50
00:04:24,038 --> 00:04:28,094
value in a sequence as such key value
pairs.

51
00:04:28,094 --> 00:04:37,032
A map task works on a key value pairs such
as a document ID and its contents and

52
00:04:37,032 --> 00:04:44,928
produces a bunch of key value pairs in
this case words and the frequencies with

53
00:04:44,928 --> 00:04:49,080
which they occur in this particular
document.

54
00:04:49,080 --> 00:04:54,067
Remember that a map task acts on one key
value pair.

55
00:04:54,067 --> 00:04:59,017
So, it needs to act in this case on one
document.

56
00:04:59,017 --> 00:05:04,098
It is, however, free to emit many key
value pairs as its output.

57
00:05:04,098 --> 00:05:11,082
In this case the many words that occur in
that document along with their

58
00:05:11,082 --> 00:05:16,048
frequencies.
For example, the map tasks for the

59
00:05:16,048 --> 00:05:21,027
document D1 would be to emit a W1 followed
by one.

60
00:05:21,027 --> 00:05:25,587
A
W2 followed by one and W4 followed by a

61
00:05:25,587 --> 00:05:34,220
one and similarly for the other documents.
In some cases such as the six, we would

62
00:05:34,220 --> 00:05:40,767
emit a W2 followed by a two.
But in most other cases it would be a

63
00:05:40,767 --> 00:05:46,710
count of one.
The second part of our problem definition

64
00:05:46,710 --> 00:05:55,836
is the reduce phase, in which a list of
key value pairs corresponding to the same

65
00:05:55,836 --> 00:06:03,769
W key, in this case is reduced or,
compressed, or combined, into one key

66
00:06:03,769 --> 00:06:13,377
value pair, for that particular key.
So for example, all the counts for a

67
00:06:13,377 --> 00:06:22,085
particular word are summed up in the
reduced phase to create one key value pair

68
00:06:22,085 --> 00:06:28,089
for each word.
Now clearly in order to perform the reduce

69
00:06:28,089 --> 00:06:36,067
function, all the key value pairs for
particular word need to be together if

70
00:06:36,067 --> 00:06:43,093
they are to be so summed up.
Unfortunately, since each mapper acted on

71
00:06:43,093 --> 00:06:51,005
an independent set of documents, it only
has access to those word count pairs for

72
00:06:51,005 --> 00:06:59,002
the particular document it happens to see.
The best that a mapper can do is compute

73
00:06:59,002 --> 00:07:05,612
the reduced function using the pairs that
it happens to have available.

74
00:07:05,612 --> 00:07:13,455
So, for example, this mapper, it only had
these two documents D1 and D2, which had a

75
00:07:13,455 --> 00:07:20,740
W1, so it computes a W1 comma two by
performing the map and reduced functions

76
00:07:20,740 --> 00:07:26,058
on the set of documents that it has
available to it.

77
00:07:26,058 --> 00:07:36,234
Similarly, this mapper adds up the sums
for W2 across all its documents to get a

78
00:07:36,234 --> 00:07:41,969
value of W2, four.
Of course that is not enough, all these

79
00:07:41,969 --> 00:07:48,431
partial summations themselves need to be
summed up and this is where the map

80
00:07:48,431 --> 00:07:53,013
produce implementation comes in.
And this where the map-reduce

81
00:07:53,013 --> 00:07:59,010
implementation comes in and sends all of
the key value pairs with the same key to

82
00:07:59,010 --> 00:08:04,093
the same producer so that they are all
together and they can finally be reduced

83
00:08:04,093 --> 00:08:11,007
to get the right answer.
So, all the pairs having W1 managed to get

84
00:08:11,007 --> 00:08:18,020
to this reducer, where they can get added
up to get the correct value for W1.

85
00:08:18,020 --> 00:08:24,048
So, we got W1 from here, we got a W1 from
here, we got a W1 from here.

86
00:08:24,048 --> 00:08:29,045
And we added all of these up to get a
value of seven.

87
00:08:29,045 --> 00:08:37,014
Similarly the second reducer also manages
to get all the key value pairs for each

88
00:08:37,014 --> 00:08:41,758
word to be together, so that they could be
added up.

89
00:08:41,758 --> 00:08:48,431
In this case the first reducer works on
pairs, where the keys are W1 and W2.

90
00:08:48,431 --> 00:08:54,115
And the second reducer works on those,
where it's W3 and W4.

91
00:08:54,115 --> 00:08:59,391
So now we're ready to define map-reduce
formally.

92
00:08:59,391 --> 00:09:10,451
Mappers read data in the form of key value
pairs K1, B1 and for each such K1, B1

93
00:09:10,451 --> 00:09:17,040
pair, a mapper may emit one or more pairs
of the form of K2, V2.

94
00:09:17,040 --> 00:09:22,031
In our word counting example, K1 was
document ID.

95
00:09:22,031 --> 00:09:30,053
And B1 was the contents of the document,
whereas K2 was the word and V2 the number

96
00:09:30,053 --> 00:09:35,053
of times a word occurred in a particular
document.

97
00:09:36,072 --> 00:09:45,430
The reduced function on the other hand
receives all the pairs for sum value of K2

98
00:09:45,430 --> 00:09:50,801
and combines them through a reduce
function.

99
00:09:50,801 --> 00:09:57,968
So the reduce of K2 and all the values
that came out of the map phase for that

100
00:09:57,968 --> 00:10:03,902
particular value of K2 are combined using
a function F sub R.

101
00:10:03,902 --> 00:10:10,704
And now we're counting example F sub R is
just a summation across all the

102
00:10:10,704 --> 00:10:16,960
occurrences of a particular word in all
the documents that we process.

103
00:10:16,960 --> 00:10:25,217
Some of you might have noticed that in our
example we actually did, some reduced work

104
00:10:25,217 --> 00:10:31,374
within the mappers.
But we'll come to that point in a short

105
00:10:31,374 --> 00:10:33,328
while.
For the moment.

106
00:10:33,328 --> 00:10:40,396
The strict definition of a map function is
only a map that is from one key val-,

107
00:10:40,396 --> 00:10:44,591
value pair produce one or more key value
pairs.

108
00:10:44,591 --> 00:10:52,043
And reduce function is take all the key
value pairs for a particular value of the

109
00:10:52,043 --> 00:11:01,591
key and apply the reduce function on them.
The map-reduce platform is responsible for

110
00:11:01,591 --> 00:11:10,255
routing pairs to reducers, so that all the
pairs for the same key end up at the same

111
00:11:10,255 --> 00:11:15,331
reducer.
Mapreduce essentially sorts the key value

112
00:11:15,331 --> 00:11:23,453
pairs that come out of the map phase.
So that they land up in the right place.

113
00:11:23,453 --> 00:11:30,420
Finally, a very important point to note is
that the map reduce process reads data,

114
00:11:30,420 --> 00:11:35,951
possibly from a file system or a database,
and writes fresh data.

115
00:11:35,951 --> 00:11:40,649
So, it's a batch process that always
produces more data.

116
00:11:40,649 --> 00:11:47,431
Very different from a database, where one
is querying data and just getting an