1
00:00:00,000 --> 00:00:06,032
Finally, let's take a peek inside, the
real implementations of map-reduce that

2
00:00:06,032 --> 00:00:12,024
is, not the light-weight octo simulation,
but platforms such as The Duke.

3
00:00:13,033 --> 00:00:20,058
The input files, as we shall see in next
week's lecture, are picked up from a

4
00:00:20,058 --> 00:00:27,046
distributed file system in chunks by
worker nodes which are essentially

5
00:00:27,046 --> 00:00:33,027
executing the map phase.
They read chunks of data, perform map

6
00:00:33,027 --> 00:00:38,090
operations on them and write the results
to the local disk.

7
00:00:38,090 --> 00:00:46,588
They sort the data before writing it and
they write it in chunks, different chunks

8
00:00:46,588 --> 00:00:52,488
for different reduce keys.
It's very important to note that whereas

9
00:00:52,488 --> 00:01:00,085
the input files are in a distributed file
system, the local rights after each mapper

10
00:01:00,085 --> 00:01:04,426
is done with its work, are on their local
disks.

11
00:01:04,426 --> 00:01:10,492
As a result there's no inter processor
communication involved until these are

12
00:01:10,492 --> 00:01:17,144
actually read by the different producers.
When a mapper is finished with a

13
00:01:17,144 --> 00:01:24,695
particular reduce key it informs a master
node that it's actually done with that

14
00:01:24,695 --> 00:01:27,850
key.
That master note keeps getting such

15
00:01:27,850 --> 00:01:34,432
messages from many different mappers task.
And assigns reduce keys to difference

16
00:01:34,432 --> 00:01:38,804
reduce processors depending on which are
available.

17
00:01:38,804 --> 00:01:45,219
Some times mapper themselves has finished
their work and can be reassigned as

18
00:01:45,219 --> 00:01:48,953
reducers.
In other cases there are mappers and

19
00:01:48,953 --> 00:01:54,717
reducers working parallel.
When a reducer is informed by the master

20
00:01:54,717 --> 00:02:02,016
node that a particular reduce key is
ready, it's also told which mapper to pick

21
00:02:02,016 --> 00:02:06,561
it up from.
The reducer then issues a remote read to

22
00:02:06,561 --> 00:02:13,332
that mapper process, which has a listener
sitting on it, waiting for such remote

23
00:02:13,332 --> 00:02:19,569
read requests, which it services.
That is how the reducers actually grab

24
00:02:19,569 --> 00:02:26,383
data from different mappers, as soon as
they know where to get that data from.

25
00:02:26,383 --> 00:02:33,154
Notice that even though a master can tell
a reducer when a mapper has finished

26
00:02:33,154 --> 00:02:39,417
working on a particular reduce key.
The master also knows how many mappers

27
00:02:39,417 --> 00:02:45,959
there are and can figure out when all the
work that needs to be done in particular

28
00:02:45,959 --> 00:02:52,216
reduce key is actually over and can inform
the reducer assigned to that key that this

29
00:02:52,216 --> 00:02:58,038
is the case so that it actually begins
performing reduce operation on a

30
00:02:58,038 --> 00:03:02,974
particular reduce key.
Notice that a worker, rather a reduce

31
00:03:02,974 --> 00:03:09,199
worker can pick up data from multiple
mappers, but can't really start working on

32
00:03:09,199 --> 00:03:15,485
a reduce key itself, that is, doing the
reduce function until it knows that it has

33
00:03:15,485 --> 00:03:21,665
all the data required for that key
otherwise the results might be incorrect.

34
00:03:21,665 --> 00:03:28,378
Because reduce functions don't actually
have to be computative and associative as

35
00:03:28,378 --> 00:03:33,353
we have seen earlier.
Last but not least, the most important

36
00:03:33,353 --> 00:03:39,195
part of map produced implementations that
scale is their fault tolerance.

37
00:03:39,195 --> 00:03:45,553
This was the reason why traditional data
based technologies could not scale to

38
00:03:45,553 --> 00:03:49,873
thousands or hundreds of thousands of
processors.

39
00:03:49,873 --> 00:03:56,366
The trouble is that processors are bound
to fail in long tasks involving large

40
00:03:56,366 --> 00:04:01,915
numbers of machines.
Simple probability insures that is almost

41
00:04:01,915 --> 00:04:08,356
always likely to happen.
For example if the chance that a processor

42
00:04:08,356 --> 00:04:13,520
will never fail in a particular time
period is 99%.

43
00:04:13,520 --> 00:04:23,048
Then if you have say, 100 such machines,
the chance that none of them will fail is

44
00:04:23,048 --> 00:04:28,974
somewhere around only 30%.
That if you have 1,000 machines, the

45
00:04:28,974 --> 00:04:32,913
chance that at least one will fail, is
almost 100%.

46
00:04:32,913 --> 00:04:38,969
You can work this out using very similar
math to that we used in the locality

47
00:04:38,969 --> 00:04:45,979
sensitive hashing example.
So map produced computation that can take

48
00:04:45,979 --> 00:04:50,676
hours or even days on large volumes of big
data.

49
00:04:50,676 --> 00:04:55,838
Need to be.
Aware of the fact, and in fact, certain

50
00:04:55,838 --> 00:05:02,918
that at least some of their, machines will
fail during the computation, but the

51
00:05:02,918 --> 00:05:07,968
computation itself should continue and
complete successfully.

52
00:05:07,968 --> 00:05:13,873
Here is how map-reduce ensures this.
The master node maintains regular

53
00:05:13,873 --> 00:05:19,423
communication with each of the worker
nodes, whether they are mappers or

54
00:05:19,423 --> 00:05:23,446
reducers.
It also keeps track of which key ranges

55
00:05:23,446 --> 00:05:29,355
have actually been successfully processed
by each mapper and reducer.

56
00:05:29,355 --> 00:05:36,729
Now if a mapper fails, then the master has
to simply reassign the key ranges assigned

57
00:05:36,729 --> 00:05:43,183
to that mapper to some other processor
whenever it gets one available.

58
00:05:43,183 --> 00:05:50,670
Since the data had to be read from the
input in any case, the new reassigned

59
00:05:50,670 --> 00:05:55,791
mapper simply rereads the data and
reprocesses it.

60
00:05:55,791 --> 00:06:04,218
Similarly if a reducer fails, the key
ranges for whatever task it has not yet

61
00:06:04,218 --> 00:06:09,670
finished, are assigned to some newly
appointed reducer.

62
00:06:09,670 --> 00:06:15,689
However, for the tasks we've just already
completed before failing.

63
00:06:15,689 --> 00:06:22,884
We don't have to worry because they have
already been written out to the output.

64
00:06:22,884 --> 00:06:30,486
This is how the master keeps track of how
much of the MapReduce task is completed.

65
00:06:30,486 --> 00:06:37,639
And when a failure is detected, it merely
assigns those key ranges to other workers

66
00:06:37,639 --> 00:06:45,531
for the appropriate map or reduce phase.
In addition, the master also figures out

67
00:06:45,531 --> 00:06:52,164
whether there are situations where some
mappers are performing very slowly for

68
00:06:52,164 --> 00:06:57,036
some unforeseen reason.
It treats those as failure, and simply

69
00:06:57,036 --> 00:07:03,400
reassigns their work to somebody else.
Whoever finishes first, the data is taken

70
00:07:03,400 --> 00:07:06,849
as done.
Since it doesn't really matter if

71
00:07:06,849 --> 00:07:13,693
something is computed twice by mistake.
Now you might think that this master which

72
00:07:13,693 --> 00:07:20,297
is a single node is in obvious point of
failure and that is indeed the case.

73
00:07:20,297 --> 00:07:26,003
If the master fails, the map produced task
indeed does fail.

74
00:07:26,003 --> 00:07:32,670
However, Map Produce works on large
numbers and probabilities.

75
00:07:32,670 --> 00:07:40,958
The chance of one amongst a 1000 processor
machines failing is almost 100 percent as

76
00:07:40,958 --> 00:07:47,008
we argued a while back.
Even though the chance that any one of

77
00:07:47,008 --> 00:07:53,345
them fails might be, less than a percent
or even.1%, by that argument the chance

78
00:07:53,345 --> 00:08:00,622
that this particular master fails might be
a tenth of a percent or even less, but the

79
00:08:00,622 --> 00:08:07,050
chance that any of the other processors
fails if there are large numbers of them

80
00:08:07,050 --> 00:08:13,453
is almost one.
This still doesn't remove the fact that

81
00:08:13,453 --> 00:08:20,053
the master is the single point of failure.
But the probability of the map produce

82
00:08:20,053 --> 00:08:25,702
task failing is still very, very small.
So let's see what we have learnt this

83
00:08:25,702 --> 00:08:28,942
week.
We began with the review of parallel

84
00:08:28,942 --> 00:08:35,085
computing ideas, the concept of speed-up,
parallel efficiency and scalable

85
00:08:35,085 --> 00:08:38,821
algorithm.
We then went into the map-reduce

86
00:08:38,821 --> 00:08:45,395
programming paradigm as a higher level
attraction as compared to hand coded

87
00:08:45,395 --> 00:08:51,913
parallel programming using shared memory
and message passing.

88
00:08:51,913 --> 00:08:59,866
We did a few examples.
Discussed how the ocdo simulator works.

89
00:08:59,866 --> 00:09:08,146
Showed how map-reduce could be applied to
various machine learning and search tasks

90
00:09:08,146 --> 00:09:14,263
that we have studied in the past weeks.
And concluded with some discussion of the

91
00:09:14,263 --> 00:09:18,363
parallel efficiency and internals of
map-reduce.

92
00:09:18,363 --> 00:09:26,383
Next week, we will turn our attention to
the big data technology that is required

93
00:09:26,383 --> 00:09:32,320
to make MapReduce actually work.
In particular, distributed file systems

94
00:09:32,320 --> 00:09:38,450
distributed no sequel databases.
And how they differ from the traditional

95
00:09:38,450 --> 00:09:44,722
relational variety And finally, emerging
trends as to where this technology is

96
00:09:44,722 --> 00:09:47,026
actually going in the future.