1
00:00:00,000 --> 00:00:06,011
Let's return now to map-reduce and ask
where the data that map-reduce program

2
00:00:06,011 --> 00:00:10,069
require is written, is taking from and
where does the output go?

3
00:00:11,058 --> 00:00:14,528
Clearly, map-reduce reads and writes fresh
data, right?

4
00:00:14,528 --> 00:00:17,563
As we've seen before, it's a batch
process.

5
00:00:17,563 --> 00:00:23,202
The question is, where does it read this
data from, and where does it write the

6
00:00:23,202 --> 00:00:27,250
output to?
Even if map-reduces and parallel computing

7
00:00:27,250 --> 00:00:30,473
paradigm, what about the data read and
write?

8
00:00:30,473 --> 00:00:37,961
Does that form a potential bottleneck?
The solution is that the data itself needs

9
00:00:37,961 --> 00:00:43,408
to lie in a distributed format, whether
it's a distributed file system or a

10
00:00:43,408 --> 00:00:48,088
database so that we can support parallel
reading and writing.

11
00:00:48,088 --> 00:00:54,054
So, that more than one processor is active
reading or writing a large file.

12
00:00:54,054 --> 00:01:00,051
Otherwise, input and output itself will
become a bottleneck for the map-reduce

13
00:01:00,051 --> 00:01:07,050
parallel computing program.
Further, we have discussed processing node

14
00:01:07,050 --> 00:01:15,015
failures, and seen how map-reduce recovers
from these without letting the overall map

15
00:01:15,015 --> 00:01:19,470
reduce job fail.
But, what if the nodes on which data is

16
00:01:19,470 --> 00:01:24,015
stored failed, they also need to be fall
tolerant.

17
00:01:24,081 --> 00:01:31,885
To solve both these problems, distributed
file systems were developed primarily

18
00:01:31,885 --> 00:01:38,012
first by Google, and then Yahoo and that's
what we are going to study next.

19
00:01:38,012 --> 00:01:44,039
This diagram illustrates the architecture
of the, distributed file systems, both the

20
00:01:44,039 --> 00:01:50,035
Google file system as well as the, Hadoop
Distributed File System which is Open

21
00:01:50,035 --> 00:01:55,781
Source now, and is essentially the
foundation of what is big data technology

22
00:01:56,008 --> 00:02:00,076
wherever it's used.
Whether there is a database built on it or

23
00:02:00,076 --> 00:02:04,054
used directly, the HDFS is fundamental to
big data.

24
00:02:05,012 --> 00:02:11,040
This is how it works.
Large files are distributed into chunks,

25
00:02:11,040 --> 00:02:19,065
typically, of 64 megabytes in size and
each of these chunks is stored on what is

26
00:02:19,065 --> 00:02:26,023
called chunk servers.
Further, chunk servers are replicated so

27
00:02:26,023 --> 00:02:33,074
that each chunk is stored on three
different servers, typically replicated on

28
00:02:33,074 --> 00:02:41,034
multiple racks and in a different network
subnet, to take care of all types of

29
00:02:41,034 --> 00:02:46,032
hardware failures.
As a result of this replication, the

30
00:02:46,032 --> 00:02:51,063
possibility of data being lost due to
hardware failure is minimized

31
00:02:51,063 --> 00:02:55,037
significantly.
Of course, the challenge is now to

32
00:02:55,037 --> 00:02:59,004
maintain consistency across all these
replicas.

33
00:02:59,004 --> 00:03:03,251
So, here is how that works.
When a client application, say, a

34
00:03:03,251 --> 00:03:09,569
map-reduce program, needs to read data, it
needs to first figure out where a

35
00:03:09,569 --> 00:03:16,018
particular piece of data resides.
A read command typically is issued with a

36
00:03:16,018 --> 00:03:19,943
file name which is a, a path in the
directory structure.

37
00:03:19,943 --> 00:03:26,637
And an offset, that is how many bytes
ahead in, in the file, the client wants to

38
00:03:26,637 --> 00:03:30,562
read.
Data about what parts of a file are on

39
00:03:30,562 --> 00:03:38,263
which chunk server is kept in a master
node which is called the Master GFS and

40
00:03:38,263 --> 00:03:45,664
it's called the name node in HDFS, and the
client reads the metadata from this master

41
00:03:45,664 --> 00:03:50,078
node.
The metadata tells it where the particular

42
00:03:50,078 --> 00:03:55,517
offset that it is asking for resides that
is which chunk server.

43
00:03:55,517 --> 00:04:01,012
The metadata itself is cached by the
client when it first starts reading so

44
00:04:01,012 --> 00:04:07,097
that it doesn't continuously have to go
and ask the name node or the master for

45
00:04:07,097 --> 00:04:10,938
information.
Further, while actually reading data, the

46
00:04:10,938 --> 00:04:16,844
client directly contacts the chunk server,
which is listening for such requests, and

47
00:04:16,844 --> 00:04:22,438
serves up requests for a particular chunks
of data from a file, rather than go

48
00:04:22,438 --> 00:04:28,039
through any centralized systems.
So, if there are many clients reading a

49
00:04:28,039 --> 00:04:33,338
large file, typically, they will be
accessing different junk servers, and

50
00:04:33,338 --> 00:04:37,015
therefore, these reads are happening in
parallel.

51
00:04:37,015 --> 00:04:42,075
And the result input at least doesn't
become a bottleneck in a parallel map

52
00:04:42,075 --> 00:04:46,028
reduce job.
In case the reading client that is a

53
00:04:46,028 --> 00:04:52,010
mapper fails, to contact the chunk server,
it looks up to the master data to figure

54
00:04:52,010 --> 00:04:56,000
out where the next replica is stored and
tries again.

55
00:04:56,000 --> 00:05:02,039
Since there are three replicas, the chance
of all of them failing is fairly low and

56
00:05:02,039 --> 00:05:06,039
that lends fall tolerance to the input,
output process.

57
00:05:06,039 --> 00:05:12,612
Of course, the tricky part comes in
writing because while writing data, we

58
00:05:12,612 --> 00:05:19,673
have to main, make, make sure that the,
each replica contains the same data all

59
00:05:19,673 --> 00:05:23,093
the time.
Here is how writes happen in GFS.

60
00:05:24,043 --> 00:05:30,054
Of the three replicas for each chunk, one
is designated as the primary replica.

61
00:05:30,054 --> 00:05:35,024
Could be any one, and that information is
kept in the master.

62
00:05:35,057 --> 00:05:42,020
Of course, the master keeps pinging these
replicas to make sure that they are alive.

63
00:05:42,020 --> 00:05:48,043
And in case, the primary is down for some
reason the master node assigns a new

64
00:05:48,043 --> 00:05:54,066
primary, and possibly even asks for a new
replica to be created for that chunk.

65
00:05:54,066 --> 00:06:00,056
Now, when writing data, the client
application sends the data to be written

66
00:06:00,056 --> 00:06:04,024
to all three replicas for a particular
chunk.

67
00:06:04,083 --> 00:06:11,077
One of them is primary, and it figures out
where to write this data, assigns an

68
00:06:11,077 --> 00:06:18,259
offset to write the data, such as
typically the end of file, and sends this

69
00:06:18,259 --> 00:06:24,194
offset to the other two replicas.
If these replicas succeed in writing at

70
00:06:24,194 --> 00:06:30,397
that particular point, they tell the
primary that they have written, and the

71
00:06:30,397 --> 00:06:36,430
write succeeds with the primary informing
the client that the operation has

72
00:06:36,430 --> 00:06:41,599
successfully completed.
On the other hand, if some of the replicas

73
00:06:41,599 --> 00:06:48,474
failed to write at the designated offset,
which can happen, for example, because of

74
00:06:48,474 --> 00:06:52,755
a bad disc sector.
Then, they return a failure to the

75
00:06:52,755 --> 00:06:59,078
primary, and the primary retries to write
at another offset, could be beyond the end

76
00:06:59,078 --> 00:07:04,803
of the file, for example, and tries again
until it succeeds at all replicas.

77
00:07:04,803 --> 00:07:11,616
As a result of this process, large bulk
writes can be done in parallel on a large

78
00:07:11,616 --> 00:07:17,327
file by multiple processors, typically
reducers, writing the output of a

79
00:07:17,327 --> 00:07:23,858
map-reduce job, while still ensuring that
three replicas of every chunk are

80
00:07:23,858 --> 00:07:28,656
maintained synchronously and with the same
state.

81
00:07:28,656 --> 00:07:34,706
Gfs and HDFS are the foundation for all
big data technology.

82
00:07:34,706 --> 00:07:40,047
Gfs, of course, is proprietary to Google,
and is internally used.

83
00:07:40,047 --> 00:07:47,062
Hdfs was developed at Yahoo, and opened
sourced as part of the Hadoop Distribution

84
00:07:47,062 --> 00:07:54,517
and has now become, and has now become
synonymous with map-reduce and large scale

85
00:07:54,517 --> 00:08:00,029
computing using big data.
So, to summarize, what the GFS distributed

86
00:08:00,029 --> 00:08:07,039
file system architecture is really good at
is supporting multiple, parallel reads and

87
00:08:07,039 --> 00:08:15,042
writes from a large number of processors.
The reads are arbitrary and random access

88
00:08:15,042 --> 00:08:23,010
but the writes are best done when they are
appends or writing to the end of a large

89
00:08:23,010 --> 00:08:27,051
file.
Because the architecture relies on the

90
00:08:27,051 --> 00:08:34,087
primary replica for a chunk deciding the
order, in which multiple append requests

91
00:08:34,087 --> 00:08:39,086
are processed.
The data is always consistent, though not

92
00:08:39,086 --> 00:08:46,058
necessarily predictable in terms of which
processor data is written first.

93
00:08:46,084 --> 00:08:52,038
That normally doesn't matter because the
data is fairly independent, especially in

94
00:08:52,038 --> 00:08:56,428
map-reduced output.
At the same time, it's important to

95
00:08:56,428 --> 00:09:03,373
realize that random writes in the middle
of some file, while can be handled using

96
00:09:03,373 --> 00:09:10,765
the GFS architecture, exactly the same way
as we have described, are not as efficient

97
00:09:10,765 --> 00:09:14,400
as bulk writes towards the end of the
file.

98
00:09:14,400 --> 00:09:20,947
Imagine a large number of reducers writing
their output to a file, which will

99
00:09:20,947 --> 00:09:26,206
eventually become large.
As soon as a reducer decides to write say

100
00:09:26,206 --> 00:09:31,792
64 MB chunk, it has to figure out that
this is going to be a new chunk.

101
00:09:31,792 --> 00:09:37,598
And inform the master or the name node,
that it's writing a new chunk.

102
00:09:37,598 --> 00:09:44,463
Similarly, other reducers figure out that
they're writing new chunks, and inform the

103
00:09:44,463 --> 00:09:48,521
master.
The master's updates its metadata, and all

104
00:09:48,521 --> 00:09:53,391
the writes happen in parallel.
The master does become a bit of a

105
00:09:53,391 --> 00:09:58,477
bottleneck, but because the writes are
fairly large, that doesn't normally create

106
00:09:58,477 --> 00:10:03,248
a problem.
On the other hand, if we have random

107
00:10:03,248 --> 00:10:12,059
writes to the middle of a file, The
challenge then becomes, what happens when

108
00:10:12,059 --> 00:10:22,689
a chunk essentially overflows and impinges
on another, next, the next chunk in the

109
00:10:22,689 --> 00:10:27,090
file.
This can create problems in the way the

110
00:10:27,090 --> 00:10:33,079
file is laid out and requires much more
synchronization.

111
00:10:34,082 --> 00:10:42,415
Without going into too much detail, it's
also shown in the paper on GFS that the

112
00:10:42,415 --> 00:10:50,004
degree of replica consistency that one
gets with large bulk of pens is much

113
00:10:50,004 --> 00:10:57,092
stronger than the degree of replica
consistency than we get with random writes

114
00:10:57,092 --> 00:11:04,021
and parallel.
Nonetheless even with a reduced degree of

115
00:11:04,021 --> 00:11:11,631
consistencies, each reader, in the GFS
architecture always sees a consistent data

116
00:11:11,631 --> 00:11:16,086
regardless of which replica it ends up
reading from.

117
00:11:16,086 --> 00:11:21,020
That's one of the powers of the GFS
architecture.

118
00:11:21,072 --> 00:11:32,448
Still, GFS and HDFS, while supporting bulk
parallel reads and writes are nevertheless

119
00:11:32,448 --> 00:11:40,018
file system architectures and not
databases as we have come to understand

120
00:11:40,018 --> 00:11:48,077
them over the past 30 to 40 years.
Much of the current debate in the big data

121
00:11:48,077 --> 00:11:53,404
and traditional BI communities is about
databases.

122
00:11:53,404 --> 00:11:59,878
The big data databases are built on top of
distributed file systems llike GFS and

123
00:11:59,878 --> 00:12:03,099
HDFS.
But before we turn to these, let's first

124
00:12:03,099 --> 00:12:09,096
take a look at traditional databases and
see why they were developed and how

125
00:12:09,096 --> 00:12:12,005
they've actually been used.