Let's return now to map-reduce and ask
where the data that map-reduce program
require is written, is taking from and
where does the output go?
Clearly, map-reduce reads and writes fresh
data, right?
As we've seen before, it's a batch
process.
The question is, where does it read this
data from, and where does it write the
output to?
Even if map-reduces and parallel computing
paradigm, what about the data read and
write?
Does that form a potential bottleneck?
The solution is that the data itself needs
to lie in a distributed format, whether
it's a distributed file system or a
database so that we can support parallel
reading and writing.
So, that more than one processor is active
reading or writing a large file.
Otherwise, input and output itself will
become a bottleneck for the map-reduce
parallel computing program.
Further, we have discussed processing node
failures, and seen how map-reduce recovers
from these without letting the overall map
reduce job fail.
But, what if the nodes on which data is
stored failed, they also need to be fall
tolerant.
To solve both these problems, distributed
file systems were developed primarily
first by Google, and then Yahoo and that's
what we are going to study next.
This diagram illustrates the architecture
of the, distributed file systems, both the
Google file system as well as the, Hadoop
Distributed File System which is Open
Source now, and is essentially the
foundation of what is big data technology
wherever it's used.
Whether there is a database built on it or
used directly, the HDFS is fundamental to
big data.
This is how it works.
Large files are distributed into chunks,
typically, of 64 megabytes in size and
each of these chunks is stored on what is
called chunk servers.
Further, chunk servers are replicated so
that each chunk is stored on three
different servers, typically replicated on
multiple racks and in a different network
subnet, to take care of all types of
hardware failures.
As a result of this replication, the
possibility of data being lost due to
hardware failure is minimized
significantly.
Of course, the challenge is now to
maintain consistency across all these
replicas.
So, here is how that works.
When a client application, say, a
map-reduce program, needs to read data, it
needs to first figure out where a
particular piece of data resides.
A read command typically is issued with a
file name which is a, a path in the
directory structure.
And an offset, that is how many bytes
ahead in, in the file, the client wants to
read.
Data about what parts of a file are on
which chunk server is kept in a master
node which is called the Master GFS and
it's called the name node in HDFS, and the
client reads the metadata from this master
node.
The metadata tells it where the particular
offset that it is asking for resides that
is which chunk server.
The metadata itself is cached by the
client when it first starts reading so
that it doesn't continuously have to go
and ask the name node or the master for
information.
Further, while actually reading data, the
client directly contacts the chunk server,
which is listening for such requests, and
serves up requests for a particular chunks
of data from a file, rather than go
through any centralized systems.
So, if there are many clients reading a
large file, typically, they will be
accessing different junk servers, and
therefore, these reads are happening in
parallel.
And the result input at least doesn't
become a bottleneck in a parallel map
reduce job.
In case the reading client that is a
mapper fails, to contact the chunk server,
it looks up to the master data to figure
out where the next replica is stored and
tries again.
Since there are three replicas, the chance
of all of them failing is fairly low and
that lends fall tolerance to the input,
output process.
Of course, the tricky part comes in
writing because while writing data, we
have to main, make, make sure that the,
each replica contains the same data all
the time.
Here is how writes happen in GFS.
Of the three replicas for each chunk, one
is designated as the primary replica.
Could be any one, and that information is
kept in the master.
Of course, the master keeps pinging these
replicas to make sure that they are alive.
And in case, the primary is down for some
reason the master node assigns a new
primary, and possibly even asks for a new
replica to be created for that chunk.
Now, when writing data, the client
application sends the data to be written
to all three replicas for a particular
chunk.
One of them is primary, and it figures out
where to write this data, assigns an
offset to write the data, such as
typically the end of file, and sends this
offset to the other two replicas.
If these replicas succeed in writing at
that particular point, they tell the
primary that they have written, and the
write succeeds with the primary informing
the client that the operation has
successfully completed.
On the other hand, if some of the replicas
failed to write at the designated offset,
which can happen, for example, because of
a bad disc sector.
Then, they return a failure to the
primary, and the primary retries to write
at another offset, could be beyond the end
of the file, for example, and tries again
until it succeeds at all replicas.
As a result of this process, large bulk
writes can be done in parallel on a large
file by multiple processors, typically
reducers, writing the output of a
map-reduce job, while still ensuring that
three replicas of every chunk are
maintained synchronously and with the same
state.
Gfs and HDFS are the foundation for all
big data technology.
Gfs, of course, is proprietary to Google,
and is internally used.
Hdfs was developed at Yahoo, and opened
sourced as part of the Hadoop Distribution
and has now become, and has now become
synonymous with map-reduce and large scale
computing using big data.
So, to summarize, what the GFS distributed
file system architecture is really good at
is supporting multiple, parallel reads and
writes from a large number of processors.
The reads are arbitrary and random access
but the writes are best done when they are
appends or writing to the end of a large
file.
Because the architecture relies on the
primary replica for a chunk deciding the
order, in which multiple append requests
are processed.
The data is always consistent, though not
necessarily predictable in terms of which
processor data is written first.
That normally doesn't matter because the
data is fairly independent, especially in
map-reduced output.
At the same time, it's important to
realize that random writes in the middle
of some file, while can be handled using
the GFS architecture, exactly the same way
as we have described, are not as efficient
as bulk writes towards the end of the
file.
Imagine a large number of reducers writing
their output to a file, which will
eventually become large.
As soon as a reducer decides to write say
64 MB chunk, it has to figure out that
this is going to be a new chunk.
And inform the master or the name node,
that it's writing a new chunk.
Similarly, other reducers figure out that
they're writing new chunks, and inform the
master.
The master's updates its metadata, and all
the writes happen in parallel.
The master does become a bit of a
bottleneck, but because the writes are
fairly large, that doesn't normally create
a problem.
On the other hand, if we have random
writes to the middle of a file, The
challenge then becomes, what happens when
a chunk essentially overflows and impinges
on another, next, the next chunk in the
file.
This can create problems in the way the
file is laid out and requires much more
synchronization.
Without going into too much detail, it's
also shown in the paper on GFS that the
degree of replica consistency that one
gets with large bulk of pens is much
stronger than the degree of replica
consistency than we get with random writes
and parallel.
Nonetheless even with a reduced degree of
consistencies, each reader, in the GFS
architecture always sees a consistent data
regardless of which replica it ends up
reading from.
That's one of the powers of the GFS
architecture.
Still, GFS and HDFS, while supporting bulk
parallel reads and writes are nevertheless
file system architectures and not
databases as we have come to understand
them over the past 30 to 40 years.
Much of the current debate in the big data
and traditional BI communities is about
databases.
The big data databases are built on top of
distributed file systems llike GFS and
HDFS.
But before we turn to these, let's first
take a look at traditional databases and
see why they were developed and how
they've actually been used.