Let's return now to map-reduce and ask where the data that map-reduce program require is written, is taking from and where does the output go? Clearly, map-reduce reads and writes fresh data, right? As we've seen before, it's a batch process. The question is, where does it read this data from, and where does it write the output to? Even if map-reduces and parallel computing paradigm, what about the data read and write? Does that form a potential bottleneck? The solution is that the data itself needs to lie in a distributed format, whether it's a distributed file system or a database so that we can support parallel reading and writing. So, that more than one processor is active reading or writing a large file. Otherwise, input and output itself will become a bottleneck for the map-reduce parallel computing program. Further, we have discussed processing node failures, and seen how map-reduce recovers from these without letting the overall map reduce job fail. But, what if the nodes on which data is stored failed, they also need to be fall tolerant. To solve both these problems, distributed file systems were developed primarily first by Google, and then Yahoo and that's what we are going to study next. This diagram illustrates the architecture of the, distributed file systems, both the Google file system as well as the, Hadoop Distributed File System which is Open Source now, and is essentially the foundation of what is big data technology wherever it's used. Whether there is a database built on it or used directly, the HDFS is fundamental to big data. This is how it works. Large files are distributed into chunks, typically, of 64 megabytes in size and each of these chunks is stored on what is called chunk servers. Further, chunk servers are replicated so that each chunk is stored on three different servers, typically replicated on multiple racks and in a different network subnet, to take care of all types of hardware failures. As a result of this replication, the possibility of data being lost due to hardware failure is minimized significantly. Of course, the challenge is now to maintain consistency across all these replicas. So, here is how that works. When a client application, say, a map-reduce program, needs to read data, it needs to first figure out where a particular piece of data resides. A read command typically is issued with a file name which is a, a path in the directory structure. And an offset, that is how many bytes ahead in, in the file, the client wants to read. Data about what parts of a file are on which chunk server is kept in a master node which is called the Master GFS and it's called the name node in HDFS, and the client reads the metadata from this master node. The metadata tells it where the particular offset that it is asking for resides that is which chunk server. The metadata itself is cached by the client when it first starts reading so that it doesn't continuously have to go and ask the name node or the master for information. Further, while actually reading data, the client directly contacts the chunk server, which is listening for such requests, and serves up requests for a particular chunks of data from a file, rather than go through any centralized systems. So, if there are many clients reading a large file, typically, they will be accessing different junk servers, and therefore, these reads are happening in parallel. And the result input at least doesn't become a bottleneck in a parallel map reduce job. In case the reading client that is a mapper fails, to contact the chunk server, it looks up to the master data to figure out where the next replica is stored and tries again. Since there are three replicas, the chance of all of them failing is fairly low and that lends fall tolerance to the input, output process. Of course, the tricky part comes in writing because while writing data, we have to main, make, make sure that the, each replica contains the same data all the time. Here is how writes happen in GFS. Of the three replicas for each chunk, one is designated as the primary replica. Could be any one, and that information is kept in the master. Of course, the master keeps pinging these replicas to make sure that they are alive. And in case, the primary is down for some reason the master node assigns a new primary, and possibly even asks for a new replica to be created for that chunk. Now, when writing data, the client application sends the data to be written to all three replicas for a particular chunk. One of them is primary, and it figures out where to write this data, assigns an offset to write the data, such as typically the end of file, and sends this offset to the other two replicas. If these replicas succeed in writing at that particular point, they tell the primary that they have written, and the write succeeds with the primary informing the client that the operation has successfully completed. On the other hand, if some of the replicas failed to write at the designated offset, which can happen, for example, because of a bad disc sector. Then, they return a failure to the primary, and the primary retries to write at another offset, could be beyond the end of the file, for example, and tries again until it succeeds at all replicas. As a result of this process, large bulk writes can be done in parallel on a large file by multiple processors, typically reducers, writing the output of a map-reduce job, while still ensuring that three replicas of every chunk are maintained synchronously and with the same state. Gfs and HDFS are the foundation for all big data technology. Gfs, of course, is proprietary to Google, and is internally used. Hdfs was developed at Yahoo, and opened sourced as part of the Hadoop Distribution and has now become, and has now become synonymous with map-reduce and large scale computing using big data. So, to summarize, what the GFS distributed file system architecture is really good at is supporting multiple, parallel reads and writes from a large number of processors. The reads are arbitrary and random access but the writes are best done when they are appends or writing to the end of a large file. Because the architecture relies on the primary replica for a chunk deciding the order, in which multiple append requests are processed. The data is always consistent, though not necessarily predictable in terms of which processor data is written first. That normally doesn't matter because the data is fairly independent, especially in map-reduced output. At the same time, it's important to realize that random writes in the middle of some file, while can be handled using the GFS architecture, exactly the same way as we have described, are not as efficient as bulk writes towards the end of the file. Imagine a large number of reducers writing their output to a file, which will eventually become large. As soon as a reducer decides to write say 64 MB chunk, it has to figure out that this is going to be a new chunk. And inform the master or the name node, that it's writing a new chunk. Similarly, other reducers figure out that they're writing new chunks, and inform the master. The master's updates its metadata, and all the writes happen in parallel. The master does become a bit of a bottleneck, but because the writes are fairly large, that doesn't normally create a problem. On the other hand, if we have random writes to the middle of a file, The challenge then becomes, what happens when a chunk essentially overflows and impinges on another, next, the next chunk in the file. This can create problems in the way the file is laid out and requires much more synchronization. Without going into too much detail, it's also shown in the paper on GFS that the degree of replica consistency that one gets with large bulk of pens is much stronger than the degree of replica consistency than we get with random writes and parallel. Nonetheless even with a reduced degree of consistencies, each reader, in the GFS architecture always sees a consistent data regardless of which replica it ends up reading from. That's one of the powers of the GFS architecture. Still, GFS and HDFS, while supporting bulk parallel reads and writes are nevertheless file system architectures and not databases as we have come to understand them over the past 30 to 40 years. Much of the current debate in the big data and traditional BI communities is about databases. The big data databases are built on top of distributed file systems llike GFS and HDFS. But before we turn to these, let's first take a look at traditional databases and see why they were developed and how they've actually been used.