1 00:00:00,000 --> 00:00:06,011 Let's return now to map-reduce and ask where the data that map-reduce program 2 00:00:06,011 --> 00:00:10,069 require is written, is taking from and where does the output go? 3 00:00:11,058 --> 00:00:14,528 Clearly, map-reduce reads and writes fresh data, right? 4 00:00:14,528 --> 00:00:17,563 As we've seen before, it's a batch process. 5 00:00:17,563 --> 00:00:23,202 The question is, where does it read this data from, and where does it write the 6 00:00:23,202 --> 00:00:27,250 output to? Even if map-reduces and parallel computing 7 00:00:27,250 --> 00:00:30,473 paradigm, what about the data read and write? 8 00:00:30,473 --> 00:00:37,961 Does that form a potential bottleneck? The solution is that the data itself needs 9 00:00:37,961 --> 00:00:43,408 to lie in a distributed format, whether it's a distributed file system or a 10 00:00:43,408 --> 00:00:48,088 database so that we can support parallel reading and writing. 11 00:00:48,088 --> 00:00:54,054 So, that more than one processor is active reading or writing a large file. 12 00:00:54,054 --> 00:01:00,051 Otherwise, input and output itself will become a bottleneck for the map-reduce 13 00:01:00,051 --> 00:01:07,050 parallel computing program. Further, we have discussed processing node 14 00:01:07,050 --> 00:01:15,015 failures, and seen how map-reduce recovers from these without letting the overall map 15 00:01:15,015 --> 00:01:19,470 reduce job fail. But, what if the nodes on which data is 16 00:01:19,470 --> 00:01:24,015 stored failed, they also need to be fall tolerant. 17 00:01:24,081 --> 00:01:31,885 To solve both these problems, distributed file systems were developed primarily 18 00:01:31,885 --> 00:01:38,012 first by Google, and then Yahoo and that's what we are going to study next. 19 00:01:38,012 --> 00:01:44,039 This diagram illustrates the architecture of the, distributed file systems, both the 20 00:01:44,039 --> 00:01:50,035 Google file system as well as the, Hadoop Distributed File System which is Open 21 00:01:50,035 --> 00:01:55,781 Source now, and is essentially the foundation of what is big data technology 22 00:01:56,008 --> 00:02:00,076 wherever it's used. Whether there is a database built on it or 23 00:02:00,076 --> 00:02:04,054 used directly, the HDFS is fundamental to big data. 24 00:02:05,012 --> 00:02:11,040 This is how it works. Large files are distributed into chunks, 25 00:02:11,040 --> 00:02:19,065 typically, of 64 megabytes in size and each of these chunks is stored on what is 26 00:02:19,065 --> 00:02:26,023 called chunk servers. Further, chunk servers are replicated so 27 00:02:26,023 --> 00:02:33,074 that each chunk is stored on three different servers, typically replicated on 28 00:02:33,074 --> 00:02:41,034 multiple racks and in a different network subnet, to take care of all types of 29 00:02:41,034 --> 00:02:46,032 hardware failures. As a result of this replication, the 30 00:02:46,032 --> 00:02:51,063 possibility of data being lost due to hardware failure is minimized 31 00:02:51,063 --> 00:02:55,037 significantly. Of course, the challenge is now to 32 00:02:55,037 --> 00:02:59,004 maintain consistency across all these replicas. 33 00:02:59,004 --> 00:03:03,251 So, here is how that works. When a client application, say, a 34 00:03:03,251 --> 00:03:09,569 map-reduce program, needs to read data, it needs to first figure out where a 35 00:03:09,569 --> 00:03:16,018 particular piece of data resides. A read command typically is issued with a 36 00:03:16,018 --> 00:03:19,943 file name which is a, a path in the directory structure. 37 00:03:19,943 --> 00:03:26,637 And an offset, that is how many bytes ahead in, in the file, the client wants to 38 00:03:26,637 --> 00:03:30,562 read. Data about what parts of a file are on 39 00:03:30,562 --> 00:03:38,263 which chunk server is kept in a master node which is called the Master GFS and 40 00:03:38,263 --> 00:03:45,664 it's called the name node in HDFS, and the client reads the metadata from this master 41 00:03:45,664 --> 00:03:50,078 node. The metadata tells it where the particular 42 00:03:50,078 --> 00:03:55,517 offset that it is asking for resides that is which chunk server. 43 00:03:55,517 --> 00:04:01,012 The metadata itself is cached by the client when it first starts reading so 44 00:04:01,012 --> 00:04:07,097 that it doesn't continuously have to go and ask the name node or the master for 45 00:04:07,097 --> 00:04:10,938 information. Further, while actually reading data, the 46 00:04:10,938 --> 00:04:16,844 client directly contacts the chunk server, which is listening for such requests, and 47 00:04:16,844 --> 00:04:22,438 serves up requests for a particular chunks of data from a file, rather than go 48 00:04:22,438 --> 00:04:28,039 through any centralized systems. So, if there are many clients reading a 49 00:04:28,039 --> 00:04:33,338 large file, typically, they will be accessing different junk servers, and 50 00:04:33,338 --> 00:04:37,015 therefore, these reads are happening in parallel. 51 00:04:37,015 --> 00:04:42,075 And the result input at least doesn't become a bottleneck in a parallel map 52 00:04:42,075 --> 00:04:46,028 reduce job. In case the reading client that is a 53 00:04:46,028 --> 00:04:52,010 mapper fails, to contact the chunk server, it looks up to the master data to figure 54 00:04:52,010 --> 00:04:56,000 out where the next replica is stored and tries again. 55 00:04:56,000 --> 00:05:02,039 Since there are three replicas, the chance of all of them failing is fairly low and 56 00:05:02,039 --> 00:05:06,039 that lends fall tolerance to the input, output process. 57 00:05:06,039 --> 00:05:12,612 Of course, the tricky part comes in writing because while writing data, we 58 00:05:12,612 --> 00:05:19,673 have to main, make, make sure that the, each replica contains the same data all 59 00:05:19,673 --> 00:05:23,093 the time. Here is how writes happen in GFS. 60 00:05:24,043 --> 00:05:30,054 Of the three replicas for each chunk, one is designated as the primary replica. 61 00:05:30,054 --> 00:05:35,024 Could be any one, and that information is kept in the master. 62 00:05:35,057 --> 00:05:42,020 Of course, the master keeps pinging these replicas to make sure that they are alive. 63 00:05:42,020 --> 00:05:48,043 And in case, the primary is down for some reason the master node assigns a new 64 00:05:48,043 --> 00:05:54,066 primary, and possibly even asks for a new replica to be created for that chunk. 65 00:05:54,066 --> 00:06:00,056 Now, when writing data, the client application sends the data to be written 66 00:06:00,056 --> 00:06:04,024 to all three replicas for a particular chunk. 67 00:06:04,083 --> 00:06:11,077 One of them is primary, and it figures out where to write this data, assigns an 68 00:06:11,077 --> 00:06:18,259 offset to write the data, such as typically the end of file, and sends this 69 00:06:18,259 --> 00:06:24,194 offset to the other two replicas. If these replicas succeed in writing at 70 00:06:24,194 --> 00:06:30,397 that particular point, they tell the primary that they have written, and the 71 00:06:30,397 --> 00:06:36,430 write succeeds with the primary informing the client that the operation has 72 00:06:36,430 --> 00:06:41,599 successfully completed. On the other hand, if some of the replicas 73 00:06:41,599 --> 00:06:48,474 failed to write at the designated offset, which can happen, for example, because of 74 00:06:48,474 --> 00:06:52,755 a bad disc sector. Then, they return a failure to the 75 00:06:52,755 --> 00:06:59,078 primary, and the primary retries to write at another offset, could be beyond the end 76 00:06:59,078 --> 00:07:04,803 of the file, for example, and tries again until it succeeds at all replicas. 77 00:07:04,803 --> 00:07:11,616 As a result of this process, large bulk writes can be done in parallel on a large 78 00:07:11,616 --> 00:07:17,327 file by multiple processors, typically reducers, writing the output of a 79 00:07:17,327 --> 00:07:23,858 map-reduce job, while still ensuring that three replicas of every chunk are 80 00:07:23,858 --> 00:07:28,656 maintained synchronously and with the same state. 81 00:07:28,656 --> 00:07:34,706 Gfs and HDFS are the foundation for all big data technology. 82 00:07:34,706 --> 00:07:40,047 Gfs, of course, is proprietary to Google, and is internally used. 83 00:07:40,047 --> 00:07:47,062 Hdfs was developed at Yahoo, and opened sourced as part of the Hadoop Distribution 84 00:07:47,062 --> 00:07:54,517 and has now become, and has now become synonymous with map-reduce and large scale 85 00:07:54,517 --> 00:08:00,029 computing using big data. So, to summarize, what the GFS distributed 86 00:08:00,029 --> 00:08:07,039 file system architecture is really good at is supporting multiple, parallel reads and 87 00:08:07,039 --> 00:08:15,042 writes from a large number of processors. The reads are arbitrary and random access 88 00:08:15,042 --> 00:08:23,010 but the writes are best done when they are appends or writing to the end of a large 89 00:08:23,010 --> 00:08:27,051 file. Because the architecture relies on the 90 00:08:27,051 --> 00:08:34,087 primary replica for a chunk deciding the order, in which multiple append requests 91 00:08:34,087 --> 00:08:39,086 are processed. The data is always consistent, though not 92 00:08:39,086 --> 00:08:46,058 necessarily predictable in terms of which processor data is written first. 93 00:08:46,084 --> 00:08:52,038 That normally doesn't matter because the data is fairly independent, especially in 94 00:08:52,038 --> 00:08:56,428 map-reduced output. At the same time, it's important to 95 00:08:56,428 --> 00:09:03,373 realize that random writes in the middle of some file, while can be handled using 96 00:09:03,373 --> 00:09:10,765 the GFS architecture, exactly the same way as we have described, are not as efficient 97 00:09:10,765 --> 00:09:14,400 as bulk writes towards the end of the file. 98 00:09:14,400 --> 00:09:20,947 Imagine a large number of reducers writing their output to a file, which will 99 00:09:20,947 --> 00:09:26,206 eventually become large. As soon as a reducer decides to write say 100 00:09:26,206 --> 00:09:31,792 64 MB chunk, it has to figure out that this is going to be a new chunk. 101 00:09:31,792 --> 00:09:37,598 And inform the master or the name node, that it's writing a new chunk. 102 00:09:37,598 --> 00:09:44,463 Similarly, other reducers figure out that they're writing new chunks, and inform the 103 00:09:44,463 --> 00:09:48,521 master. The master's updates its metadata, and all 104 00:09:48,521 --> 00:09:53,391 the writes happen in parallel. The master does become a bit of a 105 00:09:53,391 --> 00:09:58,477 bottleneck, but because the writes are fairly large, that doesn't normally create 106 00:09:58,477 --> 00:10:03,248 a problem. On the other hand, if we have random 107 00:10:03,248 --> 00:10:12,059 writes to the middle of a file, The challenge then becomes, what happens when 108 00:10:12,059 --> 00:10:22,689 a chunk essentially overflows and impinges on another, next, the next chunk in the 109 00:10:22,689 --> 00:10:27,090 file. This can create problems in the way the 110 00:10:27,090 --> 00:10:33,079 file is laid out and requires much more synchronization. 111 00:10:34,082 --> 00:10:42,415 Without going into too much detail, it's also shown in the paper on GFS that the 112 00:10:42,415 --> 00:10:50,004 degree of replica consistency that one gets with large bulk of pens is much 113 00:10:50,004 --> 00:10:57,092 stronger than the degree of replica consistency than we get with random writes 114 00:10:57,092 --> 00:11:04,021 and parallel. Nonetheless even with a reduced degree of 115 00:11:04,021 --> 00:11:11,631 consistencies, each reader, in the GFS architecture always sees a consistent data 116 00:11:11,631 --> 00:11:16,086 regardless of which replica it ends up reading from. 117 00:11:16,086 --> 00:11:21,020 That's one of the powers of the GFS architecture. 118 00:11:21,072 --> 00:11:32,448 Still, GFS and HDFS, while supporting bulk parallel reads and writes are nevertheless 119 00:11:32,448 --> 00:11:40,018 file system architectures and not databases as we have come to understand 120 00:11:40,018 --> 00:11:48,077 them over the past 30 to 40 years. Much of the current debate in the big data 121 00:11:48,077 --> 00:11:53,404 and traditional BI communities is about databases. 122 00:11:53,404 --> 00:11:59,878 The big data databases are built on top of distributed file systems llike GFS and 123 00:11:59,878 --> 00:12:03,099 HDFS. But before we turn to these, let's first 124 00:12:03,099 --> 00:12:09,096 take a look at traditional databases and see why they were developed and how 125 00:12:09,096 --> 00:12:12,005 they've actually been used.