1 00:00:00,000 --> 00:00:06,032 Finally, let's take a peek inside, the real implementations of map-reduce that 2 00:00:06,032 --> 00:00:12,024 is, not the light-weight octo simulation, but platforms such as The Duke. 3 00:00:13,033 --> 00:00:20,058 The input files, as we shall see in next week's lecture, are picked up from a 4 00:00:20,058 --> 00:00:27,046 distributed file system in chunks by worker nodes which are essentially 5 00:00:27,046 --> 00:00:33,027 executing the map phase. They read chunks of data, perform map 6 00:00:33,027 --> 00:00:38,090 operations on them and write the results to the local disk. 7 00:00:38,090 --> 00:00:46,588 They sort the data before writing it and they write it in chunks, different chunks 8 00:00:46,588 --> 00:00:52,488 for different reduce keys. It's very important to note that whereas 9 00:00:52,488 --> 00:01:00,085 the input files are in a distributed file system, the local rights after each mapper 10 00:01:00,085 --> 00:01:04,426 is done with its work, are on their local disks. 11 00:01:04,426 --> 00:01:10,492 As a result there's no inter processor communication involved until these are 12 00:01:10,492 --> 00:01:17,144 actually read by the different producers. When a mapper is finished with a 13 00:01:17,144 --> 00:01:24,695 particular reduce key it informs a master node that it's actually done with that 14 00:01:24,695 --> 00:01:27,850 key. That master note keeps getting such 15 00:01:27,850 --> 00:01:34,432 messages from many different mappers task. And assigns reduce keys to difference 16 00:01:34,432 --> 00:01:38,804 reduce processors depending on which are available. 17 00:01:38,804 --> 00:01:45,219 Some times mapper themselves has finished their work and can be reassigned as 18 00:01:45,219 --> 00:01:48,953 reducers. In other cases there are mappers and 19 00:01:48,953 --> 00:01:54,717 reducers working parallel. When a reducer is informed by the master 20 00:01:54,717 --> 00:02:02,016 node that a particular reduce key is ready, it's also told which mapper to pick 21 00:02:02,016 --> 00:02:06,561 it up from. The reducer then issues a remote read to 22 00:02:06,561 --> 00:02:13,332 that mapper process, which has a listener sitting on it, waiting for such remote 23 00:02:13,332 --> 00:02:19,569 read requests, which it services. That is how the reducers actually grab 24 00:02:19,569 --> 00:02:26,383 data from different mappers, as soon as they know where to get that data from. 25 00:02:26,383 --> 00:02:33,154 Notice that even though a master can tell a reducer when a mapper has finished 26 00:02:33,154 --> 00:02:39,417 working on a particular reduce key. The master also knows how many mappers 27 00:02:39,417 --> 00:02:45,959 there are and can figure out when all the work that needs to be done in particular 28 00:02:45,959 --> 00:02:52,216 reduce key is actually over and can inform the reducer assigned to that key that this 29 00:02:52,216 --> 00:02:58,038 is the case so that it actually begins performing reduce operation on a 30 00:02:58,038 --> 00:03:02,974 particular reduce key. Notice that a worker, rather a reduce 31 00:03:02,974 --> 00:03:09,199 worker can pick up data from multiple mappers, but can't really start working on 32 00:03:09,199 --> 00:03:15,485 a reduce key itself, that is, doing the reduce function until it knows that it has 33 00:03:15,485 --> 00:03:21,665 all the data required for that key otherwise the results might be incorrect. 34 00:03:21,665 --> 00:03:28,378 Because reduce functions don't actually have to be computative and associative as 35 00:03:28,378 --> 00:03:33,353 we have seen earlier. Last but not least, the most important 36 00:03:33,353 --> 00:03:39,195 part of map produced implementations that scale is their fault tolerance. 37 00:03:39,195 --> 00:03:45,553 This was the reason why traditional data based technologies could not scale to 38 00:03:45,553 --> 00:03:49,873 thousands or hundreds of thousands of processors. 39 00:03:49,873 --> 00:03:56,366 The trouble is that processors are bound to fail in long tasks involving large 40 00:03:56,366 --> 00:04:01,915 numbers of machines. Simple probability insures that is almost 41 00:04:01,915 --> 00:04:08,356 always likely to happen. For example if the chance that a processor 42 00:04:08,356 --> 00:04:13,520 will never fail in a particular time period is 99%. 43 00:04:13,520 --> 00:04:23,048 Then if you have say, 100 such machines, the chance that none of them will fail is 44 00:04:23,048 --> 00:04:28,974 somewhere around only 30%. That if you have 1,000 machines, the 45 00:04:28,974 --> 00:04:32,913 chance that at least one will fail, is almost 100%. 46 00:04:32,913 --> 00:04:38,969 You can work this out using very similar math to that we used in the locality 47 00:04:38,969 --> 00:04:45,979 sensitive hashing example. So map produced computation that can take 48 00:04:45,979 --> 00:04:50,676 hours or even days on large volumes of big data. 49 00:04:50,676 --> 00:04:55,838 Need to be. Aware of the fact, and in fact, certain 50 00:04:55,838 --> 00:05:02,918 that at least some of their, machines will fail during the computation, but the 51 00:05:02,918 --> 00:05:07,968 computation itself should continue and complete successfully. 52 00:05:07,968 --> 00:05:13,873 Here is how map-reduce ensures this. The master node maintains regular 53 00:05:13,873 --> 00:05:19,423 communication with each of the worker nodes, whether they are mappers or 54 00:05:19,423 --> 00:05:23,446 reducers. It also keeps track of which key ranges 55 00:05:23,446 --> 00:05:29,355 have actually been successfully processed by each mapper and reducer. 56 00:05:29,355 --> 00:05:36,729 Now if a mapper fails, then the master has to simply reassign the key ranges assigned 57 00:05:36,729 --> 00:05:43,183 to that mapper to some other processor whenever it gets one available. 58 00:05:43,183 --> 00:05:50,670 Since the data had to be read from the input in any case, the new reassigned 59 00:05:50,670 --> 00:05:55,791 mapper simply rereads the data and reprocesses it. 60 00:05:55,791 --> 00:06:04,218 Similarly if a reducer fails, the key ranges for whatever task it has not yet 61 00:06:04,218 --> 00:06:09,670 finished, are assigned to some newly appointed reducer. 62 00:06:09,670 --> 00:06:15,689 However, for the tasks we've just already completed before failing. 63 00:06:15,689 --> 00:06:22,884 We don't have to worry because they have already been written out to the output. 64 00:06:22,884 --> 00:06:30,486 This is how the master keeps track of how much of the MapReduce task is completed. 65 00:06:30,486 --> 00:06:37,639 And when a failure is detected, it merely assigns those key ranges to other workers 66 00:06:37,639 --> 00:06:45,531 for the appropriate map or reduce phase. In addition, the master also figures out 67 00:06:45,531 --> 00:06:52,164 whether there are situations where some mappers are performing very slowly for 68 00:06:52,164 --> 00:06:57,036 some unforeseen reason. It treats those as failure, and simply 69 00:06:57,036 --> 00:07:03,400 reassigns their work to somebody else. Whoever finishes first, the data is taken 70 00:07:03,400 --> 00:07:06,849 as done. Since it doesn't really matter if 71 00:07:06,849 --> 00:07:13,693 something is computed twice by mistake. Now you might think that this master which 72 00:07:13,693 --> 00:07:20,297 is a single node is in obvious point of failure and that is indeed the case. 73 00:07:20,297 --> 00:07:26,003 If the master fails, the map produced task indeed does fail. 74 00:07:26,003 --> 00:07:32,670 However, Map Produce works on large numbers and probabilities. 75 00:07:32,670 --> 00:07:40,958 The chance of one amongst a 1000 processor machines failing is almost 100 percent as 76 00:07:40,958 --> 00:07:47,008 we argued a while back. Even though the chance that any one of 77 00:07:47,008 --> 00:07:53,345 them fails might be, less than a percent or even.1%, by that argument the chance 78 00:07:53,345 --> 00:08:00,622 that this particular master fails might be a tenth of a percent or even less, but the 79 00:08:00,622 --> 00:08:07,050 chance that any of the other processors fails if there are large numbers of them 80 00:08:07,050 --> 00:08:13,453 is almost one. This still doesn't remove the fact that 81 00:08:13,453 --> 00:08:20,053 the master is the single point of failure. But the probability of the map produce 82 00:08:20,053 --> 00:08:25,702 task failing is still very, very small. So let's see what we have learnt this 83 00:08:25,702 --> 00:08:28,942 week. We began with the review of parallel 84 00:08:28,942 --> 00:08:35,085 computing ideas, the concept of speed-up, parallel efficiency and scalable 85 00:08:35,085 --> 00:08:38,821 algorithm. We then went into the map-reduce 86 00:08:38,821 --> 00:08:45,395 programming paradigm as a higher level attraction as compared to hand coded 87 00:08:45,395 --> 00:08:51,913 parallel programming using shared memory and message passing. 88 00:08:51,913 --> 00:08:59,866 We did a few examples. Discussed how the ocdo simulator works. 89 00:08:59,866 --> 00:09:08,146 Showed how map-reduce could be applied to various machine learning and search tasks 90 00:09:08,146 --> 00:09:14,263 that we have studied in the past weeks. And concluded with some discussion of the 91 00:09:14,263 --> 00:09:18,363 parallel efficiency and internals of map-reduce. 92 00:09:18,363 --> 00:09:26,383 Next week, we will turn our attention to the big data technology that is required 93 00:09:26,383 --> 00:09:32,320 to make MapReduce actually work. In particular, distributed file systems 94 00:09:32,320 --> 00:09:38,450 distributed no sequel databases. And how they differ from the traditional 95 00:09:38,450 --> 00:09:44,722 relational variety And finally, emerging trends as to where this technology is 96 00:09:44,722 --> 00:09:47,026 actually going in the future.