1 00:00:00,000 --> 00:00:04,037 Let's now turn to the map-reduce paradigm itself. 2 00:00:04,037 --> 00:00:10,014 As we just mentioned, it's a tailor parallel, paradigm, with message passing. 3 00:00:10,014 --> 00:00:14,075 It's also a pipelined procedure, where there are two phases. 4 00:00:14,075 --> 00:00:18,032 There's the map phase and the reduce phase. 5 00:00:18,032 --> 00:00:23,073 Further, as we also mentioned, it's a higher level abstraction. 6 00:00:23,073 --> 00:00:30,090 So programmers only need to specify what each mapper actually does, and what each 7 00:00:30,090 --> 00:00:37,297 reducer actually does And the map-reduce implementation takes care of all the 8 00:00:37,297 --> 00:00:41,097 message passing required between the two phases. 9 00:00:42,032 --> 00:00:48,096 In fact, one doesn't even need to worry about how many processors there are, and 10 00:00:48,096 --> 00:00:53,046 which one should be doing what kind of job map-reduce. 11 00:00:53,046 --> 00:00:57,048 That's all left to the map-reduce implementation. 12 00:00:57,048 --> 00:01:03,096 All the programmer needs to do is define their task in terms of the map-reduce 13 00:01:03,096 --> 00:01:10,514 paradigm which will come to in a moment. Point the system at the data on which they 14 00:01:10,514 --> 00:01:16,388 would like this task performed. And the map-reduce implementation, figures 15 00:01:16,388 --> 00:01:20,815 out which processors can be made available to this task. 16 00:01:20,815 --> 00:01:26,292 And does the rest without the programmer having to worry about the details. 17 00:01:26,292 --> 00:01:32,145 That's what makes map-reduce exceedingly powerful for large scale data parallel 18 00:01:32,145 --> 00:01:35,489 processing. Since it bridges the gap between high 19 00:01:35,489 --> 00:01:41,384 level specification of an algorithm, and its massively parallel implementation in a 20 00:01:41,384 --> 00:01:46,821 manner which has really never been successfully done ever before in the 21 00:01:46,821 --> 00:01:51,914 history of parallel computing. As we shall see, many complex web 22 00:01:51,914 --> 00:01:58,607 intelligence algorithms for machine learning and processing big data are easy 23 00:01:58,607 --> 00:02:04,680 to express using the mapreduce paradigm. Making their large scale parallel 24 00:02:04,680 --> 00:02:11,098 implementation much easier than if one had to program these using either shared 25 00:02:11,098 --> 00:02:16,805 memory or low level message passing as was the case before map-reduce. 26 00:02:16,805 --> 00:02:23,260 The most common example used to, explain map-reduce is that of counting words and 27 00:02:23,260 --> 00:02:27,103 documents. Something we've seen last week when we 28 00:02:27,103 --> 00:02:32,089 tried to compute inverse document frequencies for example. 29 00:02:32,089 --> 00:02:36,012 Suppose we are given many documents in this case. 30 00:02:36,012 --> 00:02:41,394 Say there are only ten documents here, but there could be you know, hundreds or 31 00:02:41,394 --> 00:02:47,074 thousands or even millions of documents. And we need to compute the frequency that 32 00:02:47,074 --> 00:02:50,652 each word takes across this whole set of documents. 33 00:02:50,652 --> 00:02:55,528 And we like to do it in parallel, using many processors, so as to reduce the 34 00:02:55,528 --> 00:03:00,042 overall execution time. In our simple example we will take only 35 00:03:00,042 --> 00:03:06,688 ten documents. And three mapper processors which execute 36 00:03:06,688 --> 00:03:11,246 map tasks. Followed by two reducer processors that 37 00:03:11,246 --> 00:03:16,146 execute reduced tasks. Of course, once the mapper processors are 38 00:03:16,146 --> 00:03:19,392 done they can start doing the reduced work. 39 00:03:19,392 --> 00:03:25,495 So the total number of, processors need not be five, it might not even be as low 40 00:03:25,495 --> 00:03:31,052 as three, or maybe four. The mappers are given data, in this case 41 00:03:31,052 --> 00:03:38,846 there are documents identified by document IDs, which the map-reduce system randomly 42 00:03:38,846 --> 00:03:43,099 distributes across the different mapper processors. 43 00:03:43,099 --> 00:03:48,043 Each mapper processor thus, gets a bunch of documents. 44 00:03:48,043 --> 00:03:52,078 It doesn't really know which documents it's getting. 45 00:03:52,078 --> 00:03:56,088 It just gets a bunch of them in any random order. 46 00:03:56,088 --> 00:04:02,091 Now, here's how the problem is defined in terms of map and reduced tasks. 47 00:04:02,091 --> 00:04:09,053 Remember, a mapper simply needs to perform a whole bunch of map tasks on all the 48 00:04:09,053 --> 00:04:16,010 documents that it gets. So its getting a sequence of document IDs 49 00:04:16,010 --> 00:04:24,038 followed by the contents of the document. In other words its getting a key and a 50 00:04:24,038 --> 00:04:28,094 value in a sequence as such key value pairs. 51 00:04:28,094 --> 00:04:37,032 A map task works on a key value pairs such as a document ID and its contents and 52 00:04:37,032 --> 00:04:44,928 produces a bunch of key value pairs in this case words and the frequencies with 53 00:04:44,928 --> 00:04:49,080 which they occur in this particular document. 54 00:04:49,080 --> 00:04:54,067 Remember that a map task acts on one key value pair. 55 00:04:54,067 --> 00:04:59,017 So, it needs to act in this case on one document. 56 00:04:59,017 --> 00:05:04,098 It is, however, free to emit many key value pairs as its output. 57 00:05:04,098 --> 00:05:11,082 In this case the many words that occur in that document along with their 58 00:05:11,082 --> 00:05:16,048 frequencies. For example, the map tasks for the 59 00:05:16,048 --> 00:05:21,027 document D1 would be to emit a W1 followed by one. 60 00:05:21,027 --> 00:05:25,587 A W2 followed by one and W4 followed by a 61 00:05:25,587 --> 00:05:34,220 one and similarly for the other documents. In some cases such as the six, we would 62 00:05:34,220 --> 00:05:40,767 emit a W2 followed by a two. But in most other cases it would be a 63 00:05:40,767 --> 00:05:46,710 count of one. The second part of our problem definition 64 00:05:46,710 --> 00:05:55,836 is the reduce phase, in which a list of key value pairs corresponding to the same 65 00:05:55,836 --> 00:06:03,769 W key, in this case is reduced or, compressed, or combined, into one key 66 00:06:03,769 --> 00:06:13,377 value pair, for that particular key. So for example, all the counts for a 67 00:06:13,377 --> 00:06:22,085 particular word are summed up in the reduced phase to create one key value pair 68 00:06:22,085 --> 00:06:28,089 for each word. Now clearly in order to perform the reduce 69 00:06:28,089 --> 00:06:36,067 function, all the key value pairs for particular word need to be together if 70 00:06:36,067 --> 00:06:43,093 they are to be so summed up. Unfortunately, since each mapper acted on 71 00:06:43,093 --> 00:06:51,005 an independent set of documents, it only has access to those word count pairs for 72 00:06:51,005 --> 00:06:59,002 the particular document it happens to see. The best that a mapper can do is compute 73 00:06:59,002 --> 00:07:05,612 the reduced function using the pairs that it happens to have available. 74 00:07:05,612 --> 00:07:13,455 So, for example, this mapper, it only had these two documents D1 and D2, which had a 75 00:07:13,455 --> 00:07:20,740 W1, so it computes a W1 comma two by performing the map and reduced functions 76 00:07:20,740 --> 00:07:26,058 on the set of documents that it has available to it. 77 00:07:26,058 --> 00:07:36,234 Similarly, this mapper adds up the sums for W2 across all its documents to get a 78 00:07:36,234 --> 00:07:41,969 value of W2, four. Of course that is not enough, all these 79 00:07:41,969 --> 00:07:48,431 partial summations themselves need to be summed up and this is where the map 80 00:07:48,431 --> 00:07:53,013 produce implementation comes in. And this where the map-reduce 81 00:07:53,013 --> 00:07:59,010 implementation comes in and sends all of the key value pairs with the same key to 82 00:07:59,010 --> 00:08:04,093 the same producer so that they are all together and they can finally be reduced 83 00:08:04,093 --> 00:08:11,007 to get the right answer. So, all the pairs having W1 managed to get 84 00:08:11,007 --> 00:08:18,020 to this reducer, where they can get added up to get the correct value for W1. 85 00:08:18,020 --> 00:08:24,048 So, we got W1 from here, we got a W1 from here, we got a W1 from here. 86 00:08:24,048 --> 00:08:29,045 And we added all of these up to get a value of seven. 87 00:08:29,045 --> 00:08:37,014 Similarly the second reducer also manages to get all the key value pairs for each 88 00:08:37,014 --> 00:08:41,758 word to be together, so that they could be added up. 89 00:08:41,758 --> 00:08:48,431 In this case the first reducer works on pairs, where the keys are W1 and W2. 90 00:08:48,431 --> 00:08:54,115 And the second reducer works on those, where it's W3 and W4. 91 00:08:54,115 --> 00:08:59,391 So now we're ready to define map-reduce formally. 92 00:08:59,391 --> 00:09:10,451 Mappers read data in the form of key value pairs K1, B1 and for each such K1, B1 93 00:09:10,451 --> 00:09:17,040 pair, a mapper may emit one or more pairs of the form of K2, V2. 94 00:09:17,040 --> 00:09:22,031 In our word counting example, K1 was document ID. 95 00:09:22,031 --> 00:09:30,053 And B1 was the contents of the document, whereas K2 was the word and V2 the number 96 00:09:30,053 --> 00:09:35,053 of times a word occurred in a particular document. 97 00:09:36,072 --> 00:09:45,430 The reduced function on the other hand receives all the pairs for sum value of K2 98 00:09:45,430 --> 00:09:50,801 and combines them through a reduce function. 99 00:09:50,801 --> 00:09:57,968 So the reduce of K2 and all the values that came out of the map phase for that 100 00:09:57,968 --> 00:10:03,902 particular value of K2 are combined using a function F sub R. 101 00:10:03,902 --> 00:10:10,704 And now we're counting example F sub R is just a summation across all the 102 00:10:10,704 --> 00:10:16,960 occurrences of a particular word in all the documents that we process. 103 00:10:16,960 --> 00:10:25,217 Some of you might have noticed that in our example we actually did, some reduced work 104 00:10:25,217 --> 00:10:31,374 within the mappers. But we'll come to that point in a short 105 00:10:31,374 --> 00:10:33,328 while. For the moment. 106 00:10:33,328 --> 00:10:40,396 The strict definition of a map function is only a map that is from one key val-, 107 00:10:40,396 --> 00:10:44,591 value pair produce one or more key value pairs. 108 00:10:44,591 --> 00:10:52,043 And reduce function is take all the key value pairs for a particular value of the 109 00:10:52,043 --> 00:11:01,591 key and apply the reduce function on them. The map-reduce platform is responsible for 110 00:11:01,591 --> 00:11:10,255 routing pairs to reducers, so that all the pairs for the same key end up at the same 111 00:11:10,255 --> 00:11:15,331 reducer. Mapreduce essentially sorts the key value 112 00:11:15,331 --> 00:11:23,453 pairs that come out of the map phase. So that they land up in the right place. 113 00:11:23,453 --> 00:11:30,420 Finally, a very important point to note is that the map reduce process reads data, 114 00:11:30,420 --> 00:11:35,951 possibly from a file system or a database, and writes fresh data. 115 00:11:35,951 --> 00:11:40,649 So, it's a batch process that always produces more data. 116 00:11:40,649 --> 00:11:47,431 Very different from a database, where one is querying data and just getting an