1 00:00:00,035 --> 00:00:04,606 [MUSIC] 2 00:00:04,606 --> 00:00:05,775 Now let's step back and 3 00:00:05,775 --> 00:00:09,580 summarize MapReduce in terms of the execution overview. 4 00:00:09,580 --> 00:00:12,779 So in particular we assume we have some large data set and 5 00:00:12,779 --> 00:00:16,360 we're going to this data across machines and 6 00:00:16,360 --> 00:00:20,720 then on each machine, this is our map phase. 7 00:00:20,720 --> 00:00:26,630 We're going to do some operation that's data parallel, so splits across the data 8 00:00:26,630 --> 00:00:31,850 divided on these different machines, and then emit some key value pair. 9 00:00:31,850 --> 00:00:38,619 So for example, maybe we emit a key value pair uw ,1). 10 00:00:38,619 --> 00:00:42,625 So that might be what k1 might represent uw, and 11 00:00:42,625 --> 00:00:48,530 v1 the number 1 in the example that we showed on the last slide. 12 00:00:48,530 --> 00:00:52,200 And then after emitting all these key value pairs, 13 00:00:52,200 --> 00:00:54,480 then we do what's called the shuffle phase. 14 00:00:54,480 --> 00:01:01,240 And the shuffle phase is where we take these key value pairs and assign 15 00:01:01,240 --> 00:01:07,510 each of them that have this same key to the same machine for the reduced phase. 16 00:01:07,510 --> 00:01:12,138 Every key value pair is going to get assigned machine 17 00:01:12,138 --> 00:01:16,446 each of that key where h's are hash function. 18 00:01:16,446 --> 00:01:21,016 So then what we end up with is in the reduced phase there's 19 00:01:21,016 --> 00:01:24,254 going to be some aggregation step where for 20 00:01:24,254 --> 00:01:29,700 every key we're going to look at all values associated with that key. 21 00:01:29,700 --> 00:01:31,860 So we're going to look at these key value pairs and 22 00:01:31,860 --> 00:01:36,980 for a given key we're going to aggregate across all those values. 23 00:01:36,980 --> 00:01:42,348 And so for example in our word counting 24 00:01:42,348 --> 00:01:47,395 this might produce "uw", 30. 25 00:01:47,395 --> 00:01:53,418 That might be what K1 V1 represents here. 26 00:01:53,418 --> 00:01:58,847 But I want to mention that when we're talking about our map phase, 27 00:01:58,847 --> 00:02:05,773 these are not necessarily unique keys that appear across the different machines. 28 00:02:05,773 --> 00:02:10,771 So, may have same key 29 00:02:10,771 --> 00:02:15,214 appear on multiple 30 00:02:15,214 --> 00:02:20,491 machines whereas when 31 00:02:20,491 --> 00:02:26,609 we're in the reduce phase 32 00:02:29,146 --> 00:02:33,321 We're going to end up with unique keys on every machine. 33 00:02:37,713 --> 00:02:42,075 And that's why in terms of our notation here we say k1, k2 and 34 00:02:42,075 --> 00:02:44,805 this is k1 prime, k2 prime and so on. 35 00:02:44,805 --> 00:02:48,673 So these primes indicate that it could take the same values what it did on 36 00:02:48,673 --> 00:02:53,376 the other machine, whereas here we're just saying k1 and k2, k3, k4 k5, k6, 37 00:02:53,376 --> 00:02:54,340 and so on. 38 00:02:54,340 --> 00:02:59,250 These are each unique values from an entire set of possible keys. 39 00:02:59,250 --> 00:03:03,600 Okay, and remember that this reduced phase, it has to be in operation, 40 00:03:03,600 --> 00:03:06,120 that's data parallel across keys. 41 00:03:08,040 --> 00:03:11,293 And then finally, the last step, 42 00:03:11,293 --> 00:03:15,900 is that you're going to save your results to disk. 43 00:03:19,850 --> 00:03:24,380 So, this is the summary of the standard MapReduce Execution Flow. 44 00:03:24,380 --> 00:03:28,982 But I quickly wanted to mention that we can often improve performance through 45 00:03:28,982 --> 00:03:31,331 something that are called combiners. 46 00:03:31,331 --> 00:03:35,083 Because this naive implementation of MapReduce can be really wasteful in terms 47 00:03:35,083 --> 00:03:38,835 of communication in that shuffle phase where you're taking all these key value 48 00:03:38,835 --> 00:03:41,470 pairs and you're sending them to different machines. 49 00:03:42,480 --> 00:03:47,920 For example, if you have a large set of documents and some reasonable vocabulary, 50 00:03:47,920 --> 00:03:52,385 you're going to have lots and lots and lots of these counts of word,1, 51 00:03:52,385 --> 00:03:56,108 word,1, word, 1, word, 1. 52 00:03:56,108 --> 00:04:00,680 And all of those has to be sent to a machine and so 53 00:04:00,680 --> 00:04:02,490 that's a lot of communication. 54 00:04:02,490 --> 00:04:07,690 So just to make this really explicit, let's say we're looking at machine 1 and 55 00:04:07,690 --> 00:04:13,640 we are counting words on all documents that were allocated to machine 1, 56 00:04:13,640 --> 00:04:18,150 and we emit (;uw',1). 57 00:04:18,150 --> 00:04:22,670 And we end up with, let's say, a whole 58 00:04:22,670 --> 00:04:28,180 bunch of UW words in the documents on this machine. 59 00:04:28,180 --> 00:04:33,380 Just to make this concrete, let's say 60 00:04:33,380 --> 00:04:38,424 we end up with 4,002 instances, 61 00:04:38,424 --> 00:04:42,690 of ('uw,1) on machine 1. 62 00:04:42,690 --> 00:04:45,828 And all of these are let's say, 63 00:04:45,828 --> 00:04:50,450 going to be shuffled to machine 7 for our reduce phase. 64 00:04:50,450 --> 00:04:54,684 And this would happen if h(uw), 65 00:04:54,684 --> 00:05:00,087 our hash function on key uw, is equal to 7. 66 00:05:00,087 --> 00:05:05,684 That's how we would shuffle all of these instances to machine 7. 67 00:05:06,920 --> 00:05:11,730 Okay, well, a combiner represents a very simple solution to this problem, 68 00:05:11,730 --> 00:05:17,340 which is to perform a reduce step before sending things over to another machine. 69 00:05:17,340 --> 00:05:22,640 So, perform this reduce locally to each one of the machines 70 00:05:22,640 --> 00:05:27,559 in the map phase before doing the shuffle, and the subsequent reduce. 71 00:05:29,030 --> 00:05:34,631 So in particular what we would do is take our 72 00:05:34,631 --> 00:05:39,922 machine 1 and all these counts of (uw, 73 00:05:39,922 --> 00:05:45,368 1) and we do reduce and our call to reduce 74 00:05:45,368 --> 00:05:50,349 would emit ('uw',4002) and 75 00:05:50,349 --> 00:05:54,393 then we would send just this one 76 00:05:54,393 --> 00:06:00,178 ('uw',4002) over to machine 7. 77 00:06:00,178 --> 00:06:05,607 And note that the reason that this works, the fact that we can do this local 78 00:06:05,607 --> 00:06:11,477 reduce and then do another reduce after we do a shuffle from the other machines, 79 00:06:11,477 --> 00:06:16,058 is because we have a commutative-associative reduce step. 80 00:06:16,058 --> 00:06:20,299 [MUSIC]