1 00:00:00,000 --> 00:00:04,555 [MUSIC] 2 00:00:04,555 --> 00:00:09,023 So, the MapReduce abstraction has two different steps, one is the map step and 3 00:00:09,023 --> 00:00:13,200 the other one is the reduce step, thus the name MapReduce. 4 00:00:13,200 --> 00:00:16,910 Well in the map step, we're going to assume that the operation we want to 5 00:00:16,910 --> 00:00:21,700 perform is data-parallel over elements like our documents. 6 00:00:21,700 --> 00:00:26,765 And the result of the map call is going to be to generate a key value pair. 7 00:00:26,765 --> 00:00:31,005 So, for example, in our word count example, 8 00:00:31,005 --> 00:00:34,059 we're going to take in a document. 9 00:00:34,059 --> 00:00:38,065 And then for every word in that document, 10 00:00:38,065 --> 00:00:42,652 we're just going to emit the pair (word, 1). 11 00:00:42,652 --> 00:00:46,310 So we'd cycle through our document. 12 00:00:46,310 --> 00:00:50,185 If we see the word, uw, we emit (uw, 1). 13 00:00:50,185 --> 00:00:54,580 We see the word, machine, 14 00:00:54,580 --> 00:00:58,800 we put (machine, 1). 15 00:00:58,800 --> 00:01:03,484 Then we happen to see uw again, so we put (uw, 1). 16 00:01:03,484 --> 00:01:07,012 And then we see learning, so 17 00:01:07,012 --> 00:01:11,866 we emit (learning, 1), and so on. 18 00:01:11,866 --> 00:01:16,829 So in this example the key are the different words in our vocabulary, uw, 19 00:01:16,829 --> 00:01:22,180 machine, learning, and so on, and the value is just the count of that word. 20 00:01:22,180 --> 00:01:24,802 And for this specific implementation, 21 00:01:24,802 --> 00:01:29,207 we're emitting just a count of 1 every time we see a word instance. 22 00:01:29,207 --> 00:01:32,823 But when we're talking about the MapReduce abstraction, 23 00:01:32,823 --> 00:01:34,710 the value can be any data type. 24 00:01:35,890 --> 00:01:40,313 Then for the reduce step here, this step is going to be one that 25 00:01:40,313 --> 00:01:44,479 aggregates over the values associated with every key. 26 00:01:44,479 --> 00:01:48,290 And this must be a commutative, associative operation. 27 00:01:48,290 --> 00:01:53,960 So commutative means that the order of the operations doesn't matter. 28 00:01:53,960 --> 00:01:57,810 So for example, a+ b = b + a. 29 00:01:57,810 --> 00:01:59,645 So addition is commutative, and 30 00:01:59,645 --> 00:02:04,439 it's also associative which means that the grouping of operations doesn't matter. 31 00:02:04,439 --> 00:02:08,880 So again, addition satisfies this, 32 00:02:08,880 --> 00:02:14,286 where a + b, if we do that first and then add c, 33 00:02:14,286 --> 00:02:21,250 that's the same as if we had done a + the operation of b + c. 34 00:02:22,640 --> 00:02:27,844 Okay, and for a reduce step, we need this operation 35 00:02:27,844 --> 00:02:32,356 to be one that's data-parallel over keys. 36 00:02:32,356 --> 00:02:39,030 So, let me emphasis what the object is data-parallel over. 37 00:02:39,030 --> 00:02:43,069 In the map step, it's over the elements like documents in our corpus. 38 00:02:43,069 --> 00:02:47,727 But here once we produce these key value pairs, 39 00:02:47,727 --> 00:02:52,385 we take all the keys, and the operation has to be 40 00:02:52,385 --> 00:02:57,318 able to be performed independently for each key. 41 00:02:57,318 --> 00:03:03,730 And so just to make this a little bit more explicit, in the map step we're saying 42 00:03:03,730 --> 00:03:09,750 that the word count in a given document doesn't depend on any other document. 43 00:03:09,750 --> 00:03:14,460 In the reduce step we're saying, the total count of a given word 44 00:03:14,460 --> 00:03:17,900 doesn't depend on the total count of any other word. 45 00:03:17,900 --> 00:03:23,263 So let's go through the reduce example for word counting. 46 00:03:23,263 --> 00:03:28,160 Here what we're going to do is we're just going to initialize 47 00:03:28,160 --> 00:03:31,126 our counts of a given word to be 0. 48 00:03:31,126 --> 00:03:36,920 So the reducer takes in the specific key as well as a list of values. 49 00:03:36,920 --> 00:03:40,450 So maybe I'll write this here, so this is the key, and 50 00:03:40,450 --> 00:03:46,180 this is a list of values. 51 00:03:46,180 --> 00:03:50,622 And what we're going to do is for every value in this list, 52 00:03:50,622 --> 00:03:54,510 we're going to increment our count by that value. 53 00:03:57,260 --> 00:03:58,380 So in this case, it's a count. 54 00:03:58,380 --> 00:04:00,040 It doesn't really always have to be a count. 55 00:04:00,040 --> 00:04:05,570 It could just be sums of values, and then we're going to emit this key value pair. 56 00:04:08,470 --> 00:04:13,122 So here's our key value pair, and 57 00:04:13,122 --> 00:04:21,311 in this example this was our key value pair that was omitted. 58 00:04:21,311 --> 00:04:26,935 So in particular if we go along with this example here, if we were to, 59 00:04:29,168 --> 00:04:34,524 Call reduce on the key uw. 60 00:04:34,524 --> 00:04:41,707 And associated with that key we had counts, 61 00:04:41,707 --> 00:04:46,947 let's say 1, 17, 0, 0, 62 00:04:46,947 --> 00:04:51,801 12, 2, then what we would 63 00:04:51,801 --> 00:04:56,671 emit would be (uw, 32). 64 00:04:56,671 --> 00:05:00,691 So MapReduce has a really long history in functional programming, and 65 00:05:00,691 --> 00:05:04,160 somewhat more recently it was popularized by Google. 66 00:05:04,160 --> 00:05:08,411 And then really it gained a lot of broad adoption through an open-source 67 00:05:08,411 --> 00:05:11,630 implementation called Hadoop that was made by Yahoo. 68 00:05:11,630 --> 00:05:15,679 [MUSIC]