1 00:00:00,000 --> 00:00:07,021 You'll recall we began our discussion of parallel computing by the notions of 2 00:00:07,021 --> 00:00:13,004 parallel efficiency. So it's worth trying to figure out what is 3 00:00:13,004 --> 00:00:19,079 the parallel efficiency of MapReduce. Let's assume that the data which is 4 00:00:19,079 --> 00:00:23,095 emitted by all the mappers put together is D. 5 00:00:23,095 --> 00:00:28,085 This might be much less than the data emitted by the. 6 00:00:30,000 --> 00:00:36,017 If we use combiners efficiently, D may actually be significantly less, than the 7 00:00:36,017 --> 00:00:40,046 total amount of data input to the MapReduce program. 8 00:00:40,088 --> 00:00:46,090 We also assume that their will be processors that will include mappers and 9 00:00:46,090 --> 00:00:50,715 reducers. And will assume there is really no overlap 10 00:00:50,715 --> 00:00:57,021 between mappers and reducers so that we have P by two mappers and P by two 11 00:00:57,021 --> 00:01:00,410 reducers. Clearly parallel efficiency depends on 12 00:01:00,410 --> 00:01:04,296 overhead such as communications between processors. 13 00:01:04,296 --> 00:01:10,605 And so the ideal way to compute the minimum parallel efficiency is to assume 14 00:01:10,605 --> 00:01:14,466 that there is no real useful work being done. 15 00:01:14,466 --> 00:01:21,323 Now we will actually assume that there is no real useful work being performed as in 16 00:01:21,323 --> 00:01:26,409 terms of an algorithm. Instead, we are only considering the 17 00:01:26,409 --> 00:01:34,209 efficiency of the map-produced platform itself, where all the work that it really 18 00:01:34,209 --> 00:01:41,658 does, is, sort, p by two of these blocks, each of size, d divided by p by two, and 19 00:01:41,658 --> 00:01:47,568 then communicate the results, to, the different reducers. 20 00:01:47,568 --> 00:01:56,476 Now, the actual efficiency of any algorithm using MapReduce will be at least 21 00:01:56,476 --> 00:02:03,900 this much, provided that algorithm requires this sorting process for its 22 00:02:03,900 --> 00:02:08,548 completion. That need not always be the case though, 23 00:02:08,548 --> 00:02:17,457 but for the moment we assume that it is. Now let's consider the communication 24 00:02:17,457 --> 00:02:22,429 costs. That is, 2D by P of data, in each block 25 00:02:22,429 --> 00:02:30,702 here, each mapper, has to be communicated to, to P by two reducers. 26 00:02:30,702 --> 00:02:41,067 Each reducer gets one by one p by 2th of the data so each reducer gets about 4d by 27 00:02:41,067 --> 00:02:49,082 p squared data and this has, has to go, go to p by two reducers in particular it has 28 00:02:49,082 --> 00:02:54,632 to be collected from p by two mappers at each reducer. 29 00:02:54,632 --> 00:02:58,730 So. This requires P by two separate steps of 30 00:02:58,730 --> 00:03:03,702 communication. So you multiply this by P by two to get 2D 31 00:03:03,702 --> 00:03:07,548 by P as representing the communicating costs. 32 00:03:07,548 --> 00:03:14,655 The serial costs are of course. The cost required to, to sort P by two 33 00:03:14,655 --> 00:03:21,095 blocks of this size. Which works to something, which is 34 00:03:21,095 --> 00:03:29,665 constant multiplier times D by P. Log D by P times P. 35 00:03:29,665 --> 00:03:35,638 And so we get. Of, a term like this. 36 00:03:35,638 --> 00:03:40,887 The same term in the denominator, plus the communication costs. 37 00:03:40,887 --> 00:03:46,927 This, of course, is the formula for efficiency, where you have the serial, 38 00:03:46,927 --> 00:03:53,421 time, in the numerator, the parallel time, in the denominator which is, the serial 39 00:03:53,421 --> 00:04:00,626 time divided by p in this case. And multiplied by P and the communication 40 00:04:00,626 --> 00:04:04,558 cost. Simplifying this, you get an expression 41 00:04:04,558 --> 00:04:11,563 that looks something like one over one plus, two times some communication cost 42 00:04:11,563 --> 00:04:16,787 and constant divide by a work constant times log of D by P. 43 00:04:16,787 --> 00:04:23,669 Notice that we have liberally removed, constants like the two which might have, 44 00:04:23,669 --> 00:04:29,753 might occur in the, the log over here because that only, adds a, constant, 45 00:04:29,753 --> 00:04:34,307 additive factor. We've also, been fairly liberal about, 46 00:04:34,307 --> 00:04:41,771 factors of two here and there, just to get the form of this equation, correctly. 47 00:04:41,771 --> 00:04:47,723 Remember what we said about an algorithm being scalable. 48 00:04:47,723 --> 00:04:56,077 If efficiency approaches one as P by P grows than the algorithm is set to be 49 00:04:56,077 --> 00:05:00,431 scalable. As d/p becomes very large this term 50 00:05:00,431 --> 00:05:07,100 becomes extremely small and efficiency does indeed become close to one. 51 00:05:07,100 --> 00:05:14,950 So the platform for maproduce is actually scalable in this sense The caveat is of 52 00:05:14,950 --> 00:05:22,658 course that the algorithm, that we perform on this platform actually requires this 53 00:05:22,658 --> 00:05:29,851 serial work to be performed, in other words, there is a need for sorting in the 54 00:05:29,851 --> 00:05:34,828 algorithm, of some kind. This is not always the case of course. 55 00:05:34,828 --> 00:05:41,161 For example there are many tasks which can just as well performed by only a map phase 56 00:05:41,161 --> 00:05:47,043 not requiring any reduce or any sorting. Such task may be called embarrassingly 57 00:05:47,043 --> 00:05:53,394 parallel in which case simply using the map reduce platform will add overhead, 58 00:05:53,394 --> 00:05:55,047 which is avoidable.