1 00:00:00,000 --> 00:00:06,071 Now let's look at some examples of algorithms implemented in parallel using 2 00:00:06,071 --> 00:00:12,004 map-reduce. For those of you used to relational 3 00:00:12,004 --> 00:00:17,098 databases, here's how one can implement a join using map-reduce. 4 00:00:17,098 --> 00:00:25,146 Suppose we are given two tables, sales by address and cities by address and we are 5 00:00:25,146 --> 00:00:31,069 interested in joining these tables to find the sales by city. 6 00:00:31,069 --> 00:00:37,639 In sequel, this is how you'll do it. You would simply select the sum of sales 7 00:00:37,639 --> 00:00:44,067 from these two tables, where the addresses match, and group them by city. 8 00:00:44,097 --> 00:00:50,089 However in map-reduce, this requires two map-reduce jobs. 9 00:00:50,089 --> 00:00:59,065 In the first case, you join or essentially sort the data by address so that each 10 00:00:59,065 --> 00:01:04,091 reducer is able to emit the city for every sale. 11 00:01:04,091 --> 00:01:13,057 So in other words, the first map-reduced task only joins the sales and city table 12 00:01:13,057 --> 00:01:20,779 but doesn't do the grouping. In the second map produced task we sort on 13 00:01:20,779 --> 00:01:27,909 the city key so that all the records for a particular city comes to the same reducer 14 00:01:27,909 --> 00:01:33,534 where the addition takes place. Notice that we couldn't have used a single 15 00:01:33,534 --> 00:01:39,488 map-reduce raise because at the end of the first phase, it's not possible to 16 00:01:39,488 --> 00:01:44,944 guarantee that all the data for a particular city resides in the same 17 00:01:44,944 --> 00:01:50,607 reducer and so the summation can't be done, at least city-wise in a single 18 00:01:50,607 --> 00:01:54,206 phase. I'll relate to you a real world story that 19 00:01:54,206 --> 00:01:59,687 illustrates the difference between the map-reduce way of thinking and a 20 00:01:59,687 --> 00:02:06,499 traditional database approach. This is a real example where an 21 00:02:06,499 --> 00:02:14,173 organization had lots of data which could be abstracted as data about papers their 22 00:02:14,173 --> 00:02:19,045 authors and their contents. The actual nature of the data is 23 00:02:19,045 --> 00:02:22,079 proprietary so I won't relate that in detail. 24 00:02:22,079 --> 00:02:26,021 But it's pretty much the same problem as this. 25 00:02:26,021 --> 00:02:32,077 There are millions of papers written by many millions of authors containing 26 00:02:32,077 --> 00:02:38,090 millions of possible technical terms, phrases, you know, multi-word terms. 27 00:02:38,090 --> 00:02:45,081 And one of the tasks was to figure out the top ten technical terms used by each 28 00:02:45,081 --> 00:02:50,013 author. The top ten authors that used a particular 29 00:02:50,013 --> 00:02:54,086 term, and so on. And you can see this, problem is quite 30 00:02:54,086 --> 00:03:01,019 close to our word counting example. And possibly indicates a real life, 31 00:03:01,019 --> 00:03:07,017 requirement as well. This is how a database person actually 32 00:03:07,017 --> 00:03:12,013 propose the solution. You start out with a bunch of documents 33 00:03:12,013 --> 00:03:15,194 represented as a table with say a paper ID. 34 00:03:15,194 --> 00:03:19,302 The content of the document itself and its author. 35 00:03:19,302 --> 00:03:26,247 Then you get rid if the actual content and replace it by words. 36 00:03:26,247 --> 00:03:33,990 Notice what has happened over here. Instead of one row per paper, now you only 37 00:03:33,990 --> 00:03:40,610 have a row for a word and answer pair that occurs in a particular paper. 38 00:03:40,610 --> 00:03:46,494 So, one row here will produce many, many rows in the table queue. 39 00:03:46,494 --> 00:03:53,853 No matter, says our database expert. All we need to do is now to group by, so 40 00:03:53,853 --> 00:04:01,361 that we can count how many times a particular word is used by a particular 41 00:04:01,361 --> 00:04:05,920 author. So the resulting table, would let you 42 00:04:05,920 --> 00:04:12,273 answer questions about, which of the top ten technical terms used by a particular 43 00:04:12,273 --> 00:04:18,911 author, by simply, finding all the entries for an author, sorting them by word count, 44 00:04:18,911 --> 00:04:25,561 and you have your answer. Problems start to occur when you try to 45 00:04:25,561 --> 00:04:34,465 consider how large these tables can get. If this P has million or a few million 46 00:04:34,465 --> 00:04:40,350 documents, then let's take a look at this table here. 47 00:04:40,350 --> 00:04:50,025 At worst, you'll have an entry for every word and every author that is if you allow 48 00:04:50,025 --> 00:04:55,503 zero word counts. That works out to million into million or 49 00:04:55,503 --> 00:05:00,998 a trillion entries. Even if you don't have zeros, and you get 50 00:05:00,998 --> 00:05:08,605 rid of them by a suitable modifications to your joining and counting process. 51 00:05:08,605 --> 00:05:16,317 You'll still have many billions of entries in this table merely to solve this kind of 52 00:05:16,317 --> 00:05:22,650 a problem. Now suppose you wanted to find the top ten 53 00:05:22,650 --> 00:05:28,282 authors for two terms. They are a combination of terms or you 54 00:05:28,282 --> 00:05:34,043 wanted to find the top ten terms for two authors who wrote together. 55 00:05:34,043 --> 00:05:39,684 This solution doesn't work anymore. So, we see that the database mindset 56 00:05:39,684 --> 00:05:45,431 doesn't quite work in such high dimensional spaces where you have millions 57 00:05:45,431 --> 00:05:50,954 of terms and millions of authors. And you are trying to find counts of 58 00:05:50,954 --> 00:05:55,955 combinations between them. This is really a case for map-reduce and 59 00:05:55,955 --> 00:06:02,748 it is exactly these types of challenges which lead the web companies to develop 60 00:06:02,748 --> 00:06:07,112 new technologies, and new programming paradigms. 61 00:06:07,112 --> 00:06:14,269 It was not just that the old technologies didn't scale, it's the fact that the 62 00:06:14,269 --> 00:06:21,776 relational paradigm itself needed to be looked at fresh, when dealing with large 63 00:06:21,776 --> 00:06:28,238 volumes of unstructured data. Okay, so lets try doing top key words per 64 00:06:28,238 --> 00:06:33,235 author in map-reduce. Well, suppose we use the map function 65 00:06:33,235 --> 00:06:40,051 where you emit a word and author every time you see it in a document just like we 66 00:06:40,051 --> 00:06:46,086 emitted the word and its counts earlier. And the reduce key is a word author 67 00:06:46,086 --> 00:06:50,697 combination. So that every reducer gets all the records 68 00:06:50,697 --> 00:06:55,849 which have one such combination, and can simply count them. 69 00:06:55,849 --> 00:07:01,294 Well, this doesn't quite solve the problem or does it? 70 00:07:01,294 --> 00:07:09,391 Actually it's the same problem as before, a database mindset being used with map 71 00:07:09,391 --> 00:07:15,189 produce. So it's the, programming paradigm is not a 72 00:07:15,189 --> 00:07:21,217 solution by itself. The, approach really needs to change from 73 00:07:21,217 --> 00:07:25,887 a data base oriented one, to something else. 74 00:07:25,887 --> 00:07:33,876 Let's try it again. Again, we have a map function but this 75 00:07:33,876 --> 00:07:40,233 time instead of emitting the author and the word, we emit the author and the 76 00:07:40,233 --> 00:07:44,615 contents. So essentially all the map phases doing is 77 00:07:44,615 --> 00:07:50,435 extracting the author field, and emitting it along with the contents. 78 00:07:50,435 --> 00:07:58,021 In the reduced phase we group everything by author so that all the documents for a 79 00:07:58,021 --> 00:08:05,646 particular author end up being in the same reducer and in the reduced function is a 80 00:08:05,646 --> 00:08:13,514 custom function which works as follows. For each author, this reduced function 81 00:08:13,514 --> 00:08:19,122 grabs, all the elements in that reducer for that author. 82 00:08:19,122 --> 00:08:29,062 Computes the word counts, inserts into some temporary array W, Sorts it. 83 00:08:29,062 --> 00:08:33,093 There goes out the top K word for that author. 84 00:08:33,093 --> 00:08:38,072 Deletes the W and starts again for the next author. 85 00:08:40,082 --> 00:08:45,032 Essentially, you're computing the top k for each author. 86 00:08:45,032 --> 00:08:49,025 Not storing the combination of each word author. 87 00:08:49,025 --> 00:08:55,055 Because you're only interested in the top K, the top ten, the top five, the top 88 00:08:55,055 --> 00:08:59,015 twenty, whatever. Notice what we've done here. 89 00:08:59,086 --> 00:09:05,063 We are bringing some element of programming back into, the reduced 90 00:09:05,063 --> 00:09:09,090 function. And not relying entirely on the system, 91 00:09:09,090 --> 00:09:17,020 map produced in this case, or the database in the previous case, to do all our work. 92 00:09:17,020 --> 00:09:24,028 Now let's look at some more examples, such as those we've considered in the past few 93 00:09:24,028 --> 00:09:27,082 weeks. Indexing, locality sensitive hashing. 94 00:09:27,082 --> 00:09:34,032 In particular, how to assemble the results from many different hash functions. 95 00:09:34,032 --> 00:09:40,048 How to compute the likelihoods while coming up with a Bayesian classifier. 96 00:09:40,079 --> 00:09:47,012 And, how to evaluate a Bayesian classifier to compute the likelihood ratio? 97 00:09:47,012 --> 00:09:51,002 Do you alctually need parallelism to do this? 98 00:09:51,002 --> 00:09:56,468 How about computing the TF-IDF or the joint probabilities for computing the 99 00:09:56,468 --> 00:10:03,417 information gain or mutual information? I'll leave these last three as homework 100 00:10:03,417 --> 00:10:08,672 for you. And we'll just do the first three right 101 00:10:08,672 --> 00:10:12,850 now. Let's start with indexing. 102 00:10:12,850 --> 00:10:20,651 In the map phase, we produce a partial index for a partial set of documents that 103 00:10:20,651 --> 00:10:27,675 is, we emit for every word, a postings list, out of all the documents that we 104 00:10:27,675 --> 00:10:32,697 see. This is not a full index so we have to 105 00:10:32,697 --> 00:10:37,284 reduce, again by word, and merge the partial indexes. 106 00:10:37,284 --> 00:10:43,553 That is merge these postings list per word into a common postings list. 107 00:10:43,553 --> 00:10:51,016 An interesting question is what if you have to sort either by document ID or by 108 00:10:51,016 --> 00:10:56,166 page rank as we had in our first homework assignment. 109 00:10:56,166 --> 00:11:03,652 What do you think you need to do then? Would you need another map-reduce phase? 110 00:11:03,652 --> 00:11:09,539 Think about it. Now let's consider locality sensitive 111 00:11:09,539 --> 00:11:16,241 hashing. The map phase, you start with documents 112 00:11:16,241 --> 00:11:21,058 and compute K hash values for every document. 113 00:11:22,005 --> 00:11:28,055 These could be the K hashwell use be used for fingerprints, or they could be others, 114 00:11:28,055 --> 00:11:32,086 as I described in the Rodger Almond book for documents. 115 00:11:33,055 --> 00:11:37,051 The reduced key has now the K hash functions. 116 00:11:37,051 --> 00:11:43,076 Each reducer gets all those documents that have the same K hash values. 117 00:11:43,076 --> 00:11:48,069 And figures out all the pairs that they could possibly forms. 118 00:11:48,069 --> 00:11:53,036 So if there are three documents, it emits three pairs. 119 00:11:53,036 --> 00:11:58,037 If there are four documents, it emits six pairs and so on. 120 00:11:59,036 --> 00:12:07,001 As a result, the map-reduce LSH, produces a list of document pairs, which are 121 00:12:07,001 --> 00:12:12,061 potentially similar. An interesting question is, will the 122 00:12:12,061 --> 00:12:16,069 document pair be emitted by more than one reducer? 123 00:12:16,069 --> 00:12:21,026 And suppose we, wanted, each pair to be emitted, only once. 124 00:12:21,026 --> 00:12:25,083 What would we do then? Lastly, let's come to our Bayesian 125 00:12:25,083 --> 00:12:32,019 classifier, where we need to compute the likelihoods, or rather the conditional 126 00:12:32,019 --> 00:12:38,048 probabilities of a yes occurring, or a no occurring, for each possible feature. 127 00:12:38,048 --> 00:12:45,688 So the map phase, merely produces the counts of feature yes or feature no which 128 00:12:45,688 --> 00:12:52,456 are then, grouped, by, features in the reduced phase, so that we get, all the, 129 00:12:52,456 --> 00:12:58,844 counts, that were emitted for a particular feature, in one reducer where we merely 130 00:12:58,844 --> 00:13:04,789 sum up, the respective counts, and divide by the total number of instances that 131 00:13:04,789 --> 00:13:11,051 we've seen, for a particular feature. Finally we emit the log of this 132 00:13:11,051 --> 00:13:17,005 likelihood, so that we get the elements of out beigen classifier. 133 00:13:18,015 --> 00:13:23,062 Notice that once we have the log likelihoods for each feature, do we really 134 00:13:23,062 --> 00:13:29,046 need parallelism for testing new documents or new objects using the naive based 135 00:13:29,046 --> 00:13:31,036 classifier? Think about it. 136 00:13:31,036 --> 00:13:36,053 Is that computation really that large as to require parallel computing? 137 00:13:37,046 --> 00:13:41,019 Well we've seen a number of examples of map-reduce. 138 00:13:41,019 --> 00:13:45,044 Admittedly the last few have been covered rather quickly. 139 00:13:45,044 --> 00:13:51,048 Please review the material so that you understand it and discuss it if you don't 140 00:13:51,048 --> 00:13:52,046 in the forum.