Now let's look at some examples of algorithms implemented in parallel using map-reduce. For those of you used to relational databases, here's how one can implement a join using map-reduce. Suppose we are given two tables, sales by address and cities by address and we are interested in joining these tables to find the sales by city. In sequel, this is how you'll do it. You would simply select the sum of sales from these two tables, where the addresses match, and group them by city. However in map-reduce, this requires two map-reduce jobs. In the first case, you join or essentially sort the data by address so that each reducer is able to emit the city for every sale. So in other words, the first map-reduced task only joins the sales and city table but doesn't do the grouping. In the second map produced task we sort on the city key so that all the records for a particular city comes to the same reducer where the addition takes place. Notice that we couldn't have used a single map-reduce raise because at the end of the first phase, it's not possible to guarantee that all the data for a particular city resides in the same reducer and so the summation can't be done, at least city-wise in a single phase. I'll relate to you a real world story that illustrates the difference between the map-reduce way of thinking and a traditional database approach. This is a real example where an organization had lots of data which could be abstracted as data about papers their authors and their contents. The actual nature of the data is proprietary so I won't relate that in detail. But it's pretty much the same problem as this. There are millions of papers written by many millions of authors containing millions of possible technical terms, phrases, you know, multi-word terms. And one of the tasks was to figure out the top ten technical terms used by each author. The top ten authors that used a particular term, and so on. And you can see this, problem is quite close to our word counting example. And possibly indicates a real life, requirement as well. This is how a database person actually propose the solution. You start out with a bunch of documents represented as a table with say a paper ID. The content of the document itself and its author. Then you get rid if the actual content and replace it by words. Notice what has happened over here. Instead of one row per paper, now you only have a row for a word and answer pair that occurs in a particular paper. So, one row here will produce many, many rows in the table queue. No matter, says our database expert. All we need to do is now to group by, so that we can count how many times a particular word is used by a particular author. So the resulting table, would let you answer questions about, which of the top ten technical terms used by a particular author, by simply, finding all the entries for an author, sorting them by word count, and you have your answer. Problems start to occur when you try to consider how large these tables can get. If this P has million or a few million documents, then let's take a look at this table here. At worst, you'll have an entry for every word and every author that is if you allow zero word counts. That works out to million into million or a trillion entries. Even if you don't have zeros, and you get rid of them by a suitable modifications to your joining and counting process. You'll still have many billions of entries in this table merely to solve this kind of a problem. Now suppose you wanted to find the top ten authors for two terms. They are a combination of terms or you wanted to find the top ten terms for two authors who wrote together. This solution doesn't work anymore. So, we see that the database mindset doesn't quite work in such high dimensional spaces where you have millions of terms and millions of authors. And you are trying to find counts of combinations between them. This is really a case for map-reduce and it is exactly these types of challenges which lead the web companies to develop new technologies, and new programming paradigms. It was not just that the old technologies didn't scale, it's the fact that the relational paradigm itself needed to be looked at fresh, when dealing with large volumes of unstructured data. Okay, so lets try doing top key words per author in map-reduce. Well, suppose we use the map function where you emit a word and author every time you see it in a document just like we emitted the word and its counts earlier. And the reduce key is a word author combination. So that every reducer gets all the records which have one such combination, and can simply count them. Well, this doesn't quite solve the problem or does it? Actually it's the same problem as before, a database mindset being used with map produce. So it's the, programming paradigm is not a solution by itself. The, approach really needs to change from a data base oriented one, to something else. Let's try it again. Again, we have a map function but this time instead of emitting the author and the word, we emit the author and the contents. So essentially all the map phases doing is extracting the author field, and emitting it along with the contents. In the reduced phase we group everything by author so that all the documents for a particular author end up being in the same reducer and in the reduced function is a custom function which works as follows. For each author, this reduced function grabs, all the elements in that reducer for that author. Computes the word counts, inserts into some temporary array W, Sorts it. There goes out the top K word for that author. Deletes the W and starts again for the next author. Essentially, you're computing the top k for each author. Not storing the combination of each word author. Because you're only interested in the top K, the top ten, the top five, the top twenty, whatever. Notice what we've done here. We are bringing some element of programming back into, the reduced function. And not relying entirely on the system, map produced in this case, or the database in the previous case, to do all our work. Now let's look at some more examples, such as those we've considered in the past few weeks. Indexing, locality sensitive hashing. In particular, how to assemble the results from many different hash functions. How to compute the likelihoods while coming up with a Bayesian classifier. And, how to evaluate a Bayesian classifier to compute the likelihood ratio? Do you alctually need parallelism to do this? How about computing the TF-IDF or the joint probabilities for computing the information gain or mutual information? I'll leave these last three as homework for you. And we'll just do the first three right now. Let's start with indexing. In the map phase, we produce a partial index for a partial set of documents that is, we emit for every word, a postings list, out of all the documents that we see. This is not a full index so we have to reduce, again by word, and merge the partial indexes. That is merge these postings list per word into a common postings list. An interesting question is what if you have to sort either by document ID or by page rank as we had in our first homework assignment. What do you think you need to do then? Would you need another map-reduce phase? Think about it. Now let's consider locality sensitive hashing. The map phase, you start with documents and compute K hash values for every document. These could be the K hashwell use be used for fingerprints, or they could be others, as I described in the Rodger Almond book for documents. The reduced key has now the K hash functions. Each reducer gets all those documents that have the same K hash values. And figures out all the pairs that they could possibly forms. So if there are three documents, it emits three pairs. If there are four documents, it emits six pairs and so on. As a result, the map-reduce LSH, produces a list of document pairs, which are potentially similar. An interesting question is, will the document pair be emitted by more than one reducer? And suppose we, wanted, each pair to be emitted, only once. What would we do then? Lastly, let's come to our Bayesian classifier, where we need to compute the likelihoods, or rather the conditional probabilities of a yes occurring, or a no occurring, for each possible feature. So the map phase, merely produces the counts of feature yes or feature no which are then, grouped, by, features in the reduced phase, so that we get, all the, counts, that were emitted for a particular feature, in one reducer where we merely sum up, the respective counts, and divide by the total number of instances that we've seen, for a particular feature. Finally we emit the log of this likelihood, so that we get the elements of out beigen classifier. Notice that once we have the log likelihoods for each feature, do we really need parallelism for testing new documents or new objects using the naive based classifier? Think about it. Is that computation really that large as to require parallel computing? Well we've seen a number of examples of map-reduce. Admittedly the last few have been covered rather quickly. Please review the material so that you understand it and discuss it if you don't in the forum.