SQL, which is the standard language for database processing on relational databases, has also evolved along with the evolution of map-reduce as a program paradigm. Sql, by itself, is difficult to compile directly into map-reduce, and a more finer level of programming is required to get reasonable efficiency. There are two languages which are popular in this regard. There is Pig, which came out of Yahoo, and Hive, which came out of Facebook. Each of these is a form of sequel primitives but executed procedurally rather than in a single functional statement. So, for example, if one were to do the, the calculation that we did last week that using map-reduce that is first trying to figure out the sales and cities that match for a particular address ID by joining the sale and city tables on address and then, joining the results on the city to essentially get sales by city. This is how Pig and Hive make this programming a little bit easier. So, the COGROUP command in Pig is essentially a map-reduce join. The generate for each parallel, no construct is essentially a map construct, which essentially deals, deals with the output of the first join and creates what the keys are. And then, again, there's a second map-reduce which is done by grouping on city. And then, the FOREACH generates what the output one wants, the, what is the reduce.So, essentially, the without going into great detail, what Pig does is it takes these statements and compiles them into a map-reduce program. Hive is sort of similar, just a little bit closer to sequel, where you can actually do a select statement with a limited form of join and a similar kind of select statement for the second join. Both of these are quite popular now. Pig has one advantage that it works directly on the distributed file system, HDFS, if one wants to. Or it can also work with Hbase. Hive on the other hand, requires data to be stored in its own format, on top of HDFS. So, it has it's own database rather than the Hbase and cannot work directly on flat files. But both of these have become reasonably popular for coding map-reduced programs as opposed to writing map and reduce functions in a programming language like Python or Java, the way you did in your assignment. Another direction where SQL is evolving is to introduce statistical primitives within the database itself. One example of this is the, MADlib library which works on the Greenplum big data database. And this work is emanated from the University of California at Berkeley. I'll illustrate MADlib with a simple example which you can see directly on the website for MADlib and this is essentially taken from there. Suppose one has training data, just as we discussed last week training for [unknown] classifier, we have attributes which take values, you know, one, two, three, one, two, one, etc. And their classes, yes, maybe, no, the positives sentiment, negative sentiment, whatever. And, one wants to create a classifier. Well, the way we did it last week, one would have to compute this using programs, maybe using map-reduce. But in MADlib, we can create a classifier using extensions to SQL itself. So, suppose we want to create a classifier using this training set and classify in the end these two new samples which are present in another table called toclassify. This is how one would does, one does it. You first do a preparation statement where essentially one is computing the likelihood values from the training set and the prior probabilities. Next one actually creates the classification using the two classify tables that one wants to classify, and then returns the actual classification. So, it's internally created a [unknown] base classifier. If one wants more details, one can actually figure out what the likely hood values were. So, one finds that for the first element in the two classified table, we have a 0.4 chance of a one and a 0.6 chance of a two, and that's why it was classified as two. Whereas, the second one, we had a 0.75 chance of a one and a 0.2 for a chance of a two. And therefore, were classified as two. We haven't gone into the details of the syntax, but I guess the, the idea is there are, there are similar parameters for set, for example, for computing regressions and other types of machine learning which we will study in the following lecture after this one. But this is one direction in which databases are evolving to try to include within SQL, some statistical or machine learning primitives. Next, map-reduce is also evolving and the main element that one needs to add to map-reduce is iteration. Essentially, many applications require one to apply map-reduce again and again. Continuous page rank calculations is one example. Continuously updating an index is another. Continuous machine learning by updating the training set with fresh values and corrections, is a third. And there are many such examples. Of course, there are many ways of iterating. The simplest is simply to run map-reduce again and again. But the trouble with that is, that every time one is writing to the distributed file system, which is costly, so there have been some attempts at trying to make this approach of iterating map-reduce directly more efficient by essentially avoiding the data copy and pipelining the output of the reducers directly into the map phase of the next iteration of map-reduce. Another example of iteration is generalizing the map produced paradigm to a general data flow graph, where map and produce are simply two types of operations and one can have a general, directed graph of such tasks and data can flow across such a graph. The key feature that is important here is the tasks block just like map and reduce tasks so that one map task needs to process all its data before moving on, and that's critical for fault tolerance as we've seen in map-reduce, if a mapper fails, one can simply restart the mapper with whatever data was given simply because it hasn't written anything. So, blocking tasks are essential for fault tolerance. An example of such data flow systems are, is the Dryad link system from Mira, from Microsoft, the Hyracks research project, and a few others. Lastly, there are implementations of recursion directly in map-reduce. And here, the challenge is how to recover if tasks, which are no longer now blocking, because you're doing iteration using recursion and recursion requires every step to essentially produce an output, otherwise, the next recursion cannot begin. So, how does one recover from non-blocking tasks failing? There are two main models here. There's a graph based model which is exemplified by Pregel and Giraph and there's a stream model which is exemplified by S4. We're not going to go into details of these in this course we just don't have the time. But the, for those brighter students amongst you looking for a research work in map-reduce, these are three essential areas where, on essential ways, which iteration is being made more efficient in map-reduce in our promising areas for doing fresh research.