1 00:00:00,000 --> 00:00:07,011 SQL, which is the standard language for database processing on relational 2 00:00:07,011 --> 00:00:14,060 databases, has also evolved along with the evolution of map-reduce as a program 3 00:00:14,060 --> 00:00:19,012 paradigm. Sql, by itself, is difficult to compile 4 00:00:19,012 --> 00:00:26,714 directly into map-reduce, and a more finer level of programming is required to get 5 00:00:26,714 --> 00:00:33,050 reasonable efficiency. There are two languages which are popular 6 00:00:33,050 --> 00:00:38,450 in this regard. There is Pig, which came out of Yahoo, and 7 00:00:38,450 --> 00:00:45,056 Hive, which came out of Facebook. Each of these is a form of sequel 8 00:00:46,017 --> 00:00:53,010 primitives but executed procedurally rather than in a single functional 9 00:00:53,010 --> 00:00:58,000 statement. So, for example, if one were to do the, 10 00:00:58,457 --> 00:01:05,415 the calculation that we did last week that using map-reduce that is first trying to 11 00:01:05,415 --> 00:01:13,537 figure out the sales and cities that match for a particular address ID by joining the 12 00:01:13,537 --> 00:01:21,256 sale and city tables on address and then, joining the results on the city to 13 00:01:21,256 --> 00:01:27,214 essentially get sales by city. This is how Pig and Hive make this 14 00:01:27,214 --> 00:01:36,022 programming a little bit easier. So, the COGROUP command in Pig is 15 00:01:36,022 --> 00:01:42,439 essentially a map-reduce join. The generate for each parallel, no 16 00:01:42,740 --> 00:01:50,513 construct is essentially a map construct, which essentially deals, deals with the 17 00:01:50,513 --> 00:01:56,090 output of the first join and creates what the keys are. 18 00:01:56,090 --> 00:02:02,580 And then, again, there's a second map-reduce which is done by grouping on 19 00:02:02,580 --> 00:02:06,308 city. And then, the FOREACH generates what the 20 00:02:06,308 --> 00:02:12,477 output one wants, the, what is the reduce.So, essentially, the without going 21 00:02:12,477 --> 00:02:18,873 into great detail, what Pig does is it takes these statements and compiles them 22 00:02:18,873 --> 00:02:23,633 into a map-reduce program. Hive is sort of similar, just a little bit 23 00:02:23,633 --> 00:02:29,278 closer to sequel, where you can actually do a select statement with a limited form 24 00:02:29,278 --> 00:02:35,056 of join and a similar kind of select statement for the second join. 25 00:02:35,056 --> 00:02:42,047 Both of these are quite popular now. Pig has one advantage that it works 26 00:02:42,352 --> 00:02:47,965 directly on the distributed file system, HDFS, if one wants to. 27 00:02:48,241 --> 00:02:55,046 Or it can also work with Hbase. Hive on the other hand, requires data to 28 00:02:55,046 --> 00:02:59,017 be stored in its own format, on top of HDFS. 29 00:02:59,017 --> 00:03:05,410 So, it has it's own database rather than the Hbase and cannot work directly on flat 30 00:03:05,410 --> 00:03:09,304 files. But both of these have become reasonably 31 00:03:09,304 --> 00:03:15,057 popular for coding map-reduced programs as opposed to writing map and reduce 32 00:03:15,057 --> 00:03:21,029 functions in a programming language like Python or Java, the way you did in your 33 00:03:21,029 --> 00:03:26,316 assignment. Another direction where SQL is evolving is 34 00:03:26,316 --> 00:03:32,049 to introduce statistical primitives within the database itself. 35 00:03:32,049 --> 00:03:39,093 One example of this is the, MADlib library which works on the Greenplum big data 36 00:03:39,093 --> 00:03:43,096 database. And this work is emanated from the 37 00:03:43,096 --> 00:03:50,006 University of California at Berkeley. I'll illustrate MADlib with a simple 38 00:03:50,006 --> 00:03:57,041 example which you can see directly on the website for MADlib and this is essentially 39 00:03:57,041 --> 00:04:01,084 taken from there. Suppose one has training data, just as we 40 00:04:01,084 --> 00:04:07,793 discussed last week training for [unknown] classifier, we have attributes which take 41 00:04:07,793 --> 00:04:10,028 values, you know, one, two, three, one, two, one, etc. 42 00:04:10,028 --> 00:04:14,547 And their classes, yes, maybe, no, the positives sentiment, negative sentiment, 43 00:04:14,547 --> 00:04:19,529 whatever. And, one wants to create a classifier. 44 00:04:19,529 --> 00:04:25,621 Well, the way we did it last week, one would have to compute this using programs, 45 00:04:25,621 --> 00:04:30,091 maybe using map-reduce. But in MADlib, we can create a classifier 46 00:04:30,091 --> 00:04:36,041 using extensions to SQL itself. So, suppose we want to create a classifier 47 00:04:36,041 --> 00:04:42,005 using this training set and classify in the end these two new samples which are 48 00:04:42,005 --> 00:04:45,015 present in another table called toclassify. 49 00:04:45,015 --> 00:04:50,093 This is how one would does, one does it. You first do a preparation statement where 50 00:04:50,093 --> 00:04:56,007 essentially one is computing the likelihood values from the training set 51 00:04:56,007 --> 00:05:00,071 and the prior probabilities. Next one actually creates the 52 00:05:00,071 --> 00:05:06,001 classification using the two classify tables that one wants to classify, and 53 00:05:06,001 --> 00:05:10,715 then returns the actual classification. So, it's internally created a [unknown] 54 00:05:10,715 --> 00:05:14,067 base classifier. If one wants more details, one can 55 00:05:14,067 --> 00:05:19,000 actually figure out what the likely hood values were. 56 00:05:19,000 --> 00:05:26,033 So, one finds that for the first element in the two classified table, we have a 0.4 57 00:05:26,033 --> 00:05:32,363 chance of a one and a 0.6 chance of a two, and that's why it was classified as two. 58 00:05:32,363 --> 00:05:38,642 Whereas, the second one, we had a 0.75 chance of a one and a 0.2 for a chance of 59 00:05:38,642 --> 00:05:41,953 a two. And therefore, were classified as two. 60 00:05:41,953 --> 00:05:47,854 We haven't gone into the details of the syntax, but I guess the, the idea is there 61 00:05:47,854 --> 00:05:53,494 are, there are similar parameters for set, for example, for computing regressions and 62 00:05:53,494 --> 00:05:59,418 other types of machine learning which we will study in the following lecture after 63 00:05:59,418 --> 00:06:02,835 this one. But this is one direction in which 64 00:06:02,835 --> 00:06:08,717 databases are evolving to try to include within SQL, some statistical or machine 65 00:06:08,717 --> 00:06:13,702 learning primitives. Next, map-reduce is also evolving and the 66 00:06:13,702 --> 00:06:19,220 main element that one needs to add to map-reduce is iteration. 67 00:06:19,220 --> 00:06:25,733 Essentially, many applications require one to apply map-reduce again and again. 68 00:06:25,959 --> 00:06:29,696 Continuous page rank calculations is one example. 69 00:06:29,696 --> 00:06:36,097 Continuously updating an index is another. Continuous machine learning by updating 70 00:06:36,097 --> 00:06:41,100 the training set with fresh values and corrections, is a third. 71 00:06:41,100 --> 00:06:46,306 And there are many such examples. Of course, there are many ways of 72 00:06:46,306 --> 00:06:49,947 iterating. The simplest is simply to run map-reduce 73 00:06:49,947 --> 00:06:53,565 again and again. But the trouble with that is, that every 74 00:06:53,565 --> 00:06:58,657 time one is writing to the distributed file system, which is costly, so there 75 00:06:58,657 --> 00:07:04,282 have been some attempts at trying to make this approach of iterating map-reduce 76 00:07:04,282 --> 00:07:09,594 directly more efficient by essentially avoiding the data copy and pipelining the 77 00:07:09,594 --> 00:07:15,184 output of the reducers directly into the map phase of the next iteration of 78 00:07:15,184 --> 00:07:19,685 map-reduce. Another example of iteration is 79 00:07:19,685 --> 00:07:27,506 generalizing the map produced paradigm to a general data flow graph, where map and 80 00:07:27,506 --> 00:07:34,688 produce are simply two types of operations and one can have a general, directed graph 81 00:07:34,688 --> 00:07:40,475 of such tasks and data can flow across such a graph. 82 00:07:40,475 --> 00:07:47,211 The key feature that is important here is the tasks block just like map and reduce 83 00:07:47,211 --> 00:07:52,702 tasks so that one map task needs to process all its data before moving on, and 84 00:07:52,702 --> 00:07:57,778 that's critical for fault tolerance as we've seen in map-reduce, if a mapper 85 00:07:57,778 --> 00:08:03,927 fails, one can simply restart the mapper with whatever data was given simply 86 00:08:03,927 --> 00:08:10,036 because it hasn't written anything. So, blocking tasks are essential for fault 87 00:08:10,036 --> 00:08:14,061 tolerance. An example of such data flow systems are, 88 00:08:14,061 --> 00:08:20,824 is the Dryad link system from Mira, from Microsoft, the Hyracks research project, 89 00:08:20,824 --> 00:08:26,045 and a few others. Lastly, there are implementations of 90 00:08:26,045 --> 00:08:32,089 recursion directly in map-reduce. And here, the challenge is how to recover 91 00:08:32,089 --> 00:08:39,017 if tasks, which are no longer now blocking, because you're doing iteration 92 00:08:39,017 --> 00:08:44,995 using recursion and recursion requires every step to essentially produce an 93 00:08:44,995 --> 00:08:49,091 output, otherwise, the next recursion cannot begin. 94 00:08:49,091 --> 00:08:53,084 So, how does one recover from non-blocking tasks failing? 95 00:08:53,084 --> 00:08:59,067 There are two main models here. There's a graph based model which is 96 00:08:59,067 --> 00:09:05,035 exemplified by Pregel and Giraph and there's a stream model which is 97 00:09:05,035 --> 00:09:09,331 exemplified by S4. We're not going to go into details of 98 00:09:09,331 --> 00:09:13,076 these in this course we just don't have the time. 99 00:09:13,076 --> 00:09:20,010 But the, for those brighter students amongst you looking for a research work in 100 00:09:20,010 --> 00:09:25,909 map-reduce, these are three essential areas where, on essential ways, which 101 00:09:25,909 --> 00:09:32,422 iteration is being made more efficient in map-reduce in our promising areas for 102 00:09:32,422 --> 00:09:34,065 doing fresh research.