SQL, which is the standard language for
database processing on relational
databases, has also evolved along with the
evolution of map-reduce as a program
paradigm.
Sql, by itself, is difficult to compile
directly into map-reduce, and a more finer
level of programming is required to get
reasonable efficiency.
There are two languages which are popular
in this regard.
There is Pig, which came out of Yahoo, and
Hive, which came out of Facebook.
Each of these is a form of sequel
primitives but executed procedurally
rather than in a single functional
statement.
So, for example, if one were to do the,
the calculation that we did last week that
using map-reduce that is first trying to
figure out the sales and cities that match
for a particular address ID by joining the
sale and city tables on address and then,
joining the results on the city to
essentially get sales by city.
This is how Pig and Hive make this
programming a little bit easier.
So, the COGROUP command in Pig is
essentially a map-reduce join.
The generate for each parallel, no
construct is essentially a map construct,
which essentially deals, deals with the
output of the first join and creates what
the keys are.
And then, again, there's a second
map-reduce which is done by grouping on
city.
And then, the FOREACH generates what the
output one wants, the, what is the
reduce.So, essentially, the without going
into great detail, what Pig does is it
takes these statements and compiles them
into a map-reduce program.
Hive is sort of similar, just a little bit
closer to sequel, where you can actually
do a select statement with a limited form
of join and a similar kind of select
statement for the second join.
Both of these are quite popular now.
Pig has one advantage that it works
directly on the distributed file system,
HDFS, if one wants to.
Or it can also work with Hbase.
Hive on the other hand, requires data to
be stored in its own format, on top of
HDFS.
So, it has it's own database rather than
the Hbase and cannot work directly on flat
files.
But both of these have become reasonably
popular for coding map-reduced programs as
opposed to writing map and reduce
functions in a programming language like
Python or Java, the way you did in your
assignment.
Another direction where SQL is evolving is
to introduce statistical primitives within
the database itself.
One example of this is the, MADlib library
which works on the Greenplum big data
database.
And this work is emanated from the
University of California at Berkeley.
I'll illustrate MADlib with a simple
example which you can see directly on the
website for MADlib and this is essentially
taken from there.
Suppose one has training data, just as we
discussed last week training for [unknown]
classifier, we have attributes which take
values, you know, one, two, three, one,
two, one, etc.
And their classes, yes, maybe, no, the
positives sentiment, negative sentiment,
whatever.
And, one wants to create a classifier.
Well, the way we did it last week, one
would have to compute this using programs,
maybe using map-reduce.
But in MADlib, we can create a classifier
using extensions to SQL itself.
So, suppose we want to create a classifier
using this training set and classify in
the end these two new samples which are
present in another table called
toclassify.
This is how one would does, one does it.
You first do a preparation statement where
essentially one is computing the
likelihood values from the training set
and the prior probabilities.
Next one actually creates the
classification using the two classify
tables that one wants to classify, and
then returns the actual classification.
So, it's internally created a [unknown]
base classifier.
If one wants more details, one can
actually figure out what the likely hood
values were.
So, one finds that for the first element
in the two classified table, we have a 0.4
chance of a one and a 0.6 chance of a two,
and that's why it was classified as two.
Whereas, the second one, we had a 0.75
chance of a one and a 0.2 for a chance of
a two.
And therefore, were classified as two.
We haven't gone into the details of the
syntax, but I guess the, the idea is there
are, there are similar parameters for set,
for example, for computing regressions and
other types of machine learning which we
will study in the following lecture after
this one.
But this is one direction in which
databases are evolving to try to include
within SQL, some statistical or machine
learning primitives.
Next, map-reduce is also evolving and the
main element that one needs to add to
map-reduce is iteration.
Essentially, many applications require one
to apply map-reduce again and again.
Continuous page rank calculations is one
example.
Continuously updating an index is another.
Continuous machine learning by updating
the training set with fresh values and
corrections, is a third.
And there are many such examples.
Of course, there are many ways of
iterating.
The simplest is simply to run map-reduce
again and again.
But the trouble with that is, that every
time one is writing to the distributed
file system, which is costly, so there
have been some attempts at trying to make
this approach of iterating map-reduce
directly more efficient by essentially
avoiding the data copy and pipelining the
output of the reducers directly into the
map phase of the next iteration of
map-reduce.
Another example of iteration is
generalizing the map produced paradigm to
a general data flow graph, where map and
produce are simply two types of operations
and one can have a general, directed graph
of such tasks and data can flow across
such a graph.
The key feature that is important here is
the tasks block just like map and reduce
tasks so that one map task needs to
process all its data before moving on, and
that's critical for fault tolerance as
we've seen in map-reduce, if a mapper
fails, one can simply restart the mapper
with whatever data was given simply
because it hasn't written anything.
So, blocking tasks are essential for fault
tolerance.
An example of such data flow systems are,
is the Dryad link system from Mira, from
Microsoft, the Hyracks research project,
and a few others.
Lastly, there are implementations of
recursion directly in map-reduce.
And here, the challenge is how to recover
if tasks, which are no longer now
blocking, because you're doing iteration
using recursion and recursion requires
every step to essentially produce an
output, otherwise, the next recursion
cannot begin.
So, how does one recover from non-blocking
tasks failing?
There are two main models here.
There's a graph based model which is
exemplified by Pregel and Giraph and
there's a stream model which is
exemplified by S4.
We're not going to go into details of
these in this course we just don't have
the time.
But the, for those brighter students
amongst you looking for a research work in
map-reduce, these are three essential
areas where, on essential ways, which
iteration is being made more efficient in
map-reduce in our promising areas for
doing fresh research.