1
00:00:00,000 --> 00:00:07,011
SQL, which is the standard language for
database processing on relational

2
00:00:07,011 --> 00:00:14,060
databases, has also evolved along with the
evolution of map-reduce as a program

3
00:00:14,060 --> 00:00:19,012
paradigm.
Sql, by itself, is difficult to compile

4
00:00:19,012 --> 00:00:26,714
directly into map-reduce, and a more finer
level of programming is required to get

5
00:00:26,714 --> 00:00:33,050
reasonable efficiency.
There are two languages which are popular

6
00:00:33,050 --> 00:00:38,450
in this regard.
There is Pig, which came out of Yahoo, and

7
00:00:38,450 --> 00:00:45,056
Hive, which came out of Facebook.
Each of these is a form of sequel

8
00:00:46,017 --> 00:00:53,010
primitives but executed procedurally
rather than in a single functional

9
00:00:53,010 --> 00:00:58,000
statement.
So, for example, if one were to do the,

10
00:00:58,457 --> 00:01:05,415
the calculation that we did last week that
using map-reduce that is first trying to

11
00:01:05,415 --> 00:01:13,537
figure out the sales and cities that match
for a particular address ID by joining the

12
00:01:13,537 --> 00:01:21,256
sale and city tables on address and then,
joining the results on the city to

13
00:01:21,256 --> 00:01:27,214
essentially get sales by city.
This is how Pig and Hive make this

14
00:01:27,214 --> 00:01:36,022
programming a little bit easier.
So, the COGROUP command in Pig is

15
00:01:36,022 --> 00:01:42,439
essentially a map-reduce join.
The generate for each parallel, no

16
00:01:42,740 --> 00:01:50,513
construct is essentially a map construct,
which essentially deals, deals with the

17
00:01:50,513 --> 00:01:56,090
output of the first join and creates what
the keys are.

18
00:01:56,090 --> 00:02:02,580
And then, again, there's a second
map-reduce which is done by grouping on

19
00:02:02,580 --> 00:02:06,308
city.
And then, the FOREACH generates what the

20
00:02:06,308 --> 00:02:12,477
output one wants, the, what is the
reduce.So, essentially, the without going

21
00:02:12,477 --> 00:02:18,873
into great detail, what Pig does is it
takes these statements and compiles them

22
00:02:18,873 --> 00:02:23,633
into a map-reduce program.
Hive is sort of similar, just a little bit

23
00:02:23,633 --> 00:02:29,278
closer to sequel, where you can actually
do a select statement with a limited form

24
00:02:29,278 --> 00:02:35,056
of join and a similar kind of select
statement for the second join.

25
00:02:35,056 --> 00:02:42,047
Both of these are quite popular now.
Pig has one advantage that it works

26
00:02:42,352 --> 00:02:47,965
directly on the distributed file system,
HDFS, if one wants to.

27
00:02:48,241 --> 00:02:55,046
Or it can also work with Hbase.
Hive on the other hand, requires data to

28
00:02:55,046 --> 00:02:59,017
be stored in its own format, on top of
HDFS.

29
00:02:59,017 --> 00:03:05,410
So, it has it's own database rather than
the Hbase and cannot work directly on flat

30
00:03:05,410 --> 00:03:09,304
files.
But both of these have become reasonably

31
00:03:09,304 --> 00:03:15,057
popular for coding map-reduced programs as
opposed to writing map and reduce

32
00:03:15,057 --> 00:03:21,029
functions in a programming language like
Python or Java, the way you did in your

33
00:03:21,029 --> 00:03:26,316
assignment.
Another direction where SQL is evolving is

34
00:03:26,316 --> 00:03:32,049
to introduce statistical primitives within
the database itself.

35
00:03:32,049 --> 00:03:39,093
One example of this is the, MADlib library
which works on the Greenplum big data

36
00:03:39,093 --> 00:03:43,096
database.
And this work is emanated from the

37
00:03:43,096 --> 00:03:50,006
University of California at Berkeley.
I'll illustrate MADlib with a simple

38
00:03:50,006 --> 00:03:57,041
example which you can see directly on the
website for MADlib and this is essentially

39
00:03:57,041 --> 00:04:01,084
taken from there.
Suppose one has training data, just as we

40
00:04:01,084 --> 00:04:07,793
discussed last week training for [unknown]
classifier, we have attributes which take

41
00:04:07,793 --> 00:04:10,028
values, you know, one, two, three, one,
two, one, etc.

42
00:04:10,028 --> 00:04:14,547
And their classes, yes, maybe, no, the
positives sentiment, negative sentiment,

43
00:04:14,547 --> 00:04:19,529
whatever.
And, one wants to create a classifier.

44
00:04:19,529 --> 00:04:25,621
Well, the way we did it last week, one
would have to compute this using programs,

45
00:04:25,621 --> 00:04:30,091
maybe using map-reduce.
But in MADlib, we can create a classifier

46
00:04:30,091 --> 00:04:36,041
using extensions to SQL itself.
So, suppose we want to create a classifier

47
00:04:36,041 --> 00:04:42,005
using this training set and classify in
the end these two new samples which are

48
00:04:42,005 --> 00:04:45,015
present in another table called
toclassify.

49
00:04:45,015 --> 00:04:50,093
This is how one would does, one does it.
You first do a preparation statement where

50
00:04:50,093 --> 00:04:56,007
essentially one is computing the
likelihood values from the training set

51
00:04:56,007 --> 00:05:00,071
and the prior probabilities.
Next one actually creates the

52
00:05:00,071 --> 00:05:06,001
classification using the two classify
tables that one wants to classify, and

53
00:05:06,001 --> 00:05:10,715
then returns the actual classification.
So, it's internally created a [unknown]

54
00:05:10,715 --> 00:05:14,067
base classifier.
If one wants more details, one can

55
00:05:14,067 --> 00:05:19,000
actually figure out what the likely hood
values were.

56
00:05:19,000 --> 00:05:26,033
So, one finds that for the first element
in the two classified table, we have a 0.4

57
00:05:26,033 --> 00:05:32,363
chance of a one and a 0.6 chance of a two,
and that's why it was classified as two.

58
00:05:32,363 --> 00:05:38,642
Whereas, the second one, we had a 0.75
chance of a one and a 0.2 for a chance of

59
00:05:38,642 --> 00:05:41,953
a two.
And therefore, were classified as two.

60
00:05:41,953 --> 00:05:47,854
We haven't gone into the details of the
syntax, but I guess the, the idea is there

61
00:05:47,854 --> 00:05:53,494
are, there are similar parameters for set,
for example, for computing regressions and

62
00:05:53,494 --> 00:05:59,418
other types of machine learning which we
will study in the following lecture after

63
00:05:59,418 --> 00:06:02,835
this one.
But this is one direction in which

64
00:06:02,835 --> 00:06:08,717
databases are evolving to try to include
within SQL, some statistical or machine

65
00:06:08,717 --> 00:06:13,702
learning primitives.
Next, map-reduce is also evolving and the

66
00:06:13,702 --> 00:06:19,220
main element that one needs to add to
map-reduce is iteration.

67
00:06:19,220 --> 00:06:25,733
Essentially, many applications require one
to apply map-reduce again and again.

68
00:06:25,959 --> 00:06:29,696
Continuous page rank calculations is one
example.

69
00:06:29,696 --> 00:06:36,097
Continuously updating an index is another.
Continuous machine learning by updating

70
00:06:36,097 --> 00:06:41,100
the training set with fresh values and
corrections, is a third.

71
00:06:41,100 --> 00:06:46,306
And there are many such examples.
Of course, there are many ways of

72
00:06:46,306 --> 00:06:49,947
iterating.
The simplest is simply to run map-reduce

73
00:06:49,947 --> 00:06:53,565
again and again.
But the trouble with that is, that every

74
00:06:53,565 --> 00:06:58,657
time one is writing to the distributed
file system, which is costly, so there

75
00:06:58,657 --> 00:07:04,282
have been some attempts at trying to make
this approach of iterating map-reduce

76
00:07:04,282 --> 00:07:09,594
directly more efficient by essentially
avoiding the data copy and pipelining the

77
00:07:09,594 --> 00:07:15,184
output of the reducers directly into the
map phase of the next iteration of

78
00:07:15,184 --> 00:07:19,685
map-reduce.
Another example of iteration is

79
00:07:19,685 --> 00:07:27,506
generalizing the map produced paradigm to
a general data flow graph, where map and

80
00:07:27,506 --> 00:07:34,688
produce are simply two types of operations
and one can have a general, directed graph

81
00:07:34,688 --> 00:07:40,475
of such tasks and data can flow across
such a graph.

82
00:07:40,475 --> 00:07:47,211
The key feature that is important here is
the tasks block just like map and reduce

83
00:07:47,211 --> 00:07:52,702
tasks so that one map task needs to
process all its data before moving on, and

84
00:07:52,702 --> 00:07:57,778
that's critical for fault tolerance as
we've seen in map-reduce, if a mapper

85
00:07:57,778 --> 00:08:03,927
fails, one can simply restart the mapper
with whatever data was given simply

86
00:08:03,927 --> 00:08:10,036
because it hasn't written anything.
So, blocking tasks are essential for fault

87
00:08:10,036 --> 00:08:14,061
tolerance.
An example of such data flow systems are,

88
00:08:14,061 --> 00:08:20,824
is the Dryad link system from Mira, from
Microsoft, the Hyracks research project,

89
00:08:20,824 --> 00:08:26,045
and a few others.
Lastly, there are implementations of

90
00:08:26,045 --> 00:08:32,089
recursion directly in map-reduce.
And here, the challenge is how to recover

91
00:08:32,089 --> 00:08:39,017
if tasks, which are no longer now
blocking, because you're doing iteration

92
00:08:39,017 --> 00:08:44,995
using recursion and recursion requires
every step to essentially produce an

93
00:08:44,995 --> 00:08:49,091
output, otherwise, the next recursion
cannot begin.

94
00:08:49,091 --> 00:08:53,084
So, how does one recover from non-blocking
tasks failing?

95
00:08:53,084 --> 00:08:59,067
There are two main models here.
There's a graph based model which is

96
00:08:59,067 --> 00:09:05,035
exemplified by Pregel and Giraph and
there's a stream model which is

97
00:09:05,035 --> 00:09:09,331
exemplified by S4.
We're not going to go into details of

98
00:09:09,331 --> 00:09:13,076
these in this course we just don't have
the time.

99
00:09:13,076 --> 00:09:20,010
But the, for those brighter students
amongst you looking for a research work in

100
00:09:20,010 --> 00:09:25,909
map-reduce, these are three essential
areas where, on essential ways, which

101
00:09:25,909 --> 00:09:32,422
iteration is being made more efficient in
map-reduce in our promising areas for

102
00:09:32,422 --> 00:09:34,065
doing fresh research.