This video provides an overview of some NoSQL systems.
I want to say right up front that it's being made in November, 2011.
This is a field
that's changing very fast, so
this is an overview of what's
going on right now.
As a reminder from the previous
video, NoSQL systems have
arisen because it was recognized that
not every problem involving large
scale management or analysis of
data was best solved
by using a relational database system.
Some problems still are, but
there are others that are more suitable
for a different type of system that we're going to talk about.
NoSQL as a term
has evolved to mean not
only SQL, where SQL
doesn't really mean the SQL language,
but it means a traditional database management system.
Again, as a reminder from the
previous video, the NoSQL
systems are different from
traditional systems in that they
tend to provide a flexible schema rather than a rigid structure.
They tend to be quicker or cheaper, or both, to set up.
They're geared towards really massive
scalability, and they tend
to use relaxed consistency models
in order to give higher performance and higher availability.
The downside is being that there's
no declarative query language, so
more programming is typically involved
in manipulating the data, and
because of the relaxed consistency models,
the plus is a better
performance, the downside is
fewer guarantees about the consistency of the data.
So there are a number of
incarnations of NoSQL systems, and
I've chosen, as of November
2011 to divide into
four categories, the MapReduce
framework, key value stores,
document stores, and graph database systems.
In terms of the first two,
one way you can think
about it sort of roughly
is that the MapReduce framework is
typically used for applications that
would have used relational OLAP
or online analytical processing.
They tend to be analysis applications that
touch large amounts of the data to do complex analyses.
Whereas key value stores tend
to be more in the
OLTP world as a
reminder that's online transaction processing and
that tends to be a
lot of small operations touching
very small parts of the data.
The other two document stores
and graph database systems are self-explanatory.
They involve documents and graphs.
Now you might wonder why
I didn't mention column stores, because
column stores are often discussed in terms of NoSQL.
So, column stores are in
one sense just a way
of organizing relational database systems
for higher performance for certain types
of applications but we'll also
see that key values stores do
tend to have, sometimes,
not all of them, have a
model that's also based on columns being an important concept.
So, now I'll discuss each of
these in turn, although I'm going
to spend the most amount of time on MapReduce.
So we can think of MapReduce as a framework.
It came originally from Google.
They invented the term MapReduce, and
now there's an open source system
widely used called Hadoop which
does implement the MapReduce framework
so the first aspect of MapReduce
is that there is no data model at all.
The data in the MapReduce
framework is stored in files
both as input and output.
In the Google MapReduce
implementation, it's the Google File System, GFS.
In the Hadoop open source
implementation, it's the Hadoop Distributed File System, HDFS.
What the user provides to process
data using the MapReduce framework,
is a set of specific functions.
Not surprisingly, one of those
functions is called map, and
one of them is called reduce.
Other functions that the user
needs to provide is a
reader function which will
read data from files and provide it as records.
A writer function that will
take the output records and
write them into files, and finally
there's an optional function called
the combiner, that we'll discuss.
So, the user just provides this
set of functions, and then what
the system provides, is the
glue that processes the data through the functions.
The system also provides fault tolerance
of the processing, so, if there is
a crash or a node
goes down during the execution,
it will be guaranteed to be as if that didn't happen.
And finally the system also provides
scalability so that the
MapReduce framework can be
used for very very large data analysis.
So let's talk about the two
most important functions, the map function and the reduce function.
The map function is used
to take the data analysis problem
and divide it into sub-problems.
Very specifically the function that
the user provides called map is
going to take a data item
as input and it's
going to produce as output zero or more key value pairs.
Now what I mean by a
sub-problem here is that we're going
to separately deal with the
set of records associated with
each key, and that's the job of the reduce function.
So the reduce function, which we'll
write, takes as its
parameters a key and
then a list of values for
that key and it
produces as output, zero or more records.
Now we'll shortly see a concrete
example that will, hopefully, make
this more understandable but before
we do that, let 's look at the
overall architecture of how
these functions are used to process data.
So we'll start with our map
function, which, let's put inside
a box, and then we
will have input records going into the map function.
As a reminder, what the map
function produces from each input
record is an output record
that's a key value pair and
we're going to have these
records sort of directed
in a different way for each key.
So let's say this is
the way that the records are
gonna go for key 1, key
2, and up to key n.
And of course the records will
have values associated with them as well.
So we'll send each batch of
records for a given
key into our reduce
function, so let me just
draw a few reduce boxes
here, there's one for each
set of records for a given key.
And then as we mentioned before the
reduce function produces output records.
At the highest level, that's it. That's our data processing.
We start with a bunch of input.
We divide it up
into sub-problems, based on a
key, which will extract from the
input record somehow, we'll see an
example, and then each
sub-problem, associated with a
particular key is set through
the reduce function, which produces the output.
And that's the end of our processing.
Now things are, of course, a bit more complex than that.
First of all, there's no reason
to have one map box, because
the map function takes each input
record and processes it
separately, so we can parallelize
the mapping as much as we want.
So let's change the picture here,
to have a whole set of map boxes.
So now, each MapBox is
going to take its records and
it's going to produce records with
given keys so we'll still send k1 over to the first reducer.
If we have k2
it'll go here and down here.
And of course.
this map will send things to
reduce, reduce, reduce, and so on.
Now, you might wonder what happened
to those reader and writer
functions that I talked about.
The reality is that we don't
actually start with input records, we
start with our data in
files. So here's the real original data.
We'll draw this picture here
for files, and let's
erase our input records here
because the job of
the reader is to take
the files, extract the
records from the files, and provide them to the map functions.
So here is that side of thing, it's a bit sloppy but I think get the idea.
And we have a similar thing
on the other end, the output
methods come out of the
reducers, but then their
provided to the writer functions
that which write the output to a final file.
So here it is, our original
input in files here, our
final output in files there.
Ok, but let me remind
you what the user provide what the system provides.
So the user creates a single
map function that takes records
and emits a key value pair for each record.
The user provides a single reduce
function that takes a set
of values for a given
key and produces zero or
more outputs and I should
mention that the map can produce
zero or more outputs from each record as well.
It doesn't have to be a one-to-one mapping.
The user also provides the
reader function, to extract data
from files and the writer
function, to write data to the output.
And there's one more optional function I mentioned called the combiner.
The combiner, actually, is sort
of attached to the mapper,
so we can kind of put it here.
And what the combiner does is,
it actually, in sort of,
in the mapper, will take
a set of records for
a given key, so, say,
for K1 and then we'll
send a combined version of that record to the reducer.
In a way, you can think of
it as a sort of pre-reduce phase,
and we'll see examples of this
that occurs with the mapper,
to make things more efficient and send less data to the reducer.
So, the user has provided these
pieces, these system infrastructure
takes the pieces, and distributes
them to multiple machines, because
a lot of this can go on in parallel.
All of this can go on in parallel, this too, and this too.
Here you have to exchange
data, maybe from one machine
to another, but once you
do, parallelism can occur and here as well.
So the system distributes them to
machines, and you can add
more machines to make it all all run faster.
The system also provides fault tolerance,
so if something goes badly here,
it will redo that reducer function
and here as well, and finally,
as I mentioned before, it provides scalability.
But I should add, I think
one of the most important things the
mass produce architecture provides, is the glue that puts this all together.
Because again, the user is
only providing these functions, and
the system will take care of
all of the execution, moving the
data around and calling the
function over the large amounts of data that are being processed.
Well, all of that is pretty
abstract, so let's look at
a concrete example, and let's
go back to the domain that
I introduced in the previous video
of analyzing a web log,
where we have, in each record,
a user ID, URL, the
time of the access, and maybe some additional information.
And let's start out with a
fairly simple task, which is
that we want to count the
number of accesses for each
domain, where the domain is inside the URL.
So, for example, the domain
might be the stanford.edu domain, where
we have accesses to many
different URLs with that domain
and we're just going to count how
many accesses there have been to Stanford.
So to perform this task, the
user has to provide a
map function and a reduce function.
Let's look at what they do.
The map function is going to take a record.
We'll assume that the reader
has already extracted the record
from the file and it provides it
in this format with these four fields.
And what the map function is
going to do is simply look
inside the record and extract
the domain from the URL,
and it's going to produce as
output from that record
the domain as the key,
so this is the key, and then
for this, we can just
have a null value as the
value, we're not going to actually need to use a value.
And so that's the job of the mapper, pretty simple.
Now what does the reduce function do?
The reduce function is going
to take a domain, because
that's the key and that's
the first argument, and then it's
going to take a list of values,
in this case, it's going to be
a list of null values, and
what's interesting is that
each one of these null values
represents one access to that domain.
So all the reduce function
needs to do, is count up
how many nulls there are for
each domain, so it's going
to produce as its result,
the domain and the count.
And believe it or not, we've
solved their problem with just
a little bit of code, just
a code to find the
domain inside the URL from
our record, and then this
simple code to count up the number of NULLs.
The system will take care of
shipping the records to
the right nodes to perform the
tasks in parallel and then
re-shipping them, so all of
the records, for all of
the outputs for a particular
domain, are in the same place and can be counted.
Now let me give an example
of how that combiner function will be used.
The combiner function as a reminder
will operate at the same
node as a mapper and
do some sort of pre-aggregation of the data.
So for example, we could
use a combiner, we'll put
that right here after the
mapper and the combined
function is going to
take the domain and
the list of NULLs, actually it's
going to do exactly what the
reduce function was doing
and it's going to produce the domain and account.
And so that at each individual
node we'll count up how many
accesses there were to that
domain in the data that's
being processed at that node, but
then when we get to the
reduce function we may get
a bunch of
those records, so this list
of NULL here now becomes a
count that's what arrives
at the reduce function, the output
of the combine, and then instead
of doing a count here we
do a sum and that
will give us the right answer
as well and that will be
more efficient again because of
the pre-aggregation that occurs right
in the same node that's processing the map function.
Whoops, I made one mistake there. Sorry about that.
Actually this count here that
goes to the reduce function is
a list of counts,
right, because we're going to get
one of these from each
of the mappers, and then we add those list of counts.
That's the sum that we perform here, sorry about that small mistake.
Now let's modify the problem.
We'll take the same data but
instead of just counting how
many accesses we have to
each domain, let's compute some
total value of the accesses for each domain.
And we might do
that based on something that we
see in the additional information, for
example, how valuable the
user is, whether the user went
off and bought something, something like that.
So let's modify our map and
reduce functions for this slightly enhanced problem.
Now our map function again
is going to take a record, and
this time it's not going to
look only at the URL,
but it's also going to look
inside the additional information, and
what it will produce, is the
domain that it extracted
from the URL and then
let's say some kind of score
on how valuable that access
was, based on whatever it sees inside additional information.
The reduced function, then, is
going to take a domain,
and it's going to take a list
of scores for that
domain and then, similar
to what we had previously, the output
is going to be the domain
and the sum of those scores.
Now, one of the interesting things
here is, how the map function
interacts with this additional information,
because the map function is going
to have code, that is going
to look in the information and
it's going to determine a score based on what it sees.
If we change what's available
in additional information, then we
can modify the map function,
but everything else can stay
the same, or if we
say we refine how we extract the score.
So that is one benefit, to
some extent, of the the
MapReduce framework, because the
computation of the score is
just embedded in this one piece of code.
Now let's modify our example further,
similar to the modification we
made in the earlier video, let's
suppose that in addition to
the web blog we have separate information about the user.
So, separately from what might
be an additional info, we have
in a different data set,
the user ID, the name, the age, the gender and so forth.
And now let's say that we
again want to find the
total value of the
accesses for each domain, but now
the value is computed using
the user attributes that we get from the separate data set.
Well, this frankly, in map reduce, is hard to do.
It effectively involves joining these
two data sets, not something
that's supported natively in MapReduce.
So now we've kind of hit
the limit of what's very
convenient to do in the
map reduce framework, but we
will momentarily see that there are solutions to that as well.
So, to summarize, the MapReduce
framework has no built-in data model.
The data just starts and files and it ends in files.
The user just needs to
provide specific functions, the
map function, reduce function, reader
and writer and optionally, a combiner.
And the system will provide all
of the execution glue, it will
guarantee the tolerance to
system failures and it
provides scalability by doing the
assignment of the processing tasks,
to say an increasing number of computing nodes.
So when the MapReduce framework
came out of Google and the
Hadoop open source implementation was released, there's a lot of excitement.
It was pretty exciting because you
could just write a couple of
simple functions and then
the system would provide the
processing of massive amounts of data
through those functions, and it
would be scalable, it would be
efficient, and it would be fault tolerant.
But over time, people realized
that they don't always want that
low level programming, and our
favorite traditional notions of
database schemas and declarative
queries started to be missed.
And so what was developed is
some languages that actually sit
on top of Hadoop or the MapReduce framework.
One of them is called Hive,
and Hive offers schemas and
a language that looks very much like SQL.
Another language is called Pig.
Pig is a little bit more imperative.
In other words, it's a bit
more of a statement language, but
the fundamental constructs in Pig
are still relational operators, and
you could almost think of a
Pig script as being a little
bit like those statements of relational
algebra that we saw way
back when, with the addition of loops and so forth.
Both of these languages are
what the user sees, and they
compile to a workflow
or you think of that as a
graph - of Hadoop jobs.
Hadoop, again, being the open source
implementation of map and
reduce, any job being one
instance of map and
reduce like that big picture I showed before.
And one thing I should mention,
as of November, 2011, which
it is now, a really significant
portion of Hadoop jobs
are actually generated by Hive and Pig, or Hive or Pig.
So more and more users
are actually choosing to use
a higher level language rather
than program the MapReduce framework directly.
Now I'd be remiss if I didn't also mention one other system.
There's a system called Driad that
allows users to specify a
workflow, sort of similar to
the workflow that might be generated by
Hive and Pig, so it's
more general than just one MapReduce job.
And there's also a language called
Driadlink that sits on
top of Driad and compiles
to Driad, sort of in the
same way that Hive and
Pig compile to a workflow of MapReduce jobs.
Now let's move on to talk about key value stores.
As a reminder, the Hadoop or
MapReduce framework is designed
for more OLAP-type operations, or
analytical operations that involve scanning
most of the data, and I
think that was very clear from what the MapReduce framework does.
Where key value stores are
designed more for these OLTP
style applications, where you're doing
small operations, maybe over
a single record, in a massive database.
And so the key value stores are extremely simple.
The data model for key
value stores are just pairs
of keys and values, not surprisingly.
And the basic operations are simply
to insert a new record,
so you provide a key and
value, to fetch a
record by it's key, to update
the contents, the value in
the record for a given
key, or to delete the record with the given key.
So, that's it and with
that simple set of operations
as you can imagine, the implementation
is focusing on doing these
simple operations over massive databases
very, very quickly. So, again
like Hadoop, efficiency, scalability,
and fault tolerance are the
most important things because we're
looking at applications with massive
amounts of data and very stringent
performance requirements.
So the
way the implementation works at
a very, very high level, it's
actually quite complicated to make
it work very well, is that
the records are distributed to the
nodes, the computing nodes
based on the key, probably a hash value over the key.
So to find the record for a given key can be very quick.
You go straight to the node.
In fact, the records may be
replicated across multiple nodes
and that gives you both efficiency,
you can go to maybe a lightly
loaded node, it gives you
fault tolerance if a node fails.
The notion of the
actions and key value stores are very simple.
One operation itself is a
transaction, so we don't have
the idea of grouping a bunch of operations into transactions.
And furthermore, they implement something called eventual consistency.
And that says that the replicas
of a single record can
actually diverge in their value for some point of time.
What eventual consistency specifies is
that if all operations stop,
then the system will become
consistent with all copies of
each record being the same.
Now, unfortunately, as is sometimes
the case, these very simple operations
and this simple data model weren't
always quite enough, and so
some key value stores, but not
all I would say, have a
concept called columns that occur within the value.
So the value here has a
little bit more structure to
it than just a blob of bits.
And the columns will typically
be kind of like an embedded key value stores.
One thing that's important is they don't require uniform column.
So none of the key value
stores are as strict in
their structure as a relational
database system would be.
The other addition that some
allow is a fetch on a range of keys.
So this might say I want
to get all keys say between
two and ten, and
so that requires a different
type of implementation as you can
imagine, but it does allow
that operation to be performed
efficiently if that is something that the application needs.
Just a few examples of key value stores.
This is not an exhaustive list, there
are many more and this is
only November 2011, so things
will change over time.
But some of the
more prominent key value stores are listed
here -  Google's Big Table, Amazon,
Dynamo, Cassandra, which is
an open source, Voldemort, H-base,
and again there are many others. These are just a few example.
Now let's talk about document stores.
Actually document stores are
very much like key value
stores, except the value itself is a document.
So the data model is a
key document pairs and
what's interesting now is that
the document in document stores
is typically a known type
of structure so the document
might contain JSON formatted data
javascript object notation.
It might contain XML, which we
have learned about, or other semi-structured formats.
The basic operations are very
similar though to what we
say in key value stores.
You can insert a new document based on a key.
We can fetch based on a key.
Modify the contents associated with key and delete
the record associated with a specific key.
But also very important is
that there is a fetch
operation based on the document
contents, and this is
very system/format specific, what
the operations would be. So
there is not a standardized fetched
query language at this point in time.
Again a few example systems,
a not exhaustive list,
are the systems Couch-DB,
Mongo DB, Simple DB.
They all seem to have DB in their name.
And again, this is November
2011, things that will
undoubtedly change.
no sequel system I'd like to cover is graph database systems.
Graph database system, as the
name implies, are designed for
storing and running queries
or other operations over very large
graphs, the data model
is that every object is
either a node or it's an edge between nodes.
Nodes may have properties, very
often ID is a
required property of a,
and edges may have labels
so you can think of them as rolls.
So I think what's best to understand,
this is just to see an
example. My example is
going to be a very small
social network, a tiny one actually.
A similar one to what
was used for some of our SQL exercises.
So let's start with three
nodes, and the nodes are
gonna represent people, and the
properties of the nodes are
going to be ID, name, and grade.
And so each node is
going to have a value for the ID, name, and grade.
For this one, we'll make
it one, Amy in
grade nine, and we'll have two more.
So here are the three nodes representing
three people in our social graph.
We also have ID2, which
is Ben in grade nine, and ID3, which is Carol in grade ten.
Depending on the system, the
nodes may or may not have
to have uniform key value
pairs within the most system won't be that stringent.
Then in addition to the nodes, we have the edges between the nodes.
Typically they would be directed edges.
So, let's make two different types of edges.
Let's make friend edges and let's make likes edges.
So, let's say, for example, Amy likes Ben.
So that would be a directed edge
here with the property likes,
and maybe Ben likes
Carol, let's say here.
And maybe then we
have that Amy and
Carol are both friends with
each other so we'll have
a different type of edge called friend.
Now, one might wonder how
long those friendships will last
with this complicated likes relationship.
But, in any case, this gives
you an idea of the
type of data that's stored in a graph database.
The data model is very specifically
about storing nodes with properties
inside them, like key value
pairs and edges
typically with labels or rolls
on them, of course that's not required.
So in graph database systems, currently,
the interfaces to the systems
and the query languages vary a lot.
There's no standardization at all
and the queries might just be
single-step queries like asking for friends.
They might be path expressions like,
ask for the women friends, of the men friends of someone.
We saw that example in the earlier video.
Or they might have full recursion, where
you can traverse to arbitrary
depths through the graph.
A few example systems, again, as
of November, 2011, you know I
was going to say that, are
a Neo4J, Flat DB and Prego.
And these systems actually differ quite a lot from each other.
I also wanted to mention RDF.
RDF is the resource description
framework, and there's something
known as the RDF triple stores.
RDF is based on objects
having relationships to other objects.
So you can almost think of those
as two nodes with edges
between them, so you can
imagine how RDF can be mapped to graph databases.
So, those were four examples of NoSQL systems.
If the most prominent categories
at this point in time, the
MapReduce framework, again with
languages sitting on top
of MapReduce such as
Hive and Pig, key value
stores for more small
transactions over massive databases
but just operating small bits of them at once.
Document stores, and graph database systems.
NoSQL stands for not
only sql, recognizing that
for some applications these frameworks
work better than traditional database
systems, but for many applications
- a vast number of
applications - traditional databases are still used.