Big table and H Base were one of the first
NoSQL databases designed for large scale
batch processing essentially and therefore
didn't really have strong support for
indexes.
Recently, the database called MongoDB has
become very popular, because it does, in
fact, include significant support for
indexes.
Instead of records, MongoDB uses the term
documents, which are essentially
structured documents spo they look a lot
like records, but they can be more complex
than records in the sense that one might
have multiple titles or multiple
categories, for example, in this case.
This is sort of like H Base records,
because H Base can also have multiple
values for a particular column.
Like H Base, it's a sharded database, so
that documents or records are kept in
different shards on different processors
so that a set of keys goes to one shard.
Additionally, just like H Base we have
three replicas for each shard.
Mongodb, however, doesn't rely on the
underlying distributed file system, like
HDFS.
But, keeps track of it's own replicas, in
a manner reminiscent of traditional paddle
relation databases.
The underlying file system can be any
operating system, file system, like Linux
file system or something else.
Further, the indexes that MongoDB supports
include full text indexes so that it
essentially has support for inverted
indexing as we saw in the first week of
this course.
Lastly, it also supports a form of
map-reduce within the database itself.
So, when [unknown] effectively write
map-reduce programs which process data in
MongoDB using Javascript which is the only
language it supports.
Another important feature of MongoDB which
is shared by some other no sequel
databases, is its consistency model for
writes.
We recall that in H Base, for example, a
write relies on a write to the distributed
file system.
We just essentially completes writing a
record only all three replicas write
successfully.
This essentially makes writes into both
GFS, HDFS, as well as H Base rather costly
because it relies on altering replicas
agreeing, before a write succeeds.
It's quite okay for large bulk rights, but
doesn't quite add up for smaller rights,
or random rights because the extra
overhead of making sure the replicas are
consistent is relatively high for small
operations.
So, instead MongoDB uses something called
eventual consistency, where writes or
updates are propagated to replicas
eventually, and not synchronously with the
write.
So, it's possible for a write to complete
without all it's replicas having been
updated with that write.
The first database to use eventual
consistency was Amazon Simple DB, where
the idea of vector clocks was used to
achieve eventual consistency.
Other NoSQL data bases, such as Mongol DB,
still use vector clocks.
So, this is quite an important concept and
we'll illustrate it through an example.
Each record is represented by a key which
could be, for example, the transaction ID
or something else.
And it needs to get replicated at three
different replicas.
In Amazon Simple DB, for example, these
replicas are chosen by hashing each key
and essentially distributing the hashed
values in a round roving fashion to
different physical notes.
A different replication technique is used
in MongoDB, but that's not relevant for
the vector clock discussion.
Nonetheless, let's assume that there is a
record which is mapped to three physical
nodes x, y and z.
And now, let's see what happens when we
try to write a record, which is replicated
at three locations.
Suppose we happen to write this record at
x, we associate it with that at time stamp
1,0,0.
So, this is a vector time stamp consisting
of three elements, which tells us that
this was written at time stamp one, at
physical note X, and the time stamps, the
other notes are maintained at zero.
Next, we may want to write it again, again
at physical note x, in which case, we
simply update the vector times stamp to
2,0,0 and forget about the old 1,0,0
version which was written earlier.
That's because in vector terms, 2,0,0 is
greater than 1,0,0 and that at least one
of the terms is greater than the one in
1,0,0 and the others are at least equal to
the same in this vector.
On the other hand, suppose we were to
first write it at location x and the next
write for some reason happened at location
y just because of parallelism.
The vector time stamp for this write would
then be 0,1,0 because the value it is
writing had, times, times, 0,0,0, and it's
writing it at y, so you, it is assigned a
time stamp of 0,1,0 now.
Next, this right is propagated to all the
replicas, both rights are propagated to
the other two replicas, so x gets the
value from y 0,1,0 and y gets the value
from x and zed gets the value from both x
and y.
Now, in each of these cases, neither of
the two values of a record are greater
than each other in vector term.
So, each replica has to maintain both the
values until something more interesting
happens in the future.
For example, now if we read the value zed,
where we read both these replicas 1,0,0
and 0,1,0 from the location zed And now,
we do some computation, and write a fresh
value.
Because we read both these replicas with
time stamps 1,0,0 and 0,1,0 figured out
which one to use using some application
logic and then wrote a value, the value
written has a time stamp 1,1,1 because
it's based on values that have seen 1,0,0
and 0,1,0 as time stamps.
Now, 1,1,1 is greater than both of these
so the value written overrides the
previous two and they can be discarded.
And when this value propagates to x and y,
the corresponding older values also get
discarded.
As a result, what we achieved is eventual
consistency in that at some point of time,
when you read a record you might get two
values instead of just one because you are
getting an inconsistent state and its up
to you, as the application, to resolve
that inconsistency usually using some
knowledge of what these values are all
about.
For example, suppose the write is
essentially adding another value to the
record.
In this case, the previous two values
simply need to be added together to which,
a third value might get added, and that in
consistencies while results because none
of the previous write has actually thrown
away, but they are actually summed up by
the third right.
A second example might be where the right
is a blind right, we doesn't really care
what was written in that record earlier,
In which case, it doesn't really matter
which of the previous two, one is to
choose because they are ignored in any
case and the third right simply writes and
resolves the conflict by assigning a
higher vector time stamp.
The final case is the most complicated one
where the write depends on the previous
value, such as a update.
So, suppose you're incrementing the value
of a record, perhaps, by different amounts
each time.
So, you have to choose which of the
previous two versions you want to use and
do this consistently so that the second
time this kind of a conflict occurs, the
rule applied is the same.
What is typically done in such situations
is, you use a lexicographical or
alphabetical ordering of the replicas so
that, for example, the replica x is
treated as higher than the replica y,
which is treated higher than the replica
zed.
So that, while choosing between 1,0,0 and
0,1,0, the third replica chooses to use
this value time stamp with one zero, zero
increment it with whatever it chooses to
do and then write with a time stamp one,
one, one.
Since we use a consistent ordering of the
replicas by alphabetical order of some
kind, we'll always get consistent results.
Now, it's not that obvious and the theory
of vector time stamps dates back to the
'70s in classical work by Leslie Lamport.
And the original paper is really worth
reading.
The paper itself is from 1978 and there
are some previous versions of it but the
most accessible one is this one, which
appeared in the communications of the ACM.
The theory of why a lexicographical
ordering works always is elegantly put in
this paper and interestingly, more than 30
years later, we are seeing this concept
being used for eventual consistency and
multiple NoSQL databases, Amazon simple DB
to start with, Mogul DB, Couch DB,
Cassandra and many others.