Since Google invented MapReduce, BigTable,
distributed file systems.
It has moved on and is now using something
called Dremel.
Recall that, in the early days of
databases, the relational database was
used both for transaction processing.
That is, inserting new records, as well
for answering complex queries.
Storage was expensive.
It was too expensive to, for example,
create a fresh copy of the entire data.
To, in a, in a better form, more suited
for efficient query processing.
So, one was happy with the compromise of
having a sim-, the one size fits all model
of the relational database.
Over the years storage became very cheap,
and one started to move data into specific
column oriented databases as well as have
analytical queries that touched all the
data performed using MapReduce, where
large volumes data would be read and then
equally large volumes freshly written
again and again as one performed more and
more processing on them.
This was fine as long as you had
terabytes, hundreds of terabytes of data,
even Google.
But, once one started dealing with
petabytes of data, and wanted queries on
such volumes.
One could not afford to produce a new
petabyte of data every time when processed
the old petabyte.
So, the challenge of storage being a
constraint once again, enters into the
arena when one is dealing with very large
volumes.
At the same time, writing extremely large
volumes is itself costly, and so by
avoiding writing again and again, one does
introduce more efficiencies.
So this is, is essentially what Dremel
does.
Dremel today powers Google's BigQuery,
which is a service, which one can access
over the web.
One can define extremely large tables.
One can populate them through computations
or importing data from various sources and
execute extremely fast queries that
process large volumes of data using the
Dremel structure underneath.
There are two important innovations in
Dremel, which was published only in 2010.
First, it uses a column-oriented storage.
Just like a column-oriented database in,
in, in some sense.
But for nested and possibly non-unique
fields.
For example, you could have a document
which had a field A.
Within that field A, one could have
another field B, which itself would have
say, three, two different fields C and D
which actually contain values.
So, the nested field A.B.C or A.B.D is
actually how you would access this data.
Further in a particular record, there
could be multiple values for A.B.C.
For example.
So, you could have multiple names,
multiple IP addresses or whatever for this
particular nested field name.
This is very common in web-oriented text,
text to unstructured data, not that common
in structured relational data, for
example.
But this is the kind of large petabyte
volume data that Google needs to process.
So, the column where in to storage is
fairly unique in that each nested field is
stored contiguously.
So, all the values for record one and
record two for this nested field are
stored in, close together on disk and
processed by leaf servers.
Similarly these, this nested field A.B.D
is stored contiguously and so on.
So, first innovation that it's column
orientated for nested and possibly
non-unique fields.
The second innovation is that instead of
reading and writing data repeatedly like
in MatReduce.
One assumes that the intermediate data
that one produces is always much, much
less than the original data, which was
quite obvious if you're dealing with
petabytes of data or you won't be
producing more petabytes.
You'll be summarizing it in some form or
selecting it or querying it, exactly as
you had in the traditional relational
databases where you would query and get
small results from large results, from
large data.
So, the second innovation is that there,
there is a tree of query servers that pass
intermediate results from the root to the
leaves and back.
And the intermediate servers essentially
execute a complex query plan.
Very similar, in some respects to
traditional sequel engines.
However, these operate at a different
scale.
Sequel engines predominantly operated in
memory.
These are operating in a distributed
fashion across a tree of query servers,
and passing results back and forth across
a network.
As a result, Google is able to demonstrate
orders of magnitude better performance
than MapReduce when performing queries on
petabytes of data.
Not only does it give more speed, but it
also clearly saves storage as compared to
MapReduce.
The underlying storage layer remains the
distributed GFS system.
But the Dremel is now widely used in
Google and is available publicly using the
BigQuery service.
There is some effort at creating an open
source equivalent of Dremel.
It's in its infancy right now.
It's under Apache.
It's called Drill.
But beyond the name, I don't think they've
made too much progress so far.
So, we can now summarize our picture of
how database technology has evolved over
the years.
We started out with the relational row
store, which was essentially a one size
fits all and still works fine for
gigabytes of data.
Then we moved onto column-oriented data
warehouse technologies, specifically
designed for all that queries.
Which scaled up to terabytes of data, but
required us to move off of the relational
row store into a data warehouse.
In parallel, the web side just created,
distributed NoSQL databases which were a
mix of row and column stores which also
allowed MapReduce pro-, processing for
bulk analysis and this scaled to tens of
terabytes or sometimes even more volumes
of data.
In parallel with this, we had in memory
databases emerging in the past few years,
which now can do what the
one-size-fits-all relational row stores
did.
Again, on gigabytes of data, but with an
order of magnitude more performance.
And on the large scale processing for
petabytes of data, Google has evolved
Dremel, which again, is a
one-size-fits-all model for petabytes of
data.
So we have three models today.
We have Dremel which only Google uses.
We have In-Memory which is fine for doing
OLAP on reasonably small databases.
And for intermediate processing to do
things like computing classifiers on large
terabytes of data.
Distributed noSQL's here's the preferred
choice.
At the same time, when you have terabytes
of data and you want to do all that
queries very fast using SQL, then there
still remains a place for the column store
data warehouses, typically the special
purpose appliances like Netezza which use
panel computing and column storage also
have a place.
This place where the column store
warehouse is might evolve to a Dremel-like
architecture in the future once we
actually have.
A publicly-available version of Dremel.
This is a space to really watch carefully
and that's what big data technology is
looking forward to in the next three