One of the first no sequel databases was
Google's Big Table.
And its, Hadoop equivalent, which is
called HBase.
Big Table is, in many ways, probably the
origin of the term big data.
This is one of the earliest, papers
describing this was almost three or four
years ago and it is Google's
implementation on top of the distributed
GFS file system, which led to the
development of many no sequel databases in
the first place.
A big table, table is distributed across
many different servers by row first.
So that table is broken up into many
tablets, each containing multiple rows.
The tablets are called regions by the way
in H base tablet is the GFS term.
Each tablet in turn, is broken up into
column families, each containing a set of
columns from among those in that row.
Each column family which contains.
A particular set of columns spanning
multiple rows, is stored in a chunk of the
distributed file system.
The GFS or HGFS.
Each, such chunk is served by, a tablet
server which takes care of three separate
replicas as they are maintained in GFS as
we have seen before, and the tablet
servers together, form the entire big
table.
Of course in order to access any
particular record, we need to know which
tablet server that particular record and
column that we are looking for falls in.
And for that purpose there is a metadata
table which keeps track of which tablet
server a particular row, lies in.
The metadata table itself is another big
table and is therefore cons-, comprises of
tablets or regions which are maintained
on, a separate set of tablets servers.
One particular tablet is the root tablet.
So all our searches start from there.
So to look for a particular row, and
column.
One looks up the tablet, root tablet.
Which will tell us which child's tablet,
among the metadata tablets contain the
information about that particular row
column.
The child tablet will in turn tell us
which tablet server to look for, and there
we can pick up that chunk where the
required row and column family is actually
stored.
Let's take a look at date, what data in a
big table looks like, to get a feel for
how it works.
The row is indexed by some key, for
example a transaction ID, when one is
storing sales transactions.
Different column families can have
multiple columns within them.
For example, we can have the location
column family which contains a whole bunch
of columns with to deal with location, and
all those are stored together in single
chunks.
The sale column family might have other
set of columns and similarly for products
and many other column families would be
there.
Each of them is stored separately in
different chunks servers.
An important point about big table is
that, the num, while the number of column
families for a big table is fixed when you
create it.
The number of columns within the column
family can vary.
So you could, dynamically add new columns
to a particular big table column family.
Further, each parti, each column family
for a particular row can have multiple
entries.
For example, you could have the region
being the U.S.
East Coast as well as the U.S.
Northeast, and that's perfectly fine
unlike a relational database where a
particular row-column combination can have
only one value.
Additionally, each of these different
values for a particular row.
Column combination can be time stamped.
So that for example, the, the location for
this transaction might be the US East
Coast today, but tomorrow one is free to
change it by making it U.S.
Northeast.
But that's not done by updating the value
US East Coast, but by inserting a new
value with a new time stamp.
So that one can always look up whichever
version of the region, one wants to look
for depending on, which time stamps one
are, one is searching for.
Someone can keep essential multiple
snapshots of one's categorization of data
together in the same big table, which is
a, a tredmendous advantage compared to a
traditional relational database.
Because Big Table and HBase rely on the
underlying distributed file system GFS or
HDFS respectively.
They also inherit some of the properties
of these systems.
In particular, large parallel reads and
inserts are efficiently supported even
simultaneously on the same table, unlike a
traditional relational database.
Similarly, reading all the rows for a
small number of column families in a large
table.
Such as for an aggrega, aggregation query,
like summing them all up.
Is efficient in the manner similar to the
column oriented databases.
Of course one of the down sides of the Big
Table or H Base architecture is that,
there is really only one key, which is the
primary key with which, data is
distributed across different processors or
different channels.
If one wants to access data by any other
column family.
One can't really rely on any other index
and the only way to do that is by reading
all the data.
So for queries, Big Table is not that
efficient, and one needs to add additional
structure to it, so as to enable efficient
queries.
In fact this is true for, any mechanism
that one used to store data in sharded
form.
By sharding, one means storing different
rows on different pieces of disk or
different servers like, like tablet
servers and distributed file systems.
So, if you have sharded data, where data
is distributed across machines by some
key, and you want to access it using
another key.
You need to do something smarter.
Which is essentially to create an index of
some kind.
For example let's take our, Big Table of
records which indicate invoice
transactions.
For example which are to do with billing
of some kind, or some products.
So that your, our main table is the
invoice table which has keys which is the
transaction.
Which might have different values for
different time-stamps as we have discussed
earlier.
But now if you want to search this table
by some other column such as by product,
one would need to create index tables
which would also be big tables but, their
keys would be things like, product CD
Achieve, product DBME, etcetera, which
would tell us which key in the original
table, this particular combination of
product actually lies.
Similarly one could create index tables
for amounts, as well as for combinations
between, say city and the status of the
index, of the, of the transaction.
It's also useful to create this index in
sorted form.
So that when you insert the records in a
index table they actually get inserted in
sorted order.
As a result, making a query which asks us
to find all transactions with amounts
between two ranges, say between 50 and 90
becomes easier since all the values
between such a range lie in the same
contiguous piece of the index table for
same amount.
Once you've retrieved all these index
values, one knows which keys to access and
one then can directly access them using
the original Big Table or H Base Table.
One example where exactly such a structure
is likely used is Google App Engine's Data
Store which is, probably based on big
table as most have speculated and uses
indexes in exactly this way to query the
big table efficiantely