Over the past, 40 years or so, relational
databases have become the work horse of
most enterprise data management systems.
Relational databases as you all know
comprise of tables, which are nothing but
a bunch of records with attributes or
fields which are of course laid out on
disk in some way.
The most common layout is that of a row
oriented database, where pages of disk are
consisting of rows of records.
The major problem that relational
databases is address is that of
transaction processing, where multiple
clients or users are trying to insert or
update different records in the same
table, and their actions have to be
protected from each other so that
consistency of some form is maintained.
As the result the database management
systems build in many features.
Like locking and other forms of
consistency control to ensure that
transactions are effectively isolated from
each other.
And this is the really primary function of
the relational database.
The second function of course is queries.
Where the user should be able to get
access to any record based on a
combination of where clauses or a
combination of conditions that the record
needs to satisfy.
Sql is the main language for accessing
relational data and essentially defines
the relational model.
Indexes, such as the b+3 index depicted
here, are used to speed up access to
records, so that SQL queries can actually
be efficient.
The b+3 is the most common form of index,
which is essentially like a binary tree,
except for the fact that.
The interior nodes don't actually store
data values.
They only store key values.
And the leaf nodes point to blocks of disk
where particular combinations of keys are
actually available to retrieve.
Other types of indexes like hash indexes,
bitmap indexes are also used but B + trees
are by far the most popular for indexing a
row-oriented database.
Now, the other important thing about
relational database is they, that they
that they don't really rely on file
systems, provided by the operating system
to manage disk.
They is, do their own disk management
figuring out what pages is to prefetch,
and which pages is to replace from the
memory based on the traffic that they see
coming in, in the form of queries for that
matter transactions coming from multiple
clients.
[inaudible] For the most part when the
data was reasonably small.
And queries using indexes could be
performed fairly efficiently on exactly
the same data stores where transactions
would insert an update data.
The row-oriented data stores were
perfectly fine for both transaction
processing and answering questions about
data.
As, the volume of data grew and in
particular the width of data in terms of
the number of columns that one would store
per record becomes very large.
And clearly the amount of storage required
grows.
But at the same time, what also happens is
that a query that only requires data from,
say, two columns, ends up needing to fetch
data, about maybe 100 columns, just
because they happen to come along with the
disk block as it is fetched.
This makes query processing extremely
inefficient, both from a storage
perspective as well as performance
perspective, and for such situations, the
column oriented databases have been
developed in the past decade or so.
Especially to take care of situations
where the number of columns is very large
and only queries need to be supported
rather than transactions.
In a column-oriented database unlike a row
store, sets of columns are stored together
in disk blocks.
In other words, a particular record is
actually split across in multiple disk
blocks with different sets of columns
occurring in different blocks.
Within each block, the data is sorted
along particular sets of columns.
So, that unlike a row store where data
essentially gets stored as it's inserted,
the column-oriented database can process
the data from a row store into different
sort orders for different columns.
Of course, by doing so, one sort of loses
the identity of an individual record and
therefore a column-oriented database needs
to create extra information in the form of
what is called a join index where
different rows in different disk blocks
are pulled together to indicate that they
actually form the same record.
Column stores are ideally used for what
are known as analytical processing
queries.
Not that such queries could not be done
using rows stores as well.
It's just that when you have large numbers
of columns, column stores become more
efficient.
Now, this is an example of a OLAP or
Online Analytical Processing query.
Essentially, there's a bunch of sales
figures which are essentially the quantity
and amount of some product being sold to
certain customers.
Add certain addresses where the retail
outlets are on particular days.
And the sales obviously are different
products.
Now, each of these fields is a dimension
which describes the particular sale
record.
The dimensions themselves can be
hierarchical in the sense that the
product, a particular product say, a
particular type of mattress may actually
be of category mattress or bed and so on.
Similarly, a particular day is part of a
week, which is part of a month, which is
part of a quarter, year, et cetera.
And similarly for location.
The kind of queries that one tries to
answer in OLAP while you are trying to
select the total sales and group them by
the category so you get results which can
be shown in a report, in a pie chart of
this form in various types of reporting
including different types of charts are
what OLAP queries actually pro, produce.
You can figure out how the SQL actually
works, it's fairly straightforward.
But as far as what constitutes business
intelligence for most of the enterprise
world, it is all about OLAP queries.
Analytics is largely thought of as being
OLAP for a large part of the community.
However, as we have seen, there's much
more to analytics than simply queries.
So, let's reflect now on why we actually
need databases.
First of all, we need to add data as it's
produced, and data is added for multiple
sources in parallel like multiple people
executing sales transactions, or
e-commerce transactions, or banking
transactions.
And they need to be isolated from each
other, so that they don't create
inconsistent data, and so you need
transaction processing.
The second reason you need databases is
querying efficiently and which it will
lead to indexing and a high-level language
like SQL.
But the point we need to think about is
whether or not you actually need
transaction processing for analytics.
Clearly we don't, and that's why we
develop column-oriented databases.
At the same time, there may be some
advantages in not having to move the data
out of a transaction store, and still able
to perform analytical queries or all that
queries on them.
If you can avoid moving it, so that you'd
don't need to incur the overhead of
transferring data from one database
row-oriented transaction processing to say
column-oriented analytical database.
Queries are, of course, important for OLAP
queries as we have just seen.
But the important point to note here is
that for other types of analytics, apart
from OLAP such as large scale counting
like the programming assignment that we've
seen building classifiers using say
Bayesian techniques or executing large
complex joints to compute sales by cities.
As we've seen in the map for this example
last week.
Large volumes of data are actually going
to be touched.
In fact, the entire data set is going to
be touched by such programs.
In such situations, indexes become far
less relevant.
What also becomes very relevant instead,
is resilience to hardware failure because
program which touch large volume of data
are likely to take a lot of time, and
therefore hardware can faill during the
execution, and so we need to figure out
how to achieve fall tolerance.
Mapreduce essentially provides that file,
fall tolerance and therefore becomes
vital.
Lastly, it's important to recognize that
OLAP is also a form of analytics.
And can rarely be viewed as computing a
part of the joint distribution across all
the features, which are the columns of a
table, but you're using intuition to
select which set of features we want to
actually look at.
So, to summarize, there are two
contradictory requirements here in
analytics.
If you have large volumes of data being
touched, you don't need indexes.
But if you want to find out pieces of the
joint distribution using OLAP queries and
human intuition, you do need indexes.
And it is this dichotomy which is driving
most of the debate in big data technology
above, say, the basic distributed file
system layer.