Of course as we have seen parallel
computing is not new and has been around
essentially since the early 80s and
databases have also, traditionally evolved
to exploit parallel computing over the
years.
In the beginning we had shared memory
databases.
Which still persists to date where you
have a large.
Multi processor system, with multiple CPUs
sharing memory.
A single operating system scheduling jobs
or processes across different CPUs and a
common disk or storage area network, where
all the data is stored.
Examples of such systems abound.
Almost all servers today are
multiprocessing shared memory systems.
But they, almost a few dozen processors,
and even with each of them having multiple
cores, we find shared memory systems
supporting at most a few hundred
processing units.
The shared memory model simply doesn't
scale beyond this level.
Databases have exploited other parallel
architectures, such as the shared disk
architecture, and the shared nothing
architecture, which scale to greater
number of processors as compared to shared
memory.
In a shared disk architecture you may have
multiple processors which communicate over
a network.
Accessing, however, a common disk system
which could be a storage area network or a
network attached system of storage using
two different networks, one for
communication between processors, and the
other one for accessing storage.
The shared nothing architecture on the
other hand.
Relies on local disks with each processor.
So that the only communication that takes
place over the network is between
processors.
In parallele database in both the shared
nothing architectures as well as the
others.
Sequel queries are executed in parallele
by multiple processors.
In the shared nothing architecture in
addition.
Data itself is distributed across
different discs using varieties of
partitioning schemes.
Such as different sets of rows on
different discs or in the case of column
oriented, specialized engines for
analytical processing, different sets of
columns on different discs.
All this has happened, in the database
community and parallel databases are now
almost given.
They all support sequel.
Some of them support transaction
processing, where of course there is the
additional overhead of managing
transaction isolation and consistency of
cross multiple processors.
But we won't get into that right now.
The thing that is not handled by parallel
databases properly is fault tolerance.
They didn't have to handle fault tolerance
because with just a few dozen processors,
you don't need to worry about processors
failing in while executing a sequel query
which in any way will take few seconds or
few minutes utmost.
When you are executing a large batch job
which touches virtually all the data.
The chances of a processor failing are
very high, especially a large number of
processors.
And some situations, the pile of database
architecture simply are not full tolerant.
Full tolerance in the parallel database
world relies on having a hot standby
architecture or deployment which is
identical to the primary and essentially
replicating data over a high speed network
between the primary and the hot standby.
A very costly and still not completely
fault tolerant architecture as compared to
say, the highly distributed,
fault-tolerant, map-produced system on a
distributed GFS or HDFS.
So here is where the dichotomy comes in
when you're doing large volume analytical
processing which touches all the data the
parallel database architecture simply
doesn't work.
So, where databases have been evolving
over the past few years.
Because of the big data technology that
has emerged from the web, is in two
directions.
On the one hand we have the no sequel
databases.
Which are based on big data technology.
They're called no sequel because firstly
don't, they don't support full asset
transactions or rather fully isolated
serializable transactions in the
traditional sense.
Secondly, instead of having complex
indexing, they have chartered indexing,
which is essentially having partitioning
of the data between different chunks or
different blocks of disk.
We'll come to sharded indexing shortly.
Third, they don't support full joints.
They, for a variety of reasons, they are
restricted in the kind of joints that they
can perform efficiently.
And lastly they support column-oriented
stories if needed, so that for very long,
wide columns, different parts of a record
can be stored on different service.
The other side of database evolution is in
memory databases, which is, has been
driven by the, increasing.
Volumes of main memory available on
today's servers.
And the falling cost of memories.
So today servers have, you know, 64
gigabytes or even 124 gigabytes of main
memory.
Which allows for most practical purposes
many ordinary enterprise transaction
systems to actually reside in main memory.
So, real time transactions become
possible.
At much higher rates than before.
Varieties of indexes can be supported just
as before.
In a, in traditional databases, but now in
memory.
So, different kinds of indexing structures
are possible.
And of course complex joints are possible.
Of course the data has to be still small
compared to web scale petabytes of data or
many hundreds of terabytes.
But if you are talking about gigabytes of
data which are traditional enterprise
transaction system, rarely exceeds in
memory database is a happy compromise
where all lab queries and all kinds of
bulk data processing as well can be fairly
efficiently performed.
However this is not the big data world, in
the web.
That doesn't quite work and, therefore, no
Sequel databases have become quite
popular.
We will study some of the non sequel table
databases and the concepts very shortly.
As, as far in memory databases are
concerned, many of the techniques that
worked in traditional databases, simply
carry over, except for the fact that
because everything is in memory, one
doesn't have to deal with additional
complexity of some parts of the index
being cached versus being on disc and
things like that.