1
00:00:00,000 --> 00:00:07,043
Of course as we have seen parallel
computing is not new and has been around

2
00:00:07,043 --> 00:00:16,078
essentially since the early 80s and
databases have also, traditionally evolved

3
00:00:16,078 --> 00:00:20,846
to exploit parallel computing over the
years.

4
00:00:20,846 --> 00:00:25,258
In the beginning we had shared memory
databases.

5
00:00:25,258 --> 00:00:31,206
Which still persists to date where you
have a large.

6
00:00:31,206 --> 00:00:36,668
Multi processor system, with multiple CPUs
sharing memory.

7
00:00:36,668 --> 00:00:45,030
A single operating system scheduling jobs
or processes across different CPUs and a

8
00:00:45,030 --> 00:00:51,591
common disk or storage area network, where
all the data is stored.

9
00:00:51,591 --> 00:00:57,665
Examples of such systems abound.
Almost all servers today are

10
00:00:57,665 --> 00:01:04,841
multiprocessing shared memory systems.
But they, almost a few dozen processors,

11
00:01:04,841 --> 00:01:12,142
and even with each of them having multiple
cores, we find shared memory systems

12
00:01:12,142 --> 00:01:16,562
supporting at most a few hundred
processing units.

13
00:01:16,562 --> 00:01:24,376
The shared memory model simply doesn't
scale beyond this level.

14
00:01:24,376 --> 00:01:32,761
Databases have exploited other parallel
architectures, such as the shared disk

15
00:01:32,761 --> 00:01:38,250
architecture, and the shared nothing
architecture, which scale to greater

16
00:01:38,250 --> 00:01:42,693
number of processors as compared to shared
memory.

17
00:01:42,693 --> 00:01:50,233
In a shared disk architecture you may have
multiple processors which communicate over

18
00:01:50,233 --> 00:01:53,888
a network.
Accessing, however, a common disk system

19
00:01:53,888 --> 00:02:00,049
which could be a storage area network or a
network attached system of storage using

20
00:02:00,049 --> 00:02:05,359
two different networks, one for
communication between processors, and the

21
00:02:05,359 --> 00:02:11,084
other one for accessing storage.
The shared nothing architecture on the

22
00:02:11,084 --> 00:02:17,055
other hand.
Relies on local disks with each processor.

23
00:02:17,055 --> 00:02:24,034
So that the only communication that takes
place over the network is between

24
00:02:24,034 --> 00:02:29,079
processors.
In parallele database in both the shared

25
00:02:29,079 --> 00:02:33,072
nothing architectures as well as the
others.

26
00:02:33,072 --> 00:02:39,043
Sequel queries are executed in parallele
by multiple processors.

27
00:02:41,074 --> 00:02:45,000
In the shared nothing architecture in
addition.

28
00:02:45,041 --> 00:02:52,082
Data itself is distributed across
different discs using varieties of

29
00:02:52,082 --> 00:02:58,065
partitioning schemes.
Such as different sets of rows on

30
00:02:58,065 --> 00:03:05,080
different discs or in the case of column
oriented, specialized engines for

31
00:03:05,080 --> 00:03:12,002
analytical processing, different sets of
columns on different discs.

32
00:03:13,096 --> 00:03:19,151
All this has happened, in the database
community and parallel databases are now

33
00:03:19,151 --> 00:03:21,725
almost given.
They all support sequel.

34
00:03:21,725 --> 00:03:26,821
Some of them support transaction
processing, where of course there is the

35
00:03:26,821 --> 00:03:32,087
additional overhead of managing
transaction isolation and consistency of

36
00:03:32,087 --> 00:03:37,303
cross multiple processors.
But we won't get into that right now.

37
00:03:37,303 --> 00:03:44,340
The thing that is not handled by parallel
databases properly is fault tolerance.

38
00:03:44,340 --> 00:03:51,544
They didn't have to handle fault tolerance
because with just a few dozen processors,

39
00:03:51,544 --> 00:03:58,306
you don't need to worry about processors
failing in while executing a sequel query

40
00:03:58,306 --> 00:04:03,011
which in any way will take few seconds or
few minutes utmost.

41
00:04:03,011 --> 00:04:09,008
When you are executing a large batch job
which touches virtually all the data.

42
00:04:09,047 --> 00:04:15,046
The chances of a processor failing are
very high, especially a large number of

43
00:04:15,046 --> 00:04:19,046
processors.
And some situations, the pile of database

44
00:04:19,046 --> 00:04:25,854
architecture simply are not full tolerant.
Full tolerance in the parallel database

45
00:04:25,854 --> 00:04:32,210
world relies on having a hot standby
architecture or deployment which is

46
00:04:32,210 --> 00:04:38,112
identical to the primary and essentially
replicating data over a high speed network

47
00:04:38,112 --> 00:04:44,390
between the primary and the hot standby.
A very costly and still not completely

48
00:04:44,390 --> 00:04:50,052
fault tolerant architecture as compared to
say, the highly distributed,

49
00:04:50,052 --> 00:04:54,943
fault-tolerant, map-produced system on a
distributed GFS or HDFS.

50
00:04:54,943 --> 00:05:01,071
So here is where the dichotomy comes in
when you're doing large volume analytical

51
00:05:01,071 --> 00:05:07,048
processing which touches all the data the
parallel database architecture simply

52
00:05:07,048 --> 00:05:12,112
doesn't work.
So, where databases have been evolving

53
00:05:12,112 --> 00:05:17,265
over the past few years.
Because of the big data technology that

54
00:05:17,265 --> 00:05:20,974
has emerged from the web, is in two
directions.

55
00:05:20,974 --> 00:05:24,899
On the one hand we have the no sequel
databases.

56
00:05:24,899 --> 00:05:32,228
Which are based on big data technology.
They're called no sequel because firstly

57
00:05:32,228 --> 00:05:38,048
don't, they don't support full asset
transactions or rather fully isolated

58
00:05:38,048 --> 00:05:43,217
serializable transactions in the
traditional sense.

59
00:05:43,217 --> 00:05:50,419
Secondly, instead of having complex
indexing, they have chartered indexing,

60
00:05:50,419 --> 00:05:58,339
which is essentially having partitioning
of the data between different chunks or

61
00:05:58,339 --> 00:06:05,013
different blocks of disk.
We'll come to sharded indexing shortly.

62
00:06:06,031 --> 00:06:13,043
Third, they don't support full joints.
They, for a variety of reasons, they are

63
00:06:13,043 --> 00:06:19,069
restricted in the kind of joints that they
can perform efficiently.

64
00:06:21,042 --> 00:06:29,007
And lastly they support column-oriented
stories if needed, so that for very long,

65
00:06:29,007 --> 00:06:36,092
wide columns, different parts of a record
can be stored on different service.

66
00:06:37,081 --> 00:06:45,066
The other side of database evolution is in
memory databases, which is, has been

67
00:06:45,066 --> 00:06:51,098
driven by the, increasing.
Volumes of main memory available on

68
00:06:51,098 --> 00:06:56,058
today's servers.
And the falling cost of memories.

69
00:06:56,058 --> 00:07:03,012
So today servers have, you know, 64
gigabytes or even 124 gigabytes of main

70
00:07:03,012 --> 00:07:07,063
memory.
Which allows for most practical purposes

71
00:07:07,063 --> 00:07:14,062
many ordinary enterprise transaction
systems to actually reside in main memory.

72
00:07:15,006 --> 00:07:18,057
So, real time transactions become
possible.

73
00:07:18,082 --> 00:07:25,019
At much higher rates than before.
Varieties of indexes can be supported just

74
00:07:25,019 --> 00:07:29,071
as before.
In a, in traditional databases, but now in

75
00:07:29,071 --> 00:07:33,081
memory.
So, different kinds of indexing structures

76
00:07:33,081 --> 00:07:38,566
are possible.
And of course complex joints are possible.

77
00:07:38,566 --> 00:07:44,751
Of course the data has to be still small
compared to web scale petabytes of data or

78
00:07:44,751 --> 00:07:50,203
many hundreds of terabytes.
But if you are talking about gigabytes of

79
00:07:50,203 --> 00:07:56,034
data which are traditional enterprise
transaction system, rarely exceeds in

80
00:07:56,034 --> 00:08:02,421
memory database is a happy compromise
where all lab queries and all kinds of

81
00:08:02,421 --> 00:08:08,811
bulk data processing as well can be fairly
efficiently performed.

82
00:08:08,811 --> 00:08:14,816
However this is not the big data world, in
the web.

83
00:08:14,816 --> 00:08:22,295
That doesn't quite work and, therefore, no
Sequel databases have become quite

84
00:08:22,295 --> 00:08:25,664
popular.
We will study some of the non sequel table

85
00:08:25,664 --> 00:08:30,899
databases and the concepts very shortly.
As, as far in memory databases are

86
00:08:30,899 --> 00:08:36,655
concerned, many of the techniques that
worked in traditional databases, simply

87
00:08:36,655 --> 00:08:42,007
carry over, except for the fact that
because everything is in memory, one

88
00:08:42,007 --> 00:08:47,546
doesn't have to deal with additional
complexity of some parts of the index

89
00:08:47,546 --> 00:08:52,001
being cached versus being on disc and
things like that.