1
00:00:00,000 --> 00:00:06,795
Over the past, 40 years or so, relational
databases have become the work horse of

2
00:00:06,795 --> 00:00:13,369
most enterprise data management systems.
Relational databases as you all know

3
00:00:13,369 --> 00:00:18,527
comprise of tables, which are nothing but
a bunch of records with attributes or

4
00:00:18,527 --> 00:00:23,057
fields which are of course laid out on
disk in some way.

5
00:00:23,057 --> 00:00:35,886
The most common layout is that of a row
oriented database, where pages of disk are

6
00:00:35,886 --> 00:00:44,086
consisting of rows of records.
The major problem that relational

7
00:00:44,086 --> 00:00:50,539
databases is address is that of
transaction processing, where multiple

8
00:00:50,539 --> 00:00:58,083
clients or users are trying to insert or
update different records in the same

9
00:00:58,083 --> 00:01:05,023
table, and their actions have to be
protected from each other so that

10
00:01:05,023 --> 00:01:12,354
consistency of some form is maintained.
As the result the database management

11
00:01:12,354 --> 00:01:17,887
systems build in many features.
Like locking and other forms of

12
00:01:17,887 --> 00:01:24,227
consistency control to ensure that
transactions are effectively isolated from

13
00:01:24,227 --> 00:01:28,879
each other.
And this is the really primary function of

14
00:01:28,879 --> 00:01:34,728
the relational database.
The second function of course is queries.

15
00:01:34,728 --> 00:01:41,225
Where the user should be able to get
access to any record based on a

16
00:01:41,225 --> 00:01:49,186
combination of where clauses or a
combination of conditions that the record

17
00:01:49,186 --> 00:01:55,003
needs to satisfy.
Sql is the main language for accessing

18
00:01:55,003 --> 00:02:00,954
relational data and essentially defines
the relational model.

19
00:02:00,954 --> 00:02:09,328
Indexes, such as the b+3 index depicted
here, are used to speed up access to

20
00:02:09,328 --> 00:02:14,331
records, so that SQL queries can actually
be efficient.

21
00:02:14,331 --> 00:02:22,354
The b+3 is the most common form of index,
which is essentially like a binary tree,

22
00:02:22,354 --> 00:02:28,151
except for the fact that.
The interior nodes don't actually store

23
00:02:28,151 --> 00:02:31,762
data values.
They only store key values.

24
00:02:31,762 --> 00:02:39,493
And the leaf nodes point to blocks of disk
where particular combinations of keys are

25
00:02:39,493 --> 00:02:48,018
actually available to retrieve.
Other types of indexes like hash indexes,

26
00:02:48,018 --> 00:02:55,559
bitmap indexes are also used but B + trees
are by far the most popular for indexing a

27
00:02:55,559 --> 00:03:01,465
row-oriented database.
Now, the other important thing about

28
00:03:01,465 --> 00:03:06,906
relational database is they, that they
that they don't really rely on file

29
00:03:06,906 --> 00:03:10,793
systems, provided by the operating system
to manage disk.

30
00:03:10,793 --> 00:03:15,842
They is, do their own disk management
figuring out what pages is to prefetch,

31
00:03:15,842 --> 00:03:21,966
and which pages is to replace from the
memory based on the traffic that they see

32
00:03:21,966 --> 00:03:27,911
coming in, in the form of queries for that
matter transactions coming from multiple

33
00:03:27,911 --> 00:03:31,061
clients.
[inaudible] For the most part when the

34
00:03:31,061 --> 00:03:35,838
data was reasonably small.
And queries using indexes could be

35
00:03:35,838 --> 00:03:42,594
performed fairly efficiently on exactly
the same data stores where transactions

36
00:03:42,594 --> 00:03:47,983
would insert an update data.
The row-oriented data stores were

37
00:03:47,983 --> 00:03:54,642
perfectly fine for both transaction
processing and answering questions about

38
00:03:54,642 --> 00:03:58,767
data.
As, the volume of data grew and in

39
00:03:58,767 --> 00:04:05,258
particular the width of data in terms of
the number of columns that one would store

40
00:04:05,258 --> 00:04:11,090
per record becomes very large.
And clearly the amount of storage required

41
00:04:11,090 --> 00:04:15,404
grows.
But at the same time, what also happens is

42
00:04:15,404 --> 00:04:21,833
that a query that only requires data from,
say, two columns, ends up needing to fetch

43
00:04:21,833 --> 00:04:28,306
data, about maybe 100 columns, just
because they happen to come along with the

44
00:04:28,306 --> 00:04:33,066
disk block as it is fetched.
This makes query processing extremely

45
00:04:33,066 --> 00:04:38,017
inefficient, both from a storage
perspective as well as performance

46
00:04:38,017 --> 00:04:43,030
perspective, and for such situations, the
column oriented databases have been

47
00:04:43,030 --> 00:04:49,030
developed in the past decade or so.
Especially to take care of situations

48
00:04:49,030 --> 00:04:55,080
where the number of columns is very large
and only queries need to be supported

49
00:04:55,080 --> 00:05:01,032
rather than transactions.
In a column-oriented database unlike a row

50
00:05:01,032 --> 00:05:05,095
store, sets of columns are stored together
in disk blocks.

51
00:05:05,095 --> 00:05:12,021
In other words, a particular record is
actually split across in multiple disk

52
00:05:12,021 --> 00:05:17,074
blocks with different sets of columns
occurring in different blocks.

53
00:05:18,007 --> 00:05:26,062
Within each block, the data is sorted
along particular sets of columns.

54
00:05:26,098 --> 00:05:34,000
So, that unlike a row store where data
essentially gets stored as it's inserted,

55
00:05:34,000 --> 00:05:41,029
the column-oriented database can process
the data from a row store into different

56
00:05:41,029 --> 00:05:48,024
sort orders for different columns.
Of course, by doing so, one sort of loses

57
00:05:48,024 --> 00:05:54,792
the identity of an individual record and
therefore a column-oriented database needs

58
00:05:54,792 --> 00:06:01,574
to create extra information in the form of
what is called a join index where

59
00:06:01,574 --> 00:06:08,869
different rows in different disk blocks
are pulled together to indicate that they

60
00:06:08,869 --> 00:06:15,069
actually form the same record.
Column stores are ideally used for what

61
00:06:15,069 --> 00:06:19,088
are known as analytical processing
queries.

62
00:06:19,088 --> 00:06:25,030
Not that such queries could not be done
using rows stores as well.

63
00:06:25,030 --> 00:06:31,077
It's just that when you have large numbers
of columns, column stores become more

64
00:06:31,077 --> 00:06:35,062
efficient.
Now, this is an example of a OLAP or

65
00:06:35,062 --> 00:06:41,053
Online Analytical Processing query.
Essentially, there's a bunch of sales

66
00:06:41,053 --> 00:06:48,066
figures which are essentially the quantity
and amount of some product being sold to

67
00:06:48,066 --> 00:06:53,094
certain customers.
Add certain addresses where the retail

68
00:06:53,094 --> 00:07:00,006
outlets are on particular days.
And the sales obviously are different

69
00:07:00,006 --> 00:07:04,042
products.
Now, each of these fields is a dimension

70
00:07:04,042 --> 00:07:08,024
which describes the particular sale
record.

71
00:07:09,010 --> 00:07:15,091
The dimensions themselves can be
hierarchical in the sense that the

72
00:07:15,091 --> 00:07:23,245
product, a particular product say, a
particular type of mattress may actually

73
00:07:23,245 --> 00:07:31,062
be of category mattress or bed and so on.
Similarly, a particular day is part of a

74
00:07:31,062 --> 00:07:38,000
week, which is part of a month, which is
part of a quarter, year, et cetera.

75
00:07:38,039 --> 00:07:43,041
And similarly for location.
The kind of queries that one tries to

76
00:07:43,041 --> 00:07:48,778
answer in OLAP while you are trying to
select the total sales and group them by

77
00:07:48,778 --> 00:07:54,743
the category so you get results which can
be shown in a report, in a pie chart of

78
00:07:54,743 --> 00:08:01,389
this form in various types of reporting
including different types of charts are

79
00:08:01,389 --> 00:08:07,642
what OLAP queries actually pro, produce.
You can figure out how the SQL actually

80
00:08:07,642 --> 00:08:14,549
works, it's fairly straightforward.
But as far as what constitutes business

81
00:08:14,549 --> 00:08:23,168
intelligence for most of the enterprise
world, it is all about OLAP queries.

82
00:08:23,168 --> 00:08:33,008
Analytics is largely thought of as being
OLAP for a large part of the community.

83
00:08:33,074 --> 00:08:40,034
However, as we have seen, there's much
more to analytics than simply queries.

84
00:08:42,079 --> 00:08:47,092
So, let's reflect now on why we actually
need databases.

85
00:08:47,092 --> 00:08:55,636
First of all, we need to add data as it's
produced, and data is added for multiple

86
00:08:55,636 --> 00:09:01,406
sources in parallel like multiple people
executing sales transactions, or

87
00:09:01,406 --> 00:09:04,992
e-commerce transactions, or banking
transactions.

88
00:09:04,992 --> 00:09:10,497
And they need to be isolated from each
other, so that they don't create

89
00:09:10,497 --> 00:09:14,890
inconsistent data, and so you need
transaction processing.

90
00:09:14,890 --> 00:09:20,913
The second reason you need databases is
querying efficiently and which it will

91
00:09:20,913 --> 00:09:25,376
lead to indexing and a high-level language
like SQL.

92
00:09:25,376 --> 00:09:33,030
But the point we need to think about is
whether or not you actually need

93
00:09:33,030 --> 00:09:40,564
transaction processing for analytics.
Clearly we don't, and that's why we

94
00:09:40,564 --> 00:09:46,807
develop column-oriented databases.
At the same time, there may be some

95
00:09:46,807 --> 00:09:54,127
advantages in not having to move the data
out of a transaction store, and still able

96
00:09:54,127 --> 00:09:59,064
to perform analytical queries or all that
queries on them.

97
00:09:59,064 --> 00:10:05,060
If you can avoid moving it, so that you'd
don't need to incur the overhead of

98
00:10:05,060 --> 00:10:11,312
transferring data from one database
row-oriented transaction processing to say

99
00:10:11,312 --> 00:10:19,176
column-oriented analytical database.
Queries are, of course, important for OLAP

100
00:10:19,176 --> 00:10:27,534
queries as we have just seen.
But the important point to note here is

101
00:10:27,534 --> 00:10:36,062
that for other types of analytics, apart
from OLAP such as large scale counting

102
00:10:36,062 --> 00:10:42,954
like the programming assignment that we've
seen building classifiers using say

103
00:10:42,954 --> 00:10:50,217
Bayesian techniques or executing large
complex joints to compute sales by cities.

104
00:10:50,217 --> 00:10:54,806
As we've seen in the map for this example
last week.

105
00:10:54,806 --> 00:10:59,264
Large volumes of data are actually going
to be touched.

106
00:10:59,264 --> 00:11:05,432
In fact, the entire data set is going to
be touched by such programs.

107
00:11:05,432 --> 00:11:10,599
In such situations, indexes become far
less relevant.

108
00:11:10,599 --> 00:11:18,079
What also becomes very relevant instead,
is resilience to hardware failure because

109
00:11:18,079 --> 00:11:24,898
program which touch large volume of data
are likely to take a lot of time, and

110
00:11:24,898 --> 00:11:30,861
therefore hardware can faill during the
execution, and so we need to figure out

111
00:11:30,861 --> 00:11:36,212
how to achieve fall tolerance.
Mapreduce essentially provides that file,

112
00:11:36,212 --> 00:11:40,894
fall tolerance and therefore becomes
vital.

113
00:11:40,894 --> 00:11:49,086
Lastly, it's important to recognize that
OLAP is also a form of analytics.

114
00:11:50,055 --> 00:11:55,771
And can rarely be viewed as computing a
part of the joint distribution across all

115
00:11:55,771 --> 00:12:01,060
the features, which are the columns of a
table, but you're using intuition to

116
00:12:01,060 --> 00:12:05,063
select which set of features we want to
actually look at.

117
00:12:07,013 --> 00:12:12,049
So, to summarize, there are two
contradictory requirements here in

118
00:12:12,049 --> 00:12:16,061
analytics.
If you have large volumes of data being

119
00:12:16,061 --> 00:12:22,063
touched, you don't need indexes.
But if you want to find out pieces of the

120
00:12:22,063 --> 00:12:29,031
joint distribution using OLAP queries and
human intuition, you do need indexes.

121
00:12:29,031 --> 00:12:36,023
And it is this dichotomy which is driving
most of the debate in big data technology

122
00:12:36,023 --> 00:12:40,036
above, say, the basic distributed file
system layer.