1 00:00:00,000 --> 00:00:06,795 Over the past, 40 years or so, relational databases have become the work horse of 2 00:00:06,795 --> 00:00:13,369 most enterprise data management systems. Relational databases as you all know 3 00:00:13,369 --> 00:00:18,527 comprise of tables, which are nothing but a bunch of records with attributes or 4 00:00:18,527 --> 00:00:23,057 fields which are of course laid out on disk in some way. 5 00:00:23,057 --> 00:00:35,886 The most common layout is that of a row oriented database, where pages of disk are 6 00:00:35,886 --> 00:00:44,086 consisting of rows of records. The major problem that relational 7 00:00:44,086 --> 00:00:50,539 databases is address is that of transaction processing, where multiple 8 00:00:50,539 --> 00:00:58,083 clients or users are trying to insert or update different records in the same 9 00:00:58,083 --> 00:01:05,023 table, and their actions have to be protected from each other so that 10 00:01:05,023 --> 00:01:12,354 consistency of some form is maintained. As the result the database management 11 00:01:12,354 --> 00:01:17,887 systems build in many features. Like locking and other forms of 12 00:01:17,887 --> 00:01:24,227 consistency control to ensure that transactions are effectively isolated from 13 00:01:24,227 --> 00:01:28,879 each other. And this is the really primary function of 14 00:01:28,879 --> 00:01:34,728 the relational database. The second function of course is queries. 15 00:01:34,728 --> 00:01:41,225 Where the user should be able to get access to any record based on a 16 00:01:41,225 --> 00:01:49,186 combination of where clauses or a combination of conditions that the record 17 00:01:49,186 --> 00:01:55,003 needs to satisfy. Sql is the main language for accessing 18 00:01:55,003 --> 00:02:00,954 relational data and essentially defines the relational model. 19 00:02:00,954 --> 00:02:09,328 Indexes, such as the b+3 index depicted here, are used to speed up access to 20 00:02:09,328 --> 00:02:14,331 records, so that SQL queries can actually be efficient. 21 00:02:14,331 --> 00:02:22,354 The b+3 is the most common form of index, which is essentially like a binary tree, 22 00:02:22,354 --> 00:02:28,151 except for the fact that. The interior nodes don't actually store 23 00:02:28,151 --> 00:02:31,762 data values. They only store key values. 24 00:02:31,762 --> 00:02:39,493 And the leaf nodes point to blocks of disk where particular combinations of keys are 25 00:02:39,493 --> 00:02:48,018 actually available to retrieve. Other types of indexes like hash indexes, 26 00:02:48,018 --> 00:02:55,559 bitmap indexes are also used but B + trees are by far the most popular for indexing a 27 00:02:55,559 --> 00:03:01,465 row-oriented database. Now, the other important thing about 28 00:03:01,465 --> 00:03:06,906 relational database is they, that they that they don't really rely on file 29 00:03:06,906 --> 00:03:10,793 systems, provided by the operating system to manage disk. 30 00:03:10,793 --> 00:03:15,842 They is, do their own disk management figuring out what pages is to prefetch, 31 00:03:15,842 --> 00:03:21,966 and which pages is to replace from the memory based on the traffic that they see 32 00:03:21,966 --> 00:03:27,911 coming in, in the form of queries for that matter transactions coming from multiple 33 00:03:27,911 --> 00:03:31,061 clients. [inaudible] For the most part when the 34 00:03:31,061 --> 00:03:35,838 data was reasonably small. And queries using indexes could be 35 00:03:35,838 --> 00:03:42,594 performed fairly efficiently on exactly the same data stores where transactions 36 00:03:42,594 --> 00:03:47,983 would insert an update data. The row-oriented data stores were 37 00:03:47,983 --> 00:03:54,642 perfectly fine for both transaction processing and answering questions about 38 00:03:54,642 --> 00:03:58,767 data. As, the volume of data grew and in 39 00:03:58,767 --> 00:04:05,258 particular the width of data in terms of the number of columns that one would store 40 00:04:05,258 --> 00:04:11,090 per record becomes very large. And clearly the amount of storage required 41 00:04:11,090 --> 00:04:15,404 grows. But at the same time, what also happens is 42 00:04:15,404 --> 00:04:21,833 that a query that only requires data from, say, two columns, ends up needing to fetch 43 00:04:21,833 --> 00:04:28,306 data, about maybe 100 columns, just because they happen to come along with the 44 00:04:28,306 --> 00:04:33,066 disk block as it is fetched. This makes query processing extremely 45 00:04:33,066 --> 00:04:38,017 inefficient, both from a storage perspective as well as performance 46 00:04:38,017 --> 00:04:43,030 perspective, and for such situations, the column oriented databases have been 47 00:04:43,030 --> 00:04:49,030 developed in the past decade or so. Especially to take care of situations 48 00:04:49,030 --> 00:04:55,080 where the number of columns is very large and only queries need to be supported 49 00:04:55,080 --> 00:05:01,032 rather than transactions. In a column-oriented database unlike a row 50 00:05:01,032 --> 00:05:05,095 store, sets of columns are stored together in disk blocks. 51 00:05:05,095 --> 00:05:12,021 In other words, a particular record is actually split across in multiple disk 52 00:05:12,021 --> 00:05:17,074 blocks with different sets of columns occurring in different blocks. 53 00:05:18,007 --> 00:05:26,062 Within each block, the data is sorted along particular sets of columns. 54 00:05:26,098 --> 00:05:34,000 So, that unlike a row store where data essentially gets stored as it's inserted, 55 00:05:34,000 --> 00:05:41,029 the column-oriented database can process the data from a row store into different 56 00:05:41,029 --> 00:05:48,024 sort orders for different columns. Of course, by doing so, one sort of loses 57 00:05:48,024 --> 00:05:54,792 the identity of an individual record and therefore a column-oriented database needs 58 00:05:54,792 --> 00:06:01,574 to create extra information in the form of what is called a join index where 59 00:06:01,574 --> 00:06:08,869 different rows in different disk blocks are pulled together to indicate that they 60 00:06:08,869 --> 00:06:15,069 actually form the same record. Column stores are ideally used for what 61 00:06:15,069 --> 00:06:19,088 are known as analytical processing queries. 62 00:06:19,088 --> 00:06:25,030 Not that such queries could not be done using rows stores as well. 63 00:06:25,030 --> 00:06:31,077 It's just that when you have large numbers of columns, column stores become more 64 00:06:31,077 --> 00:06:35,062 efficient. Now, this is an example of a OLAP or 65 00:06:35,062 --> 00:06:41,053 Online Analytical Processing query. Essentially, there's a bunch of sales 66 00:06:41,053 --> 00:06:48,066 figures which are essentially the quantity and amount of some product being sold to 67 00:06:48,066 --> 00:06:53,094 certain customers. Add certain addresses where the retail 68 00:06:53,094 --> 00:07:00,006 outlets are on particular days. And the sales obviously are different 69 00:07:00,006 --> 00:07:04,042 products. Now, each of these fields is a dimension 70 00:07:04,042 --> 00:07:08,024 which describes the particular sale record. 71 00:07:09,010 --> 00:07:15,091 The dimensions themselves can be hierarchical in the sense that the 72 00:07:15,091 --> 00:07:23,245 product, a particular product say, a particular type of mattress may actually 73 00:07:23,245 --> 00:07:31,062 be of category mattress or bed and so on. Similarly, a particular day is part of a 74 00:07:31,062 --> 00:07:38,000 week, which is part of a month, which is part of a quarter, year, et cetera. 75 00:07:38,039 --> 00:07:43,041 And similarly for location. The kind of queries that one tries to 76 00:07:43,041 --> 00:07:48,778 answer in OLAP while you are trying to select the total sales and group them by 77 00:07:48,778 --> 00:07:54,743 the category so you get results which can be shown in a report, in a pie chart of 78 00:07:54,743 --> 00:08:01,389 this form in various types of reporting including different types of charts are 79 00:08:01,389 --> 00:08:07,642 what OLAP queries actually pro, produce. You can figure out how the SQL actually 80 00:08:07,642 --> 00:08:14,549 works, it's fairly straightforward. But as far as what constitutes business 81 00:08:14,549 --> 00:08:23,168 intelligence for most of the enterprise world, it is all about OLAP queries. 82 00:08:23,168 --> 00:08:33,008 Analytics is largely thought of as being OLAP for a large part of the community. 83 00:08:33,074 --> 00:08:40,034 However, as we have seen, there's much more to analytics than simply queries. 84 00:08:42,079 --> 00:08:47,092 So, let's reflect now on why we actually need databases. 85 00:08:47,092 --> 00:08:55,636 First of all, we need to add data as it's produced, and data is added for multiple 86 00:08:55,636 --> 00:09:01,406 sources in parallel like multiple people executing sales transactions, or 87 00:09:01,406 --> 00:09:04,992 e-commerce transactions, or banking transactions. 88 00:09:04,992 --> 00:09:10,497 And they need to be isolated from each other, so that they don't create 89 00:09:10,497 --> 00:09:14,890 inconsistent data, and so you need transaction processing. 90 00:09:14,890 --> 00:09:20,913 The second reason you need databases is querying efficiently and which it will 91 00:09:20,913 --> 00:09:25,376 lead to indexing and a high-level language like SQL. 92 00:09:25,376 --> 00:09:33,030 But the point we need to think about is whether or not you actually need 93 00:09:33,030 --> 00:09:40,564 transaction processing for analytics. Clearly we don't, and that's why we 94 00:09:40,564 --> 00:09:46,807 develop column-oriented databases. At the same time, there may be some 95 00:09:46,807 --> 00:09:54,127 advantages in not having to move the data out of a transaction store, and still able 96 00:09:54,127 --> 00:09:59,064 to perform analytical queries or all that queries on them. 97 00:09:59,064 --> 00:10:05,060 If you can avoid moving it, so that you'd don't need to incur the overhead of 98 00:10:05,060 --> 00:10:11,312 transferring data from one database row-oriented transaction processing to say 99 00:10:11,312 --> 00:10:19,176 column-oriented analytical database. Queries are, of course, important for OLAP 100 00:10:19,176 --> 00:10:27,534 queries as we have just seen. But the important point to note here is 101 00:10:27,534 --> 00:10:36,062 that for other types of analytics, apart from OLAP such as large scale counting 102 00:10:36,062 --> 00:10:42,954 like the programming assignment that we've seen building classifiers using say 103 00:10:42,954 --> 00:10:50,217 Bayesian techniques or executing large complex joints to compute sales by cities. 104 00:10:50,217 --> 00:10:54,806 As we've seen in the map for this example last week. 105 00:10:54,806 --> 00:10:59,264 Large volumes of data are actually going to be touched. 106 00:10:59,264 --> 00:11:05,432 In fact, the entire data set is going to be touched by such programs. 107 00:11:05,432 --> 00:11:10,599 In such situations, indexes become far less relevant. 108 00:11:10,599 --> 00:11:18,079 What also becomes very relevant instead, is resilience to hardware failure because 109 00:11:18,079 --> 00:11:24,898 program which touch large volume of data are likely to take a lot of time, and 110 00:11:24,898 --> 00:11:30,861 therefore hardware can faill during the execution, and so we need to figure out 111 00:11:30,861 --> 00:11:36,212 how to achieve fall tolerance. Mapreduce essentially provides that file, 112 00:11:36,212 --> 00:11:40,894 fall tolerance and therefore becomes vital. 113 00:11:40,894 --> 00:11:49,086 Lastly, it's important to recognize that OLAP is also a form of analytics. 114 00:11:50,055 --> 00:11:55,771 And can rarely be viewed as computing a part of the joint distribution across all 115 00:11:55,771 --> 00:12:01,060 the features, which are the columns of a table, but you're using intuition to 116 00:12:01,060 --> 00:12:05,063 select which set of features we want to actually look at. 117 00:12:07,013 --> 00:12:12,049 So, to summarize, there are two contradictory requirements here in 118 00:12:12,049 --> 00:12:16,061 analytics. If you have large volumes of data being 119 00:12:16,061 --> 00:12:22,063 touched, you don't need indexes. But if you want to find out pieces of the 120 00:12:22,063 --> 00:12:29,031 joint distribution using OLAP queries and human intuition, you do need indexes. 121 00:12:29,031 --> 00:12:36,023 And it is this dichotomy which is driving most of the debate in big data technology 122 00:12:36,023 --> 00:12:40,036 above, say, the basic distributed file system layer.