Over the past, 40 years or so, relational databases have become the work horse of most enterprise data management systems. Relational databases as you all know comprise of tables, which are nothing but a bunch of records with attributes or fields which are of course laid out on disk in some way. The most common layout is that of a row oriented database, where pages of disk are consisting of rows of records. The major problem that relational databases is address is that of transaction processing, where multiple clients or users are trying to insert or update different records in the same table, and their actions have to be protected from each other so that consistency of some form is maintained. As the result the database management systems build in many features. Like locking and other forms of consistency control to ensure that transactions are effectively isolated from each other. And this is the really primary function of the relational database. The second function of course is queries. Where the user should be able to get access to any record based on a combination of where clauses or a combination of conditions that the record needs to satisfy. Sql is the main language for accessing relational data and essentially defines the relational model. Indexes, such as the b+3 index depicted here, are used to speed up access to records, so that SQL queries can actually be efficient. The b+3 is the most common form of index, which is essentially like a binary tree, except for the fact that. The interior nodes don't actually store data values. They only store key values. And the leaf nodes point to blocks of disk where particular combinations of keys are actually available to retrieve. Other types of indexes like hash indexes, bitmap indexes are also used but B + trees are by far the most popular for indexing a row-oriented database. Now, the other important thing about relational database is they, that they that they don't really rely on file systems, provided by the operating system to manage disk. They is, do their own disk management figuring out what pages is to prefetch, and which pages is to replace from the memory based on the traffic that they see coming in, in the form of queries for that matter transactions coming from multiple clients. [inaudible] For the most part when the data was reasonably small. And queries using indexes could be performed fairly efficiently on exactly the same data stores where transactions would insert an update data. The row-oriented data stores were perfectly fine for both transaction processing and answering questions about data. As, the volume of data grew and in particular the width of data in terms of the number of columns that one would store per record becomes very large. And clearly the amount of storage required grows. But at the same time, what also happens is that a query that only requires data from, say, two columns, ends up needing to fetch data, about maybe 100 columns, just because they happen to come along with the disk block as it is fetched. This makes query processing extremely inefficient, both from a storage perspective as well as performance perspective, and for such situations, the column oriented databases have been developed in the past decade or so. Especially to take care of situations where the number of columns is very large and only queries need to be supported rather than transactions. In a column-oriented database unlike a row store, sets of columns are stored together in disk blocks. In other words, a particular record is actually split across in multiple disk blocks with different sets of columns occurring in different blocks. Within each block, the data is sorted along particular sets of columns. So, that unlike a row store where data essentially gets stored as it's inserted, the column-oriented database can process the data from a row store into different sort orders for different columns. Of course, by doing so, one sort of loses the identity of an individual record and therefore a column-oriented database needs to create extra information in the form of what is called a join index where different rows in different disk blocks are pulled together to indicate that they actually form the same record. Column stores are ideally used for what are known as analytical processing queries. Not that such queries could not be done using rows stores as well. It's just that when you have large numbers of columns, column stores become more efficient. Now, this is an example of a OLAP or Online Analytical Processing query. Essentially, there's a bunch of sales figures which are essentially the quantity and amount of some product being sold to certain customers. Add certain addresses where the retail outlets are on particular days. And the sales obviously are different products. Now, each of these fields is a dimension which describes the particular sale record. The dimensions themselves can be hierarchical in the sense that the product, a particular product say, a particular type of mattress may actually be of category mattress or bed and so on. Similarly, a particular day is part of a week, which is part of a month, which is part of a quarter, year, et cetera. And similarly for location. The kind of queries that one tries to answer in OLAP while you are trying to select the total sales and group them by the category so you get results which can be shown in a report, in a pie chart of this form in various types of reporting including different types of charts are what OLAP queries actually pro, produce. You can figure out how the SQL actually works, it's fairly straightforward. But as far as what constitutes business intelligence for most of the enterprise world, it is all about OLAP queries. Analytics is largely thought of as being OLAP for a large part of the community. However, as we have seen, there's much more to analytics than simply queries. So, let's reflect now on why we actually need databases. First of all, we need to add data as it's produced, and data is added for multiple sources in parallel like multiple people executing sales transactions, or e-commerce transactions, or banking transactions. And they need to be isolated from each other, so that they don't create inconsistent data, and so you need transaction processing. The second reason you need databases is querying efficiently and which it will lead to indexing and a high-level language like SQL. But the point we need to think about is whether or not you actually need transaction processing for analytics. Clearly we don't, and that's why we develop column-oriented databases. At the same time, there may be some advantages in not having to move the data out of a transaction store, and still able to perform analytical queries or all that queries on them. If you can avoid moving it, so that you'd don't need to incur the overhead of transferring data from one database row-oriented transaction processing to say column-oriented analytical database. Queries are, of course, important for OLAP queries as we have just seen. But the important point to note here is that for other types of analytics, apart from OLAP such as large scale counting like the programming assignment that we've seen building classifiers using say Bayesian techniques or executing large complex joints to compute sales by cities. As we've seen in the map for this example last week. Large volumes of data are actually going to be touched. In fact, the entire data set is going to be touched by such programs. In such situations, indexes become far less relevant. What also becomes very relevant instead, is resilience to hardware failure because program which touch large volume of data are likely to take a lot of time, and therefore hardware can faill during the execution, and so we need to figure out how to achieve fall tolerance. Mapreduce essentially provides that file, fall tolerance and therefore becomes vital. Lastly, it's important to recognize that OLAP is also a form of analytics. And can rarely be viewed as computing a part of the joint distribution across all the features, which are the columns of a table, but you're using intuition to select which set of features we want to actually look at. So, to summarize, there are two contradictory requirements here in analytics. If you have large volumes of data being touched, you don't need indexes. But if you want to find out pieces of the joint distribution using OLAP queries and human intuition, you do need indexes. And it is this dichotomy which is driving most of the debate in big data technology above, say, the basic distributed file system layer.