Since Google invented MapReduce, BigTable, distributed file systems. It has moved on and is now using something called Dremel. Recall that, in the early days of databases, the relational database was used both for transaction processing. That is, inserting new records, as well for answering complex queries. Storage was expensive. It was too expensive to, for example, create a fresh copy of the entire data. To, in a, in a better form, more suited for efficient query processing. So, one was happy with the compromise of having a sim-, the one size fits all model of the relational database. Over the years storage became very cheap, and one started to move data into specific column oriented databases as well as have analytical queries that touched all the data performed using MapReduce, where large volumes data would be read and then equally large volumes freshly written again and again as one performed more and more processing on them. This was fine as long as you had terabytes, hundreds of terabytes of data, even Google. But, once one started dealing with petabytes of data, and wanted queries on such volumes. One could not afford to produce a new petabyte of data every time when processed the old petabyte. So, the challenge of storage being a constraint once again, enters into the arena when one is dealing with very large volumes. At the same time, writing extremely large volumes is itself costly, and so by avoiding writing again and again, one does introduce more efficiencies. So this is, is essentially what Dremel does. Dremel today powers Google's BigQuery, which is a service, which one can access over the web. One can define extremely large tables. One can populate them through computations or importing data from various sources and execute extremely fast queries that process large volumes of data using the Dremel structure underneath. There are two important innovations in Dremel, which was published only in 2010. First, it uses a column-oriented storage. Just like a column-oriented database in, in, in some sense. But for nested and possibly non-unique fields. For example, you could have a document which had a field A. Within that field A, one could have another field B, which itself would have say, three, two different fields C and D which actually contain values. So, the nested field A.B.C or A.B.D is actually how you would access this data. Further in a particular record, there could be multiple values for A.B.C. For example. So, you could have multiple names, multiple IP addresses or whatever for this particular nested field name. This is very common in web-oriented text, text to unstructured data, not that common in structured relational data, for example. But this is the kind of large petabyte volume data that Google needs to process. So, the column where in to storage is fairly unique in that each nested field is stored contiguously. So, all the values for record one and record two for this nested field are stored in, close together on disk and processed by leaf servers. Similarly these, this nested field A.B.D is stored contiguously and so on. So, first innovation that it's column orientated for nested and possibly non-unique fields. The second innovation is that instead of reading and writing data repeatedly like in MatReduce. One assumes that the intermediate data that one produces is always much, much less than the original data, which was quite obvious if you're dealing with petabytes of data or you won't be producing more petabytes. You'll be summarizing it in some form or selecting it or querying it, exactly as you had in the traditional relational databases where you would query and get small results from large results, from large data. So, the second innovation is that there, there is a tree of query servers that pass intermediate results from the root to the leaves and back. And the intermediate servers essentially execute a complex query plan. Very similar, in some respects to traditional sequel engines. However, these operate at a different scale. Sequel engines predominantly operated in memory. These are operating in a distributed fashion across a tree of query servers, and passing results back and forth across a network. As a result, Google is able to demonstrate orders of magnitude better performance than MapReduce when performing queries on petabytes of data. Not only does it give more speed, but it also clearly saves storage as compared to MapReduce. The underlying storage layer remains the distributed GFS system. But the Dremel is now widely used in Google and is available publicly using the BigQuery service. There is some effort at creating an open source equivalent of Dremel. It's in its infancy right now. It's under Apache. It's called Drill. But beyond the name, I don't think they've made too much progress so far. So, we can now summarize our picture of how database technology has evolved over the years. We started out with the relational row store, which was essentially a one size fits all and still works fine for gigabytes of data. Then we moved onto column-oriented data warehouse technologies, specifically designed for all that queries. Which scaled up to terabytes of data, but required us to move off of the relational row store into a data warehouse. In parallel, the web side just created, distributed NoSQL databases which were a mix of row and column stores which also allowed MapReduce pro-, processing for bulk analysis and this scaled to tens of terabytes or sometimes even more volumes of data. In parallel with this, we had in memory databases emerging in the past few years, which now can do what the one-size-fits-all relational row stores did. Again, on gigabytes of data, but with an order of magnitude more performance. And on the large scale processing for petabytes of data, Google has evolved Dremel, which again, is a one-size-fits-all model for petabytes of data. So we have three models today. We have Dremel which only Google uses. We have In-Memory which is fine for doing OLAP on reasonably small databases. And for intermediate processing to do things like computing classifiers on large terabytes of data. Distributed noSQL's here's the preferred choice. At the same time, when you have terabytes of data and you want to do all that queries very fast using SQL, then there still remains a place for the column store data warehouses, typically the special purpose appliances like Netezza which use panel computing and column storage also have a place. This place where the column store warehouse is might evolve to a Dremel-like architecture in the future once we actually have. A publicly-available version of Dremel. This is a space to really watch carefully and that's what big data technology is looking forward to in the next three