So, as I just mentioned, analytics is about counting, not queries. What kind of counting do you need to do? Well, we've already done one example computing these likelihoods. A lot of counting to be done. We're going to talk about things like clustering. one example we've already seen in the first lecture, computing hash-functions using a very sensitive hashing is a kind of clustering. data mining, finding bumps or statistically different regions in your data, finding rules and learning latent features which involves things like sampling distributions which our guest lecture on Markov logic networks, as MLN's, we'll talk a little bit about. And we'll, and we'll give one example using matrix multiplications also. But the bottom line that I want to talk about now that since we're going to cover some of these techniques only after unit five onwards, is that from the database perspective, all of these techniques require you to touch all your data. It's not about querying. If you're to touch all your data, then complex index structures for fast query processing, which traditional relational databases and data warehouse technologies like TerraData, have are pretty useless. So let me try to summarize the evolution of analytical databases and current trends. On the one hand we have the noSQL databases. So they got rid of ACID transactions, that is, transaction isolation to deal with the, deal with the conflicts and and locking. none, none of that is there. they use, instead of having indexes, it essentially uses shards, that means distribution of data across different data nodes, distributive processing to rather the indexing. They don't have joins, because joins are better done in same app produced and they support columnar storage so that you can have Y data. And just as we've mentioned, big data is about Y data. On the other hand, there is another trend in the analytical database world which is all about in-memory databases. So in-memory databases, allow real-time transactions, they allow variety of indexes and complex joins. The problem is that the even as in-memory databases are great for very fast query processing and also large scale transaction processing it's back to the same problem of the number or possible options you can explore while doing slicing and dicing. You just can't do too much if the data was very wide. The second problem is that even in memory, the fact that you accessed data sequentially versus random access makes a huge difference given current memory hierarchies. And so in memory databases are not really a Panacea for large scale data analytics today. When you have reasonable amount of data, it's still important to access your data sequentially touch all of it to do analytical processing, and whether it's in memory or whether it's on disk map produces still a reasonably good paradigm that that, that is an abstraction in which allows you to touch login data efficiently. So if I were to summarize the evolution of the database technology, we started out with the relational databases in a sort of one size fits all. Whether you're doing transaction processing or you're doing reporting or even analytics you know, you can find gigabytes of data. You can pretty much do one size fits all using a relational database. Of course, whether you want to do OLAP or not, will depend on the number of columns you have here. Too many then OLAP is kind of difficult to, to, to go through all the possibilities but that's, that's not is not a common OLAP problem, not a database problem. Then as the, when we, when we, when we have large volumes, large number of columns, we went into column-store data ware houses and terabytes of data. So the data warehouse technologies like teradata, Oracle data warehouse systems like that, they can deal with column stores and terabytes of data. And in parallel we had distributed noSQL row/column stores, which essentially dealt with the, terabytes of data but where you didn't really have to do joins. And when you want a join, you would do a map-reduce. especially if you want to do joins for bulk analysis then you could deal with ten's of terabytes of data over there. In parallel, in memory one size fits all databases are also a, available, but they have their own challenges in terms of access patterns still mattering. And last but not least Dremel. where it's now it's a one size fits all database for petabytes of data. So you can still do you know bot processing on it but also you can do fast aggregation queries, where only parts of the data are touched. in the last year in Cloudera which is one of hive distributions. released a product called Impala which is essentially an implementation of Dremel. it still has a long ways to go in terms of efficiency. But what it allows you to do is essentially do [COUGH] distributed query processing without having to do a map-reduce. So it's good subject query, just like Dremel on to, into multiple subqueries and each each query can be executed on a separate data node. Data nodes can be in hive which is essentially HDFS plus some metadata or they can be linked as hive so allows to link data to edgeface. For those of you who want to learn more about Impala, Cloudera has enough documentation about it, plus of course it's freely downloadable as part of the distribution. Okay. So that brings us to the end of the section on load, the two units on load. We started off talking about distributed files, which is the second basic element of big-data after the map-reduce . We talked about what databases are good for and why traditional databases were a happy compromise. And we covered the evolution of databases, the evolution of SQL and the evolution of map-reduce itself. in the past few minutes I've also tried to cover the big picture for database evolution as we see it today. and next week we'll actually be having a guest lecture by Srikanta Bedathur of Triple IT Delhi on graph databases. We'll not have separate quizzes and home works on the guest lecture but if you have time to do home work three and the program assignment that goes with it. We'll get started with unit 5. Which is going to be on the subject learn, where we'll get started with unearthing facts, from data using more complex classification, clustering, rule mining, all the things we just mentioned earlier in, in this summary.