So, as I just mentioned, analytics is 
about counting, not queries. 
What kind of counting do you need to do? 
Well, we've already done one example 
computing these likelihoods. 
A lot of counting to be done. 
We're going to talk about things like 
clustering. 
one example we've already seen in the 
first lecture, computing hash-functions 
using a very sensitive hashing is a kind 
of clustering. 
data mining, finding bumps or 
statistically different regions in your 
data, finding rules and learning latent 
features which involves things like 
sampling distributions which our guest 
lecture on Markov logic networks, as 
MLN's, we'll talk a little bit about. 
And we'll, and we'll give one example 
using matrix multiplications also. 
But the bottom line that I want to talk 
about now that since we're going to cover 
some of these techniques only after unit 
five onwards, is that from the database 
perspective, all of these techniques 
require you to touch all your data. 
It's not about querying. 
If you're to touch all your data, then 
complex index structures for fast query 
processing, which traditional relational 
databases and data warehouse technologies 
like TerraData, have are pretty useless. 
So let me try to summarize the evolution 
of analytical databases and current 
trends. 
On the one hand we have the noSQL 
databases. 
So they got rid of ACID transactions, 
that is, transaction isolation to deal 
with the, deal with the conflicts and and 
locking. 
none, none of that is there. 
they use, instead of having indexes, it 
essentially uses shards, that means 
distribution of data across different 
data nodes, distributive processing to 
rather the indexing. 
They don't have joins, because joins are 
better done in same app produced and they 
support columnar storage so that you can 
have Y data. 
And just as we've mentioned, big data is 
about Y data. 
On the other hand, there is another trend 
in the analytical database world which is 
all about in-memory databases. 
So in-memory databases, allow real-time 
transactions, they allow variety of 
indexes and complex joins. 
The problem is that the even as in-memory 
databases are great for very fast query 
processing and also large scale 
transaction processing it's back to the 
same problem of the number or possible 
options you can explore while doing 
slicing and dicing. 
You just can't do too much if the data 
was very wide. 
The second problem is that even in 
memory, the fact that you accessed data 
sequentially versus random access makes a 
huge difference given current memory 
hierarchies. 
And so in memory databases are not really 
a Panacea for large scale data analytics 
today. 
When you have reasonable amount of data, 
it's still important to access your data 
sequentially touch all of it to do 
analytical processing, and whether it's 
in memory or whether it's on disk map 
produces still a reasonably good paradigm 
that that, that is an abstraction in 
which allows you to touch login data 
efficiently. 
So if I were to summarize the evolution 
of the database technology, we started 
out with the relational databases in a 
sort of one size fits all. 
Whether you're doing transaction 
processing or you're doing reporting or 
even analytics you know, you can find 
gigabytes of data. 
You can pretty much do one size fits all 
using a relational database. 
Of course, whether you want to do OLAP or 
not, will depend on the number of columns 
you have here. 
Too many then OLAP is kind of difficult 
to, to, to go through all the 
possibilities but that's, that's not is 
not a common OLAP problem, not a database 
problem. 
Then as the, when we, when we, when we 
have large volumes, large number of 
columns, we went into column-store data 
ware houses and terabytes of data. 
So the data warehouse technologies like 
teradata, Oracle data warehouse systems 
like that, they can deal with column 
stores and terabytes of data. 
And in parallel we had distributed noSQL 
row/column stores, which essentially 
dealt with the, terabytes of data but 
where you didn't really have to do joins. 
And when you want a join, you would do a 
map-reduce. 
especially if you want to do joins for 
bulk analysis then you could deal with 
ten's of terabytes of data over there. 
In parallel, in memory one size fits all 
databases are also a, available, but they 
have their own challenges in terms of 
access patterns still mattering. 
And last but not least Dremel. 
where it's now it's a one size fits all 
database for petabytes of data. 
So you can still do you know bot 
processing on it but also you can do fast 
aggregation queries, where only parts of 
the data are touched. 
in the last year in Cloudera which is one 
of hive distributions. 
released a product called Impala which is 
essentially an implementation of Dremel. 
it still has a long ways to go in terms 
of efficiency. 
But what it allows you to do is 
essentially do [COUGH] distributed query 
processing without having to do a 
map-reduce. 
So it's good subject query, just like 
Dremel on to, into multiple subqueries 
and each each query can be executed on a 
separate data node. 
Data nodes can be in hive which is 
essentially HDFS plus some metadata or 
they can be linked as hive so allows to 
link data to edgeface. 
For those of you who want to learn more 
about Impala, Cloudera has enough 
documentation about it, plus of course 
it's freely downloadable as part of the 
distribution. 
Okay. 
So that brings us to the end of the 
section on load, the two units on load. 
We started off talking about distributed 
files, which is the second basic element 
of big-data after the map-reduce . 
We talked about what databases are good 
for and why traditional databases were a 
happy compromise. 
And we covered the evolution of 
databases, the evolution of SQL and the 
evolution of map-reduce itself. 
in the past few minutes I've also tried 
to cover the big picture for database 
evolution as we see it today. 
and next week we'll actually be having a 
guest lecture by Srikanta Bedathur of 
Triple IT Delhi on graph databases. 
We'll not have separate quizzes and home 
works on the guest lecture but if you 
have time to do home work three and the 
program assignment that goes with it. 
We'll get started with unit 5. 
Which is going to be on the subject 
learn, where we'll get started with 
unearthing facts, from data using more 
complex classification, clustering, rule 
mining, all the things we just mentioned 
earlier in, in this summary.