So, let's try to summarize the bigger 
picture as I understand it regarding data 
management and big data. 
I often asked, getting asked questions 
about, you know, what's the difference 
between big data technologies, relational 
databases. 
and in memory databases and, and all 
that. 
So I'll try to paint a bigger picture 
over the next few slides. 
this is not really course material, it's 
more a bigger picture. 
we won't be asking questions on this 
material, but it's probably interesting 
to many of you. 
let's see, data bases if you think about 
it, were originally designed in the 
financial service industry to start with 
for transaction processing. 
Essentially keeping track of everybody's 
money. 
and in other industries very soon 
followed in the late 70s, 80s we came, 
had Oracle, then slowly DV2, the 
relational database model. 
The whole business of reporting and 
analytics on data, really came as an 
afterthought. 
They were reporting databases, where 
people would take backups of 
transactional data and then run reports 
on them. 
To figure out what the sales was in the 
past few months, and slice them and dice 
them by region, et cetera. 
Big data technologies, on the other hand, 
were designed for analytics. 
computing classifiers like the Baysian 
classifier we discussed earlier on in the 
in the section on listen. 
Not really for making queries. 
For example, one would rather not run a 
batch map produce job, to select, you 
know, 5% of the rows in a table. 
it's much easier to, for example, use an 
inverted index. 
Which is similar to what, what one would 
use for unstructured data to retrieve 
those rows, rather than run a large scale 
map produce job. 
We'll look at that in a little bit more 
detail since Drimel sort of looks at this 
in a slightly different light in a, in a 
few minutes. 
But, by in large the batch map produce 
paradigm really designed for counting, 
not for doing queries. 
And the second big difference from 
traditional databases, is that data is 
captured pretty much in the raw. 
Since the logs of, of transactions that 
come in, there no transactional overheads 
in terms of making sure that when data is 
captured. 
it's being entered by multiple people at 
the same time. 
So you have to make sure they don't, they 
don't override each other's transaction. 
So those things don't have to be worry, 
you don't have to worry though, so the, 
the overhead is much less in terms of the 
daily capture. 
The blowup is much less in terms of how 
much extra data needs to be stored. 
As a result there, it turns out that in 
the enterprise world there, people are 
perceiving a price performance advantage. 
Even for standard transformation extract 
transform load tasks, as well as some 
bulk query tasks. 
and that's why things like Drimel become 
important.. 
Now, as an aside in the transaction 
processing world, there is also an 
evolution, sort of big dataish, but not 
very different from the analytical world. 
As an example think about Google. 
they run a online massive keyword 
auction, to sell ads using bidding on 
keywords every, everyday, and 
continuously. 
initially used variations of My Sequel. 
very quickly they move to a Big Table, 
based transactional store to handle the, 
the bids on keywords. 
they built something called Mega Store, 
and then they inpable built something 
called F-One, which is really being used 
much more now. 
And very recently, in just last year in 
2012, it came out with Spanner, which is 
a large scale distributed, globally 
distributed, in memory transactional 
database. 
but all these are, you know, in some 
sense really big data, but not analytical 
big databases, they are transaction 
processing databases. 
and that's why we don't really talk about 
them too much in this course, because 
our, we're talking about web intelligence 
and analytics. 
Related to web intelligence rather than 
capturing transactions such as a keyword 
auction to make sure you get the right 
highest bidder for a keyword. 
About those of you who are very familiar 
with business intelligence, using SQL. 
Which is essentially what reporting, and 
all, online analytical processing in 
large scale traditional enterprises is 
all about. 
Generating reports from, packages like 
business objects or Oracle or other data 
warehouses like Teradata. 
What is this all about? 
Well think about what somebody doing 
business intelligence is actually up to. 
If you have, if he has a lot of data, say 
data about customers, that represented by 
these points. 
what they're trying to do is your looking 
at a small slice, you know by this 
region, and sales by city, by store, by 
product. 
A slice of this data you're analyzing a 
subset, and trying to find a distribution 
of how that data looks in this small 
subset. 
Trying to find some interesting patterns. 
Then you may use another slice, and try 
to find some other interesting pattern. 
Try another slice, and, and look for some 
correlations which might lead to higher 
sales, better operational processes and 
keep going. 
The trouble is, you really can't do that 
too much. 
Because if you have a small amount of 
data and more importantly a small amount 
of data about each customer. 
Say you have m pieces of data about each 
customer. 
You're okay. 
But if this m becomes very large, even 
moderately large. 
And the number of possible values each of 
these Xs that you know about your 
customer, the features that you know 
about customers, becomes large. 
Which is even the the words that your 
writing in your email, their clicks that 
they have performed on your website. 
Suddenly your space becomes very large. 
So you, if each of these Xs, each of 
these features, takes just d values, the 
number of possible cubes is sort of, of 
this order, due to the 2m. 
And very easily, you can figure out that 
if m is equal to 40 and d equal to 10, so 
you have 40 features per customer. 
and just each of them can take 10 
possible values. 
This is a huge number. 
It's, it's more than the number of atoms 
in the universe. 
and really what this means is that, 
sampling this distribution and trying to 
find some interesting patterns manually, 
is pretty close to taking infinite time. 
Even if you have an infinite number of 
people you know you can probably care, 
crack it, but not otherwise. 
So what this message is that business 
intelligence folks need to learn deeper 
analytical techniques, which is going to 
be the subject of a later unit. 
And the second message is, big data is 
not really about having lots and lots of 
points. 
I mean Google, for example, have 
petabytes of data. 
A large enterprise may have many hundred 
terabytes of data, or even if you have a 
few gigabytes of data, hundreds of 
gigabytes of data. 
The problem is not the number of points. 
The problem is, how much information you 
have about this points. 
And that is what, in my opinion is big 
about big data these days. 
The number of different sources of data 
that you have about your customers or 
anything else. 
Because of the different inputs that you 
have today. 
Whether it is from social media, whether 
it is from sensors on mobile phones, it's 
increasing M, and D hugely. 
And therefore, the number of possible 
tubes is just too difficult to examine 
manually, and so you need analytical 
techniques. 
And that's really what big data analytics 
is all about. 
I hope that gives you a picture. 
And it's not about petabytes versus 
terabytes versus gigabytes. 
It's really about how much, how many 
columns you have. 
and how, how, how can you explore this 
space more efficiently. 
So that you find something interesting, 
or you or you can learn something about 
your data.