So, let's try to summarize the bigger picture as I understand it regarding data management and big data. I often asked, getting asked questions about, you know, what's the difference between big data technologies, relational databases. and in memory databases and, and all that. So I'll try to paint a bigger picture over the next few slides. this is not really course material, it's more a bigger picture. we won't be asking questions on this material, but it's probably interesting to many of you. let's see, data bases if you think about it, were originally designed in the financial service industry to start with for transaction processing. Essentially keeping track of everybody's money. and in other industries very soon followed in the late 70s, 80s we came, had Oracle, then slowly DV2, the relational database model. The whole business of reporting and analytics on data, really came as an afterthought. They were reporting databases, where people would take backups of transactional data and then run reports on them. To figure out what the sales was in the past few months, and slice them and dice them by region, et cetera. Big data technologies, on the other hand, were designed for analytics. computing classifiers like the Baysian classifier we discussed earlier on in the in the section on listen. Not really for making queries. For example, one would rather not run a batch map produce job, to select, you know, 5% of the rows in a table. it's much easier to, for example, use an inverted index. Which is similar to what, what one would use for unstructured data to retrieve those rows, rather than run a large scale map produce job. We'll look at that in a little bit more detail since Drimel sort of looks at this in a slightly different light in a, in a few minutes. But, by in large the batch map produce paradigm really designed for counting, not for doing queries. And the second big difference from traditional databases, is that data is captured pretty much in the raw. Since the logs of, of transactions that come in, there no transactional overheads in terms of making sure that when data is captured. it's being entered by multiple people at the same time. So you have to make sure they don't, they don't override each other's transaction. So those things don't have to be worry, you don't have to worry though, so the, the overhead is much less in terms of the daily capture. The blowup is much less in terms of how much extra data needs to be stored. As a result there, it turns out that in the enterprise world there, people are perceiving a price performance advantage. Even for standard transformation extract transform load tasks, as well as some bulk query tasks. and that's why things like Drimel become important.. Now, as an aside in the transaction processing world, there is also an evolution, sort of big dataish, but not very different from the analytical world. As an example think about Google. they run a online massive keyword auction, to sell ads using bidding on keywords every, everyday, and continuously. initially used variations of My Sequel. very quickly they move to a Big Table, based transactional store to handle the, the bids on keywords. they built something called Mega Store, and then they inpable built something called F-One, which is really being used much more now. And very recently, in just last year in 2012, it came out with Spanner, which is a large scale distributed, globally distributed, in memory transactional database. but all these are, you know, in some sense really big data, but not analytical big databases, they are transaction processing databases. and that's why we don't really talk about them too much in this course, because our, we're talking about web intelligence and analytics. Related to web intelligence rather than capturing transactions such as a keyword auction to make sure you get the right highest bidder for a keyword. About those of you who are very familiar with business intelligence, using SQL. Which is essentially what reporting, and all, online analytical processing in large scale traditional enterprises is all about. Generating reports from, packages like business objects or Oracle or other data warehouses like Teradata. What is this all about? Well think about what somebody doing business intelligence is actually up to. If you have, if he has a lot of data, say data about customers, that represented by these points. what they're trying to do is your looking at a small slice, you know by this region, and sales by city, by store, by product. A slice of this data you're analyzing a subset, and trying to find a distribution of how that data looks in this small subset. Trying to find some interesting patterns. Then you may use another slice, and try to find some other interesting pattern. Try another slice, and, and look for some correlations which might lead to higher sales, better operational processes and keep going. The trouble is, you really can't do that too much. Because if you have a small amount of data and more importantly a small amount of data about each customer. Say you have m pieces of data about each customer. You're okay. But if this m becomes very large, even moderately large. And the number of possible values each of these Xs that you know about your customer, the features that you know about customers, becomes large. Which is even the the words that your writing in your email, their clicks that they have performed on your website. Suddenly your space becomes very large. So you, if each of these Xs, each of these features, takes just d values, the number of possible cubes is sort of, of this order, due to the 2m. And very easily, you can figure out that if m is equal to 40 and d equal to 10, so you have 40 features per customer. and just each of them can take 10 possible values. This is a huge number. It's, it's more than the number of atoms in the universe. and really what this means is that, sampling this distribution and trying to find some interesting patterns manually, is pretty close to taking infinite time. Even if you have an infinite number of people you know you can probably care, crack it, but not otherwise. So what this message is that business intelligence folks need to learn deeper analytical techniques, which is going to be the subject of a later unit. And the second message is, big data is not really about having lots and lots of points. I mean Google, for example, have petabytes of data. A large enterprise may have many hundred terabytes of data, or even if you have a few gigabytes of data, hundreds of gigabytes of data. The problem is not the number of points. The problem is, how much information you have about this points. And that is what, in my opinion is big about big data these days. The number of different sources of data that you have about your customers or anything else. Because of the different inputs that you have today. Whether it is from social media, whether it is from sensors on mobile phones, it's increasing M, and D hugely. And therefore, the number of possible tubes is just too difficult to examine manually, and so you need analytical techniques. And that's really what big data analytics is all about. I hope that gives you a picture. And it's not about petabytes versus terabytes versus gigabytes. It's really about how much, how many columns you have. and how, how, how can you explore this space more efficiently. So that you find something interesting, or you or you can learn something about your data.