1 00:00:00,710 --> 00:00:07,535 So, let's try to summarize the bigger picture as I understand it regarding data 2 00:00:07,535 --> 00:00:13,363 management and big data. I often asked, getting asked questions 3 00:00:13,363 --> 00:00:17,633 about, you know, what's the difference between big data technologies, relational 4 00:00:17,633 --> 00:00:21,338 databases. and in memory databases and, and all 5 00:00:21,338 --> 00:00:23,977 that. So I'll try to paint a bigger picture 6 00:00:23,977 --> 00:00:28,830 over the next few slides. this is not really course material, it's 7 00:00:28,830 --> 00:00:33,057 more a bigger picture. we won't be asking questions on this 8 00:00:33,057 --> 00:00:37,330 material, but it's probably interesting to many of you. 9 00:00:38,870 --> 00:00:43,694 let's see, data bases if you think about it, were originally designed in the 10 00:00:43,694 --> 00:00:49,470 financial service industry to start with for transaction processing. 11 00:00:49,470 --> 00:00:52,937 Essentially keeping track of everybody's money. 12 00:00:52,937 --> 00:00:57,463 and in other industries very soon followed in the late 70s, 80s we came, 13 00:00:57,463 --> 00:01:03,860 had Oracle, then slowly DV2, the relational database model. 14 00:01:03,860 --> 00:01:07,340 The whole business of reporting and analytics on data, really came as an 15 00:01:07,340 --> 00:01:10,758 afterthought. They were reporting databases, where 16 00:01:10,758 --> 00:01:14,534 people would take backups of transactional data and then run reports 17 00:01:14,534 --> 00:01:17,993 on them. To figure out what the sales was in the 18 00:01:17,993 --> 00:01:23,720 past few months, and slice them and dice them by region, et cetera. 19 00:01:23,720 --> 00:01:28,770 Big data technologies, on the other hand, were designed for analytics. 20 00:01:28,770 --> 00:01:34,386 computing classifiers like the Baysian classifier we discussed earlier on in the 21 00:01:34,386 --> 00:01:39,520 in the section on listen. Not really for making queries. 22 00:01:39,520 --> 00:01:44,245 For example, one would rather not run a batch map produce job, to select, you 23 00:01:44,245 --> 00:01:51,443 know, 5% of the rows in a table. it's much easier to, for example, use an 24 00:01:51,443 --> 00:01:55,727 inverted index. Which is similar to what, what one would 25 00:01:55,727 --> 00:02:01,577 use for unstructured data to retrieve those rows, rather than run a large scale 26 00:02:01,577 --> 00:02:07,389 map produce job. We'll look at that in a little bit more 27 00:02:07,389 --> 00:02:12,861 detail since Drimel sort of looks at this in a slightly different light in a, in a 28 00:02:12,861 --> 00:02:18,603 few minutes. But, by in large the batch map produce 29 00:02:18,603 --> 00:02:24,480 paradigm really designed for counting, not for doing queries. 30 00:02:24,480 --> 00:02:28,968 And the second big difference from traditional databases, is that data is 31 00:02:28,968 --> 00:02:34,673 captured pretty much in the raw. Since the logs of, of transactions that 32 00:02:34,673 --> 00:02:39,425 come in, there no transactional overheads in terms of making sure that when data is 33 00:02:39,425 --> 00:02:44,080 captured. it's being entered by multiple people at 34 00:02:44,080 --> 00:02:47,724 the same time. So you have to make sure they don't, they 35 00:02:47,724 --> 00:02:53,205 don't override each other's transaction. So those things don't have to be worry, 36 00:02:53,205 --> 00:02:56,537 you don't have to worry though, so the, the overhead is much less in terms of the 37 00:02:56,537 --> 00:03:00,192 daily capture. The blowup is much less in terms of how 38 00:03:00,192 --> 00:03:06,944 much extra data needs to be stored. As a result there, it turns out that in 39 00:03:06,944 --> 00:03:14,337 the enterprise world there, people are perceiving a price performance advantage. 40 00:03:14,337 --> 00:03:20,007 Even for standard transformation extract transform load tasks, as well as some 41 00:03:20,007 --> 00:03:24,600 bulk query tasks. and that's why things like Drimel become 42 00:03:24,600 --> 00:03:29,759 important.. Now, as an aside in the transaction 43 00:03:29,759 --> 00:03:34,781 processing world, there is also an evolution, sort of big dataish, but not 44 00:03:34,781 --> 00:03:44,650 very different from the analytical world. As an example think about Google. 45 00:03:44,650 --> 00:03:49,980 they run a online massive keyword auction, to sell ads using bidding on 46 00:03:49,980 --> 00:03:55,990 keywords every, everyday, and continuously. 47 00:03:55,990 --> 00:04:03,085 initially used variations of My Sequel. very quickly they move to a Big Table, 48 00:04:03,085 --> 00:04:09,593 based transactional store to handle the, the bids on keywords. 49 00:04:09,593 --> 00:04:15,518 they built something called Mega Store, and then they inpable built something 50 00:04:15,518 --> 00:04:21,620 called F-One, which is really being used much more now. 51 00:04:21,620 --> 00:04:26,235 And very recently, in just last year in 2012, it came out with Spanner, which is 52 00:04:26,235 --> 00:04:30,779 a large scale distributed, globally distributed, in memory transactional 53 00:04:30,779 --> 00:04:35,770 database. but all these are, you know, in some 54 00:04:35,770 --> 00:04:39,730 sense really big data, but not analytical big databases, they are transaction 55 00:04:39,730 --> 00:04:44,758 processing databases. and that's why we don't really talk about 56 00:04:44,758 --> 00:04:48,376 them too much in this course, because our, we're talking about web intelligence 57 00:04:48,376 --> 00:04:52,205 and analytics. Related to web intelligence rather than 58 00:04:52,205 --> 00:04:56,093 capturing transactions such as a keyword auction to make sure you get the right 59 00:04:56,093 --> 00:05:02,673 highest bidder for a keyword. About those of you who are very familiar 60 00:05:02,673 --> 00:05:10,337 with business intelligence, using SQL. Which is essentially what reporting, and 61 00:05:10,337 --> 00:05:15,419 all, online analytical processing in large scale traditional enterprises is 62 00:05:15,419 --> 00:05:20,986 all about. Generating reports from, packages like 63 00:05:20,986 --> 00:05:28,830 business objects or Oracle or other data warehouses like Teradata. 64 00:05:28,830 --> 00:05:31,657 What is this all about? Well think about what somebody doing 65 00:05:31,657 --> 00:05:36,542 business intelligence is actually up to. If you have, if he has a lot of data, say 66 00:05:36,542 --> 00:05:41,334 data about customers, that represented by these points. 67 00:05:41,334 --> 00:05:45,878 what they're trying to do is your looking at a small slice, you know by this 68 00:05:45,878 --> 00:05:51,430 region, and sales by city, by store, by product. 69 00:05:51,430 --> 00:05:56,187 A slice of this data you're analyzing a subset, and trying to find a distribution 70 00:05:56,187 --> 00:06:00,405 of how that data looks in this small subset. 71 00:06:00,405 --> 00:06:06,004 Trying to find some interesting patterns. Then you may use another slice, and try 72 00:06:06,004 --> 00:06:12,234 to find some other interesting pattern. Try another slice, and, and look for some 73 00:06:12,234 --> 00:06:16,098 correlations which might lead to higher sales, better operational processes and 74 00:06:16,098 --> 00:06:20,357 keep going. The trouble is, you really can't do that 75 00:06:20,357 --> 00:06:23,149 too much. Because if you have a small amount of 76 00:06:23,149 --> 00:06:27,969 data and more importantly a small amount of data about each customer. 77 00:06:27,969 --> 00:06:30,400 Say you have m pieces of data about each customer. 78 00:06:30,400 --> 00:06:32,772 You're okay. But if this m becomes very large, even 79 00:06:32,772 --> 00:06:36,432 moderately large. And the number of possible values each of 80 00:06:36,432 --> 00:06:40,028 these Xs that you know about your customer, the features that you know 81 00:06:40,028 --> 00:06:45,615 about customers, becomes large. Which is even the the words that your 82 00:06:45,615 --> 00:06:51,496 writing in your email, their clicks that they have performed on your website. 83 00:06:51,496 --> 00:06:56,564 Suddenly your space becomes very large. So you, if each of these Xs, each of 84 00:06:56,564 --> 00:07:01,712 these features, takes just d values, the number of possible cubes is sort of, of 85 00:07:01,712 --> 00:07:07,825 this order, due to the 2m. And very easily, you can figure out that 86 00:07:07,825 --> 00:07:13,318 if m is equal to 40 and d equal to 10, so you have 40 features per customer. 87 00:07:13,318 --> 00:07:16,936 and just each of them can take 10 possible values. 88 00:07:16,936 --> 00:07:19,697 This is a huge number. It's, it's more than the number of atoms 89 00:07:19,697 --> 00:07:22,775 in the universe. and really what this means is that, 90 00:07:22,775 --> 00:07:27,390 sampling this distribution and trying to find some interesting patterns manually, 91 00:07:27,390 --> 00:07:33,314 is pretty close to taking infinite time. Even if you have an infinite number of 92 00:07:33,314 --> 00:07:38,500 people you know you can probably care, crack it, but not otherwise. 93 00:07:39,930 --> 00:07:44,962 So what this message is that business intelligence folks need to learn deeper 94 00:07:44,962 --> 00:07:51,452 analytical techniques, which is going to be the subject of a later unit. 95 00:07:51,452 --> 00:07:57,020 And the second message is, big data is not really about having lots and lots of 96 00:07:57,020 --> 00:08:00,704 points. I mean Google, for example, have 97 00:08:00,704 --> 00:08:04,342 petabytes of data. A large enterprise may have many hundred 98 00:08:04,342 --> 00:08:08,124 terabytes of data, or even if you have a few gigabytes of data, hundreds of 99 00:08:08,124 --> 00:08:13,530 gigabytes of data. The problem is not the number of points. 100 00:08:13,530 --> 00:08:18,450 The problem is, how much information you have about this points. 101 00:08:18,450 --> 00:08:24,380 And that is what, in my opinion is big about big data these days. 102 00:08:24,380 --> 00:08:28,604 The number of different sources of data that you have about your customers or 103 00:08:28,604 --> 00:08:32,725 anything else. Because of the different inputs that you 104 00:08:32,725 --> 00:08:36,173 have today. Whether it is from social media, whether 105 00:08:36,173 --> 00:08:42,200 it is from sensors on mobile phones, it's increasing M, and D hugely. 106 00:08:42,200 --> 00:08:46,104 And therefore, the number of possible tubes is just too difficult to examine 107 00:08:46,104 --> 00:08:50,520 manually, and so you need analytical techniques. 108 00:08:50,520 --> 00:08:53,280 And that's really what big data analytics is all about. 109 00:08:53,280 --> 00:08:56,850 I hope that gives you a picture. And it's not about petabytes versus 110 00:08:56,850 --> 00:09:00,460 terabytes versus gigabytes. It's really about how much, how many 111 00:09:00,460 --> 00:09:03,953 columns you have. and how, how, how can you explore this 112 00:09:03,953 --> 00:09:08,022 space more efficiently. So that you find something interesting, 113 00:09:08,022 --> 00:09:11,470 or you or you can learn something about your data.