1 00:00:00,430 --> 00:00:05,430 So, as I just mentioned, analytics is about counting, not queries. 2 00:00:05,430 --> 00:00:09,053 What kind of counting do you need to do? Well, we've already done one example 3 00:00:09,053 --> 00:00:12,585 computing these likelihoods. A lot of counting to be done. 4 00:00:12,585 --> 00:00:15,840 We're going to talk about things like clustering. 5 00:00:15,840 --> 00:00:19,944 one example we've already seen in the first lecture, computing hash-functions 6 00:00:19,944 --> 00:00:24,550 using a very sensitive hashing is a kind of clustering. 7 00:00:24,550 --> 00:00:28,519 data mining, finding bumps or statistically different regions in your 8 00:00:28,519 --> 00:00:32,803 data, finding rules and learning latent features which involves things like 9 00:00:32,803 --> 00:00:37,465 sampling distributions which our guest lecture on Markov logic networks, as 10 00:00:37,465 --> 00:00:44,629 MLN's, we'll talk a little bit about. And we'll, and we'll give one example 11 00:00:44,629 --> 00:00:48,977 using matrix multiplications also. But the bottom line that I want to talk 12 00:00:48,977 --> 00:00:53,186 about now that since we're going to cover some of these techniques only after unit 13 00:00:53,186 --> 00:00:57,334 five onwards, is that from the database perspective, all of these techniques 14 00:00:57,334 --> 00:01:03,910 require you to touch all your data. It's not about querying. 15 00:01:03,910 --> 00:01:07,745 If you're to touch all your data, then complex index structures for fast query 16 00:01:07,745 --> 00:01:12,111 processing, which traditional relational databases and data warehouse technologies 17 00:01:12,111 --> 00:01:18,848 like TerraData, have are pretty useless. So let me try to summarize the evolution 18 00:01:18,848 --> 00:01:22,940 of analytical databases and current trends. 19 00:01:22,940 --> 00:01:26,460 On the one hand we have the noSQL databases. 20 00:01:26,460 --> 00:01:31,080 So they got rid of ACID transactions, that is, transaction isolation to deal 21 00:01:31,080 --> 00:01:36,510 with the, deal with the conflicts and and locking. 22 00:01:36,510 --> 00:01:39,396 none, none of that is there. they use, instead of having indexes, it 23 00:01:39,396 --> 00:01:42,966 essentially uses shards, that means distribution of data across different 24 00:01:42,966 --> 00:01:47,565 data nodes, distributive processing to rather the indexing. 25 00:01:47,565 --> 00:01:52,108 They don't have joins, because joins are better done in same app produced and they 26 00:01:52,108 --> 00:01:56,390 support columnar storage so that you can have Y data. 27 00:01:56,390 --> 00:01:59,610 And just as we've mentioned, big data is about Y data. 28 00:02:01,410 --> 00:02:05,067 On the other hand, there is another trend in the analytical database world which is 29 00:02:05,067 --> 00:02:09,997 all about in-memory databases. So in-memory databases, allow real-time 30 00:02:09,997 --> 00:02:14,857 transactions, they allow variety of indexes and complex joins. 31 00:02:14,857 --> 00:02:20,040 The problem is that the even as in-memory databases are great for very fast query 32 00:02:20,040 --> 00:02:25,223 processing and also large scale transaction processing it's back to the 33 00:02:25,223 --> 00:02:30,051 same problem of the number or possible options you can explore while doing 34 00:02:30,051 --> 00:02:36,120 slicing and dicing. You just can't do too much if the data 35 00:02:36,120 --> 00:02:39,556 was very wide. The second problem is that even in 36 00:02:39,556 --> 00:02:44,110 memory, the fact that you accessed data sequentially versus random access makes a 37 00:02:44,110 --> 00:02:48,848 huge difference given current memory hierarchies. 38 00:02:48,848 --> 00:02:53,404 And so in memory databases are not really a Panacea for large scale data analytics 39 00:02:53,404 --> 00:02:58,514 today. When you have reasonable amount of data, 40 00:02:58,514 --> 00:03:02,479 it's still important to access your data sequentially touch all of it to do 41 00:03:02,479 --> 00:03:06,993 analytical processing, and whether it's in memory or whether it's on disk map 42 00:03:06,993 --> 00:03:11,202 produces still a reasonably good paradigm that that, that is an abstraction in 43 00:03:11,202 --> 00:03:17,680 which allows you to touch login data efficiently. 44 00:03:17,680 --> 00:03:21,645 So if I were to summarize the evolution of the database technology, we started 45 00:03:21,645 --> 00:03:26,600 out with the relational databases in a sort of one size fits all. 46 00:03:26,600 --> 00:03:30,626 Whether you're doing transaction processing or you're doing reporting or 47 00:03:30,626 --> 00:03:34,993 even analytics you know, you can find gigabytes of data. 48 00:03:34,993 --> 00:03:39,710 You can pretty much do one size fits all using a relational database. 49 00:03:39,710 --> 00:03:42,860 Of course, whether you want to do OLAP or not, will depend on the number of columns 50 00:03:42,860 --> 00:03:45,948 you have here. Too many then OLAP is kind of difficult 51 00:03:45,948 --> 00:03:48,513 to, to, to go through all the possibilities but that's, that's not is 52 00:03:48,513 --> 00:03:51,870 not a common OLAP problem, not a database problem. 53 00:03:53,930 --> 00:03:57,590 Then as the, when we, when we, when we have large volumes, large number of 54 00:03:57,590 --> 00:04:03,050 columns, we went into column-store data ware houses and terabytes of data. 55 00:04:03,050 --> 00:04:07,328 So the data warehouse technologies like teradata, Oracle data warehouse systems 56 00:04:07,328 --> 00:04:12,139 like that, they can deal with column stores and terabytes of data. 57 00:04:13,440 --> 00:04:17,850 And in parallel we had distributed noSQL row/column stores, which essentially 58 00:04:17,850 --> 00:04:24,210 dealt with the, terabytes of data but where you didn't really have to do joins. 59 00:04:26,210 --> 00:04:29,562 And when you want a join, you would do a map-reduce. 60 00:04:29,562 --> 00:04:32,617 especially if you want to do joins for bulk analysis then you could deal with 61 00:04:32,617 --> 00:04:36,167 ten's of terabytes of data over there. In parallel, in memory one size fits all 62 00:04:36,167 --> 00:04:38,279 databases are also a, available, but they have their own challenges in terms of 63 00:04:38,279 --> 00:04:48,520 access patterns still mattering. And last but not least Dremel. 64 00:04:48,520 --> 00:04:53,350 where it's now it's a one size fits all database for petabytes of data. 65 00:04:53,350 --> 00:04:57,504 So you can still do you know bot processing on it but also you can do fast 66 00:04:57,504 --> 00:05:02,680 aggregation queries, where only parts of the data are touched. 67 00:05:03,710 --> 00:05:10,460 in the last year in Cloudera which is one of hive distributions. 68 00:05:10,460 --> 00:05:15,440 released a product called Impala which is essentially an implementation of Dremel. 69 00:05:15,440 --> 00:05:18,450 it still has a long ways to go in terms of efficiency. 70 00:05:18,450 --> 00:05:23,211 But what it allows you to do is essentially do [COUGH] distributed query 71 00:05:23,211 --> 00:05:28,170 processing without having to do a map-reduce. 72 00:05:28,170 --> 00:05:32,860 So it's good subject query, just like Dremel on to, into multiple subqueries 73 00:05:32,860 --> 00:05:39,020 and each each query can be executed on a separate data node. 74 00:05:39,020 --> 00:05:42,980 Data nodes can be in hive which is essentially HDFS plus some metadata or 75 00:05:42,980 --> 00:05:49,400 they can be linked as hive so allows to link data to edgeface. 76 00:05:49,400 --> 00:05:52,220 For those of you who want to learn more about Impala, Cloudera has enough 77 00:05:52,220 --> 00:05:55,322 documentation about it, plus of course it's freely downloadable as part of the 78 00:05:55,322 --> 00:05:57,753 distribution. Okay. 79 00:05:57,753 --> 00:06:03,940 So that brings us to the end of the section on load, the two units on load. 80 00:06:05,920 --> 00:06:09,370 We started off talking about distributed files, which is the second basic element 81 00:06:09,370 --> 00:06:14,047 of big-data after the map-reduce . We talked about what databases are good 82 00:06:14,047 --> 00:06:18,650 for and why traditional databases were a happy compromise. 83 00:06:18,650 --> 00:06:21,895 And we covered the evolution of databases, the evolution of SQL and the 84 00:06:21,895 --> 00:06:28,936 evolution of map-reduce itself. in the past few minutes I've also tried 85 00:06:28,936 --> 00:06:34,522 to cover the big picture for database evolution as we see it today. 86 00:06:34,522 --> 00:06:39,634 and next week we'll actually be having a guest lecture by Srikanta Bedathur of 87 00:06:39,634 --> 00:06:46,024 Triple IT Delhi on graph databases. We'll not have separate quizzes and home 88 00:06:46,024 --> 00:06:50,782 works on the guest lecture but if you have time to do home work three and the 89 00:06:50,782 --> 00:07:02,570 program assignment that goes with it. We'll get started with unit 5. 90 00:07:02,570 --> 00:07:07,662 Which is going to be on the subject learn, where we'll get started with 91 00:07:07,662 --> 00:07:13,210 unearthing facts, from data using more complex classification, clustering, rule 92 00:07:13,210 --> 00:07:21,540 mining, all the things we just mentioned earlier in, in this summary.