1
00:00:00,430 --> 00:00:05,430
So, as I just mentioned, analytics is 
about counting, not queries. 

2
00:00:05,430 --> 00:00:09,053
What kind of counting do you need to do? 
Well, we've already done one example 

3
00:00:09,053 --> 00:00:12,585
computing these likelihoods. 
A lot of counting to be done. 

4
00:00:12,585 --> 00:00:15,840
We're going to talk about things like 
clustering. 

5
00:00:15,840 --> 00:00:19,944
one example we've already seen in the 
first lecture, computing hash-functions 

6
00:00:19,944 --> 00:00:24,550
using a very sensitive hashing is a kind 
of clustering. 

7
00:00:24,550 --> 00:00:28,519
data mining, finding bumps or 
statistically different regions in your 

8
00:00:28,519 --> 00:00:32,803
data, finding rules and learning latent 
features which involves things like 

9
00:00:32,803 --> 00:00:37,465
sampling distributions which our guest 
lecture on Markov logic networks, as 

10
00:00:37,465 --> 00:00:44,629
MLN's, we'll talk a little bit about. 
And we'll, and we'll give one example 

11
00:00:44,629 --> 00:00:48,977
using matrix multiplications also. 
But the bottom line that I want to talk 

12
00:00:48,977 --> 00:00:53,186
about now that since we're going to cover 
some of these techniques only after unit 

13
00:00:53,186 --> 00:00:57,334
five onwards, is that from the database 
perspective, all of these techniques 

14
00:00:57,334 --> 00:01:03,910
require you to touch all your data. 
It's not about querying. 

15
00:01:03,910 --> 00:01:07,745
If you're to touch all your data, then 
complex index structures for fast query 

16
00:01:07,745 --> 00:01:12,111
processing, which traditional relational 
databases and data warehouse technologies 

17
00:01:12,111 --> 00:01:18,848
like TerraData, have are pretty useless. 
So let me try to summarize the evolution 

18
00:01:18,848 --> 00:01:22,940
of analytical databases and current 
trends. 

19
00:01:22,940 --> 00:01:26,460
On the one hand we have the noSQL 
databases. 

20
00:01:26,460 --> 00:01:31,080
So they got rid of ACID transactions, 
that is, transaction isolation to deal 

21
00:01:31,080 --> 00:01:36,510
with the, deal with the conflicts and and 
locking. 

22
00:01:36,510 --> 00:01:39,396
none, none of that is there. 
they use, instead of having indexes, it 

23
00:01:39,396 --> 00:01:42,966
essentially uses shards, that means 
distribution of data across different 

24
00:01:42,966 --> 00:01:47,565
data nodes, distributive processing to 
rather the indexing. 

25
00:01:47,565 --> 00:01:52,108
They don't have joins, because joins are 
better done in same app produced and they 

26
00:01:52,108 --> 00:01:56,390
support columnar storage so that you can 
have Y data. 

27
00:01:56,390 --> 00:01:59,610
And just as we've mentioned, big data is 
about Y data. 

28
00:02:01,410 --> 00:02:05,067
On the other hand, there is another trend 
in the analytical database world which is 

29
00:02:05,067 --> 00:02:09,997
all about in-memory databases. 
So in-memory databases, allow real-time 

30
00:02:09,997 --> 00:02:14,857
transactions, they allow variety of 
indexes and complex joins. 

31
00:02:14,857 --> 00:02:20,040
The problem is that the even as in-memory 
databases are great for very fast query 

32
00:02:20,040 --> 00:02:25,223
processing and also large scale 
transaction processing it's back to the 

33
00:02:25,223 --> 00:02:30,051
same problem of the number or possible 
options you can explore while doing 

34
00:02:30,051 --> 00:02:36,120
slicing and dicing. 
You just can't do too much if the data 

35
00:02:36,120 --> 00:02:39,556
was very wide. 
The second problem is that even in 

36
00:02:39,556 --> 00:02:44,110
memory, the fact that you accessed data 
sequentially versus random access makes a 

37
00:02:44,110 --> 00:02:48,848
huge difference given current memory 
hierarchies. 

38
00:02:48,848 --> 00:02:53,404
And so in memory databases are not really 
a Panacea for large scale data analytics 

39
00:02:53,404 --> 00:02:58,514
today. 
When you have reasonable amount of data, 

40
00:02:58,514 --> 00:03:02,479
it's still important to access your data 
sequentially touch all of it to do 

41
00:03:02,479 --> 00:03:06,993
analytical processing, and whether it's 
in memory or whether it's on disk map 

42
00:03:06,993 --> 00:03:11,202
produces still a reasonably good paradigm 
that that, that is an abstraction in 

43
00:03:11,202 --> 00:03:17,680
which allows you to touch login data 
efficiently. 

44
00:03:17,680 --> 00:03:21,645
So if I were to summarize the evolution 
of the database technology, we started 

45
00:03:21,645 --> 00:03:26,600
out with the relational databases in a 
sort of one size fits all. 

46
00:03:26,600 --> 00:03:30,626
Whether you're doing transaction 
processing or you're doing reporting or 

47
00:03:30,626 --> 00:03:34,993
even analytics you know, you can find 
gigabytes of data. 

48
00:03:34,993 --> 00:03:39,710
You can pretty much do one size fits all 
using a relational database. 

49
00:03:39,710 --> 00:03:42,860
Of course, whether you want to do OLAP or 
not, will depend on the number of columns 

50
00:03:42,860 --> 00:03:45,948
you have here. 
Too many then OLAP is kind of difficult 

51
00:03:45,948 --> 00:03:48,513
to, to, to go through all the 
possibilities but that's, that's not is 

52
00:03:48,513 --> 00:03:51,870
not a common OLAP problem, not a database 
problem. 

53
00:03:53,930 --> 00:03:57,590
Then as the, when we, when we, when we 
have large volumes, large number of 

54
00:03:57,590 --> 00:04:03,050
columns, we went into column-store data 
ware houses and terabytes of data. 

55
00:04:03,050 --> 00:04:07,328
So the data warehouse technologies like 
teradata, Oracle data warehouse systems 

56
00:04:07,328 --> 00:04:12,139
like that, they can deal with column 
stores and terabytes of data. 

57
00:04:13,440 --> 00:04:17,850
And in parallel we had distributed noSQL 
row/column stores, which essentially 

58
00:04:17,850 --> 00:04:24,210
dealt with the, terabytes of data but 
where you didn't really have to do joins. 

59
00:04:26,210 --> 00:04:29,562
And when you want a join, you would do a 
map-reduce. 

60
00:04:29,562 --> 00:04:32,617
especially if you want to do joins for 
bulk analysis then you could deal with 

61
00:04:32,617 --> 00:04:36,167
ten's of terabytes of data over there. 
In parallel, in memory one size fits all 

62
00:04:36,167 --> 00:04:38,279
databases are also a, available, but they 
have their own challenges in terms of 

63
00:04:38,279 --> 00:04:48,520
access patterns still mattering. 
And last but not least Dremel. 

64
00:04:48,520 --> 00:04:53,350
where it's now it's a one size fits all 
database for petabytes of data. 

65
00:04:53,350 --> 00:04:57,504
So you can still do you know bot 
processing on it but also you can do fast 

66
00:04:57,504 --> 00:05:02,680
aggregation queries, where only parts of 
the data are touched. 

67
00:05:03,710 --> 00:05:10,460
in the last year in Cloudera which is one 
of hive distributions. 

68
00:05:10,460 --> 00:05:15,440
released a product called Impala which is 
essentially an implementation of Dremel. 

69
00:05:15,440 --> 00:05:18,450
it still has a long ways to go in terms 
of efficiency. 

70
00:05:18,450 --> 00:05:23,211
But what it allows you to do is 
essentially do [COUGH] distributed query 

71
00:05:23,211 --> 00:05:28,170
processing without having to do a 
map-reduce. 

72
00:05:28,170 --> 00:05:32,860
So it's good subject query, just like 
Dremel on to, into multiple subqueries 

73
00:05:32,860 --> 00:05:39,020
and each each query can be executed on a 
separate data node. 

74
00:05:39,020 --> 00:05:42,980
Data nodes can be in hive which is 
essentially HDFS plus some metadata or 

75
00:05:42,980 --> 00:05:49,400
they can be linked as hive so allows to 
link data to edgeface. 

76
00:05:49,400 --> 00:05:52,220
For those of you who want to learn more 
about Impala, Cloudera has enough 

77
00:05:52,220 --> 00:05:55,322
documentation about it, plus of course 
it's freely downloadable as part of the 

78
00:05:55,322 --> 00:05:57,753
distribution. 
Okay. 

79
00:05:57,753 --> 00:06:03,940
So that brings us to the end of the 
section on load, the two units on load. 

80
00:06:05,920 --> 00:06:09,370
We started off talking about distributed 
files, which is the second basic element 

81
00:06:09,370 --> 00:06:14,047
of big-data after the map-reduce . 
We talked about what databases are good 

82
00:06:14,047 --> 00:06:18,650
for and why traditional databases were a 
happy compromise. 

83
00:06:18,650 --> 00:06:21,895
And we covered the evolution of 
databases, the evolution of SQL and the 

84
00:06:21,895 --> 00:06:28,936
evolution of map-reduce itself. 
in the past few minutes I've also tried 

85
00:06:28,936 --> 00:06:34,522
to cover the big picture for database 
evolution as we see it today. 

86
00:06:34,522 --> 00:06:39,634
and next week we'll actually be having a 
guest lecture by Srikanta Bedathur of 

87
00:06:39,634 --> 00:06:46,024
Triple IT Delhi on graph databases. 
We'll not have separate quizzes and home 

88
00:06:46,024 --> 00:06:50,782
works on the guest lecture but if you 
have time to do home work three and the 

89
00:06:50,782 --> 00:07:02,570
program assignment that goes with it. 
We'll get started with unit 5. 

90
00:07:02,570 --> 00:07:07,662
Which is going to be on the subject 
learn, where we'll get started with 

91
00:07:07,662 --> 00:07:13,210
unearthing facts, from data using more 
complex classification, clustering, rule 

92
00:07:13,210 --> 00:07:21,540
mining, all the things we just mentioned 
earlier in, in this summary.