1
00:00:00,710 --> 00:00:07,535
So, let's try to summarize the bigger 
picture as I understand it regarding data 

2
00:00:07,535 --> 00:00:13,363
management and big data. 
I often asked, getting asked questions 

3
00:00:13,363 --> 00:00:17,633
about, you know, what's the difference 
between big data technologies, relational 

4
00:00:17,633 --> 00:00:21,338
databases. 
and in memory databases and, and all 

5
00:00:21,338 --> 00:00:23,977
that. 
So I'll try to paint a bigger picture 

6
00:00:23,977 --> 00:00:28,830
over the next few slides. 
this is not really course material, it's 

7
00:00:28,830 --> 00:00:33,057
more a bigger picture. 
we won't be asking questions on this 

8
00:00:33,057 --> 00:00:37,330
material, but it's probably interesting 
to many of you. 

9
00:00:38,870 --> 00:00:43,694
let's see, data bases if you think about 
it, were originally designed in the 

10
00:00:43,694 --> 00:00:49,470
financial service industry to start with 
for transaction processing. 

11
00:00:49,470 --> 00:00:52,937
Essentially keeping track of everybody's 
money. 

12
00:00:52,937 --> 00:00:57,463
and in other industries very soon 
followed in the late 70s, 80s we came, 

13
00:00:57,463 --> 00:01:03,860
had Oracle, then slowly DV2, the 
relational database model. 

14
00:01:03,860 --> 00:01:07,340
The whole business of reporting and 
analytics on data, really came as an 

15
00:01:07,340 --> 00:01:10,758
afterthought. 
They were reporting databases, where 

16
00:01:10,758 --> 00:01:14,534
people would take backups of 
transactional data and then run reports 

17
00:01:14,534 --> 00:01:17,993
on them. 
To figure out what the sales was in the 

18
00:01:17,993 --> 00:01:23,720
past few months, and slice them and dice 
them by region, et cetera. 

19
00:01:23,720 --> 00:01:28,770
Big data technologies, on the other hand, 
were designed for analytics. 

20
00:01:28,770 --> 00:01:34,386
computing classifiers like the Baysian 
classifier we discussed earlier on in the 

21
00:01:34,386 --> 00:01:39,520
in the section on listen. 
Not really for making queries. 

22
00:01:39,520 --> 00:01:44,245
For example, one would rather not run a 
batch map produce job, to select, you 

23
00:01:44,245 --> 00:01:51,443
know, 5% of the rows in a table. 
it's much easier to, for example, use an 

24
00:01:51,443 --> 00:01:55,727
inverted index. 
Which is similar to what, what one would 

25
00:01:55,727 --> 00:02:01,577
use for unstructured data to retrieve 
those rows, rather than run a large scale 

26
00:02:01,577 --> 00:02:07,389
map produce job. 
We'll look at that in a little bit more 

27
00:02:07,389 --> 00:02:12,861
detail since Drimel sort of looks at this 
in a slightly different light in a, in a 

28
00:02:12,861 --> 00:02:18,603
few minutes. 
But, by in large the batch map produce 

29
00:02:18,603 --> 00:02:24,480
paradigm really designed for counting, 
not for doing queries. 

30
00:02:24,480 --> 00:02:28,968
And the second big difference from 
traditional databases, is that data is 

31
00:02:28,968 --> 00:02:34,673
captured pretty much in the raw. 
Since the logs of, of transactions that 

32
00:02:34,673 --> 00:02:39,425
come in, there no transactional overheads 
in terms of making sure that when data is 

33
00:02:39,425 --> 00:02:44,080
captured. 
it's being entered by multiple people at 

34
00:02:44,080 --> 00:02:47,724
the same time. 
So you have to make sure they don't, they 

35
00:02:47,724 --> 00:02:53,205
don't override each other's transaction. 
So those things don't have to be worry, 

36
00:02:53,205 --> 00:02:56,537
you don't have to worry though, so the, 
the overhead is much less in terms of the 

37
00:02:56,537 --> 00:03:00,192
daily capture. 
The blowup is much less in terms of how 

38
00:03:00,192 --> 00:03:06,944
much extra data needs to be stored. 
As a result there, it turns out that in 

39
00:03:06,944 --> 00:03:14,337
the enterprise world there, people are 
perceiving a price performance advantage. 

40
00:03:14,337 --> 00:03:20,007
Even for standard transformation extract 
transform load tasks, as well as some 

41
00:03:20,007 --> 00:03:24,600
bulk query tasks. 
and that's why things like Drimel become 

42
00:03:24,600 --> 00:03:29,759
important.. 
Now, as an aside in the transaction 

43
00:03:29,759 --> 00:03:34,781
processing world, there is also an 
evolution, sort of big dataish, but not 

44
00:03:34,781 --> 00:03:44,650
very different from the analytical world. 
As an example think about Google. 

45
00:03:44,650 --> 00:03:49,980
they run a online massive keyword 
auction, to sell ads using bidding on 

46
00:03:49,980 --> 00:03:55,990
keywords every, everyday, and 
continuously. 

47
00:03:55,990 --> 00:04:03,085
initially used variations of My Sequel. 
very quickly they move to a Big Table, 

48
00:04:03,085 --> 00:04:09,593
based transactional store to handle the, 
the bids on keywords. 

49
00:04:09,593 --> 00:04:15,518
they built something called Mega Store, 
and then they inpable built something 

50
00:04:15,518 --> 00:04:21,620
called F-One, which is really being used 
much more now. 

51
00:04:21,620 --> 00:04:26,235
And very recently, in just last year in 
2012, it came out with Spanner, which is 

52
00:04:26,235 --> 00:04:30,779
a large scale distributed, globally 
distributed, in memory transactional 

53
00:04:30,779 --> 00:04:35,770
database. 
but all these are, you know, in some 

54
00:04:35,770 --> 00:04:39,730
sense really big data, but not analytical 
big databases, they are transaction 

55
00:04:39,730 --> 00:04:44,758
processing databases. 
and that's why we don't really talk about 

56
00:04:44,758 --> 00:04:48,376
them too much in this course, because 
our, we're talking about web intelligence 

57
00:04:48,376 --> 00:04:52,205
and analytics. 
Related to web intelligence rather than 

58
00:04:52,205 --> 00:04:56,093
capturing transactions such as a keyword 
auction to make sure you get the right 

59
00:04:56,093 --> 00:05:02,673
highest bidder for a keyword. 
About those of you who are very familiar 

60
00:05:02,673 --> 00:05:10,337
with business intelligence, using SQL. 
Which is essentially what reporting, and 

61
00:05:10,337 --> 00:05:15,419
all, online analytical processing in 
large scale traditional enterprises is 

62
00:05:15,419 --> 00:05:20,986
all about. 
Generating reports from, packages like 

63
00:05:20,986 --> 00:05:28,830
business objects or Oracle or other data 
warehouses like Teradata. 

64
00:05:28,830 --> 00:05:31,657
What is this all about? 
Well think about what somebody doing 

65
00:05:31,657 --> 00:05:36,542
business intelligence is actually up to. 
If you have, if he has a lot of data, say 

66
00:05:36,542 --> 00:05:41,334
data about customers, that represented by 
these points. 

67
00:05:41,334 --> 00:05:45,878
what they're trying to do is your looking 
at a small slice, you know by this 

68
00:05:45,878 --> 00:05:51,430
region, and sales by city, by store, by 
product. 

69
00:05:51,430 --> 00:05:56,187
A slice of this data you're analyzing a 
subset, and trying to find a distribution 

70
00:05:56,187 --> 00:06:00,405
of how that data looks in this small 
subset. 

71
00:06:00,405 --> 00:06:06,004
Trying to find some interesting patterns. 
Then you may use another slice, and try 

72
00:06:06,004 --> 00:06:12,234
to find some other interesting pattern. 
Try another slice, and, and look for some 

73
00:06:12,234 --> 00:06:16,098
correlations which might lead to higher 
sales, better operational processes and 

74
00:06:16,098 --> 00:06:20,357
keep going. 
The trouble is, you really can't do that 

75
00:06:20,357 --> 00:06:23,149
too much. 
Because if you have a small amount of 

76
00:06:23,149 --> 00:06:27,969
data and more importantly a small amount 
of data about each customer. 

77
00:06:27,969 --> 00:06:30,400
Say you have m pieces of data about each 
customer. 

78
00:06:30,400 --> 00:06:32,772
You're okay. 
But if this m becomes very large, even 

79
00:06:32,772 --> 00:06:36,432
moderately large. 
And the number of possible values each of 

80
00:06:36,432 --> 00:06:40,028
these Xs that you know about your 
customer, the features that you know 

81
00:06:40,028 --> 00:06:45,615
about customers, becomes large. 
Which is even the the words that your 

82
00:06:45,615 --> 00:06:51,496
writing in your email, their clicks that 
they have performed on your website. 

83
00:06:51,496 --> 00:06:56,564
Suddenly your space becomes very large. 
So you, if each of these Xs, each of 

84
00:06:56,564 --> 00:07:01,712
these features, takes just d values, the 
number of possible cubes is sort of, of 

85
00:07:01,712 --> 00:07:07,825
this order, due to the 2m. 
And very easily, you can figure out that 

86
00:07:07,825 --> 00:07:13,318
if m is equal to 40 and d equal to 10, so 
you have 40 features per customer. 

87
00:07:13,318 --> 00:07:16,936
and just each of them can take 10 
possible values. 

88
00:07:16,936 --> 00:07:19,697
This is a huge number. 
It's, it's more than the number of atoms 

89
00:07:19,697 --> 00:07:22,775
in the universe. 
and really what this means is that, 

90
00:07:22,775 --> 00:07:27,390
sampling this distribution and trying to 
find some interesting patterns manually, 

91
00:07:27,390 --> 00:07:33,314
is pretty close to taking infinite time. 
Even if you have an infinite number of 

92
00:07:33,314 --> 00:07:38,500
people you know you can probably care, 
crack it, but not otherwise. 

93
00:07:39,930 --> 00:07:44,962
So what this message is that business 
intelligence folks need to learn deeper 

94
00:07:44,962 --> 00:07:51,452
analytical techniques, which is going to 
be the subject of a later unit. 

95
00:07:51,452 --> 00:07:57,020
And the second message is, big data is 
not really about having lots and lots of 

96
00:07:57,020 --> 00:08:00,704
points. 
I mean Google, for example, have 

97
00:08:00,704 --> 00:08:04,342
petabytes of data. 
A large enterprise may have many hundred 

98
00:08:04,342 --> 00:08:08,124
terabytes of data, or even if you have a 
few gigabytes of data, hundreds of 

99
00:08:08,124 --> 00:08:13,530
gigabytes of data. 
The problem is not the number of points. 

100
00:08:13,530 --> 00:08:18,450
The problem is, how much information you 
have about this points. 

101
00:08:18,450 --> 00:08:24,380
And that is what, in my opinion is big 
about big data these days. 

102
00:08:24,380 --> 00:08:28,604
The number of different sources of data 
that you have about your customers or 

103
00:08:28,604 --> 00:08:32,725
anything else. 
Because of the different inputs that you 

104
00:08:32,725 --> 00:08:36,173
have today. 
Whether it is from social media, whether 

105
00:08:36,173 --> 00:08:42,200
it is from sensors on mobile phones, it's 
increasing M, and D hugely. 

106
00:08:42,200 --> 00:08:46,104
And therefore, the number of possible 
tubes is just too difficult to examine 

107
00:08:46,104 --> 00:08:50,520
manually, and so you need analytical 
techniques. 

108
00:08:50,520 --> 00:08:53,280
And that's really what big data analytics 
is all about. 

109
00:08:53,280 --> 00:08:56,850
I hope that gives you a picture. 
And it's not about petabytes versus 

110
00:08:56,850 --> 00:09:00,460
terabytes versus gigabytes. 
It's really about how much, how many 

111
00:09:00,460 --> 00:09:03,953
columns you have. 
and how, how, how can you explore this 

112
00:09:03,953 --> 00:09:08,022
space more efficiently. 
So that you find something interesting, 

113
00:09:08,022 --> 00:09:11,470
or you or you can learn something about 
your data.