1
00:00:00,000 --> 00:00:05,044
Welcome to week three.
As promised, we'll now begin our coverage

2
00:00:05,044 --> 00:00:10,870
of big data technology starting this week
with Map Reduce and a programming

3
00:00:10,870 --> 00:00:15,090
assignment based on this new way of doing
parallel computing.

4
00:00:15,090 --> 00:00:19,059
And next week, we'll cover the big data
platforms.

5
00:00:19,059 --> 00:00:24,070
Such as distributed file systems, new
database technologies, and most

6
00:00:24,070 --> 00:00:30,042
importantly, where all this is headed with
a glimpse of some research topics

7
00:00:30,042 --> 00:00:35,031
currently being explored.
Obviously, the first question that comes

8
00:00:35,031 --> 00:00:41,153
to mind is, with 30 to 40 years of
database work behind us, why did we

9
00:00:41,153 --> 00:00:46,053
suddenly need to invent a new database
technology?

10
00:00:47,038 --> 00:00:52,068
All this was done in the web and they
clearly found some reasons why the

11
00:00:52,068 --> 00:00:55,088
traditional technology didn't work for
them.

12
00:00:56,021 --> 00:01:01,353
Basically, there are four reasons which
we'll basically summarize right now, but

13
00:01:01,353 --> 00:01:07,419
we'll get into the details next week.
First of all, the traditional technologies

14
00:01:07,419 --> 00:01:13,660
were not fault tolerant at scale.
They would handle maybe dozens or hundreds

15
00:01:13,660 --> 00:01:20,490
of processing machines, but not thousands
or tens of thousands, or even millions,

16
00:01:20,490 --> 00:01:25,384
which are common in web platforms like
Google and Facebook.

17
00:01:25,384 --> 00:01:30,793
Next, traditional technologies were not
really good at handling text, and

18
00:01:30,793 --> 00:01:38,036
certainly not handling video and images.
Third, and this is a technical point.

19
00:01:38,036 --> 00:01:44,450
Large data volumes needed to be kept
online and available all the time, which

20
00:01:44,450 --> 00:01:51,681
was simply not possible in the traditional
way of doing things where old data would

21
00:01:51,681 --> 00:01:58,024
usually be archived onto tape or some
other storage which didn't eat up the

22
00:01:58,024 --> 00:02:02,493
online space.
And most importantly, parallel computing

23
00:02:02,493 --> 00:02:09,394
was really an add on feature bolted onto
traditional database technologies only in

24
00:02:09,394 --> 00:02:14,023
the'90s.
Whereas, the scale of parallelism that the

25
00:02:14,023 --> 00:02:19,478
web required simply didn't work with such
add-on technology.

26
00:02:19,478 --> 00:02:25,341
As a result, traditional relational
database technology simply could not

27
00:02:25,341 --> 00:02:29,076
scale.
And it was also not suited for the deep

28
00:02:29,076 --> 00:02:34,072
analytics tasks which were computing
intensive that all the web companies

29
00:02:34,072 --> 00:02:40,008
needed to perform, primarily, in order to
target advertisements better as we have

30
00:02:40,008 --> 00:02:46,012
seen the past week.
What is resulted is a bunch of new

31
00:02:46,012 --> 00:02:53,095
technologies coming out of the web world,
which not only perform at a higher scale

32
00:02:53,095 --> 00:02:58,038
than traditional ones.
But because they were built using

33
00:02:58,038 --> 00:03:03,672
commodity hardware and many of them are
now open source, they present a price

34
00:03:03,672 --> 00:03:08,089
performance challenge as compared to the
new technologies.

35
00:03:08,089 --> 00:03:14,080
So, it's not that big data technology
should be used only if, if you have big

36
00:03:14,080 --> 00:03:20,064
data or huge computational requirements.
The fact is, that some of these new

37
00:03:20,064 --> 00:03:25,046
technology is just cheaper and faster than
the old technology.

38
00:03:25,046 --> 00:03:31,015
Now, the main innovations in big data
technology are the Map Reduce programming

39
00:03:31,015 --> 00:03:37,027
paradigm which we'll study this week, and
the distributed file systems in data bases

40
00:03:37,027 --> 00:03:43,077
which we'll come to next week.
However, the main message is a little

41
00:03:43,077 --> 00:03:47,089
different.
It's not just the technology.

42
00:03:47,089 --> 00:03:54,791
As we shall see soon, a different approach
to data processing problems is required.

43
00:03:54,791 --> 00:04:01,730
If one tries to use new technology with an
old mindset, one can still get into the

44
00:04:01,730 --> 00:04:09,167
same kinds of difficulties.
There are some important caveats, though.

45
00:04:09,167 --> 00:04:15,509
The new technology is still maturing.
Many database innovations which have come

46
00:04:15,509 --> 00:04:22,333
up over the past 40 years remain unique to
the traditional relational database stack.

47
00:04:22,333 --> 00:04:28,098
Varieties of indexing, very complex query
optimizations, storage optimizations.

48
00:04:28,098 --> 00:04:33,092
All of these are been rediscovered and
reinvented for big data.

49
00:04:33,092 --> 00:04:37,046
We'll come to some of these things next
week.

50
00:04:37,046 --> 00:04:43,023
But for the moment, let's plunge into
hollow computing.