Welcome to week three.
As promised, we'll now begin our coverage
of big data technology starting this week
with Map Reduce and a programming
assignment based on this new way of doing
parallel computing.
And next week, we'll cover the big data
platforms.
Such as distributed file systems, new
database technologies, and most
importantly, where all this is headed with
a glimpse of some research topics
currently being explored.
Obviously, the first question that comes
to mind is, with 30 to 40 years of
database work behind us, why did we
suddenly need to invent a new database
technology?
All this was done in the web and they
clearly found some reasons why the
traditional technology didn't work for
them.
Basically, there are four reasons which
we'll basically summarize right now, but
we'll get into the details next week.
First of all, the traditional technologies
were not fault tolerant at scale.
They would handle maybe dozens or hundreds
of processing machines, but not thousands
or tens of thousands, or even millions,
which are common in web platforms like
Google and Facebook.
Next, traditional technologies were not
really good at handling text, and
certainly not handling video and images.
Third, and this is a technical point.
Large data volumes needed to be kept
online and available all the time, which
was simply not possible in the traditional
way of doing things where old data would
usually be archived onto tape or some
other storage which didn't eat up the
online space.
And most importantly, parallel computing
was really an add on feature bolted onto
traditional database technologies only in
the'90s.
Whereas, the scale of parallelism that the
web required simply didn't work with such
add-on technology.
As a result, traditional relational
database technology simply could not
scale.
And it was also not suited for the deep
analytics tasks which were computing
intensive that all the web companies
needed to perform, primarily, in order to
target advertisements better as we have
seen the past week.
What is resulted is a bunch of new
technologies coming out of the web world,
which not only perform at a higher scale
than traditional ones.
But because they were built using
commodity hardware and many of them are
now open source, they present a price
performance challenge as compared to the
new technologies.
So, it's not that big data technology
should be used only if, if you have big
data or huge computational requirements.
The fact is, that some of these new
technology is just cheaper and faster than
the old technology.
Now, the main innovations in big data
technology are the Map Reduce programming
paradigm which we'll study this week, and
the distributed file systems in data bases
which we'll come to next week.
However, the main message is a little
different.
It's not just the technology.
As we shall see soon, a different approach
to data processing problems is required.
If one tries to use new technology with an
old mindset, one can still get into the
same kinds of difficulties.
There are some important caveats, though.
The new technology is still maturing.
Many database innovations which have come
up over the past 40 years remain unique to
the traditional relational database stack.
Varieties of indexing, very complex query
optimizations, storage optimizations.
All of these are been rediscovered and
reinvented for big data.
We'll come to some of these things next
week.
But for the moment, let's plunge into
hollow computing.