Welcome to week three. As promised, we'll now begin our coverage of big data technology starting this week with Map Reduce and a programming assignment based on this new way of doing parallel computing. And next week, we'll cover the big data platforms. Such as distributed file systems, new database technologies, and most importantly, where all this is headed with a glimpse of some research topics currently being explored. Obviously, the first question that comes to mind is, with 30 to 40 years of database work behind us, why did we suddenly need to invent a new database technology? All this was done in the web and they clearly found some reasons why the traditional technology didn't work for them. Basically, there are four reasons which we'll basically summarize right now, but we'll get into the details next week. First of all, the traditional technologies were not fault tolerant at scale. They would handle maybe dozens or hundreds of processing machines, but not thousands or tens of thousands, or even millions, which are common in web platforms like Google and Facebook. Next, traditional technologies were not really good at handling text, and certainly not handling video and images. Third, and this is a technical point. Large data volumes needed to be kept online and available all the time, which was simply not possible in the traditional way of doing things where old data would usually be archived onto tape or some other storage which didn't eat up the online space. And most importantly, parallel computing was really an add on feature bolted onto traditional database technologies only in the'90s. Whereas, the scale of parallelism that the web required simply didn't work with such add-on technology. As a result, traditional relational database technology simply could not scale. And it was also not suited for the deep analytics tasks which were computing intensive that all the web companies needed to perform, primarily, in order to target advertisements better as we have seen the past week. What is resulted is a bunch of new technologies coming out of the web world, which not only perform at a higher scale than traditional ones. But because they were built using commodity hardware and many of them are now open source, they present a price performance challenge as compared to the new technologies. So, it's not that big data technology should be used only if, if you have big data or huge computational requirements. The fact is, that some of these new technology is just cheaper and faster than the old technology. Now, the main innovations in big data technology are the Map Reduce programming paradigm which we'll study this week, and the distributed file systems in data bases which we'll come to next week. However, the main message is a little different. It's not just the technology. As we shall see soon, a different approach to data processing problems is required. If one tries to use new technology with an old mindset, one can still get into the same kinds of difficulties. There are some important caveats, though. The new technology is still maturing. Many database innovations which have come up over the past 40 years remain unique to the traditional relational database stack. Varieties of indexing, very complex query optimizations, storage optimizations. All of these are been rediscovered and reinvented for big data. We'll come to some of these things next week. But for the moment, let's plunge into hollow computing.