1 00:00:00,000 --> 00:00:05,044 Welcome to week three. As promised, we'll now begin our coverage 2 00:00:05,044 --> 00:00:10,870 of big data technology starting this week with Map Reduce and a programming 3 00:00:10,870 --> 00:00:15,090 assignment based on this new way of doing parallel computing. 4 00:00:15,090 --> 00:00:19,059 And next week, we'll cover the big data platforms. 5 00:00:19,059 --> 00:00:24,070 Such as distributed file systems, new database technologies, and most 6 00:00:24,070 --> 00:00:30,042 importantly, where all this is headed with a glimpse of some research topics 7 00:00:30,042 --> 00:00:35,031 currently being explored. Obviously, the first question that comes 8 00:00:35,031 --> 00:00:41,153 to mind is, with 30 to 40 years of database work behind us, why did we 9 00:00:41,153 --> 00:00:46,053 suddenly need to invent a new database technology? 10 00:00:47,038 --> 00:00:52,068 All this was done in the web and they clearly found some reasons why the 11 00:00:52,068 --> 00:00:55,088 traditional technology didn't work for them. 12 00:00:56,021 --> 00:01:01,353 Basically, there are four reasons which we'll basically summarize right now, but 13 00:01:01,353 --> 00:01:07,419 we'll get into the details next week. First of all, the traditional technologies 14 00:01:07,419 --> 00:01:13,660 were not fault tolerant at scale. They would handle maybe dozens or hundreds 15 00:01:13,660 --> 00:01:20,490 of processing machines, but not thousands or tens of thousands, or even millions, 16 00:01:20,490 --> 00:01:25,384 which are common in web platforms like Google and Facebook. 17 00:01:25,384 --> 00:01:30,793 Next, traditional technologies were not really good at handling text, and 18 00:01:30,793 --> 00:01:38,036 certainly not handling video and images. Third, and this is a technical point. 19 00:01:38,036 --> 00:01:44,450 Large data volumes needed to be kept online and available all the time, which 20 00:01:44,450 --> 00:01:51,681 was simply not possible in the traditional way of doing things where old data would 21 00:01:51,681 --> 00:01:58,024 usually be archived onto tape or some other storage which didn't eat up the 22 00:01:58,024 --> 00:02:02,493 online space. And most importantly, parallel computing 23 00:02:02,493 --> 00:02:09,394 was really an add on feature bolted onto traditional database technologies only in 24 00:02:09,394 --> 00:02:14,023 the'90s. Whereas, the scale of parallelism that the 25 00:02:14,023 --> 00:02:19,478 web required simply didn't work with such add-on technology. 26 00:02:19,478 --> 00:02:25,341 As a result, traditional relational database technology simply could not 27 00:02:25,341 --> 00:02:29,076 scale. And it was also not suited for the deep 28 00:02:29,076 --> 00:02:34,072 analytics tasks which were computing intensive that all the web companies 29 00:02:34,072 --> 00:02:40,008 needed to perform, primarily, in order to target advertisements better as we have 30 00:02:40,008 --> 00:02:46,012 seen the past week. What is resulted is a bunch of new 31 00:02:46,012 --> 00:02:53,095 technologies coming out of the web world, which not only perform at a higher scale 32 00:02:53,095 --> 00:02:58,038 than traditional ones. But because they were built using 33 00:02:58,038 --> 00:03:03,672 commodity hardware and many of them are now open source, they present a price 34 00:03:03,672 --> 00:03:08,089 performance challenge as compared to the new technologies. 35 00:03:08,089 --> 00:03:14,080 So, it's not that big data technology should be used only if, if you have big 36 00:03:14,080 --> 00:03:20,064 data or huge computational requirements. The fact is, that some of these new 37 00:03:20,064 --> 00:03:25,046 technology is just cheaper and faster than the old technology. 38 00:03:25,046 --> 00:03:31,015 Now, the main innovations in big data technology are the Map Reduce programming 39 00:03:31,015 --> 00:03:37,027 paradigm which we'll study this week, and the distributed file systems in data bases 40 00:03:37,027 --> 00:03:43,077 which we'll come to next week. However, the main message is a little 41 00:03:43,077 --> 00:03:47,089 different. It's not just the technology. 42 00:03:47,089 --> 00:03:54,791 As we shall see soon, a different approach to data processing problems is required. 43 00:03:54,791 --> 00:04:01,730 If one tries to use new technology with an old mindset, one can still get into the 44 00:04:01,730 --> 00:04:09,167 same kinds of difficulties. There are some important caveats, though. 45 00:04:09,167 --> 00:04:15,509 The new technology is still maturing. Many database innovations which have come 46 00:04:15,509 --> 00:04:22,333 up over the past 40 years remain unique to the traditional relational database stack. 47 00:04:22,333 --> 00:04:28,098 Varieties of indexing, very complex query optimizations, storage optimizations. 48 00:04:28,098 --> 00:04:33,092 All of these are been rediscovered and reinvented for big data. 49 00:04:33,092 --> 00:04:37,046 We'll come to some of these things next week. 50 00:04:37,046 --> 00:04:43,023 But for the moment, let's plunge into hollow computing.