1 00:00:00,000 --> 00:00:07,043 Of course as we have seen parallel computing is not new and has been around 2 00:00:07,043 --> 00:00:16,078 essentially since the early 80s and databases have also, traditionally evolved 3 00:00:16,078 --> 00:00:20,846 to exploit parallel computing over the years. 4 00:00:20,846 --> 00:00:25,258 In the beginning we had shared memory databases. 5 00:00:25,258 --> 00:00:31,206 Which still persists to date where you have a large. 6 00:00:31,206 --> 00:00:36,668 Multi processor system, with multiple CPUs sharing memory. 7 00:00:36,668 --> 00:00:45,030 A single operating system scheduling jobs or processes across different CPUs and a 8 00:00:45,030 --> 00:00:51,591 common disk or storage area network, where all the data is stored. 9 00:00:51,591 --> 00:00:57,665 Examples of such systems abound. Almost all servers today are 10 00:00:57,665 --> 00:01:04,841 multiprocessing shared memory systems. But they, almost a few dozen processors, 11 00:01:04,841 --> 00:01:12,142 and even with each of them having multiple cores, we find shared memory systems 12 00:01:12,142 --> 00:01:16,562 supporting at most a few hundred processing units. 13 00:01:16,562 --> 00:01:24,376 The shared memory model simply doesn't scale beyond this level. 14 00:01:24,376 --> 00:01:32,761 Databases have exploited other parallel architectures, such as the shared disk 15 00:01:32,761 --> 00:01:38,250 architecture, and the shared nothing architecture, which scale to greater 16 00:01:38,250 --> 00:01:42,693 number of processors as compared to shared memory. 17 00:01:42,693 --> 00:01:50,233 In a shared disk architecture you may have multiple processors which communicate over 18 00:01:50,233 --> 00:01:53,888 a network. Accessing, however, a common disk system 19 00:01:53,888 --> 00:02:00,049 which could be a storage area network or a network attached system of storage using 20 00:02:00,049 --> 00:02:05,359 two different networks, one for communication between processors, and the 21 00:02:05,359 --> 00:02:11,084 other one for accessing storage. The shared nothing architecture on the 22 00:02:11,084 --> 00:02:17,055 other hand. Relies on local disks with each processor. 23 00:02:17,055 --> 00:02:24,034 So that the only communication that takes place over the network is between 24 00:02:24,034 --> 00:02:29,079 processors. In parallele database in both the shared 25 00:02:29,079 --> 00:02:33,072 nothing architectures as well as the others. 26 00:02:33,072 --> 00:02:39,043 Sequel queries are executed in parallele by multiple processors. 27 00:02:41,074 --> 00:02:45,000 In the shared nothing architecture in addition. 28 00:02:45,041 --> 00:02:52,082 Data itself is distributed across different discs using varieties of 29 00:02:52,082 --> 00:02:58,065 partitioning schemes. Such as different sets of rows on 30 00:02:58,065 --> 00:03:05,080 different discs or in the case of column oriented, specialized engines for 31 00:03:05,080 --> 00:03:12,002 analytical processing, different sets of columns on different discs. 32 00:03:13,096 --> 00:03:19,151 All this has happened, in the database community and parallel databases are now 33 00:03:19,151 --> 00:03:21,725 almost given. They all support sequel. 34 00:03:21,725 --> 00:03:26,821 Some of them support transaction processing, where of course there is the 35 00:03:26,821 --> 00:03:32,087 additional overhead of managing transaction isolation and consistency of 36 00:03:32,087 --> 00:03:37,303 cross multiple processors. But we won't get into that right now. 37 00:03:37,303 --> 00:03:44,340 The thing that is not handled by parallel databases properly is fault tolerance. 38 00:03:44,340 --> 00:03:51,544 They didn't have to handle fault tolerance because with just a few dozen processors, 39 00:03:51,544 --> 00:03:58,306 you don't need to worry about processors failing in while executing a sequel query 40 00:03:58,306 --> 00:04:03,011 which in any way will take few seconds or few minutes utmost. 41 00:04:03,011 --> 00:04:09,008 When you are executing a large batch job which touches virtually all the data. 42 00:04:09,047 --> 00:04:15,046 The chances of a processor failing are very high, especially a large number of 43 00:04:15,046 --> 00:04:19,046 processors. And some situations, the pile of database 44 00:04:19,046 --> 00:04:25,854 architecture simply are not full tolerant. Full tolerance in the parallel database 45 00:04:25,854 --> 00:04:32,210 world relies on having a hot standby architecture or deployment which is 46 00:04:32,210 --> 00:04:38,112 identical to the primary and essentially replicating data over a high speed network 47 00:04:38,112 --> 00:04:44,390 between the primary and the hot standby. A very costly and still not completely 48 00:04:44,390 --> 00:04:50,052 fault tolerant architecture as compared to say, the highly distributed, 49 00:04:50,052 --> 00:04:54,943 fault-tolerant, map-produced system on a distributed GFS or HDFS. 50 00:04:54,943 --> 00:05:01,071 So here is where the dichotomy comes in when you're doing large volume analytical 51 00:05:01,071 --> 00:05:07,048 processing which touches all the data the parallel database architecture simply 52 00:05:07,048 --> 00:05:12,112 doesn't work. So, where databases have been evolving 53 00:05:12,112 --> 00:05:17,265 over the past few years. Because of the big data technology that 54 00:05:17,265 --> 00:05:20,974 has emerged from the web, is in two directions. 55 00:05:20,974 --> 00:05:24,899 On the one hand we have the no sequel databases. 56 00:05:24,899 --> 00:05:32,228 Which are based on big data technology. They're called no sequel because firstly 57 00:05:32,228 --> 00:05:38,048 don't, they don't support full asset transactions or rather fully isolated 58 00:05:38,048 --> 00:05:43,217 serializable transactions in the traditional sense. 59 00:05:43,217 --> 00:05:50,419 Secondly, instead of having complex indexing, they have chartered indexing, 60 00:05:50,419 --> 00:05:58,339 which is essentially having partitioning of the data between different chunks or 61 00:05:58,339 --> 00:06:05,013 different blocks of disk. We'll come to sharded indexing shortly. 62 00:06:06,031 --> 00:06:13,043 Third, they don't support full joints. They, for a variety of reasons, they are 63 00:06:13,043 --> 00:06:19,069 restricted in the kind of joints that they can perform efficiently. 64 00:06:21,042 --> 00:06:29,007 And lastly they support column-oriented stories if needed, so that for very long, 65 00:06:29,007 --> 00:06:36,092 wide columns, different parts of a record can be stored on different service. 66 00:06:37,081 --> 00:06:45,066 The other side of database evolution is in memory databases, which is, has been 67 00:06:45,066 --> 00:06:51,098 driven by the, increasing. Volumes of main memory available on 68 00:06:51,098 --> 00:06:56,058 today's servers. And the falling cost of memories. 69 00:06:56,058 --> 00:07:03,012 So today servers have, you know, 64 gigabytes or even 124 gigabytes of main 70 00:07:03,012 --> 00:07:07,063 memory. Which allows for most practical purposes 71 00:07:07,063 --> 00:07:14,062 many ordinary enterprise transaction systems to actually reside in main memory. 72 00:07:15,006 --> 00:07:18,057 So, real time transactions become possible. 73 00:07:18,082 --> 00:07:25,019 At much higher rates than before. Varieties of indexes can be supported just 74 00:07:25,019 --> 00:07:29,071 as before. In a, in traditional databases, but now in 75 00:07:29,071 --> 00:07:33,081 memory. So, different kinds of indexing structures 76 00:07:33,081 --> 00:07:38,566 are possible. And of course complex joints are possible. 77 00:07:38,566 --> 00:07:44,751 Of course the data has to be still small compared to web scale petabytes of data or 78 00:07:44,751 --> 00:07:50,203 many hundreds of terabytes. But if you are talking about gigabytes of 79 00:07:50,203 --> 00:07:56,034 data which are traditional enterprise transaction system, rarely exceeds in 80 00:07:56,034 --> 00:08:02,421 memory database is a happy compromise where all lab queries and all kinds of 81 00:08:02,421 --> 00:08:08,811 bulk data processing as well can be fairly efficiently performed. 82 00:08:08,811 --> 00:08:14,816 However this is not the big data world, in the web. 83 00:08:14,816 --> 00:08:22,295 That doesn't quite work and, therefore, no Sequel databases have become quite 84 00:08:22,295 --> 00:08:25,664 popular. We will study some of the non sequel table 85 00:08:25,664 --> 00:08:30,899 databases and the concepts very shortly. As, as far in memory databases are 86 00:08:30,899 --> 00:08:36,655 concerned, many of the techniques that worked in traditional databases, simply 87 00:08:36,655 --> 00:08:42,007 carry over, except for the fact that because everything is in memory, one 88 00:08:42,007 --> 00:08:47,546 doesn't have to deal with additional complexity of some parts of the index 89 00:08:47,546 --> 00:08:52,001 being cached versus being on disc and things like that.