1 00:00:00,000 --> 00:00:08,035 Big table and H Base were one of the first NoSQL databases designed for large scale 2 00:00:08,035 --> 00:00:16,031 batch processing essentially and therefore didn't really have strong support for 3 00:00:16,031 --> 00:00:22,003 indexes. Recently, the database called MongoDB has 4 00:00:22,003 --> 00:00:28,793 become very popular, because it does, in fact, include significant support for 5 00:00:28,793 --> 00:00:33,026 indexes. Instead of records, MongoDB uses the term 6 00:00:33,026 --> 00:00:38,788 documents, which are essentially structured documents spo they look a lot 7 00:00:38,788 --> 00:00:46,038 like records, but they can be more complex than records in the sense that one might 8 00:00:46,038 --> 00:00:52,464 have multiple titles or multiple categories, for example, in this case. 9 00:00:52,736 --> 00:00:58,780 This is sort of like H Base records, because H Base can also have multiple 10 00:00:58,780 --> 00:01:07,369 values for a particular column. Like H Base, it's a sharded database, so 11 00:01:07,369 --> 00:01:17,840 that documents or records are kept in different shards on different processors 12 00:01:17,840 --> 00:01:26,734 so that a set of keys goes to one shard. Additionally, just like H Base we have 13 00:01:27,095 --> 00:01:34,063 three replicas for each shard. Mongodb, however, doesn't rely on the 14 00:01:34,063 --> 00:01:38,060 underlying distributed file system, like HDFS. 15 00:01:38,060 --> 00:01:45,092 But, keeps track of it's own replicas, in a manner reminiscent of traditional paddle 16 00:01:45,092 --> 00:01:50,095 relation databases. The underlying file system can be any 17 00:01:50,095 --> 00:01:58,644 operating system, file system, like Linux file system or something else. 18 00:01:58,644 --> 00:02:08,805 Further, the indexes that MongoDB supports include full text indexes so that it 19 00:02:08,805 --> 00:02:17,176 essentially has support for inverted indexing as we saw in the first week of 20 00:02:17,176 --> 00:02:23,527 this course. Lastly, it also supports a form of 21 00:02:23,527 --> 00:02:30,247 map-reduce within the database itself. So, when [unknown] effectively write 22 00:02:30,247 --> 00:02:37,419 map-reduce programs which process data in MongoDB using Javascript which is the only 23 00:02:37,419 --> 00:02:44,490 language it supports. Another important feature of MongoDB which 24 00:02:44,490 --> 00:02:51,473 is shared by some other no sequel databases, is its consistency model for 25 00:02:51,473 --> 00:02:56,332 writes. We recall that in H Base, for example, a 26 00:02:56,332 --> 00:03:02,166 write relies on a write to the distributed file system. 27 00:03:02,166 --> 00:03:09,740 We just essentially completes writing a record only all three replicas write 28 00:03:09,740 --> 00:03:15,012 successfully. This essentially makes writes into both 29 00:03:15,012 --> 00:03:22,297 GFS, HDFS, as well as H Base rather costly because it relies on altering replicas 30 00:03:22,297 --> 00:03:29,002 agreeing, before a write succeeds. It's quite okay for large bulk rights, but 31 00:03:29,002 --> 00:03:35,031 doesn't quite add up for smaller rights, or random rights because the extra 32 00:03:35,031 --> 00:03:42,009 overhead of making sure the replicas are consistent is relatively high for small 33 00:03:42,009 --> 00:03:46,045 operations. So, instead MongoDB uses something called 34 00:03:46,045 --> 00:03:52,049 eventual consistency, where writes or updates are propagated to replicas 35 00:03:52,049 --> 00:03:56,077 eventually, and not synchronously with the write. 36 00:03:56,077 --> 00:04:02,078 So, it's possible for a write to complete without all it's replicas having been 37 00:04:02,078 --> 00:04:08,015 updated with that write. The first database to use eventual 38 00:04:08,015 --> 00:04:15,093 consistency was Amazon Simple DB, where the idea of vector clocks was used to 39 00:04:16,031 --> 00:04:22,259 achieve eventual consistency. Other NoSQL data bases, such as Mongol DB, 40 00:04:22,259 --> 00:04:27,309 still use vector clocks. So, this is quite an important concept and 41 00:04:27,309 --> 00:04:34,868 we'll illustrate it through an example. Each record is represented by a key which 42 00:04:34,868 --> 00:04:39,726 could be, for example, the transaction ID or something else. 43 00:04:39,726 --> 00:04:44,687 And it needs to get replicated at three different replicas. 44 00:04:44,687 --> 00:04:51,053 In Amazon Simple DB, for example, these replicas are chosen by hashing each key 45 00:04:51,053 --> 00:04:57,037 and essentially distributing the hashed values in a round roving fashion to 46 00:04:57,037 --> 00:05:02,700 different physical notes. A different replication technique is used 47 00:05:02,700 --> 00:05:08,477 in MongoDB, but that's not relevant for the vector clock discussion. 48 00:05:08,477 --> 00:05:14,939 Nonetheless, let's assume that there is a record which is mapped to three physical 49 00:05:14,939 --> 00:05:19,313 nodes x, y and z. And now, let's see what happens when we 50 00:05:19,313 --> 00:05:24,415 try to write a record, which is replicated at three locations. 51 00:05:24,415 --> 00:05:31,035 Suppose we happen to write this record at x, we associate it with that at time stamp 52 00:05:31,035 --> 00:05:35,010 1,0,0. So, this is a vector time stamp consisting 53 00:05:35,010 --> 00:05:41,059 of three elements, which tells us that this was written at time stamp one, at 54 00:05:41,059 --> 00:05:48,043 physical note X, and the time stamps, the other notes are maintained at zero. 55 00:05:49,069 --> 00:05:56,391 Next, we may want to write it again, again at physical note x, in which case, we 56 00:05:56,391 --> 00:06:03,410 simply update the vector times stamp to 2,0,0 and forget about the old 1,0,0 57 00:06:03,410 --> 00:06:12,008 version which was written earlier. That's because in vector terms, 2,0,0 is 58 00:06:12,008 --> 00:06:21,111 greater than 1,0,0 and that at least one of the terms is greater than the one in 59 00:06:21,111 --> 00:06:29,358 1,0,0 and the others are at least equal to the same in this vector. 60 00:06:29,358 --> 00:06:37,001 On the other hand, suppose we were to first write it at location x and the next 61 00:06:37,001 --> 00:06:43,520 write for some reason happened at location y just because of parallelism. 62 00:06:43,520 --> 00:06:52,336 The vector time stamp for this write would then be 0,1,0 because the value it is 63 00:06:52,336 --> 00:07:00,635 writing had, times, times, 0,0,0, and it's writing it at y, so you, it is assigned a 64 00:07:00,635 --> 00:07:09,092 time stamp of 0,1,0 now. Next, this right is propagated to all the 65 00:07:09,092 --> 00:07:17,090 replicas, both rights are propagated to the other two replicas, so x gets the 66 00:07:17,090 --> 00:07:25,654 value from y 0,1,0 and y gets the value from x and zed gets the value from both x 67 00:07:25,654 --> 00:07:29,724 and y. Now, in each of these cases, neither of 68 00:07:29,724 --> 00:07:37,015 the two values of a record are greater than each other in vector term. 69 00:07:37,015 --> 00:07:44,081 So, each replica has to maintain both the values until something more interesting 70 00:07:44,081 --> 00:07:51,025 happens in the future. For example, now if we read the value zed, 71 00:07:51,025 --> 00:08:00,367 where we read both these replicas 1,0,0 and 0,1,0 from the location zed And now, 72 00:08:00,367 --> 00:08:05,297 we do some computation, and write a fresh value. 73 00:08:05,297 --> 00:08:14,675 Because we read both these replicas with time stamps 1,0,0 and 0,1,0 figured out 74 00:08:14,675 --> 00:08:25,075 which one to use using some application logic and then wrote a value, the value 75 00:08:25,075 --> 00:08:33,807 written has a time stamp 1,1,1 because it's based on values that have seen 1,0,0 76 00:08:33,807 --> 00:08:41,373 and 0,1,0 as time stamps. Now, 1,1,1 is greater than both of these 77 00:08:41,373 --> 00:08:47,967 so the value written overrides the previous two and they can be discarded. 78 00:08:47,967 --> 00:08:55,274 And when this value propagates to x and y, the corresponding older values also get 79 00:08:55,274 --> 00:08:59,543 discarded. As a result, what we achieved is eventual 80 00:08:59,543 --> 00:09:06,485 consistency in that at some point of time, when you read a record you might get two 81 00:09:06,485 --> 00:09:13,508 values instead of just one because you are getting an inconsistent state and its up 82 00:09:13,508 --> 00:09:19,994 to you, as the application, to resolve that inconsistency usually using some 83 00:09:19,994 --> 00:09:24,008 knowledge of what these values are all about. 84 00:09:24,008 --> 00:09:33,512 For example, suppose the write is essentially adding another value to the 85 00:09:33,987 --> 00:09:40,289 record. In this case, the previous two values 86 00:09:40,289 --> 00:09:47,901 simply need to be added together to which, a third value might get added, and that in 87 00:09:47,901 --> 00:09:53,084 consistencies while results because none of the previous write has actually thrown 88 00:09:53,084 --> 00:09:56,919 away, but they are actually summed up by the third right. 89 00:09:56,919 --> 00:10:02,937 A second example might be where the right is a blind right, we doesn't really care 90 00:10:02,937 --> 00:10:08,763 what was written in that record earlier, In which case, it doesn't really matter 91 00:10:08,763 --> 00:10:13,945 which of the previous two, one is to choose because they are ignored in any 92 00:10:13,945 --> 00:10:19,197 case and the third right simply writes and resolves the conflict by assigning a 93 00:10:19,197 --> 00:10:26,477 higher vector time stamp. The final case is the most complicated one 94 00:10:26,477 --> 00:10:32,095 where the write depends on the previous value, such as a update. 95 00:10:32,095 --> 00:10:40,019 So, suppose you're incrementing the value of a record, perhaps, by different amounts 96 00:10:40,019 --> 00:10:44,016 each time. So, you have to choose which of the 97 00:10:44,016 --> 00:10:50,753 previous two versions you want to use and do this consistently so that the second 98 00:10:50,753 --> 00:10:56,493 time this kind of a conflict occurs, the rule applied is the same. 99 00:10:56,493 --> 00:11:02,803 What is typically done in such situations is, you use a lexicographical or 100 00:11:02,803 --> 00:11:09,066 alphabetical ordering of the replicas so that, for example, the replica x is 101 00:11:09,066 --> 00:11:15,703 treated as higher than the replica y, which is treated higher than the replica 102 00:11:15,703 --> 00:11:19,513 zed. So that, while choosing between 1,0,0 and 103 00:11:19,513 --> 00:11:26,085 0,1,0, the third replica chooses to use this value time stamp with one zero, zero 104 00:11:26,085 --> 00:11:33,300 increment it with whatever it chooses to do and then write with a time stamp one, 105 00:11:33,300 --> 00:11:38,030 one, one. Since we use a consistent ordering of the 106 00:11:38,030 --> 00:11:45,303 replicas by alphabetical order of some kind, we'll always get consistent results. 107 00:11:45,303 --> 00:11:52,940 Now, it's not that obvious and the theory of vector time stamps dates back to the 108 00:11:52,940 --> 00:12:00,060 '70s in classical work by Leslie Lamport. And the original paper is really worth 109 00:12:00,060 --> 00:12:04,723 reading. The paper itself is from 1978 and there 110 00:12:04,723 --> 00:12:11,092 are some previous versions of it but the most accessible one is this one, which 111 00:12:11,092 --> 00:12:17,409 appeared in the communications of the ACM. The theory of why a lexicographical 112 00:12:17,409 --> 00:12:25,035 ordering works always is elegantly put in this paper and interestingly, more than 30 113 00:12:25,035 --> 00:12:31,632 years later, we are seeing this concept being used for eventual consistency and 114 00:12:31,632 --> 00:12:38,603 multiple NoSQL databases, Amazon simple DB to start with, Mogul DB, Couch DB, 115 00:12:38,603 --> 00:12:41,012 Cassandra and many others.