Big table and H Base were one of the first NoSQL databases designed for large scale batch processing essentially and therefore didn't really have strong support for indexes. Recently, the database called MongoDB has become very popular, because it does, in fact, include significant support for indexes. Instead of records, MongoDB uses the term documents, which are essentially structured documents spo they look a lot like records, but they can be more complex than records in the sense that one might have multiple titles or multiple categories, for example, in this case. This is sort of like H Base records, because H Base can also have multiple values for a particular column. Like H Base, it's a sharded database, so that documents or records are kept in different shards on different processors so that a set of keys goes to one shard. Additionally, just like H Base we have three replicas for each shard. Mongodb, however, doesn't rely on the underlying distributed file system, like HDFS. But, keeps track of it's own replicas, in a manner reminiscent of traditional paddle relation databases. The underlying file system can be any operating system, file system, like Linux file system or something else. Further, the indexes that MongoDB supports include full text indexes so that it essentially has support for inverted indexing as we saw in the first week of this course. Lastly, it also supports a form of map-reduce within the database itself. So, when [unknown] effectively write map-reduce programs which process data in MongoDB using Javascript which is the only language it supports. Another important feature of MongoDB which is shared by some other no sequel databases, is its consistency model for writes. We recall that in H Base, for example, a write relies on a write to the distributed file system. We just essentially completes writing a record only all three replicas write successfully. This essentially makes writes into both GFS, HDFS, as well as H Base rather costly because it relies on altering replicas agreeing, before a write succeeds. It's quite okay for large bulk rights, but doesn't quite add up for smaller rights, or random rights because the extra overhead of making sure the replicas are consistent is relatively high for small operations. So, instead MongoDB uses something called eventual consistency, where writes or updates are propagated to replicas eventually, and not synchronously with the write. So, it's possible for a write to complete without all it's replicas having been updated with that write. The first database to use eventual consistency was Amazon Simple DB, where the idea of vector clocks was used to achieve eventual consistency. Other NoSQL data bases, such as Mongol DB, still use vector clocks. So, this is quite an important concept and we'll illustrate it through an example. Each record is represented by a key which could be, for example, the transaction ID or something else. And it needs to get replicated at three different replicas. In Amazon Simple DB, for example, these replicas are chosen by hashing each key and essentially distributing the hashed values in a round roving fashion to different physical notes. A different replication technique is used in MongoDB, but that's not relevant for the vector clock discussion. Nonetheless, let's assume that there is a record which is mapped to three physical nodes x, y and z. And now, let's see what happens when we try to write a record, which is replicated at three locations. Suppose we happen to write this record at x, we associate it with that at time stamp 1,0,0. So, this is a vector time stamp consisting of three elements, which tells us that this was written at time stamp one, at physical note X, and the time stamps, the other notes are maintained at zero. Next, we may want to write it again, again at physical note x, in which case, we simply update the vector times stamp to 2,0,0 and forget about the old 1,0,0 version which was written earlier. That's because in vector terms, 2,0,0 is greater than 1,0,0 and that at least one of the terms is greater than the one in 1,0,0 and the others are at least equal to the same in this vector. On the other hand, suppose we were to first write it at location x and the next write for some reason happened at location y just because of parallelism. The vector time stamp for this write would then be 0,1,0 because the value it is writing had, times, times, 0,0,0, and it's writing it at y, so you, it is assigned a time stamp of 0,1,0 now. Next, this right is propagated to all the replicas, both rights are propagated to the other two replicas, so x gets the value from y 0,1,0 and y gets the value from x and zed gets the value from both x and y. Now, in each of these cases, neither of the two values of a record are greater than each other in vector term. So, each replica has to maintain both the values until something more interesting happens in the future. For example, now if we read the value zed, where we read both these replicas 1,0,0 and 0,1,0 from the location zed And now, we do some computation, and write a fresh value. Because we read both these replicas with time stamps 1,0,0 and 0,1,0 figured out which one to use using some application logic and then wrote a value, the value written has a time stamp 1,1,1 because it's based on values that have seen 1,0,0 and 0,1,0 as time stamps. Now, 1,1,1 is greater than both of these so the value written overrides the previous two and they can be discarded. And when this value propagates to x and y, the corresponding older values also get discarded. As a result, what we achieved is eventual consistency in that at some point of time, when you read a record you might get two values instead of just one because you are getting an inconsistent state and its up to you, as the application, to resolve that inconsistency usually using some knowledge of what these values are all about. For example, suppose the write is essentially adding another value to the record. In this case, the previous two values simply need to be added together to which, a third value might get added, and that in consistencies while results because none of the previous write has actually thrown away, but they are actually summed up by the third right. A second example might be where the right is a blind right, we doesn't really care what was written in that record earlier, In which case, it doesn't really matter which of the previous two, one is to choose because they are ignored in any case and the third right simply writes and resolves the conflict by assigning a higher vector time stamp. The final case is the most complicated one where the write depends on the previous value, such as a update. So, suppose you're incrementing the value of a record, perhaps, by different amounts each time. So, you have to choose which of the previous two versions you want to use and do this consistently so that the second time this kind of a conflict occurs, the rule applied is the same. What is typically done in such situations is, you use a lexicographical or alphabetical ordering of the replicas so that, for example, the replica x is treated as higher than the replica y, which is treated higher than the replica zed. So that, while choosing between 1,0,0 and 0,1,0, the third replica chooses to use this value time stamp with one zero, zero increment it with whatever it chooses to do and then write with a time stamp one, one, one. Since we use a consistent ordering of the replicas by alphabetical order of some kind, we'll always get consistent results. Now, it's not that obvious and the theory of vector time stamps dates back to the '70s in classical work by Leslie Lamport. And the original paper is really worth reading. The paper itself is from 1978 and there are some previous versions of it but the most accessible one is this one, which appeared in the communications of the ACM. The theory of why a lexicographical ordering works always is elegantly put in this paper and interestingly, more than 30 years later, we are seeing this concept being used for eventual consistency and multiple NoSQL databases, Amazon simple DB to start with, Mogul DB, Couch DB, Cassandra and many others.