1
00:00:00,000 --> 00:00:08,035
Big table and H Base were one of the first
NoSQL databases designed for large scale

2
00:00:08,035 --> 00:00:16,031
batch processing essentially and therefore
didn't really have strong support for

3
00:00:16,031 --> 00:00:22,003
indexes.
Recently, the database called MongoDB has

4
00:00:22,003 --> 00:00:28,793
become very popular, because it does, in
fact, include significant support for

5
00:00:28,793 --> 00:00:33,026
indexes.
Instead of records, MongoDB uses the term

6
00:00:33,026 --> 00:00:38,788
documents, which are essentially
structured documents spo they look a lot

7
00:00:38,788 --> 00:00:46,038
like records, but they can be more complex
than records in the sense that one might

8
00:00:46,038 --> 00:00:52,464
have multiple titles or multiple
categories, for example, in this case.

9
00:00:52,736 --> 00:00:58,780
This is sort of like H Base records,
because H Base can also have multiple

10
00:00:58,780 --> 00:01:07,369
values for a particular column.
Like H Base, it's a sharded database, so

11
00:01:07,369 --> 00:01:17,840
that documents or records are kept in
different shards on different processors

12
00:01:17,840 --> 00:01:26,734
so that a set of keys goes to one shard.
Additionally, just like H Base we have

13
00:01:27,095 --> 00:01:34,063
three replicas for each shard.
Mongodb, however, doesn't rely on the

14
00:01:34,063 --> 00:01:38,060
underlying distributed file system, like
HDFS.

15
00:01:38,060 --> 00:01:45,092
But, keeps track of it's own replicas, in
a manner reminiscent of traditional paddle

16
00:01:45,092 --> 00:01:50,095
relation databases.
The underlying file system can be any

17
00:01:50,095 --> 00:01:58,644
operating system, file system, like Linux
file system or something else.

18
00:01:58,644 --> 00:02:08,805
Further, the indexes that MongoDB supports
include full text indexes so that it

19
00:02:08,805 --> 00:02:17,176
essentially has support for inverted
indexing as we saw in the first week of

20
00:02:17,176 --> 00:02:23,527
this course.
Lastly, it also supports a form of

21
00:02:23,527 --> 00:02:30,247
map-reduce within the database itself.
So, when [unknown] effectively write

22
00:02:30,247 --> 00:02:37,419
map-reduce programs which process data in
MongoDB using Javascript which is the only

23
00:02:37,419 --> 00:02:44,490
language it supports.
Another important feature of MongoDB which

24
00:02:44,490 --> 00:02:51,473
is shared by some other no sequel
databases, is its consistency model for

25
00:02:51,473 --> 00:02:56,332
writes.
We recall that in H Base, for example, a

26
00:02:56,332 --> 00:03:02,166
write relies on a write to the distributed
file system.

27
00:03:02,166 --> 00:03:09,740
We just essentially completes writing a
record only all three replicas write

28
00:03:09,740 --> 00:03:15,012
successfully.
This essentially makes writes into both

29
00:03:15,012 --> 00:03:22,297
GFS, HDFS, as well as H Base rather costly
because it relies on altering replicas

30
00:03:22,297 --> 00:03:29,002
agreeing, before a write succeeds.
It's quite okay for large bulk rights, but

31
00:03:29,002 --> 00:03:35,031
doesn't quite add up for smaller rights,
or random rights because the extra

32
00:03:35,031 --> 00:03:42,009
overhead of making sure the replicas are
consistent is relatively high for small

33
00:03:42,009 --> 00:03:46,045
operations.
So, instead MongoDB uses something called

34
00:03:46,045 --> 00:03:52,049
eventual consistency, where writes or
updates are propagated to replicas

35
00:03:52,049 --> 00:03:56,077
eventually, and not synchronously with the
write.

36
00:03:56,077 --> 00:04:02,078
So, it's possible for a write to complete
without all it's replicas having been

37
00:04:02,078 --> 00:04:08,015
updated with that write.
The first database to use eventual

38
00:04:08,015 --> 00:04:15,093
consistency was Amazon Simple DB, where
the idea of vector clocks was used to

39
00:04:16,031 --> 00:04:22,259
achieve eventual consistency.
Other NoSQL data bases, such as Mongol DB,

40
00:04:22,259 --> 00:04:27,309
still use vector clocks.
So, this is quite an important concept and

41
00:04:27,309 --> 00:04:34,868
we'll illustrate it through an example.
Each record is represented by a key which

42
00:04:34,868 --> 00:04:39,726
could be, for example, the transaction ID
or something else.

43
00:04:39,726 --> 00:04:44,687
And it needs to get replicated at three
different replicas.

44
00:04:44,687 --> 00:04:51,053
In Amazon Simple DB, for example, these
replicas are chosen by hashing each key

45
00:04:51,053 --> 00:04:57,037
and essentially distributing the hashed
values in a round roving fashion to

46
00:04:57,037 --> 00:05:02,700
different physical notes.
A different replication technique is used

47
00:05:02,700 --> 00:05:08,477
in MongoDB, but that's not relevant for
the vector clock discussion.

48
00:05:08,477 --> 00:05:14,939
Nonetheless, let's assume that there is a
record which is mapped to three physical

49
00:05:14,939 --> 00:05:19,313
nodes x, y and z.
And now, let's see what happens when we

50
00:05:19,313 --> 00:05:24,415
try to write a record, which is replicated
at three locations.

51
00:05:24,415 --> 00:05:31,035
Suppose we happen to write this record at
x, we associate it with that at time stamp

52
00:05:31,035 --> 00:05:35,010
1,0,0.
So, this is a vector time stamp consisting

53
00:05:35,010 --> 00:05:41,059
of three elements, which tells us that
this was written at time stamp one, at

54
00:05:41,059 --> 00:05:48,043
physical note X, and the time stamps, the
other notes are maintained at zero.

55
00:05:49,069 --> 00:05:56,391
Next, we may want to write it again, again
at physical note x, in which case, we

56
00:05:56,391 --> 00:06:03,410
simply update the vector times stamp to
2,0,0 and forget about the old 1,0,0

57
00:06:03,410 --> 00:06:12,008
version which was written earlier.
That's because in vector terms, 2,0,0 is

58
00:06:12,008 --> 00:06:21,111
greater than 1,0,0 and that at least one
of the terms is greater than the one in

59
00:06:21,111 --> 00:06:29,358
1,0,0 and the others are at least equal to
the same in this vector.

60
00:06:29,358 --> 00:06:37,001
On the other hand, suppose we were to
first write it at location x and the next

61
00:06:37,001 --> 00:06:43,520
write for some reason happened at location
y just because of parallelism.

62
00:06:43,520 --> 00:06:52,336
The vector time stamp for this write would
then be 0,1,0 because the value it is

63
00:06:52,336 --> 00:07:00,635
writing had, times, times, 0,0,0, and it's
writing it at y, so you, it is assigned a

64
00:07:00,635 --> 00:07:09,092
time stamp of 0,1,0 now.
Next, this right is propagated to all the

65
00:07:09,092 --> 00:07:17,090
replicas, both rights are propagated to
the other two replicas, so x gets the

66
00:07:17,090 --> 00:07:25,654
value from y 0,1,0 and y gets the value
from x and zed gets the value from both x

67
00:07:25,654 --> 00:07:29,724
and y.
Now, in each of these cases, neither of

68
00:07:29,724 --> 00:07:37,015
the two values of a record are greater
than each other in vector term.

69
00:07:37,015 --> 00:07:44,081
So, each replica has to maintain both the
values until something more interesting

70
00:07:44,081 --> 00:07:51,025
happens in the future.
For example, now if we read the value zed,

71
00:07:51,025 --> 00:08:00,367
where we read both these replicas 1,0,0
and 0,1,0 from the location zed And now,

72
00:08:00,367 --> 00:08:05,297
we do some computation, and write a fresh
value.

73
00:08:05,297 --> 00:08:14,675
Because we read both these replicas with
time stamps 1,0,0 and 0,1,0 figured out

74
00:08:14,675 --> 00:08:25,075
which one to use using some application
logic and then wrote a value, the value

75
00:08:25,075 --> 00:08:33,807
written has a time stamp 1,1,1 because
it's based on values that have seen 1,0,0

76
00:08:33,807 --> 00:08:41,373
and 0,1,0 as time stamps.
Now, 1,1,1 is greater than both of these

77
00:08:41,373 --> 00:08:47,967
so the value written overrides the
previous two and they can be discarded.

78
00:08:47,967 --> 00:08:55,274
And when this value propagates to x and y,
the corresponding older values also get

79
00:08:55,274 --> 00:08:59,543
discarded.
As a result, what we achieved is eventual

80
00:08:59,543 --> 00:09:06,485
consistency in that at some point of time,
when you read a record you might get two

81
00:09:06,485 --> 00:09:13,508
values instead of just one because you are
getting an inconsistent state and its up

82
00:09:13,508 --> 00:09:19,994
to you, as the application, to resolve
that inconsistency usually using some

83
00:09:19,994 --> 00:09:24,008
knowledge of what these values are all
about.

84
00:09:24,008 --> 00:09:33,512
For example, suppose the write is
essentially adding another value to the

85
00:09:33,987 --> 00:09:40,289
record.
In this case, the previous two values

86
00:09:40,289 --> 00:09:47,901
simply need to be added together to which,
a third value might get added, and that in

87
00:09:47,901 --> 00:09:53,084
consistencies while results because none
of the previous write has actually thrown

88
00:09:53,084 --> 00:09:56,919
away, but they are actually summed up by
the third right.

89
00:09:56,919 --> 00:10:02,937
A second example might be where the right
is a blind right, we doesn't really care

90
00:10:02,937 --> 00:10:08,763
what was written in that record earlier,
In which case, it doesn't really matter

91
00:10:08,763 --> 00:10:13,945
which of the previous two, one is to
choose because they are ignored in any

92
00:10:13,945 --> 00:10:19,197
case and the third right simply writes and
resolves the conflict by assigning a

93
00:10:19,197 --> 00:10:26,477
higher vector time stamp.
The final case is the most complicated one

94
00:10:26,477 --> 00:10:32,095
where the write depends on the previous
value, such as a update.

95
00:10:32,095 --> 00:10:40,019
So, suppose you're incrementing the value
of a record, perhaps, by different amounts

96
00:10:40,019 --> 00:10:44,016
each time.
So, you have to choose which of the

97
00:10:44,016 --> 00:10:50,753
previous two versions you want to use and
do this consistently so that the second

98
00:10:50,753 --> 00:10:56,493
time this kind of a conflict occurs, the
rule applied is the same.

99
00:10:56,493 --> 00:11:02,803
What is typically done in such situations
is, you use a lexicographical or

100
00:11:02,803 --> 00:11:09,066
alphabetical ordering of the replicas so
that, for example, the replica x is

101
00:11:09,066 --> 00:11:15,703
treated as higher than the replica y,
which is treated higher than the replica

102
00:11:15,703 --> 00:11:19,513
zed.
So that, while choosing between 1,0,0 and

103
00:11:19,513 --> 00:11:26,085
0,1,0, the third replica chooses to use
this value time stamp with one zero, zero

104
00:11:26,085 --> 00:11:33,300
increment it with whatever it chooses to
do and then write with a time stamp one,

105
00:11:33,300 --> 00:11:38,030
one, one.
Since we use a consistent ordering of the

106
00:11:38,030 --> 00:11:45,303
replicas by alphabetical order of some
kind, we'll always get consistent results.

107
00:11:45,303 --> 00:11:52,940
Now, it's not that obvious and the theory
of vector time stamps dates back to the

108
00:11:52,940 --> 00:12:00,060
'70s in classical work by Leslie Lamport.
And the original paper is really worth

109
00:12:00,060 --> 00:12:04,723
reading.
The paper itself is from 1978 and there

110
00:12:04,723 --> 00:12:11,092
are some previous versions of it but the
most accessible one is this one, which

111
00:12:11,092 --> 00:12:17,409
appeared in the communications of the ACM.
The theory of why a lexicographical

112
00:12:17,409 --> 00:12:25,035
ordering works always is elegantly put in
this paper and interestingly, more than 30

113
00:12:25,035 --> 00:12:31,632
years later, we are seeing this concept
being used for eventual consistency and

114
00:12:31,632 --> 00:12:38,603
multiple NoSQL databases, Amazon simple DB
to start with, Mogul DB, Couch DB,

115
00:12:38,603 --> 00:12:41,012
Cassandra and many others.