In this final video about transactions,
we'll focus on the concept of isolation levels.
As a reminder transactions are a
solution for both the concurrency control
and system failure problem in databases.
A transaction is a
sequence of one or more
operations that's treated as a unit.
Transactions appear to run in
isolation and if the
system fails, each transaction changes
are reflected either entirely or not at all.
On this video we are going to focus on the isolation portion of transactions.
As a reminder we can have multiple clients operating on the same database.
And each client will submit a sequence of transactions.
So we have this client with T1, T2.
Here we have T9, T10, T11 and so forth.
And each transaction itself is a sequence of statements.
Serializeability says that it's
okay for the system to
inter-leave the execution of these
statements with the statements that
are being performed by other clients;
however, the behavior against the
database must be equivalent
to the transactions themselves executing
in some serial order.
For example, in this case,
the system may enforce behavior that's
equivalent to say, doing transaction T9 first.
Then maybe T1, T2, and then T10,
and so on.
Serializability give us understandable behavior
and consistency but it does
have some overhead involved in
the locking protocols that are used
and it does reduce concurrency.
As a result of the overhead and
reduced concurrency, systems do
offer weaker isolation levels.
In the sequel standard, there are
three levels: read uncommitted, read
committed, and repeatable read. And
these isolation levels have lower
overhead and allow higher concurrency
but of course at a cost
which is lower consistency guarantees.
I've listed the three alternative isolation
levels from the weaker to
the stronger and to
complete the picture at the
bottom we have a fourth
one which is serializable, which is what we've been talking about already.
Before we proceed to learn
about the different isolation, let me
mention a couple of
things:  First of all,
the Isolation level is per a transaction.
So each client could set
different isolation levels for each of its transactions,
if it wishes.
Second of all, isolation levels
are in the eye of the
beholder, and let me show you what I mean by that.
So each client submits transaction
to the database and might set the isolation level for that transaction.
That isolation level only affects that transaction itself.
It does not affect the behavior of
any other transactions that are running concurrently.
So, for example, our client
on the left might set its
transaction to be a repeatable
read, while our client
on the right will set it's
transaction to read uncommitted and
those properties will be guaranteed
for each of those transactions and won't affect the other.
By the way, the isolation levels
really are specific to reads.
They specify what values might
be seen in the transaction,
as we'll see when we get into the details.
So let's start by defining a concept called dirty reads.
A data item in the database
is dirty if it's been
written by a transaction that has not yet committed.
So for example, here are
two transactions, I'll call them
T1 and T2, and by
the way throughout the rest of
this video I'm going to put
transactions in boxes, and you
can assume implicitly that there
is a commit at the end of each box.
I'm not going to write it each time.
So our first transaction is updating
Standford's enrollment, adding 1000
to it, and our second transaction is
reading the average enrollment in the college table.
We're using our usual database
of students applying to colleges.
So after this enrollment, Standford's
enrollment has 1,000 added to
it, but before the transaction commits at
that point in time the value is what's known as dirty.
If our second transaction here
reads this value then it
might be reading a value that
never actually exists in the database.
And why is that, because
before this transaction commits
there could be a system failure
and the transaction could be rolled
back as we described before and all of it's changes undone.
Meanwhile, however, the second transaction
may have read that value before it was undone.
Here's another example now we
have three transactions T1, T2,
T3 our first transaction
is modifying the GPA of
student's who's high school size
is sufficiently large.
Our second transaction is finding
the GPA of student number 123.
And our third transaction
is modifying the high school
size of student 234.
So if this GPA here
in transaction T2 is read
before the commit of
transaction T1, then that
would be a dirty value for the GPA.
Again, because of this
first transaction, doesn't commit, then
that value will be rolled back.
There's a second case where we
might have dirty data read,
in this trio of transactions
and that's the size high school here.
Because notice that here we're
modifying a high school size,
so if this size of high
school is read before the commit
point of the third transaction, that
would also be a dirty data item.
One clarification about dirty reads,
is that there is no such thing
as a dirty read within the same transaction.
In T3 for example,
after we've modified the size
high school, we might read
the size high school later in
the same transaction and that's not considered a dirty read.
So a read is only dirty
when it reads a uncommitted
value that was modified by a different transaction.
So here's our first isolation level, and it's our weakest one.
It's called Read I'm Committed, and
what is says is that a
transaction that has this
isolation level may perform dirty reads.
It may read values that have
been modified by a different transaction and not yet committed.
So lets take a look at an example.
It's our same example.
We've dropped the third transaction, so
our first transaction is modifying
GPAs in the student table
and our second transaction is
reading average of those GPAs.
So if these transactions are
serializable, then it'll be
the behavior's guaranteed to be
equivalent to either T1 followed
by T2 or T2 followed by T1.
So either the second transaction
will see all the GPAs before
they were updated, or
it will see all the GPAs after they were updated.
As a reminder we don't know which order these will occur in.
Only that the behavior will be equivalent to one of those orders.
Now let's suppose we add
to our second transaction a specification
that it has isolation level read uncommitted.
And by the way, this very long
sentence is how we specify the
isolation level in the SQL standard.
Now when we don't specify an
isolation level, as we haven't
here, the default is serializable.
Although in most of our
examples, it won't actually matter
what the first transaction's isolation level is, as we'll see.
We're going to be focusing on the
data that's read in the
second transaction and typically written in the first transaction.
Okay, so let's see what's going on here.
Again this is T1 and T2
and our first transaction is updating the GPAs.
And now we've said in our
second that it's okay for
this average to read dirty
values, in other words, to
see uncommitted GPA modifications.
In that case, as the
average is computed, it could
be computed right in the
middle of the set of
modifications being performed by T1.
In that case, we certainly don't have serializable behavior.
We don't have T1 followed by
T2, since T2 is
reading some values that are
in the middle of T1, and similarly
we don't have T2 followed by T1.
It might be that for our
particular application, we just don't
care that much about having exact consistency.
It may be that we don't
mind if our average is
computed with some old values
and some new values, we might
not even mind if we compute
in our average an increased GPA
that ends up being undone when a transaction rolls back.
So if we're just looking for a
rough approximate GPA we can
use this isolation level, and
we'll have increased concurrency decreased
overhead better performance overall with
the understanding that it will have reduced consistency guarantees.
Let's go one step up to
the next isolation level which is called "read committed".
As you can probably guess this one
specifies that transactions may not perform dirty reads.
They may only read data values
whose updates by other transactions
have been committed to the database.
Now this isolation level is stronger
but it still doesn't guarantee global
serializability. Let's take a look through an example.
Our first transaction, T1, is
the same one, modifying the GPA
for students from large high schools.
Our second transaction is the
one where we are going reset the isolation level.
In this case to read committed.
And it is going to perform two statements.
One of them is going to
read the average GPA from
the student table, and the other
is going to read the maximum GPA from the student table.
So let's look at one behavior
that's consistent with isolation level, but we will see is not serializable.
Let's suppose that this
average GPA is read
before transaction T1, but
the max GPA is computed after transaction T1.
So the average will take,
will not take into account the
increases but the max will
take account, will take into account the increases.
So, let's see if this is equivalent to any serial order.
Is it equivalent to T1 followed by T2?
Well it's certainly not because
T2's first statement is reading
the state of the table
before T1 and not the
state of the table afterward, although
C2 second statement is reading the state of the table afterward.
Similarly it's not equivalent
to T2 followed by T1 because
T2 is reading in its
second statement the state of the database after T1.
So there's no equivalent serial order.
But again, perhaps that's not needed for the particular application.
And by using Read Committed we
do get somewhat more performance then we would have if we were serializable.
Our next isolation level is called Repeatable Read.
And it's our strongest one before we get to Serializable.
In Repeatable Read, a transaction
may not perform dirty reads just like in read committed.
And furthermore, there is
an additional constraint that if
an item is read multiple times, it can't change value.
You might remember in our previous
example, we read the GPA
multiple times, and it did change value.
So if we were using Repeatable Read
for the consistency level there, then
the behavior that I described couldn't occur.
So, even with this stronger
condition, we still don't have
a guarantee of global serializability, and we'll again see that through an example.
Our examples are getting a little more complicated, here.
So we have our two transactions T1,
T2, our first transaction is
still modifying the GPA (I
took away the condition about the
high school size, just to keep
things simple) and our second
statement in our first transaction is
modifying the high school size
of the student with ID 123.
So we first modified GPA's and then a high school size.
In our second transaction, and
that's the one we're setting as
Repeatable read, we are
going to read the average GPA,
as we usually do, and this
time we are going to read the
average of the high school sizes.
Incidentally, our first transaction is
serializable, as they always are by default.
Let's look at a behavior
where the first statement reading
the average GPA is executed
before transaction T1, or sees
the values before T1, while
our second statement, the high
school size, sees the values
after transaction T1.
So let's check our conditions.
We are not performing dirty reads,
because the first read here
is of the committed value
before T1 and the second
read is the committed value after
T1 and furthermore, any
items that are read multiple times
have not had their value changed
because we are actually not reading any values multiple times.
So the execution of the
first statement here, before T1
and the second one after is
legal in the repeatable read isolation level.
Yet we're still not serializable.
We're not equivalent to T1
before T2 because again
this Statement of T2
is going - sorry -
the first statement of T2
is going before T1 or
seeing the state before T1 and
we're not equivalent to T2
followed by T1 because the
second statement of T2 is
seeing the state after T1.
Now there is another situation with repeatable
read that's quite important to understand.
We said that a transaction can't perform dirty reads, and it can't.
We also said that when an
item that's read multiple times can't change value.
But the fact is that Repeatable
Read does allow a relation
to change value if it's
read multiple times through what's known as phantom tuples.
Let me explain through an example.
Let's suppose our first transaction
inserts a hundred new students into the database.
And that's run concurrently with our
second transaction, which is
right at the repeatable
read isolation level and now
we're just going to read the average
GPA and we're going
to follow that with the max GPA
similar to one of our earlier examples.
Now, repeatable read actually does
allow behavior where this
average is computed before T1
and this max is computed at after T1.
So the justification behind that is
pretty much that when we
do the second read of the
GPA, the tuples that
we're reading for a second time
do still have the same value.
So we are reading those some
new tuples that were inserted
and in fact if this max
were an average instead of max
we might get 2 different answers
for the average, even with Repeatable
Read at the isolation level.
But that's what it allows and
these hundred tuples here are
what are know as the phantom tuples.
They sort of emerged during execution out of nowhere.
Now, I would have
to say that my opinion is
that this behavior within the
repeatable read isolation level, although
it's part of the standard, is really
in effect of the way repeatable
read is implemented using Locks.
When a value is read once,
it's locked and can't be modified,
but when we insert new tuples,
they aren't inserted with locks
so they can read in a second read of the same relation.
Don't worry about the implementation details,
but do worry about phantom
tuples because if you're using
the repeatable read isolation level, you
do need to know
that insertions can be made
by another transaction, even between
two entire readings of a table.
Now on the other hand, if
what we do in our first
transaction is delete the
hundred tuples instead of insert
them, in that case we
actually can not get the
behavior where the first statement
is before and the second statement is after.
Because once these, the average
value has been read of
this GPA, this deletion
will not be allowed because,
again kind of an implementation, but those values are locked.
And so in this case, the second
read of that same relation wouldn't be allowed.
So in summary we may have
phantom tuples up here between
two reads of the
same relation in a repeatable
read transaction, but we
won't have tuples disappear from
the relation in between two reads of it.
So, that completes our three isolations
levels, in addition to serializable
we had at the weakest read-uncommitted,
then read-committed, and then repeatable read.
I did want to mention that we can also set transactions to be read-only.
That's sort of orthogonal to setting
the isolation level, what it
does is it helps the system optimize performance.
So, for example, in our
transaction, where we were
just reading the average GPA and
the max GPA, we can
set an isolation level, and then
we can also tell the system
that it's going to be a read only transaction.
That means that we are not going
to perform any modifications to the database within the transaction.
The system can use that as
a hint to figure out his
protocols to guarantee the
right isolation level, but it
might not have as much
overhead as if the
transaction had the possibility of
performing modifications as well as performing reads.
OK, so the behavior of transactions
can be rather confusing and
it's very important to get it right or surprising things might happen.
But I think we can summarize it pretty well with this table here.
We're going here from the weakest
to the strongest and we
can classify the behavior of
transactions based on again what happens with reading.
Can they read uncommitted values?
Can they have non-repeatable reads, where
we read a value and
then read a different one later in
the same transaction, and can there
be phantom tuples inserted during the transaction?
If we set our isolation level
to serializable, then we
cannot have dirty reads, we
cannot have non-repeatable reads and we cannot have phantoms.
If we go one step weaker
for a little more performance, and use
repeatable read, then we still won't have dirty reads,
we still won't have non-repeatable reads,
but we might have phantom tuples.
Moving up with Read
Committed, we still won't have
dirty reads but we might have non-repeatable reads.
So we might read a value
that's committed both times we
read it however a transaction
wrote the value in between
those two reads so it's
different each time and we may have phantoms as well.
Finally read uncommitted is the absolute weakest,
not many guarantees at all.
We might have dirty reads, we
might have not repeatable reads, and we might have phantoms.
So to wrap up transactions completely
the standard default is serializable
behavior, and we specified exactly what that means.
Weaker isolation levels allow
us to increase concurrency, decrease overhead
so overhaul will get an increased performance,
but we have weaker consistency guarantees.
I should mention that some
prominent database systems have
actually chosen to have repeatable read as their default.
Oracle and MySQL are examples of that.
So, in those systems it's assumed
that most applications will be
willing to sacrifice a little
bit inconsistency in order to get higher performance.
And, finally, the isolation
level is set for each transaction
and is in the eye of
the beholder, meaning that that
transaction's reads must conform to
its own isolation level but
won't affect any concurrent transaction's isolation level.
I hope I've gotten the point
across that transactions are one
of the most important concepts in database systems.
They are what allow multiple clients,
maybe thousands, tens of thousands
of clients, to operate on
a data base, all at the
same time without concern that
the actions they perform on the
data will affect each other in unpredictable ways.
And furthermore, transactions are what
allow database systems to recover
when there's an unexpected crash into a consistent space.