In this final video about transactions, we'll focus on the concept of isolation levels. As a reminder transactions are a solution for both the concurrency control and system failure problem in databases. A transaction is a sequence of one or more operations that's treated as a unit. Transactions appear to run in isolation and if the system fails, each transaction changes are reflected either entirely or not at all. On this video we are going to focus on the isolation portion of transactions. As a reminder we can have multiple clients operating on the same database. And each client will submit a sequence of transactions. So we have this client with T1, T2. Here we have T9, T10, T11 and so forth. And each transaction itself is a sequence of statements. Serializeability says that it's okay for the system to inter-leave the execution of these statements with the statements that are being performed by other clients; however, the behavior against the database must be equivalent to the transactions themselves executing in some serial order. For example, in this case, the system may enforce behavior that's equivalent to say, doing transaction T9 first. Then maybe T1, T2, and then T10, and so on. Serializability give us understandable behavior and consistency but it does have some overhead involved in the locking protocols that are used and it does reduce concurrency. As a result of the overhead and reduced concurrency, systems do offer weaker isolation levels. In the sequel standard, there are three levels: read uncommitted, read committed, and repeatable read. And these isolation levels have lower overhead and allow higher concurrency but of course at a cost which is lower consistency guarantees. I've listed the three alternative isolation levels from the weaker to the stronger and to complete the picture at the bottom we have a fourth one which is serializable, which is what we've been talking about already. Before we proceed to learn about the different isolation, let me mention a couple of things: First of all, the Isolation level is per a transaction. So each client could set different isolation levels for each of its transactions, if it wishes. Second of all, isolation levels are in the eye of the beholder, and let me show you what I mean by that. So each client submits transaction to the database and might set the isolation level for that transaction. That isolation level only affects that transaction itself. It does not affect the behavior of any other transactions that are running concurrently. So, for example, our client on the left might set its transaction to be a repeatable read, while our client on the right will set it's transaction to read uncommitted and those properties will be guaranteed for each of those transactions and won't affect the other. By the way, the isolation levels really are specific to reads. They specify what values might be seen in the transaction, as we'll see when we get into the details. So let's start by defining a concept called dirty reads. A data item in the database is dirty if it's been written by a transaction that has not yet committed. So for example, here are two transactions, I'll call them T1 and T2, and by the way throughout the rest of this video I'm going to put transactions in boxes, and you can assume implicitly that there is a commit at the end of each box. I'm not going to write it each time. So our first transaction is updating Standford's enrollment, adding 1000 to it, and our second transaction is reading the average enrollment in the college table. We're using our usual database of students applying to colleges. So after this enrollment, Standford's enrollment has 1,000 added to it, but before the transaction commits at that point in time the value is what's known as dirty. If our second transaction here reads this value then it might be reading a value that never actually exists in the database. And why is that, because before this transaction commits there could be a system failure and the transaction could be rolled back as we described before and all of it's changes undone. Meanwhile, however, the second transaction may have read that value before it was undone. Here's another example now we have three transactions T1, T2, T3 our first transaction is modifying the GPA of student's who's high school size is sufficiently large. Our second transaction is finding the GPA of student number 123. And our third transaction is modifying the high school size of student 234. So if this GPA here in transaction T2 is read before the commit of transaction T1, then that would be a dirty value for the GPA. Again, because of this first transaction, doesn't commit, then that value will be rolled back. There's a second case where we might have dirty data read, in this trio of transactions and that's the size high school here. Because notice that here we're modifying a high school size, so if this size of high school is read before the commit point of the third transaction, that would also be a dirty data item. One clarification about dirty reads, is that there is no such thing as a dirty read within the same transaction. In T3 for example, after we've modified the size high school, we might read the size high school later in the same transaction and that's not considered a dirty read. So a read is only dirty when it reads a uncommitted value that was modified by a different transaction. So here's our first isolation level, and it's our weakest one. It's called Read I'm Committed, and what is says is that a transaction that has this isolation level may perform dirty reads. It may read values that have been modified by a different transaction and not yet committed. So lets take a look at an example. It's our same example. We've dropped the third transaction, so our first transaction is modifying GPAs in the student table and our second transaction is reading average of those GPAs. So if these transactions are serializable, then it'll be the behavior's guaranteed to be equivalent to either T1 followed by T2 or T2 followed by T1. So either the second transaction will see all the GPAs before they were updated, or it will see all the GPAs after they were updated. As a reminder we don't know which order these will occur in. Only that the behavior will be equivalent to one of those orders. Now let's suppose we add to our second transaction a specification that it has isolation level read uncommitted. And by the way, this very long sentence is how we specify the isolation level in the SQL standard. Now when we don't specify an isolation level, as we haven't here, the default is serializable. Although in most of our examples, it won't actually matter what the first transaction's isolation level is, as we'll see. We're going to be focusing on the data that's read in the second transaction and typically written in the first transaction. Okay, so let's see what's going on here. Again this is T1 and T2 and our first transaction is updating the GPAs. And now we've said in our second that it's okay for this average to read dirty values, in other words, to see uncommitted GPA modifications. In that case, as the average is computed, it could be computed right in the middle of the set of modifications being performed by T1. In that case, we certainly don't have serializable behavior. We don't have T1 followed by T2, since T2 is reading some values that are in the middle of T1, and similarly we don't have T2 followed by T1. It might be that for our particular application, we just don't care that much about having exact consistency. It may be that we don't mind if our average is computed with some old values and some new values, we might not even mind if we compute in our average an increased GPA that ends up being undone when a transaction rolls back. So if we're just looking for a rough approximate GPA we can use this isolation level, and we'll have increased concurrency decreased overhead better performance overall with the understanding that it will have reduced consistency guarantees. Let's go one step up to the next isolation level which is called "read committed". As you can probably guess this one specifies that transactions may not perform dirty reads. They may only read data values whose updates by other transactions have been committed to the database. Now this isolation level is stronger but it still doesn't guarantee global serializability. Let's take a look through an example. Our first transaction, T1, is the same one, modifying the GPA for students from large high schools. Our second transaction is the one where we are going reset the isolation level. In this case to read committed. And it is going to perform two statements. One of them is going to read the average GPA from the student table, and the other is going to read the maximum GPA from the student table. So let's look at one behavior that's consistent with isolation level, but we will see is not serializable. Let's suppose that this average GPA is read before transaction T1, but the max GPA is computed after transaction T1. So the average will take, will not take into account the increases but the max will take account, will take into account the increases. So, let's see if this is equivalent to any serial order. Is it equivalent to T1 followed by T2? Well it's certainly not because T2's first statement is reading the state of the table before T1 and not the state of the table afterward, although C2 second statement is reading the state of the table afterward. Similarly it's not equivalent to T2 followed by T1 because T2 is reading in its second statement the state of the database after T1. So there's no equivalent serial order. But again, perhaps that's not needed for the particular application. And by using Read Committed we do get somewhat more performance then we would have if we were serializable. Our next isolation level is called Repeatable Read. And it's our strongest one before we get to Serializable. In Repeatable Read, a transaction may not perform dirty reads just like in read committed. And furthermore, there is an additional constraint that if an item is read multiple times, it can't change value. You might remember in our previous example, we read the GPA multiple times, and it did change value. So if we were using Repeatable Read for the consistency level there, then the behavior that I described couldn't occur. So, even with this stronger condition, we still don't have a guarantee of global serializability, and we'll again see that through an example. Our examples are getting a little more complicated, here. So we have our two transactions T1, T2, our first transaction is still modifying the GPA (I took away the condition about the high school size, just to keep things simple) and our second statement in our first transaction is modifying the high school size of the student with ID 123. So we first modified GPA's and then a high school size. In our second transaction, and that's the one we're setting as Repeatable read, we are going to read the average GPA, as we usually do, and this time we are going to read the average of the high school sizes. Incidentally, our first transaction is serializable, as they always are by default. Let's look at a behavior where the first statement reading the average GPA is executed before transaction T1, or sees the values before T1, while our second statement, the high school size, sees the values after transaction T1. So let's check our conditions. We are not performing dirty reads, because the first read here is of the committed value before T1 and the second read is the committed value after T1 and furthermore, any items that are read multiple times have not had their value changed because we are actually not reading any values multiple times. So the execution of the first statement here, before T1 and the second one after is legal in the repeatable read isolation level. Yet we're still not serializable. We're not equivalent to T1 before T2 because again this Statement of T2 is going - sorry - the first statement of T2 is going before T1 or seeing the state before T1 and we're not equivalent to T2 followed by T1 because the second statement of T2 is seeing the state after T1. Now there is another situation with repeatable read that's quite important to understand. We said that a transaction can't perform dirty reads, and it can't. We also said that when an item that's read multiple times can't change value. But the fact is that Repeatable Read does allow a relation to change value if it's read multiple times through what's known as phantom tuples. Let me explain through an example. Let's suppose our first transaction inserts a hundred new students into the database. And that's run concurrently with our second transaction, which is right at the repeatable read isolation level and now we're just going to read the average GPA and we're going to follow that with the max GPA similar to one of our earlier examples. Now, repeatable read actually does allow behavior where this average is computed before T1 and this max is computed at after T1. So the justification behind that is pretty much that when we do the second read of the GPA, the tuples that we're reading for a second time do still have the same value. So we are reading those some new tuples that were inserted and in fact if this max were an average instead of max we might get 2 different answers for the average, even with Repeatable Read at the isolation level. But that's what it allows and these hundred tuples here are what are know as the phantom tuples. They sort of emerged during execution out of nowhere. Now, I would have to say that my opinion is that this behavior within the repeatable read isolation level, although it's part of the standard, is really in effect of the way repeatable read is implemented using Locks. When a value is read once, it's locked and can't be modified, but when we insert new tuples, they aren't inserted with locks so they can read in a second read of the same relation. Don't worry about the implementation details, but do worry about phantom tuples because if you're using the repeatable read isolation level, you do need to know that insertions can be made by another transaction, even between two entire readings of a table. Now on the other hand, if what we do in our first transaction is delete the hundred tuples instead of insert them, in that case we actually can not get the behavior where the first statement is before and the second statement is after. Because once these, the average value has been read of this GPA, this deletion will not be allowed because, again kind of an implementation, but those values are locked. And so in this case, the second read of that same relation wouldn't be allowed. So in summary we may have phantom tuples up here between two reads of the same relation in a repeatable read transaction, but we won't have tuples disappear from the relation in between two reads of it. So, that completes our three isolations levels, in addition to serializable we had at the weakest read-uncommitted, then read-committed, and then repeatable read. I did want to mention that we can also set transactions to be read-only. That's sort of orthogonal to setting the isolation level, what it does is it helps the system optimize performance. So, for example, in our transaction, where we were just reading the average GPA and the max GPA, we can set an isolation level, and then we can also tell the system that it's going to be a read only transaction. That means that we are not going to perform any modifications to the database within the transaction. The system can use that as a hint to figure out his protocols to guarantee the right isolation level, but it might not have as much overhead as if the transaction had the possibility of performing modifications as well as performing reads. OK, so the behavior of transactions can be rather confusing and it's very important to get it right or surprising things might happen. But I think we can summarize it pretty well with this table here. We're going here from the weakest to the strongest and we can classify the behavior of transactions based on again what happens with reading. Can they read uncommitted values? Can they have non-repeatable reads, where we read a value and then read a different one later in the same transaction, and can there be phantom tuples inserted during the transaction? If we set our isolation level to serializable, then we cannot have dirty reads, we cannot have non-repeatable reads and we cannot have phantoms. If we go one step weaker for a little more performance, and use repeatable read, then we still won't have dirty reads, we still won't have non-repeatable reads, but we might have phantom tuples. Moving up with Read Committed, we still won't have dirty reads but we might have non-repeatable reads. So we might read a value that's committed both times we read it however a transaction wrote the value in between those two reads so it's different each time and we may have phantoms as well. Finally read uncommitted is the absolute weakest, not many guarantees at all. We might have dirty reads, we might have not repeatable reads, and we might have phantoms. So to wrap up transactions completely the standard default is serializable behavior, and we specified exactly what that means. Weaker isolation levels allow us to increase concurrency, decrease overhead so overhaul will get an increased performance, but we have weaker consistency guarantees. I should mention that some prominent database systems have actually chosen to have repeatable read as their default. Oracle and MySQL are examples of that. So, in those systems it's assumed that most applications will be willing to sacrifice a little bit inconsistency in order to get higher performance. And, finally, the isolation level is set for each transaction and is in the eye of the beholder, meaning that that transaction's reads must conform to its own isolation level but won't affect any concurrent transaction's isolation level. I hope I've gotten the point across that transactions are one of the most important concepts in database systems. They are what allow multiple clients, maybe thousands, tens of thousands of clients, to operate on a data base, all at the same time without concern that the actions they perform on the data will affect each other in unpredictable ways. And furthermore, transactions are what allow database systems to recover when there's an unexpected crash into a consistent space.