This video introduces a very important type of constraint known as referential integrity. As a reminder, integrity constraints in a database restrict the allowable data beyond what's already restricted by the structure and types of the database. Now the term referential integrity refers to integrity of references that appear in the database. In a relational database, a reference from one couple to another occurs through specifying values, and integrity, referential integrity says that those values are valid. It's sort of equivalent to having no dangling pointers if we used pointers in the database; we'll see a number of examples. So let's look at our simple example database with the students, colleges and students applying to colleges, that we've been using for all our demos. Let's suppose for example, that we had a student, 123, who had applied to Stanford, for some major, let's say CS, with a decision of yes. When we have this tuple, the value in the SID field here of 123 presumably refers to a value here of an actual student. So maybe 123 is student Mary with some GPA in some high school. And furthermore, the Stanford value in the C name or college name attribute is presumably referring to the college name in the college table. So we would expect to have a Stanford value here. Referential integrity talks about these values here, referencing the corresponding values in the other tables. Specifically if we have referential integrity from a attribute A say of a relation R to an attribute B of another relation S. What that's saying is that every value that appears in the A column of relation R must have some corresponding value in the B column of relation S. So if we take a look at our example, we would say then that we would have referential integrity from the SID column of apply, to the SID column of students. We would expect every value that appears in this column to also have a value in the other column, and similarly we all have referential integrity from the college name attribute been applied to the college name attribute, in college. Again we want every value that appears in this column, to also appear in this column. Now we might have a violation, for example, if we had a, say, 555 applying to Stanford for some major and some decision. If we have no 555 over in the student table, then this here would be considered a referential integrity violation. Similarly we might have 1 2 3, which is valid because we have a 1 2 3 student here, but if 1 2 3 is applying to Yale and we don't have Yale over here, then again we have a referential integrity violation. Now let me mention that referential integrity is directional, so we talk about, this SID here, referencing the SID in the student table. We could have referential integrity in the other direction, but that's saying something different. That would be saying that every student must apply somewhere, every value of her student ID must appear in the apply table. And this particular example, we probably would not have that be the case, we would probably be able to have students who hadn't yet applied anywhere, or colleges where no one had applied yet. And the most sensible direction for this referential integrity in this case is from the apply to the student and the apply to the college. But again, we could have it in both directions if we so wished. Now just a few more details of referential integrity constraints. The referencing attribute, in this case, the referencing attribute is A, is often called the foreign key. And in fact, referential integrity is often referred to as foreign key constraints. And even in the SQL syntax, we'll see that the term foreign key is used. Second, when we have a referenced attribute, in this case now we're talking about attribute B, so we have R.A to S.B, the referenced attribute is usually required to be the primary key for the table, for it's table, or at least specified as unique. And that's more about efficient implementation than anything else. But it is a requirement in the SQL standard, and in most systems. Third, it is possible to have foreign keys that consist of multiple attributes, just like keys themselves can be multiple attributes. Let's say for example that in our college relation, the college name together with the state, form a key, not the college name individually. If that were the case then our apply table would presumably have one more column that specified the state so we knew which college a student was applying to. And in this case we would have these two attributes together as the foreign key referencing the college name and state together in the college table. And we'll see an example of multikey, multiattribute foreign key constraints in our demo. Now let's talk about the enforcement of referential integrity constraints. First, let's think about what operations can occur to the database to violate a constraint. It's not every possible modification. So again, let's suppose we have R.A referencing S.B. So, for example, here we have apply referencing students and we have apply referencing college. So certainly if we insert a tuple into the referencing relation, so if we inserted a tuple into the apply relation that could potentially violate the referential integrity if the value, say an SID or C name, didn't have matching values in the reference relations. Sort of conversely, if we delete from a reference relation, say we delete a student, then that could cause a violation because say we have the 123 value here and the app apply tuple that was referring to 123 after this is gone would then have effectively a dangling pointer. And of course, if we updated the referencing value either of these columns that could cause a violation if the new value didn't exist in the reference table, or if we update the referenced values that could also cause a violation. In the SQL standard and in all implementations, if we have an insertion into the referencing table or an update to the referencing table that causes a violation to the integrity constraint, the referential integrity constraint, then an error is generated, and that modification is not allowed, just like the violation of other types of constraints. For the reference table, however, table S in our case table student and table college. If there are modifications that violate the constraint, if the referential integrity was defined initially with some special options then it's possible for the database system to automatically modify the referencing table so that the constraint is not violated. So let's talk about that in a little more detail. Let's start by talking about deletions from the referenced table. So let's says we have our student 123 here, maybe 123 has applied a couple of places, and then we have our student 123 here in the student table. So right now referential integrity is good. Everything's okay. But let's suppose that we delete the tuple with 123. So there's actually three possible things that can happen, depending on how we set up the referential integrity constraint in the first place. So the default behavior is what's called restrict, so restrict is actually a key word, but that's the default, and it says that if we do a deletion to the reference table and a constraint becomes violated then we generate a air just like I said before and the modification is disallowed. The two other options are a little bit more interesting. One of them is called set null. And what set null says is if we delete a tuple in a reference table, then we don't generate an error. Rather, we take the referencing tuples, in this case these two tuples with 123, and we take their SIDs and we replace those SIDs with NULL. And this is considered acceptable from a referential integrity point of view to have nulls in the foreign key column, so that will occur automatically. The third possibility is what's called cascade, so let's set up a little more data let's say we have 234, who's applied to Stanford, and we have Stanford over here, and now let's say that we again delete this tuple, so that would leave us with a referential integrity constraint violation here with the Stanford value. So what cascade says for the on delete case, is that if we delete this tuple then we'll simply delete any tuple that has a referencing value, so this tuple will be deleted as well. Now the reason is called Cascade is because sometimes you can actually set up a whole chain of referential integrity constraints, so we have apply referencing college here but maybe we've had some other referencing ply and maybe even another table referencing that one, and if we say deleted a tuple from college that caused us to delete a tuple from a ply, if there was a tuple up here referencing that we might get a delete there and then a further delete, and then so on. Typically, cascading will only go one step, but we'll see an example in our demo, where we'll set it up where a cascade will go some distance. Now, updates have a similar three options. Let's erase all this stuff here. Let's again set up some example data. So let's say our student 123 was applied to Stanford and we have 1 "23" over here. If we tried to update, say this "123" to be the value "456", there strict command would say that's not allowed because that would leave us with a steg when pointer and will generate an error. The set null command will similar to the delete if this is changed to four, five, six Set any 123 values to null. So in this case we change 123 to 456 in the student. And then we would change 123 over here to null. Probably the most interesting case is the cascade case for the update. Cascade says that if we update a reference value then we'll make the same update to the referencing value. So, let's say we have Stanford over here in fact we have to if we have the value the referential integrity constraint being correct, and now let's say that somebody comes along and says, "I think Stanford is spelled wrong. It's actually Stanford, well, that's actually a common mispelling for Stanford. The first one was correct. But if someone makes this change, if we have the cascade option for the referential integrity constraints between apply C name and college C name then that update will be propagated to any referencing values. So in that case automatically this Stanford and any other Stanfords in the apply table will be updated automatically to Stanford. Now let's take a look at referential [xx] integrity in action. Let's create our three tables. We create the college table with the college name as primary key and the student table with the student ID as primary key. That allows us to have referential integrity constraints in the apply table that reference those two attributes. When we create the table apply, now we're going to specify the attributes, and thhe key word "references" says that we're setting up a referential integrity constraint from attribute "Student ID" to the "Student ID" attribute of the student table. And similarly, we extend the declaration of the "College Name" attribute to have a referential integrity constraint to the "College Name" attribute of the college table. We'll go ahead and create those tables. Now, let's try to populate our tables. If we make a mistake, we try to put our apply values in first, but there's no data in the student table or the college table. So, when we try to insert, for example, student123 applying to Stanford, we'll get a referential integrity violation because there's no student123, and there's no college Standford, and similarly for student 234 applying to Berkley. So, we see the errors and what we need to do is first insert the tuples for the students and the colleges and then insert the applied tuples afterwards. So let's insert the two students, 1, 2, 3 and two, three, four, and the two colleges Stanford and Berkeley, no problem doing that. And now we'll go ahead, and again, insert the apply tuples, and this time everything should work just fine, and it does. In addition to inserts into the referencing table, we also have to worry about updates. So, as a reminder let's take a look at the students that we have. We have students 123 and 234 and in the apply we have students 123 and 234 each applying to one college. Now we're going to update our applied tuple that has student ID 123. Our first update tries to set the student ID to 345, but we'll get a referential integrity violation because there's no student with ID 345. Our second update will be more successful. It will update 123's application to have student ID 234. We'll go ahead and execute the update. And we see that the second one did succeed. So far, we've looked at modifications to the referencing table but we also have to worry about modifications for the referenced tables. In our case that's the student table and the college table. For example, let's suppose we try to delete from the college table where the college name is Stanford. If we try to delete that couple we'll get an error because we do have an applied couple that has a value Stamford and is therefore referencing the couple we're trying to delete. Similarly tried to delete some student couples. Let's go back and look at our apply relation, and we now see that both of the student IDs in apply are "234", so it should be ok to delete the student couple with 123, nothing is referencing it. But it should not be okay to delete the couple with student ID 234 and indeed when we run the command we see that the first one generated an error and the second one succeeded. How about updating a referenced table. Let's say that we decide we'd rather have Berkley called Bezerkly So we try to update that college name, but when we run the command, we get an error because we do have an apply tuple that's referencing the value as Berkley. And finally, although we've been talking about data level modifications, referential integrity constraints also place restrictions on dropping tables. For example, if we try tried to drop the student table, we would again get a referential integrity constraint because that would leave data in the apply table referencing non-existent data in what would now be a non-existing table. You can see that the errors says that you cannot drop a table student because other objects are currently depending on it. So when we have referential integrity constraints, if we wanted to drop the tables, we'd have to first drop the applied table and then drop the table that it's referencing. Now we're going to set up the apply table to experiment with some of the automatic mechanisms for handling referential integrity violations. Specifically, we still have the same referential integrity constraints from student id to the student table and from college names to the college table, but for the student ID referential integrity constraint we're and we're going to specify that if a student is deleted, then we're going to set any referencing values to "null", and we do that with the keywords "on delete" which tells us what to do when there's a delete to the referenced table, we use the set null option. For the college name reference we're going to specify that if the college is updated in the college table and that says on update we'll use the cascade option. As a reminder what that does is that if we change the college name then we'll propagate that change to any college names that reference it in the apply table. Now I could have specified two more options. I could have specified an on update option for the student ID and an on delete option for the college name, so there could be four all together. Because I left those out those both will use the default which is the restrict option which says if you perform a modification that generates a referential integrity violation then an error will be generated and the modification will not be performed, just as we saw in the previous examples. So let's go ahead and let's create the table and then let's experiment with some modifications. Let's start by adding a couple more students to our student table. So then, if we take a look, our applied table is currently empty because we just created it, our college table has Stanford and Berkely, and our student table now has three tuples, student IDs "123", "234", and "235". We'll insert five tuples into the apply table, and all of them are going to be valid with respect to referential integrity. We're only going to insert students that are "123", "234", or "345", and they're only going to apply to Stanford or Berkley. So, we've inserted those values, and now we'll perform some modifications to see what happens. We're going to delete, from the student table, all students whose ID is greater than 200. Going back and looking at that table, we'll see that student "234" and "345" are going to be deleted. Now remember, we specified on delete set null for the apply referential integrity constraints. Specifically, when we look at our apply table the references to the students that are about to be deleted should be automatically set to null when we run the delete command. So, we'll go ahead. We'll perform the deletion. We'll take a look at the apply table when we We see that those values have indeed been set to null, and if we take a look at the student table, we'll see that we only have student123 left. Now, let's test our cascaded update. As a reminder, when we set up the college name referential integrity constraint, we said that if we update the college name in the college table, we should propogate those updates using cascade to any references in the applied table. So, we're once again going to attempt to change the name of Berkely to Bezerkly. This time it should allow us to do it, and it should change any applications to Berkely to now specify Berzerkly. So, we'll go ahead, and run the command, and we'll look at the apply, and we will see, once we refresh, that indeed Berkley has now been changed to Bezerkly . This example is a doozy. It's going to show a whole bunch of features that we haven't seen in previous examples. We're gonna use a simple table T with just three attributes A, B and C, and we're going to say that A and B together are a primary key for the table. In the example, we're going to demonstrate referential integrity within a single table, so intra-table referential integrity. We're going to demonstrate referential integrity involving multiple we'll attribute foreign keys and primary keys and we're going to demonstrate a real cascading where we're going to have a cascaded delete that's going to propagate across numerous couples. So typically one thinks of referential integrity as having a referencing table and then the referenced value exists in a different table but there's nothing wrong with having referential integrity within a single table. For example, in the one attribute case I might have attribute B where every value in B must also appear in value A and that would be a referential integrity constraint within the table. In this case things, are slightly more complicated because I'm using pairs of attributes for my referential integrity constraint. Specifically, I'm going to have attributes B and C together reference attributes A and B together. Syntactically, to declare a multi attribute referential integrity constraint in a table definition, I have to put it at the end because I can't attach it to a single attribute. Just like I do when I have, say, multi-attribute keys which are also demonstrated here. The syntax is at the end of the table definition. I'd say that I'm going to have foreign key constraint and attributes B and C together have a referential integrity constraint to attributes A and B of the same table Then, in addition, I'm going to specify, "on delete cascade". That means, if I delete an attribute, if I delete a tuple, then any other tuple that's referencing the AB values of that tuple with it's BC values will automatically be deleted. So, let's run the table creation, insert our data, and then, let's take a look at the table and predict what's going to happen when we actually delete a tuple. So, here's the contents of table T after the insertions. So, we see that A and B together do form a key. All of the AB pairs are unique, and furthermore, every BC pair has a corresponding AB value in the same table. So, every tuple except the first, the BC pair, references the AB pair of the preceding tuple. So, we have '1-1' here referencing '1-1' in tuple one. our two one here references the two one in table two and so on. So our referential integrity constraints are intact for this table. What we're going to delete the first tupple by running a command that says delete the tupple whose A value is one. When we delete tupple one because we have the cascaded delete set-up we will need to delete any tuple whose reference values are one one. So that will be couple two. So after deleting couple one the referential integrity constraint enforcement will delete couple two. When couple two is deleted the two one value will be gone So tuple 3 which references tuple 2 will be deleted, then tuple 4 which references 3 will be deleted, and so on until the entire table is empty. So here's our delete command to delete the first tuple. We run the command. We go back and we look at the table, and when we refresh, we see that the table is indeed empty. That concludes our demonstration of referential integrity constraints. Referential integrity is actually extremely common in deployments of relational databases. The natural way to design a relational schema often has values in columns of one table referring to values of columns of another, and by setting up referential integrity constraints the system itself will monitor the database and make sure that it always remains consistent.