Now that we've learned about functional dependencies, let's talk about how they're used to create relations that are in Boyce Codd Normal Form. Very quick reminder about relational design by decomposition. The database designer creates mega relations that contain all the information to be captured and specifies properties of the data to be captured. The system uses the properties to decompose the relations into smaller ones and those final decomposed relations satisfy what is known as a Normal Form. They don't have anomalies and they don't lose information. Functional dependencies are used to create relations in Boys Codd Normal Form and multi-doubt value dependencies are used to create relations in Fourth Normal Form. This video talks about the process of using functional dependencies to create relations in Boyce Codd Normal Form. Let's start by defining what it means to do a decomposition of a relational schema. Let's suppose we have a relation R with a set of attributes A-1 through A-N. We can decompose R into two relations. We'll call them R-1 and R-2 such that R-1 has... I'll just label them B-1 through B-K and C-1 through C and let's say, let me use the notation for the list of attributes A, B and C that I've been using in other videos. So R1 and R2 are a decomposition of R. First of all if the attributes that capture B union C are equal to the set of attributes we started with A. In other words recovering all of the attributes, and furthermore, this is the tricky part, R1 natural join R2 equals R. So, let me draw this pictorially. So here's our relation R and all of our attributes together are the A attributes. And then we're going to decompose R into R1 and R2. So, let's say this first set of attributes here are the B attributes and the second bunch of attributes here with some overlap are the C attributes. So, now R-1 consists of this portion of R and the purple part here now is R2. So, clearly the B and C attributes are equal to the original attributes and then is the join of R-1 and R-2 giving us R. Now remember, all of this is logical. We don't have R itself and we don't have the data and we don't have R-1 and R-2, so everything is being done at the schema level. And we will explore later how we can guarantee that this join does return R and not something else. Now just using a little bit more relational algebra here, let me mention that R-1 can be defined as the projection on the B attributes of R and then in purple, R-2 is the projection of the C attributes of R. So again all of this is logical, but the idea is that when we do the projection, if there are duplicates that are present simply because we have say, different values in the remaining attributes, those duplicates don't have to be retained in the projection. We saw in some of our examples in other videos where we had redundancy because we were capturing information multiple times that we didn't need to. And we're going to see that what Boyce Codd Normal Form really does is separate the relations so that we capture each piece of information exactly once. I know that's very abstract now, but when we see examples, we'll see how that works. So let's look at two possible decompositions of the student relation and see which ones are correct. So let's start with a decomposition where we take... we're gonna decompose student into S-1 and S-2, and in S-1 we'll put the social security number, name, address, let me abbreviate a little bit here and the high school code but no more high school information, the GPA and the priority. And then in relation 2, we'll put the high school code and we'll put the high school name and the high school city. So you can see what I've done done here is I've separated out the high school information into a separate relation. So first of all, is this a correct decomposition in the sense that A union B equals C. Certainly all of the attributes are still present and furthermore, if you think about it and we'll formalize this concept later S1 join S2 and that's going to occur by the way based on this highest school code value. S1 join S2 in this case will, for the data that you would expect, equal student. Again, we'll formalize that momentarily. Now let's look at a second possible decomposition of the student relation again into two relations S1 and S2. In the first one we'll put the first bunch of attributes. So we'll put the social security number, the student's name, their address, their high school code. Let's say high school name and high school city. And then in the second relation, we'll put again the student name. We'll put the high school name and we'll put, say the GPA and lastly the priority. So again, is this a decomposition? Well, certainly again we have the case that the A union B equals C, in other words we've captured all of the attributes of the student relation in our decomposed relation and do we think it's the case that if we join S1 and S2 then we'll get the student relation back and I'll put a question mark here and you know, of course, the answer is going to be no. When we join back we'll be joining in this case on the student name here and the high school name and likely these are not unique values so when we join back, we may be getting information together that doesn't really belong together. And, again, we'll be formalizing that and seeing additional examples momentarily. So now let's dig a little further into the actual process of decomposition. So, first of all we definitely want good decomposition. So, as we saw a good decomposition must capture all of the attributes of course. But the more important property is that this reassembly by the join produces the original relation. Sometimes that's called, by the way, a lossless join property. But the second thing that we want is not only that we have a good decomposition but that the relations that we decompose into are good relations. And those relations are going to be the ones that are in Boyce-Codd Normal Form. So let me first define formally Boyce-Codd Normal Form and then we'll go back to figure out an algorithm for automatically decomposing relations using good decompositions into decomposed relations that are in Boyce-Codd Normal Form. So here's the formal definition of when a relation is in Boyce-Codd Normal Form, usually abbreviated B, C and F. A relation R with functional dependencies is in Boyce-Codd Normal Form if every functional dependencies is such that it's left-hand side is a key, ok. Let's see what happens when it's not the case that the left hand side of a functional dependency is not the key and we'll see why that's a bad design. So here's our relation R and here's a set attributes A on the left side of the functional dependency, attribute B and the rest. And let's just put in some values. So let's suppose that we have two tuples here with the same A value. Then by our functional dependency, we're going to have the same B value and the rest can be anything. What has happened here is that we've captured the piece of information the connection between A and B twice. And the reason that's allowed to happen is because A is not a key. if A were a key, we would not be allowed to have the situation where we have these two tuples both present in the relation. So this relation is not in voice cod normal form. And this functional dependency here is what we would call a B C and F violation. That violation is causing us to have redundancy in our relation and that also give us as we've seen the update anomalies and deletion anomalies. Let me clarify a little bit the requirement that the left-hand side of functional dependencies have to be key, so that's what tells us we're in Boyce-Codd normal form. Now I'm not saying that the left-hand side of every functional dependency has to be declared as the primary key for a relation, only that it is, in fact, a key. And, as you might recall, the definition of a key is an attribute that determines all other attributes, if you're thinking about functional dependencies, or if you don't have any duplicates in your relation, then a key is a value that is never duplicated across tacets. So if you think about it for a second, you'll realize that whenever a set of attributes is a key, so is any superset of those attributes. So if "A" is a key, then so is ac and so is a,c,d and so on. So sometimes you'll see in the definition of Boyce Codd normal form this wording not is a key but will be contains a key which in fact is exactly the same thing or sometimes it will even say is a super key, and a super key is a key or a super set of a key. Again, all of those are saying exactly the same thing, but I just wanted to clarify because different wording and sometimes different notation is used for that concept. So far, things have been pretty abstract. Let's try to get a bit more concrete here. Let's look at two examples and determine if those examples are in "B", "C", and "F". Remember to determine is something is in B, C, and F we need the relational schema and a set of functional dependencies. So here we have our student relation and this is a set of functional dependencies we had in earlier examples, where the social security number is determining the name, address, and GPA. That means that if there's two tuples with the same social security number they will have the same name, address, and GPA. That's the same student and they only live in one place. The GPA determines the priority, so any two students with the same GPA will have the same priority and, finally, the high school code determines the high school name and city. So the high school is a unique identifier for a particular high school in a city. So those are our three functional dependencies in order to test whether this is relation is in normal form with respect to the functional dependencies we need to know what the key of the relation is or the set of keys of the relation and we worked on this in an earlier video using the closure idea, so I'll just remind you now, that for this relation, this set of functional dependencies, there's one key or one minimal key and that's the social security number together with the high school code, those two attributes do functionally determine all other attributes in the relation and, therefore, they are together, forming a key. So now, to check if we're in Boyce-Codd Normal Form, we have to ask the question, "Does every functional dependency have a key on its left-hand side?" and the answer, of course, is no, not all. In fact, the reality is that no functional dependency, in this case, has the key on the left hand side. We have three left hand sides and no of them have or contain our one key. If you've given any thought at all to this database design, you will see that it's not a good one. It's combining too much information in one place, which is our basic idea, that we start with a mega relation and break it down. And so what we're going to do is use these functional dependencies, and specifically the fact that those are BCNF or Boyce Codd Normal Form violations, to break this relation down into one that is a better design. Now let's look at a second example, our apply relation to see if this one is in Boyce Codd Normal Form. So in this case as a reminder, we have social security number, college, state, date and major. So the date is the date of application; the major is major the student is applying for at that particular college and we'll have one functional dependency which effectively says in English that each student may apply to each college only once and for one major. Now let's compute the key for this relation, or keys for this relation based on the functional dependency. Well, it's pretty straightforward that these three attributes form a key because they determine the other attributes in the relation and therefore they determine all the attributes of the relation. Furthermore, we can see that our one and only functional dependency obviously has a key on its left hand side and so so this relation is in fact already in Boyce-Codd normal form and we'll see there's no way to decompose this relation further into a better design. So here we are again, talking about our decomposition process, so now we know what good relations are. They are in BCNF. And we saw earlier what good decompositions are, so now we're going to give an algorithm that's going to perform good decompositions and those decompositions are going to yield decomposed relations that are in Boyce Codd normal form. So here's the algorithm. The input is a relation r and the set of functional dependencies for that relation, and the output is going to be our good decomposition into good relations. So let's just go through it step by step. The first thing we're going to do is compute keys from r and were going to do that using functional dependencies as we have seen, and we saw in the actual algorithm for doing that in a previous video. Then we are going to start doing the decomposition process and we're going to repeat the decomposition until all of the relations are in BCNF. though we're going to take r and break it up into smaller relations and we might further break those into smaller relations, and so on, until every relation is good. So the breaking down process says pick a relation that's bad. So we're going to pick a relation, r prime in our current set, and again, we're starting with only r in our set. And we're going define a situation where that relation has a functional dependency and I guess, technically speaking, this would be with lines on top, that violates Boyce-Codd Normal Form and that violation is what's going to guide us to do the decomposition into better relations. So our decomposition then,in the second step, is going take our prime and and it's going to put the attributes involved in the functional dependency into one relation, and then it's going to keep the left-hand side of the functional dependency and put the rest of the attributes along with that left-hand side into a separate relation. So let's draw this as a picture. So here's our relation, r prime and of course the first time through our loop, that relation would have to be r self, and then because we have a functional dependency from A to B and A is not a key - that means it's BCNF violation - we're going to decompose this into R1 and R2. So R1 will contain the attributes in the functional dependency. R2 will contain the left-hand side of the functional dependency and the rest. We can see clearly that this is a decomposition that keeps all attributes, and what we'll see soon is that this is also a good one in that logically the join of these two relations is guaranteed to give us back what we had originally. Now the remaining two lines of the algorithm compute after the decomposition the set of functional dependencies for the decomposed relations and then the keys for those. I'm going to come back to this particular line here after doing an example. This is our same example, although squished to give me more space to write. And, as a reminder, in this example, we've computed a couple of the times that the key, given the functional dependencies, is the combination of the social security number and the high school code. So our goal is to take the student relation and iteratively break it down into smaller relations until all of the relations are in Boyce-Codd normal form. So let's start the iterative process. We pick some functional dependency that violates Boyce-Codd normal form and we use it guide our decomposition. So all three of these functional dependencies actually violate Boyce-Codd normal form because none of them have a key on the left-hand side. So let's start with the high school code one. So, to do the decomposition, based on this violating functional dependency, we're going to create two relations. The first one is going to contain just the three attributes that are in the functional dependency itself, so it's high school code, high school name and high school city. And the second one is going to have all remaining attributes in the relation, plus the left hand side, so we'll have the social security number, the name, the address. We will have the high school code because it's on the left-hand side of the functional dependency we're using, but we won't have the name and city from the right side hand side and I'll have a GPA and the priority. Now at this point our algorithm computes the functional dependencies for the decomposed those relations. For this particular example, we can just see what they are, they're the same functional dependencies that we had for the non-decomposed relation. Sometimes there's a little bit of computation you have to do and I'm going to talk about that in a bit. But, in this case, we can see, for example, for our relation S1, the only functional dependency is this functional dependency here. That tells us that high school code here is a key for S1. So our only functional dependency has a key on the left-hand side and that tells us that this relation is in Boyce-Codd normal form. So we're done with that one, but we're not done with S2. So for S2 the key is still the social security number and high school code together, so we still have these two functional dependencies that are Boyce-Codd normal form violations. So let's take the GPA priority one and let's guide that to decompose S3 further. We'll decompose S2 into S3 and S4. S3 will contain the GPA and priority from the functional dependency we're using, and then S4 will take the remaining attributes in S-2 together with the left hand side of the GPA. So, we'll keep our social security number, name, address, high school code and GPA, but we don't keep the priority. So at this point, S-2 is completely gone and let's take a look at the remaining relations. S-3 now just has one functional dependency that applies and the left-hand side is now a key. And so now we're done with S-2. It's in Boyce Codd Normal Form, but we're not done, I'm sorry, we're done with S-3, but we're not done yet with S-4. S-4 still has social security and high school code as its key, and so we still have a violating functional dependency, so let's decompose S-4 further. We decompose into S-5 and S-6. S-5 contains the attributes in the functional dependency that we're using now, so it's the social security number, name, address and GPA and then as 6 contains the remaining attributes plus the left hand side, so that's the social security number and the high school code. And I will just tell you right now because you might be getting bored with this example, we're done with S-4 S5 and S6 are now both in Boyce Codd normal form. So this is our final schema. It contains 4 relations, S1, with the information about high schools, S3 with the information about GPA's and priorities, S5 has student with their name, address and GPA and S6, a student with the high school they went to. And, if you think about it, this really is a good schema design and it's what's produced automatically by the BCNF decomposition algorithm, using our functional dependencies. Let me mention a few more things about the algorithm. First of all, I left this step kind of mysterious, which is how we compute the functional dependencies for the decomposed relations. We can't just take the functional dependencies we already have for the bigger relation, and throw away the ones that don't apply exclusively to one or the other of the decomposed, we actually have to compute the functional dependencies that are implied and that apply to these relations. So in the video on functional dependencies, we learned all of implied functional dependencies, and we would use the closure as we did in that video to determine the implied functional dependencies. Now the reality is, for many examples, it's pretty obvious, we saw in the previous example it is. And, the other thing is, that this is being done by a computer so, we don't actually have to worry about it except when we're doing exercises for our class. Second, let me mention that there is little nondeterminism in this algorithm. It says "pick any of our relations with a violating functional dependency and use that to guide the next decomposition step". So, the fact is, that you can get a different answer, depending on which one you choose at this point in time. All of the results will be BCNF decomposition but they might not be the same one. And, in fact, if you go back to the example that I did, and you pick the functional dependencies in a different order, you might get a different final schema. But, again, it will be in BCNF. And lastly, some presentations of the BCNF decomposition algorithm actually have another step here, which is to extend the functional dependency that is used for the decomposition. So we're using A to B but if we have A to B we also have A to B and we can add more attributes, those in the closure of A. If you remember the closure and that's also a correct functional dependency. By doing this extension, we will still get a correct BCNF answer, but we'll tend to get relations that are larger than the ones we get if we don't do the extension first. In some cases, larger relations are better because you don't need to join them back when you're doing queries, but that can depend on the query load on the data base. So to conclude, does BCNF guarantee a good decomposition? Well of course the answer is yes or I wouldn't have spent all this time teaching you about it. Does it remove the anomalies that we looked at in our first example in another video of bad relational design? Yes, it does remove anomalies. When we have multiple instances of the same piece of information being captured, that's what's squeezed out by the decomposition into BCNF, and that's fairly easy to see through the examples that we've done already. It's a little less obvious seeing why BCNF composition does allow us to have a breakdown of a relation that logically reconstructs the original relation. So let's look at that a little bit more. So we're taking a relation in R, we're producing R1 and R2, and we want to guarantee that when we join R1 and R2, we get R back. We don't get too few tuples and we don't get too many tuples. Too few is pretty easy to see. If we break R into R1 and R2 and their projections of R then when we join them back, certainly all the data is still present. If they're too many tuples, that's a little bit more complicated to see. Let's just use a simple abstract example. So here's a relation R with three attributes. Let's have two tuples: 1, 2, 3 and 4, 2, 5. Let's suppose that we decompose R into R1 and R2. R1 is going to contain AB, and R2 is going to contain BC. So let's fill in the data: in R1 we have 1, 2, 4, 2. And in R2 we have 2, 3, 2, 5. Now let's see what happens when we join R1 and R2 back together. When we do that, you might see what's going to happen. We're actually going to get four tuples. We're going to get "1,2,3", "1,2,5", "4,2,3" and "4,2,5". That is not same as our original relation, so what happened? Well, what happened is we didn't decompose based on a functional dependency. The only way we would decompose with "B" as the shared attribute were if have a functional dependency from B to A or B to C. And we don't. In both cases, there's two values of B that are the same where here the A values are not the same and here the C values are not the same. So neither of these functional dependencies hold N, B, C, and F would not perform this decompostion. So in fact B, C, and F only performs decompositions when we will get the property that they're joined back. Again that's called a lossless join. So, B, C, and F seems great, we just list all our attributes a few functional dependencies and the systems churns on and out comes a wonderful schema. And in general that actually is what happens. In our student example, that worked very well. There are, however, shortcomings of B C and F and we will discuss those in a later video