In this sequence of videos, we'll
learn about designing good schemas for relational databases.
So let's suppose we're building a
database for an application or
set of applications and we
have to figure out what schema we want to store our data.
Usually there are many different
possible schema designs for a
database, and databases do tend to get quite complicated.
And some designs are much better than others.
So how do we choose what design to use?
Now the reality is that people
often use higher level tools
to design relational databases and
don't design the schemas directly themselves.
But some designers do go
straight to relations, and furthermore,
it's useful to understand why
the relations that are produced
by design tools are what
they are.
Furthermore, from an academic point of view,
it turns out there's a very
nice theory for relational data base design.
So let's consider the process of
designing the schema for our
database about students applying to colleges.
Specifically, for a given
student, let's suppose we have
their social security number and
their name, the colleges that
student is applying to, the
high schools they attended and
what city those high schools were in, and the student's hobbies.
So if that's what we want we
can create a single relation
called apply, that has one
attribute for each of those pieces of information.
Now let's take a look at how that database would be populated.
Let's suppose that we have
a student, Anne, with Social Security
number 123, she went to 2
different high schools in Palo
Alto, she plays tennis and
the trumpet, and she's applying
to Stanford, Berkeley, and MIT.
So let's look at some of
the tuples that we would be
having in the apply relation
to represent this information about Anne.
So we'll have 1,2,3, Anne,
her name, she's applying to
Stanford, she went to
Palo Alto High School, and
that's in Palo Alto, and one of her hobbies is tennis.
And then we also have 1
2 3 and she applied to
Berkeley and went to
Palo Alto High School in
Palo Alto and tennis there as well.
Of course she also has
a tuple representing the fact
that she's applying to Berkeley and
and we'll stick with Palo
Alto High School, and she played the trumpet.
And as you can see we'll
have more tuples, we'll have
various Stanford and Berkeleys, we'll
have some for her other high
schools called Gunn High School
also in Palo Alto, and so on.
So if we think about
it we will need a total
of 12 tuples to represent
this information about Ann.
Now do we think that's a good design?
I'm going to argue no, it's not a good design.
There are several types of anomalies in that design.
First of all, we capture information
multiple times in that
design, and I'll give some examples of that.
For example how many times
do we capture the fact that
1 2 3 the Social Security
number is associated with a student named Ann?
We capture that twelve times in our twelve tuples.
How many times do we
capture that Anne went to Palo Alto High School?
We're going to capture that six times.
And we're going to capture
the fact that she plays tennis six times.
And we're going to capture the fact
that she went to apply to
MIT four times, so for
each piece of information, in fact,
we're capturing it many, many times.
So that doesn't seem like a good feature of the design.
The second type is an
update anomaly, and that's really a direct effect of redundancy.
What update anomalies say
is that you can update facts
in some places but not
all all or differently in different places.
So let's take the fact for
example that Ann plays the trumpet.
I might decide to call
that the coronet instead but I
can go ahead and I
can modify, say, three of
the incidences where we captured
the fact about her playing the
trumpet and not the fourth
one and then we end up
with what's effectively an inconsistent database.
And the third type of
anomaly is called a deletion anomaly, and
there's a case where we could inadvertently
completely do a complete
deletion of somebody in the database.
Let's say for example that we
decide that surfing is an
unacceptable hobby for our
college applicants, and we go
ahead and we delete the tuples about surfing.
If we have students who have
surfing as their only hobby, then those students will be deleted completely.
Now you may argue that's the right
thing to do, but probably that isn't what was intended.
So now let's take a look at a very different design for the same data.
Here we have five different
relations, one with the
information about students and their
names, one where they've applied
to colleges, one where they
went to high school, where their
high schools are located and what hobbies the students has.
In this case we have no anomalies.
If we go back and look at
the three different types, they don't occur in this design.
We don't have redundant information, we
don't have the update anomaly or the deletion anomaly.
Furthermore, we can reconstruct all
of the original data from our
first design, so we haven't
lost any information by breaking it up this way.
So in fact this looks like a much better design.
Now let me mention a couple
of modifications to this design that might occur.
Let's suppose, for example, that
the high school name alone is not a key.
So when we break up the
high school name and high school
city, we no longer can identify the high school.
In that case, the preferred
design would be to move
the high school up here so
we'll have that together with
the high school name and then we don't need this relation here.
And actually that's a fine design.
It does not introduce any anomalies,
that's just based on the
fact that we need the name
of the high school together with the city to identify it.
As another example, suppose a
student doesn't want all of
their hobbies revealed to all
of the colleges that they are applying to.
For example, maybe they don't want Stanford to know about their surfing.
If that's the case then we
can modify the design again,
and in that case we would
put the hobby up here with where they're applying to college.
And so that would include the hobbies
that they want to reveal
to those particular colleges, and we'll take away this one.
So it looked like we were
taking our nice, small relations,
and moving back to a design that had bigger relations.
But in this case it was very well motivated.
We needed these attributes together to
identify the high school and
we want it to have our hobbies specific to the colleges.
So what that shows is
that the best design, for an
application for relational databases
depend not only on constructing
the relations well, but also
in what the data is representing in the real world.
So the basic of idea of
what we're going to do is
design by decomposition, specifically,
we're going to do what we
did at the very beginning of this
example, which is start
by creating mega-relations that just
contain attributes for everything
that we want to represent in our
database, then we're going
to decompose those mega relations
into smaller ones that are
better, but still capture the same information.
And most importantly we can do this decomposition automatically.
So how does automatic decomposition work?
In addition to the mega
relations, we're going to specify
formally, properties of the data.The
system is going to use
the properties to decompose the
relations, and then it's
going to guarantee that the final
set of relations satisfy what's called a normal form.
And we'll be formalizing all of this.
But the basic idea behind normal
forms is that they don't
have any of those anomalies that
I showed and they don't lose any information.
So specifically for specification of
properties, we're going to begin
by looking at something called functional dependencies.
And once we specify functional dependencies,
the system will generate
relations that are in
what's called Boyse Codd normal form.
And Boyse and Codd by the
way were two early pioneers in relational databases in general.
Then we're going to
look at another type of
specification called multi valued
dependencies which will add to
functional dependencies and when we
have both functional and multi
valued dependencies, then we
can have what's called fourth
normal form, and again, that
would be relations that are generated
by the system that satisfy the normal form.
Boyce-Codd normal form is stricter than fourth normal form.
Specifically if we make
a big Venn diagram here of
all the relational designs
that satisfied Boyce-Codd Normal Form,
which by the way is very
often abbreviated BCNF, then that
contains all of the
relations that satisfy fourth normal form,
normally abbreviated 4NF.
So every relation that's in
fourth normal form is also
in Boyce-Codd normal form, but not vice versa.
You might be wondering what happened
to first, second and third, normal forms.
So first normal form is
pretty much just a specification
that relations are real
relations with atomic values in each cell.
Second normal form is specifying
something about the way relations are structured with respect to their keys.
Neither of those is discussed very much anymore.
Third normal form is a
slight weakening of Boyce-Codd
normal form and sometimes people
do like to talk about third normal form.
So you can think of third normal form as a little bit of a even bigger circle here.
We're not going to cover third normal
form in this video because
Boyce-Codd normal form is the
most common normal form used
if we have functional dependencies only, and
fourth normal form if we
have functional and multivalued dependencies.
So what's going to happen next is
I'm going to give some examples
to motivate these four concepts:
functional dependencies, Boyce-Codd normal form,
multivalued dependencies normal form,
and then later videos will
go into each one in much greater depth.
So let me just give
the general idea of functional dependencies
and Boyce-Codd Normal Form.
And we'll use a very
simple for example, an abbreviated version
of our apply relation that has
students' social security numbers, the
student's name and their colleges
that the student is applying to.
Even this small relation actually
has redundancy and update and deletion anomalies.
Specifically, let's say that our
student, 123Ann, applies to 7 colleges.
Then there will be 7 tuples and
there will be 7 instances where we
know that a student with the
social security number 123 is named Ann.
Specifically, we're going to store
for every student the name
and social security number pair once
for each college that they apply to.
So now let me explain what
a functional dependency is and then
we'll see how functional dependencies are
used to recognize when we
have a bad design like this
one, and to see how we can fix it.
A functional dependency, in this
case from social security number
to name, and we're saying
social security number functionally determines
the student name says that
the same social security number always has the same name.
In other words, every time we see 123, we're going to see Ann.
Now it doesn't necessarily go in the other direction.
It might not be that whenever
we see Ann, it's 123,
but whenever we see 123, it is Ann.
And so what we'd like to
do is store that relationship just one time.
One time say that for 123, the name is Ann.
Now what Boyce Codd Normal Form
says is that whenever we have
one of these functional dependencies, then
the left hand side of that functional dependency must be a key.
And think about what that's saying.
Remember a key says
that we have just one tupple with each value for that attribute.
So if we have say
social security number to name
as a functional dependency and we
satisfy Boyce-Codd Normal Form,
then we're going to say that
social security number has to
be a key in our relation,
and we'll only have one
tupple for each social security number.
Specifically, we can go back to our original relation.
We have this functional dependency social
security number here is not a key, right?
So then we know that this is not in Boyce-Codd Normal Form.
So we're going to use functional
dependencies to help us
decompose our relation so
that the decomposed relations are in Boyce-Codd Normal Form.
And here's what would happen in this example.
Our functional dependency would tell
us to pull out the social
security number and student name
into its own relation where
the social security number is a
key and then we have
just one time for each
social security number that students
name, and then separately we'll
have the information about the
students and which colleges they applied to.
Again, we'll completely formalize this
whole idea, the definition of
functional dependencies, their properties, the
normal form, and how we
do the decomposition in a later video.
Now, let's similarly motivate the
concept of multi-value dependencies, and fourth normal form.
It is actually a little bit more complicated, but it follows the same rough outline.
Now let's look at a different
portion of the information
about applying and let's suppose,
for now, that we're just concerned
about students, what colleges they're
applying to, and what high schools they went to.
We still have redundancy and update and deletion anomalies.
For example, a student who
applies to Stanford is going
to have that fact captured once
for every high school that they went to.
A student who went to
Palo Alto high School will have
that fact captured once for every college they apply to.
In addition, we get a kind of multiplicative effect here.
Because let's say a student
applies to C colleges
and they went to H
high schools; I know students don't
go to a lot of high schools
but let's suppose that this is one that had moved a lot.
In that case, we're going to have C times H, tuples.
What we'd really like to have
is something more on the
order of C plus H,
because then we'd be capturing each piece of information just once.
Now the interesting thing is that
the badness of this particular
design is not addressed by
Boyce-Codd Normal Form, in fact
this relation is in Boyce-Codd
Normal Form, because it has no functional dependencies.
It's not the case that every
instance of a social security
number is associated with a
single college name or a single high school.
As we will see later, if there
are no functional dependencies, then the
relation is automatically in Boyce-Codd
Normal Form, but it's not
in and fourth normal form.
So fourth normal form is associated with what are called multi-value dependencies.
When we specify a multi-value
dependency as we've done here
with the double arrow, what this
is saying is that if we
take a particular social security
number in the relation,
we will have every combination
of college names that are
associated with that social security
number with every high
school that's associated with that social security number.
We'll actually see that when
we have this multi-value dependency,
we automatically have this one, too.
I know it seems a bit complicated,
and we will formalize it completely,
but for now now just think
about the English statement that multi-valued dependency
is saying that we are going to
have every combination of those
two attributes and values
in those attributes for a given social security number.
In other words, those values
are really independent of each other.
So if we have that situation,
then what we should really do
is store each college name
and each high school for each
social security number one time,
and that's what fourth normal form will do for us.
Fourth normal form, similarly to
Boyce-Codd normal form, says if
we have a dependency, then the left hand side must be a key.
In this case, it's a multi-value dependency
we're looking at, so it's really
saying something different but the
basic idea is the same
which is that we want
only one tuple that has
each value that's appears on
the left hand side of a
multi-value dependency So let's
see what would happen, in this
example, if we use
our multi-value dependencies to decompose
the relation, based on the idea of fourth Normal Form.
Well it is the intuitive thing that happens.
We separate the information about
the college names that a student
applies to from the information
about the high schools themselves, and
then we'll see that that
we only store each fact
once and we do get
the behavior of having C
plus H tuples instead of having C times H tuples.
Like with functional dependencies and Boyce-Codd
Normal Form we'll be completely
formalizing all of this
reasoning and the definitions in later videos.
To summarize, we're going to do relational design by decomposition.
We're going to start by specifying mega
relations that contain all the
information that we want to capture,
as well as specifying properties of
the data usually reflecting the real world in some fashion.
The system can automatically decompose
the mega-relations into smaller relations
based on the properties we specify,
and guarantee that the final
set of relations have certain
good properties captured in a normal form.
They will have no anomalies, and they'll be guaranteed not to lose information.
We'll start by specifying properties as
functional dependencies and from
there the system will guarantee Boyce-Codd
Normal Form and then we'll
add to that properties specified as
multi-value dependencies, and from
there the system will guarantee fourth
normal form, which is even
stronger than Boyce-Codd Normal Form and
is generally thought to be good relational design.