This is the first of two
videos where we learn about relational algebra.
Relational Algebra is a formal language.
It's an algebra that forms
the underpinnings of implemented languages like SQL.
In this video we're going to
learn the basics of the
Relational Algebra Query Language and a few of the most popular operators.
In the second video we'll learn
some additional operators and some
alternate notations for relational algebra.
Now, let's just review first from
our previous video on relational
querying that queries over
relational databases operate on
relations and they also produce relations as a result.
So if we write a query
that operates say on the three
relations depicted here, the
result of that query is going to be a new relation.
And, in fact, we can
post queries on that
new relation or combine that
new relation with our previous relations.
So let's start out with Relational Algebra.
For the examples in
this video we're going to be
using a simple college admission relations database with three relations.
The first relation, the college
relation, contains information about the
college name, state, and enrollment of the college.
The second relation, the student
relation, contains an ID
for each student, the student's name,
GPA and the size of the high school they attended.
And, finally, the third relation contains
information about students applying to colleges.
Specifically, the student's ID, the
college name where they're
applying, the major they're
applying for and the decision of that application.
I've underlined the keys for these three relations.
As a reminder, a key is
an attribute or a set of
attributes whose value is
guaranteed to be unique.
So, for example, we're going
to assume the college names are unique,
student IDs are unique and that
students will only apply to
each college for a particular major one time.
So, we're going to have a
picture of these three relations at
the bottom of the slides throughout the video.
The simplest query in relational
algebra is a query
that is simply the name of a relation.
So, for example, we can
write a query, "student" and that's
a valid expression in relational algebra.
If we run that query on
our database we'll get as
a result a copy of the student relation.
Pretty straightforward .
Now what happens next
is that we're going to use
operators of the relational algebra
to filter relations, slice relations, and combine relations.
So, let's through those operators.
The first operator is the select operator.
So, the select operator is used
to pick certain rows out of a relation.
The select operator is denoted by
a Sigma with a subscript--that's
the condition that's used to
filter the rows that we extract from the relations.
So, we're just going through three examples here.
The first example says that we
want to find the students
whose GPA is greater than 3.7.
So to write that
expression in relational algebra, we write
the sigma which is the
selection operator as a
subscript the condition that we're
filtering for--GPA greater than
3.7--and the relation over which we're finding that selection predicate.
So, this expression will return
a subset of the student
table containing those rows
where the GPA is greater 3.7.
If we want to filter for two
conditions, we just do
an "and" of the conditions in the subscript of the sigma.
So if we want, say, students
whose GPA is greater than 3.7
and whose high school size
is less than a thousand, we'll
write select GPA greater than 3.7.
We used a logical
and operator--a caret, high school
size is less than a
thousand, and again we'll apply that to the student relation.
And once again, the result of
that will be a subset of
the student relation containing the rows that satisfy the condition.
If we want to find
the applications to Stanford for
a CS major, then we'll be
applying a selection condition to the apply relation.
Again, we write the sigma
and now the subscript is
going to say that the college
name is Stanford and the major is CS.
Again, the and operator, and
that will be applied
to the apply relation and
it will return as a
result, a subset of the apply relation.
So the general case of the
select operator is that we have the sigma.
We have a condition as a
subscript and then we have a relation name.
And we return as a result the subset of the relation.
Our next operator is the Project Operator.
So the select operator picks certain
rows, and the project operator picks certain columns.
So let's say we're
interested in the applications, but all
we wanted to know was the
list of ID's and the decisions for those applications.
The project operator is written
using the Greek pi symbol,
and now the subscript is
a list of the column names that we would like to extract.
So we write ID, sorry,
student ID and decision, and
we apply that to the apply relation again.
And now what we'll get
back is a relation that has just two rows.
It's going to have all the
tuples of apply, but it's
only going to have the
student ID and the decision columns.
So the general case of
a project operator is the
projection, and then a
list of attributes, can be
any number, and then a relation name.
Now, what if we're interested in
picking both rows and columns at the same time.
So we want only some of
the rows, and we want only some of the columns.
Now we're going to compose operators.
Remember that relational queries produce relations .
So we can write a
query, say, with the
select operator of the
students whose GPA is greater than 3.7.
And this is how we do that.
And now, we can take
that whole expression which produces a
relation, and we can
apply the project operator to that, and
we can get out the student
ID and the student name.
Okay.
So, what we actually see
now is that the general
case of the selection and
projection operators weren't quite what I told you at first.
I was deceiving you slightly.
When we write the select
operator, it's a select
with the condition on any
expression of the relational
algebra, and if it's
a big one we might want
to put parenthesis on it,
and similarly the project operator is
a list of attributes from
any expression of the relational algebra.
And we can compose these as much as we want.
We can have select over project, over
select, select, project, and so on.
Now let's talk about
duplicate values in the results of relational algebra queries.
Let's suppose we ask
for a list of the majors that
people have applied for and the decision for those majors.
So we write that as the
project of the major
and the decision on the applied relation.
You might think that when we get
the results of this query, we're going to have a lot of duplicate values.
So we'll have CS yes, CS
yes, CS no, EE yes, EE no, and so on.
You can imagine in a
large realistic database of applications,
there's going to be hundreds of
people applying for majors and having a yes or a no decision.
The semantics of relational
algebra says that duplicates are always eliminated.
So if you run a query
that would logically have a
lot of duplicate values, you just
get one value for each result.
That's actually a bit of a difference with the SQL language.
So, SQL is based on
what's known as multi-sets or
bags and that means
that we don't eliminate duplicates, whereas
relational algebra is based
on sets themselves and duplicates are eliminated.
There is a multi-set or
bad relational algebra defined as
well but we'll be fine by
just considering the set relational algebra in these videos.
Our first operator that combines two
relations is the cross-product operator,
also known as the Cartesian product.
What this operator does, is it
takes two relations and kinda
glues them together so that
their schema of the result
is the union of the
schemas of the two relations and
the contents of the result are every
combination of tuples from those relations.
This is in fact the normal
set cross product that you
might have learned way back in the elementary school.
So let's talk about, say,
doing the cross products of students and apply.
So if we do this
cross products, just to
save drawing, I'm just gonna glue
these two relations together here.
So if we do the
cross product we'll get at the
result a big relation, here, which
is going to have eight attributes.
The eight attributes across the student
and apply now the only
small little trick is that
when we glue two relations together
sometimes they'll have the same
attribute and we can see we have SID on both sides.
So just as a notational convention,
when cross-product is done
and there's two attributes that are
named, they're prefaced with the name of the relation they came from.
So this one would be referred
to in the cross-product as
the student dot SID where this
one over here would be referred
to as the apply dot SID.
So, again, we glue together in
the Cartesian product the two
relations with four attributes each,
we get a result with eight attributes.
Now let's talk about the contents of these.
So let's suppose that the student
relation had s-tuples in it
and that's how many tuples, while
the apply had 8 tuples in
it, the result of the
Cartesian products is gonna
have S times A tuples,
is going to have one tuple
for every combination of tuples
from the student relation and the apply relation.
Now, the cross-product seems
like it might not be that
helpful, but what is
interesting is when we use
the cross-product together with other operators.
And let's see a big example of that.
Let's suppose that we want
to get the names and GPAs of
students with a high school
size greater than a thousand who
applied to CS and were rejected.
Okay, so let's take a look.
We're going to have to access
the students and the apply
records in order to run this query.
So what we'll do is we'll
take student cross apply as our starting point.
So now we have
a big relation that contains
eight attributes and all of those tuples that we described previously.
But now we're going to
start making things more interesting, because
what we're going to do is
a big selection over this relation.
And that selection is first
of all going to make sure
that it only combines student and
apply tuples that are referring to the same student.
So to do that, we write
student dot SID equals
apply dot SID.
So now we've filtered the result
of that cross-product to only
include combinations of student
and apply by couples that make sets.
Now we have to do a little bit of additional filtering.
We said that we want the
high school size to be
greater than a thousand, so
we do an "and" operator in the high school.
We want them to have applied
to CS so that's and major equals CS.
We're getting a nice big query here.
And finally we want them to
have been rejected, so "and
decision" equals, we'll just be using R for reject.
So now, we've got that gigantic query.
But that gets us exactly what
we want except for one more thing,
which is, as I said, all we want is their names and GPAs.
So finally we take a
big parentheses around here and
we apply to that the
projection operator, getting the
student name and the GPA.
And that is the relational
algebra expression that produces
the query that we have written in English.
Now we have seen how the
cross product allows us to
combine tuples and then
apply selection conditions to
get meaningful combinations of tuples.
It turns out that relational algebra
includes an operator called
the natural join that is
used pretty much for the exact purpose.
What the natural join does is
it performs a cross-product
but then it enforces equality
on all of the attributes with the same name.
So if we set up our
schema properly, for example,
we have student ID and student
ID here, meaning the same
thing, and when the cross
product is created, it's only
going to combine tuples where the student ID is the same.
And furthermore, if we add
college in, we can
see that we have the college name here and the college name here.
If we combine college and apply
tuples, we'll only combine tuples that are talking about the same college.
Now in addition, one more
thing that it does is it
gets rid of these pesky attributes that have the same names.
So since when we combine,
for example, student and apply
with the natural join, we're only
combining tuples where the
student SID is the same as the apply SID.
Then we don't need to keep two
copies of that
column because the values are always going to be equal.
So the natural join operator
is written using a bow
tie, that's just the convention.
You will find that in your text editing programs if you look carefully.
So let's do some examples now.
Let's go back to our same
query where we were finding
the names and GPAs of students
from large high schools who applied to CS and were rejected.
So now, instead of using
the cross-product we're gonna
use the natural join, which, as I said, was written with a bow tie.
What that allows us to
do, once we do that natural
join, is we don't have
to write that condition, that enforced
equality on those two
attributes, because it's going to do it itself.
And once we have done that then
all we need to do is
apply the rest of our conditions,
which were that the high school
is greater than a thousand and the
major is CS and the decision
is reject, again we'll
call that R. And then,
since we're only getting the names
and GPAs, we write the
student name and the GPA.
Okay.
And that's the result of the query using a natural join.
So, as you can see that's a
little bit simpler than the original
with the cross-product and by setting
up schemas correctly, natural join can be very useful.
Now let's add one more complication to our query.
Let's suppose that we're only interested
in applications to colleges where the enrollment is greater than 20,000.
So, so far in
our expression we refer to the
student relation and the apply
relation, but we haven't used the college relation.
But if we want to have a
filter on enrollment, we're going to have
to bring the college relation into the picture.
This turns out to perhaps be easier than you think.
Let's just erase a couple
of our parentheses here, and what
we're going to do is we're
going to join in the
college relation, with the two relations we have already.
Now, technically, the natural
join is the binary operator, people
often use it without parentheses
because it's associative, but if
we get pedantic about it we could add that and then we're in good shape.
Now we've joined all three relations together.
And remember, automatically the natural
join enforces equality on the shared attributes.
Very specifically, the college
name here is going to
be set equal to the apply college name as well.
Now once we've done that, we've got all the information we need.
We just need to add one
more filtering condition, which is
that the college enrollment is greater than 20,000.
And with that, we've solved our query.
So to summarize the
natural join, the natural join combines relations.
It automatically sets values equal
when attribute names are the
same and then it removes the duplicate columns.
The natural join actually does not
add any expressive power to relational algebra.
We can rewrite the natural join without it using the cross-product.
So let me just show that rewrite here.
If we have, and now I'm
going to use the general case of two expressions.
One expression, natural join
with another expression, that is
actually equivalent to doing
a projection on the
schema of the first
expression - I'll just
call it E1 now - union
the schema of the second expression.
That's a real union, so that
means if we have two copies we
just keep one of them. Over
the selection of. Now we're
going to set all the shared attributes
of the first expression to
be equal to the shared attributes of the second.
So I'll just write E1, A1
equals E2, A1
and E1, A2 equals E2 dot A2.
Now these are the cases where,
again, the attributes have the same names, and so on.
So we're setting all those equal,
and that is applied over
expression one cross-product expression two.
So again, the natural
join is not giving us
additional expressive power, but it is very convenient notationally.
The last operator that I'm
going to cover in this video is the theta join operator.
Like natural join, theta join is
actually an abbreviation that doesn't add expressive power to the language.
Let me just write it.
The theta join operator takes
two expressions and combines them
with the bow tie looking
operator, but with a subscript theta.
That theta is a condition.
It's a condition in the style
of the condition in the selection operator.
And what this actually says
- it's pretty simple -
is it's equivalent to applying
the theta condition to the
cross-product of the two expressions.
So you might wonder why
I even mention the theta
join operator, and the reason I
mention it is that most
database management systems implement the
theta join as their basic
operation for combining relations.
So the basic operation is
take two relations, combine all tuples,
but then only keep the combinations
that pass the theta condition.
Often when you talk to
people who build database systems or
use databases, when they
use the word join, they really mean the theta join.
So, in conclusion, relational algebra is a formal language.
It operates on sets of
relations and produces relations as a result.
The simplest query is just
the name of a relation and
then operators are used
to filter relations, slice them, and combine them.
So far, we've learned the
select operator for selecting rows;
the project operator for selecting
columns; the cross-product
operator for combining every possible
pair of tuples from two
relations; and then two
abbreviations, the natural join,
which a very useful way to combine
relations by enforcing a equality
on certain columns; and the theta join operator.
In the next video, we'll learn
some additional operators of relational
algebra and also some alternative
notations for relational algebra expressions.