This is the first of two videos where we learn about relational algebra. Relational Algebra is a formal language. It's an algebra that forms the underpinnings of implemented languages like SQL. In this video we're going to learn the basics of the Relational Algebra Query Language and a few of the most popular operators. In the second video we'll learn some additional operators and some alternate notations for relational algebra. Now, let's just review first from our previous video on relational querying that queries over relational databases operate on relations and they also produce relations as a result. So if we write a query that operates say on the three relations depicted here, the result of that query is going to be a new relation. And, in fact, we can post queries on that new relation or combine that new relation with our previous relations. So let's start out with Relational Algebra. For the examples in this video we're going to be using a simple college admission relations database with three relations. The first relation, the college relation, contains information about the college name, state, and enrollment of the college. The second relation, the student relation, contains an ID for each student, the student's name, GPA and the size of the high school they attended. And, finally, the third relation contains information about students applying to colleges. Specifically, the student's ID, the college name where they're applying, the major they're applying for and the decision of that application. I've underlined the keys for these three relations. As a reminder, a key is an attribute or a set of attributes whose value is guaranteed to be unique. So, for example, we're going to assume the college names are unique, student IDs are unique and that students will only apply to each college for a particular major one time. So, we're going to have a picture of these three relations at the bottom of the slides throughout the video. The simplest query in relational algebra is a query that is simply the name of a relation. So, for example, we can write a query, "student" and that's a valid expression in relational algebra. If we run that query on our database we'll get as a result a copy of the student relation. Pretty straightforward . Now what happens next is that we're going to use operators of the relational algebra to filter relations, slice relations, and combine relations. So, let's through those operators. The first operator is the select operator. So, the select operator is used to pick certain rows out of a relation. The select operator is denoted by a Sigma with a subscript--that's the condition that's used to filter the rows that we extract from the relations. So, we're just going through three examples here. The first example says that we want to find the students whose GPA is greater than 3.7. So to write that expression in relational algebra, we write the sigma which is the selection operator as a subscript the condition that we're filtering for--GPA greater than 3.7--and the relation over which we're finding that selection predicate. So, this expression will return a subset of the student table containing those rows where the GPA is greater 3.7. If we want to filter for two conditions, we just do an "and" of the conditions in the subscript of the sigma. So if we want, say, students whose GPA is greater than 3.7 and whose high school size is less than a thousand, we'll write select GPA greater than 3.7. We used a logical and operator--a caret, high school size is less than a thousand, and again we'll apply that to the student relation. And once again, the result of that will be a subset of the student relation containing the rows that satisfy the condition. If we want to find the applications to Stanford for a CS major, then we'll be applying a selection condition to the apply relation. Again, we write the sigma and now the subscript is going to say that the college name is Stanford and the major is CS. Again, the and operator, and that will be applied to the apply relation and it will return as a result, a subset of the apply relation. So the general case of the select operator is that we have the sigma. We have a condition as a subscript and then we have a relation name. And we return as a result the subset of the relation. Our next operator is the Project Operator. So the select operator picks certain rows, and the project operator picks certain columns. So let's say we're interested in the applications, but all we wanted to know was the list of ID's and the decisions for those applications. The project operator is written using the Greek pi symbol, and now the subscript is a list of the column names that we would like to extract. So we write ID, sorry, student ID and decision, and we apply that to the apply relation again. And now what we'll get back is a relation that has just two rows. It's going to have all the tuples of apply, but it's only going to have the student ID and the decision columns. So the general case of a project operator is the projection, and then a list of attributes, can be any number, and then a relation name. Now, what if we're interested in picking both rows and columns at the same time. So we want only some of the rows, and we want only some of the columns. Now we're going to compose operators. Remember that relational queries produce relations . So we can write a query, say, with the select operator of the students whose GPA is greater than 3.7. And this is how we do that. And now, we can take that whole expression which produces a relation, and we can apply the project operator to that, and we can get out the student ID and the student name. Okay. So, what we actually see now is that the general case of the selection and projection operators weren't quite what I told you at first. I was deceiving you slightly. When we write the select operator, it's a select with the condition on any expression of the relational algebra, and if it's a big one we might want to put parenthesis on it, and similarly the project operator is a list of attributes from any expression of the relational algebra. And we can compose these as much as we want. We can have select over project, over select, select, project, and so on. Now let's talk about duplicate values in the results of relational algebra queries. Let's suppose we ask for a list of the majors that people have applied for and the decision for those majors. So we write that as the project of the major and the decision on the applied relation. You might think that when we get the results of this query, we're going to have a lot of duplicate values. So we'll have CS yes, CS yes, CS no, EE yes, EE no, and so on. You can imagine in a large realistic database of applications, there's going to be hundreds of people applying for majors and having a yes or a no decision. The semantics of relational algebra says that duplicates are always eliminated. So if you run a query that would logically have a lot of duplicate values, you just get one value for each result. That's actually a bit of a difference with the SQL language. So, SQL is based on what's known as multi-sets or bags and that means that we don't eliminate duplicates, whereas relational algebra is based on sets themselves and duplicates are eliminated. There is a multi-set or bad relational algebra defined as well but we'll be fine by just considering the set relational algebra in these videos. Our first operator that combines two relations is the cross-product operator, also known as the Cartesian product. What this operator does, is it takes two relations and kinda glues them together so that their schema of the result is the union of the schemas of the two relations and the contents of the result are every combination of tuples from those relations. This is in fact the normal set cross product that you might have learned way back in the elementary school. So let's talk about, say, doing the cross products of students and apply. So if we do this cross products, just to save drawing, I'm just gonna glue these two relations together here. So if we do the cross product we'll get at the result a big relation, here, which is going to have eight attributes. The eight attributes across the student and apply now the only small little trick is that when we glue two relations together sometimes they'll have the same attribute and we can see we have SID on both sides. So just as a notational convention, when cross-product is done and there's two attributes that are named, they're prefaced with the name of the relation they came from. So this one would be referred to in the cross-product as the student dot SID where this one over here would be referred to as the apply dot SID. So, again, we glue together in the Cartesian product the two relations with four attributes each, we get a result with eight attributes. Now let's talk about the contents of these. So let's suppose that the student relation had s-tuples in it and that's how many tuples, while the apply had 8 tuples in it, the result of the Cartesian products is gonna have S times A tuples, is going to have one tuple for every combination of tuples from the student relation and the apply relation. Now, the cross-product seems like it might not be that helpful, but what is interesting is when we use the cross-product together with other operators. And let's see a big example of that. Let's suppose that we want to get the names and GPAs of students with a high school size greater than a thousand who applied to CS and were rejected. Okay, so let's take a look. We're going to have to access the students and the apply records in order to run this query. So what we'll do is we'll take student cross apply as our starting point. So now we have a big relation that contains eight attributes and all of those tuples that we described previously. But now we're going to start making things more interesting, because what we're going to do is a big selection over this relation. And that selection is first of all going to make sure that it only combines student and apply tuples that are referring to the same student. So to do that, we write student dot SID equals apply dot SID. So now we've filtered the result of that cross-product to only include combinations of student and apply by couples that make sets. Now we have to do a little bit of additional filtering. We said that we want the high school size to be greater than a thousand, so we do an "and" operator in the high school. We want them to have applied to CS so that's and major equals CS. We're getting a nice big query here. And finally we want them to have been rejected, so "and decision" equals, we'll just be using R for reject. So now, we've got that gigantic query. But that gets us exactly what we want except for one more thing, which is, as I said, all we want is their names and GPAs. So finally we take a big parentheses around here and we apply to that the projection operator, getting the student name and the GPA. And that is the relational algebra expression that produces the query that we have written in English. Now we have seen how the cross product allows us to combine tuples and then apply selection conditions to get meaningful combinations of tuples. It turns out that relational algebra includes an operator called the natural join that is used pretty much for the exact purpose. What the natural join does is it performs a cross-product but then it enforces equality on all of the attributes with the same name. So if we set up our schema properly, for example, we have student ID and student ID here, meaning the same thing, and when the cross product is created, it's only going to combine tuples where the student ID is the same. And furthermore, if we add college in, we can see that we have the college name here and the college name here. If we combine college and apply tuples, we'll only combine tuples that are talking about the same college. Now in addition, one more thing that it does is it gets rid of these pesky attributes that have the same names. So since when we combine, for example, student and apply with the natural join, we're only combining tuples where the student SID is the same as the apply SID. Then we don't need to keep two copies of that column because the values are always going to be equal. So the natural join operator is written using a bow tie, that's just the convention. You will find that in your text editing programs if you look carefully. So let's do some examples now. Let's go back to our same query where we were finding the names and GPAs of students from large high schools who applied to CS and were rejected. So now, instead of using the cross-product we're gonna use the natural join, which, as I said, was written with a bow tie. What that allows us to do, once we do that natural join, is we don't have to write that condition, that enforced equality on those two attributes, because it's going to do it itself. And once we have done that then all we need to do is apply the rest of our conditions, which were that the high school is greater than a thousand and the major is CS and the decision is reject, again we'll call that R. And then, since we're only getting the names and GPAs, we write the student name and the GPA. Okay. And that's the result of the query using a natural join. So, as you can see that's a little bit simpler than the original with the cross-product and by setting up schemas correctly, natural join can be very useful. Now let's add one more complication to our query. Let's suppose that we're only interested in applications to colleges where the enrollment is greater than 20,000. So, so far in our expression we refer to the student relation and the apply relation, but we haven't used the college relation. But if we want to have a filter on enrollment, we're going to have to bring the college relation into the picture. This turns out to perhaps be easier than you think. Let's just erase a couple of our parentheses here, and what we're going to do is we're going to join in the college relation, with the two relations we have already. Now, technically, the natural join is the binary operator, people often use it without parentheses because it's associative, but if we get pedantic about it we could add that and then we're in good shape. Now we've joined all three relations together. And remember, automatically the natural join enforces equality on the shared attributes. Very specifically, the college name here is going to be set equal to the apply college name as well. Now once we've done that, we've got all the information we need. We just need to add one more filtering condition, which is that the college enrollment is greater than 20,000. And with that, we've solved our query. So to summarize the natural join, the natural join combines relations. It automatically sets values equal when attribute names are the same and then it removes the duplicate columns. The natural join actually does not add any expressive power to relational algebra. We can rewrite the natural join without it using the cross-product. So let me just show that rewrite here. If we have, and now I'm going to use the general case of two expressions. One expression, natural join with another expression, that is actually equivalent to doing a projection on the schema of the first expression - I'll just call it E1 now - union the schema of the second expression. That's a real union, so that means if we have two copies we just keep one of them. Over the selection of. Now we're going to set all the shared attributes of the first expression to be equal to the shared attributes of the second. So I'll just write E1, A1 equals E2, A1 and E1, A2 equals E2 dot A2. Now these are the cases where, again, the attributes have the same names, and so on. So we're setting all those equal, and that is applied over expression one cross-product expression two. So again, the natural join is not giving us additional expressive power, but it is very convenient notationally. The last operator that I'm going to cover in this video is the theta join operator. Like natural join, theta join is actually an abbreviation that doesn't add expressive power to the language. Let me just write it. The theta join operator takes two expressions and combines them with the bow tie looking operator, but with a subscript theta. That theta is a condition. It's a condition in the style of the condition in the selection operator. And what this actually says - it's pretty simple - is it's equivalent to applying the theta condition to the cross-product of the two expressions. So you might wonder why I even mention the theta join operator, and the reason I mention it is that most database management systems implement the theta join as their basic operation for combining relations. So the basic operation is take two relations, combine all tuples, but then only keep the combinations that pass the theta condition. Often when you talk to people who build database systems or use databases, when they use the word join, they really mean the theta join. So, in conclusion, relational algebra is a formal language. It operates on sets of relations and produces relations as a result. The simplest query is just the name of a relation and then operators are used to filter relations, slice them, and combine them. So far, we've learned the select operator for selecting rows; the project operator for selecting columns; the cross-product operator for combining every possible pair of tuples from two relations; and then two abbreviations, the natural join, which a very useful way to combine relations by enforcing a equality on certain columns; and the theta join operator. In the next video, we'll learn some additional operators of relational algebra and also some alternative notations for relational algebra expressions.