1 00:00:00,480 --> 00:00:01,370 This is the first of two 2 00:00:01,620 --> 00:00:03,560 videos where we learn about relational algebra. 3 00:00:04,860 --> 00:00:06,570 Relational Algebra is a formal language. 4 00:00:07,340 --> 00:00:08,920 It's an algebra that forms 5 00:00:09,010 --> 00:00:11,680 the underpinnings of implemented languages like SQL. 6 00:00:12,600 --> 00:00:13,900 In this video we're going to 7 00:00:14,010 --> 00:00:15,020 learn the basics of the 8 00:00:15,110 --> 00:00:18,510 Relational Algebra Query Language and a few of the most popular operators. 9 00:00:19,430 --> 00:00:20,620 In the second video we'll learn 10 00:00:20,670 --> 00:00:22,220 some additional operators and some 11 00:00:22,390 --> 00:00:24,820 alternate notations for relational algebra. 12 00:00:26,360 --> 00:00:27,890 Now, let's just review first from 13 00:00:28,130 --> 00:00:29,470 our previous video on relational 14 00:00:29,860 --> 00:00:31,440 querying that queries over 15 00:00:31,620 --> 00:00:33,450 relational databases operate on 16 00:00:33,660 --> 00:00:36,340 relations and they also produce relations as a result. 17 00:00:36,970 --> 00:00:37,880 So if we write a query 18 00:00:38,500 --> 00:00:39,500 that operates say on the three 19 00:00:39,800 --> 00:00:41,620 relations depicted here, the 20 00:00:41,710 --> 00:00:44,080 result of that query is going to be a new relation. 21 00:00:44,730 --> 00:00:45,960 And, in fact, we can 22 00:00:46,120 --> 00:00:47,590 post queries on that 23 00:00:47,750 --> 00:00:49,240 new relation or combine that 24 00:00:49,360 --> 00:00:50,850 new relation with our previous relations. 25 00:00:52,350 --> 00:00:53,940 So let's start out with Relational Algebra. 26 00:00:54,510 --> 00:00:56,100 For the examples in 27 00:00:56,170 --> 00:00:57,170 this video we're going to be 28 00:00:57,290 --> 00:01:00,040 using a simple college admission relations database with three relations. 29 00:01:00,930 --> 00:01:01,760 The first relation, the college 30 00:01:02,130 --> 00:01:03,800 relation, contains information about the 31 00:01:03,860 --> 00:01:05,910 college name, state, and enrollment of the college. 32 00:01:06,860 --> 00:01:07,760 The second relation, the student 33 00:01:08,020 --> 00:01:09,720 relation, contains an ID 34 00:01:10,050 --> 00:01:11,300 for each student, the student's name, 35 00:01:11,710 --> 00:01:13,550 GPA and the size of the high school they attended. 36 00:01:14,530 --> 00:01:15,860 And, finally, the third relation contains 37 00:01:16,200 --> 00:01:17,890 information about students applying to colleges. 38 00:01:18,790 --> 00:01:20,420 Specifically, the student's ID, the 39 00:01:20,500 --> 00:01:21,330 college name where they're 40 00:01:21,650 --> 00:01:22,780 applying, the major they're 41 00:01:22,880 --> 00:01:24,860 applying for and the decision of that application. 42 00:01:25,890 --> 00:01:28,120 I've underlined the keys for these three relations. 43 00:01:28,690 --> 00:01:30,400 As a reminder, a key is 44 00:01:30,730 --> 00:01:31,720 an attribute or a set of 45 00:01:31,790 --> 00:01:33,150 attributes whose value is 46 00:01:33,310 --> 00:01:34,970 guaranteed to be unique. 47 00:01:35,560 --> 00:01:36,590 So, for example, we're going 48 00:01:36,840 --> 00:01:37,840 to assume the college names are unique, 49 00:01:38,370 --> 00:01:40,180 student IDs are unique and that 50 00:01:40,330 --> 00:01:41,500 students will only apply to 51 00:01:41,910 --> 00:01:44,290 each college for a particular major one time. 52 00:01:46,220 --> 00:01:47,110 So, we're going to have a 53 00:01:47,280 --> 00:01:48,620 picture of these three relations at 54 00:01:48,700 --> 00:01:50,220 the bottom of the slides throughout the video. 55 00:01:52,380 --> 00:01:53,670 The simplest query in relational 56 00:01:54,170 --> 00:01:55,270 algebra is a query 57 00:01:55,510 --> 00:01:57,030 that is simply the name of a relation. 58 00:01:57,620 --> 00:01:58,710 So, for example, we can 59 00:01:58,850 --> 00:02:01,040 write a query, "student" and that's 60 00:02:01,210 --> 00:02:03,310 a valid expression in relational algebra. 61 00:02:04,220 --> 00:02:05,210 If we run that query on 62 00:02:05,360 --> 00:02:06,860 our database we'll get as 63 00:02:07,020 --> 00:02:08,400 a result a copy of the student relation. 64 00:02:09,580 --> 00:02:10,340 Pretty straightforward . 65 00:02:11,030 --> 00:02:12,460 Now what happens next 66 00:02:12,680 --> 00:02:13,650 is that we're going to use 67 00:02:13,850 --> 00:02:15,140 operators of the relational algebra 68 00:02:15,740 --> 00:02:18,790 to filter relations, slice relations, and combine relations. 69 00:02:19,920 --> 00:02:20,780 So, let's through those operators. 70 00:02:23,410 --> 00:02:25,180 The first operator is the select operator. 71 00:02:26,060 --> 00:02:27,430 So, the select operator is used 72 00:02:27,770 --> 00:02:29,430 to pick certain rows out of a relation. 73 00:02:30,080 --> 00:02:31,530 The select operator is denoted by 74 00:02:31,670 --> 00:02:33,460 a Sigma with a subscript--that's 75 00:02:33,670 --> 00:02:34,960 the condition that's used to 76 00:02:35,120 --> 00:02:36,560 filter the rows that we extract from the relations. 77 00:02:37,400 --> 00:02:39,590 So, we're just going through three examples here. 78 00:02:40,280 --> 00:02:41,370 The first example says that we 79 00:02:41,500 --> 00:02:42,130 want to find the students 80 00:02:42,660 --> 00:02:43,840 whose GPA is greater than 3.7. 81 00:02:44,190 --> 00:02:45,680 So to write that 82 00:02:45,960 --> 00:02:47,350 expression in relational algebra, we write 83 00:02:47,590 --> 00:02:48,620 the sigma which is the 84 00:02:48,780 --> 00:02:50,240 selection operator as a 85 00:02:50,590 --> 00:02:51,560 subscript the condition that we're 86 00:02:51,770 --> 00:02:54,370 filtering for--GPA greater than 87 00:02:54,530 --> 00:02:57,600 3.7--and the relation over which we're finding that selection predicate. 88 00:02:58,540 --> 00:02:59,780 So, this expression will return 89 00:03:00,690 --> 00:03:01,570 a subset of the student 90 00:03:02,010 --> 00:03:03,570 table containing those rows 91 00:03:04,010 --> 00:03:05,590 where the GPA is greater 3.7. 92 00:03:07,030 --> 00:03:08,120 If we want to filter for two 93 00:03:08,300 --> 00:03:09,450 conditions, we just do 94 00:03:09,570 --> 00:03:11,630 an "and" of the conditions in the subscript of the sigma. 95 00:03:12,430 --> 00:03:13,420 So if we want, say, students 96 00:03:13,980 --> 00:03:15,270 whose GPA is greater than 3.7 97 00:03:15,290 --> 00:03:16,620 and whose high school size 98 00:03:16,940 --> 00:03:18,250 is less than a thousand, we'll 99 00:03:18,510 --> 00:03:22,980 write select GPA greater than 3.7. 100 00:03:23,070 --> 00:03:24,670 We used a logical 101 00:03:24,960 --> 00:03:26,750 and operator--a caret, high school 102 00:03:27,070 --> 00:03:28,010 size is less than a 103 00:03:28,300 --> 00:03:30,930 thousand, and again we'll apply that to the student relation. 104 00:03:32,210 --> 00:03:33,350 And once again, the result of 105 00:03:33,510 --> 00:03:34,610 that will be a subset of 106 00:03:34,870 --> 00:03:36,960 the student relation containing the rows that satisfy the condition. 107 00:03:37,350 --> 00:03:39,110 If we want to find 108 00:03:39,420 --> 00:03:40,990 the applications to Stanford for 109 00:03:41,290 --> 00:03:42,200 a CS major, then we'll be 110 00:03:42,700 --> 00:03:44,430 applying a selection condition to the apply relation. 111 00:03:45,520 --> 00:03:46,280 Again, we write the sigma 112 00:03:46,870 --> 00:03:48,530 and now the subscript is 113 00:03:48,660 --> 00:03:49,370 going to say that the college 114 00:03:49,790 --> 00:03:54,060 name is Stanford and the major is CS. 115 00:03:54,770 --> 00:03:57,530 Again, the and operator, and 116 00:03:57,600 --> 00:03:58,660 that will be applied 117 00:03:59,390 --> 00:04:01,000 to the apply relation and 118 00:04:01,330 --> 00:04:02,600 it will return as a 119 00:04:02,810 --> 00:04:04,320 result, a subset of the apply relation. 120 00:04:06,630 --> 00:04:07,720 So the general case of the 121 00:04:07,810 --> 00:04:09,640 select operator is that we have the sigma. 122 00:04:10,580 --> 00:04:12,100 We have a condition as a 123 00:04:12,280 --> 00:04:14,060 subscript and then we have a relation name. 124 00:04:14,560 --> 00:04:16,780 And we return as a result the subset of the relation. 125 00:04:17,190 --> 00:04:20,740 Our next operator is the Project Operator. 126 00:04:21,490 --> 00:04:22,900 So the select operator picks certain 127 00:04:23,210 --> 00:04:25,780 rows, and the project operator picks certain columns. 128 00:04:26,710 --> 00:04:28,010 So let's say we're 129 00:04:28,110 --> 00:04:29,530 interested in the applications, but all 130 00:04:29,880 --> 00:04:30,760 we wanted to know was the 131 00:04:30,840 --> 00:04:33,000 list of ID's and the decisions for those applications. 132 00:04:34,270 --> 00:04:35,440 The project operator is written 133 00:04:35,720 --> 00:04:37,400 using the Greek pi symbol, 134 00:04:37,840 --> 00:04:39,320 and now the subscript is 135 00:04:39,420 --> 00:04:41,580 a list of the column names that we would like to extract. 136 00:04:42,150 --> 00:04:43,600 So we write ID, sorry, 137 00:04:43,910 --> 00:04:46,510 student ID and decision, and 138 00:04:46,750 --> 00:04:48,860 we apply that to the apply relation again. 139 00:04:49,770 --> 00:04:50,890 And now what we'll get 140 00:04:51,060 --> 00:04:53,450 back is a relation that has just two rows. 141 00:04:54,270 --> 00:04:55,180 It's going to have all the 142 00:04:55,410 --> 00:04:56,550 tuples of apply, but it's 143 00:04:56,670 --> 00:04:57,640 only going to have the 144 00:04:57,980 --> 00:04:59,720 student ID and the decision columns. 145 00:05:00,710 --> 00:05:02,070 So the general case of 146 00:05:02,320 --> 00:05:03,980 a project operator is the 147 00:05:04,050 --> 00:05:05,740 projection, and then a 148 00:05:05,910 --> 00:05:07,420 list of attributes, can be 149 00:05:07,610 --> 00:05:09,500 any number, and then a relation name. 150 00:05:13,200 --> 00:05:14,520 Now, what if we're interested in 151 00:05:14,690 --> 00:05:16,630 picking both rows and columns at the same time. 152 00:05:16,920 --> 00:05:18,070 So we want only some of 153 00:05:18,150 --> 00:05:19,980 the rows, and we want only some of the columns. 154 00:05:20,950 --> 00:05:22,360 Now we're going to compose operators. 155 00:05:23,080 --> 00:05:25,680 Remember that relational queries produce relations . 156 00:05:26,170 --> 00:05:27,500 So we can write a 157 00:05:27,650 --> 00:05:29,200 query, say, with the 158 00:05:29,310 --> 00:05:30,650 select operator of the 159 00:05:31,120 --> 00:05:32,960 students whose GPA is greater than 3.7. 160 00:05:33,110 --> 00:05:35,070 And this is how we do that. 161 00:05:35,990 --> 00:05:37,250 And now, we can take 162 00:05:37,440 --> 00:05:39,100 that whole expression which produces a 163 00:05:39,140 --> 00:05:40,310 relation, and we can 164 00:05:40,470 --> 00:05:41,940 apply the project operator to that, and 165 00:05:42,660 --> 00:05:43,500 we can get out the student 166 00:05:43,960 --> 00:05:45,080 ID and the student name. 167 00:05:49,760 --> 00:05:49,760 Okay. 168 00:05:50,170 --> 00:05:51,730 So, what we actually see 169 00:05:51,880 --> 00:05:53,860 now is that the general 170 00:05:54,310 --> 00:05:55,660 case of the selection and 171 00:05:55,750 --> 00:05:57,810 projection operators weren't quite what I told you at first. 172 00:05:58,080 --> 00:05:58,940 I was deceiving you slightly. 173 00:05:59,580 --> 00:06:00,840 When we write the select 174 00:06:01,210 --> 00:06:02,510 operator, it's a select 175 00:06:03,750 --> 00:06:05,650 with the condition on any 176 00:06:06,070 --> 00:06:07,030 expression of the relational 177 00:06:07,560 --> 00:06:08,790 algebra, and if it's 178 00:06:08,950 --> 00:06:09,750 a big one we might want 179 00:06:09,930 --> 00:06:11,610 to put parenthesis on it, 180 00:06:11,800 --> 00:06:13,610 and similarly the project operator is 181 00:06:13,980 --> 00:06:15,790 a list of attributes from 182 00:06:16,210 --> 00:06:17,740 any expression of the relational algebra. 183 00:06:18,610 --> 00:06:20,330 And we can compose these as much as we want. 184 00:06:20,610 --> 00:06:22,030 We can have select over project, over 185 00:06:22,240 --> 00:06:24,100 select, select, project, and so on. 186 00:06:24,370 --> 00:06:27,520 Now let's talk about 187 00:06:27,720 --> 00:06:30,180 duplicate values in the results of relational algebra queries. 188 00:06:31,440 --> 00:06:33,040 Let's suppose we ask 189 00:06:33,070 --> 00:06:34,250 for a list of the majors that 190 00:06:34,360 --> 00:06:36,400 people have applied for and the decision for those majors. 191 00:06:37,070 --> 00:06:37,850 So we write that as the 192 00:06:37,920 --> 00:06:39,320 project of the major 193 00:06:40,630 --> 00:06:42,920 and the decision on the applied relation. 194 00:06:45,580 --> 00:06:46,450 You might think that when we get 195 00:06:46,700 --> 00:06:48,990 the results of this query, we're going to have a lot of duplicate values. 196 00:06:49,530 --> 00:06:51,490 So we'll have CS yes, CS 197 00:06:51,820 --> 00:06:54,390 yes, CS no, EE yes, EE no, and so on. 198 00:06:54,570 --> 00:06:55,540 You can imagine in a 199 00:06:55,720 --> 00:06:56,850 large realistic database of applications, 200 00:06:57,560 --> 00:06:58,810 there's going to be hundreds of 201 00:06:59,070 --> 00:07:01,870 people applying for majors and having a yes or a no decision. 202 00:07:02,380 --> 00:07:03,820 The semantics of relational 203 00:07:04,300 --> 00:07:07,280 algebra says that duplicates are always eliminated. 204 00:07:08,330 --> 00:07:09,300 So if you run a query 205 00:07:09,460 --> 00:07:10,580 that would logically have a 206 00:07:10,680 --> 00:07:11,750 lot of duplicate values, you just 207 00:07:11,980 --> 00:07:13,710 get one value for each result. 208 00:07:14,490 --> 00:07:16,210 That's actually a bit of a difference with the SQL language. 209 00:07:16,630 --> 00:07:18,190 So, SQL is based on 210 00:07:18,340 --> 00:07:20,260 what's known as multi-sets or 211 00:07:20,480 --> 00:07:21,680 bags and that means 212 00:07:21,930 --> 00:07:23,990 that we don't eliminate duplicates, whereas 213 00:07:24,300 --> 00:07:25,630 relational algebra is based 214 00:07:26,170 --> 00:07:28,280 on sets themselves and duplicates are eliminated. 215 00:07:29,190 --> 00:07:30,620 There is a multi-set or 216 00:07:30,710 --> 00:07:32,560 bad relational algebra defined as 217 00:07:32,670 --> 00:07:33,690 well but we'll be fine by 218 00:07:33,820 --> 00:07:36,340 just considering the set relational algebra in these videos. 219 00:07:38,000 --> 00:07:39,580 Our first operator that combines two 220 00:07:39,860 --> 00:07:41,200 relations is the cross-product operator, 221 00:07:41,750 --> 00:07:43,020 also known as the Cartesian product. 222 00:07:43,980 --> 00:07:45,090 What this operator does, is it 223 00:07:45,220 --> 00:07:46,400 takes two relations and kinda 224 00:07:46,540 --> 00:07:47,850 glues them together so that 225 00:07:48,060 --> 00:07:49,220 their schema of the result 226 00:07:49,820 --> 00:07:50,680 is the union of the 227 00:07:50,820 --> 00:07:52,460 schemas of the two relations and 228 00:07:52,650 --> 00:07:53,930 the contents of the result are every 229 00:07:54,050 --> 00:07:55,710 combination of tuples from those relations. 230 00:07:56,450 --> 00:07:57,360 This is in fact the normal 231 00:07:57,950 --> 00:07:59,000 set cross product that you 232 00:07:59,100 --> 00:08:00,550 might have learned way back in the elementary school. 233 00:08:01,250 --> 00:08:02,810 So let's talk about, say, 234 00:08:03,100 --> 00:08:05,610 doing the cross products of students and apply. 235 00:08:06,860 --> 00:08:07,810 So if we do this 236 00:08:08,040 --> 00:08:09,640 cross products, just to 237 00:08:09,800 --> 00:08:11,110 save drawing, I'm just gonna glue 238 00:08:11,620 --> 00:08:13,350 these two relations together here. 239 00:08:14,040 --> 00:08:14,840 So if we do the 240 00:08:15,000 --> 00:08:16,130 cross product we'll get at the 241 00:08:16,350 --> 00:08:18,210 result a big relation, here, which 242 00:08:18,380 --> 00:08:20,120 is going to have eight attributes. 243 00:08:20,280 --> 00:08:21,620 The eight attributes across the student 244 00:08:21,910 --> 00:08:23,130 and apply now the only 245 00:08:24,120 --> 00:08:25,160 small little trick is that 246 00:08:25,460 --> 00:08:26,760 when we glue two relations together 247 00:08:27,490 --> 00:08:28,300 sometimes they'll have the same 248 00:08:28,700 --> 00:08:30,740 attribute and we can see we have SID on both sides. 249 00:08:31,120 --> 00:08:32,520 So just as a notational convention, 250 00:08:33,490 --> 00:08:34,540 when cross-product is done 251 00:08:35,140 --> 00:08:36,520 and there's two attributes that are 252 00:08:36,610 --> 00:08:39,230 named, they're prefaced with the name of the relation they came from. 253 00:08:39,420 --> 00:08:40,230 So this one would be referred 254 00:08:40,710 --> 00:08:42,040 to in the cross-product as 255 00:08:42,180 --> 00:08:44,060 the student dot SID where this 256 00:08:44,260 --> 00:08:45,160 one over here would be referred 257 00:08:45,520 --> 00:08:46,760 to as the apply dot SID. 258 00:08:48,450 --> 00:08:49,800 So, again, we glue together in 259 00:08:49,920 --> 00:08:51,090 the Cartesian product the two 260 00:08:51,330 --> 00:08:52,690 relations with four attributes each, 261 00:08:53,110 --> 00:08:55,020 we get a result with eight attributes. 262 00:08:55,560 --> 00:08:57,050 Now let's talk about the contents of these. 263 00:08:57,690 --> 00:08:58,860 So let's suppose that the student 264 00:09:00,110 --> 00:09:02,020 relation had s-tuples in it 265 00:09:02,370 --> 00:09:04,020 and that's how many tuples, while 266 00:09:04,170 --> 00:09:05,650 the apply had 8 tuples in 267 00:09:05,800 --> 00:09:07,410 it, the result of the 268 00:09:07,480 --> 00:09:08,640 Cartesian products is gonna 269 00:09:08,790 --> 00:09:10,210 have S times A tuples, 270 00:09:10,710 --> 00:09:12,040 is going to have one tuple 271 00:09:12,740 --> 00:09:14,380 for every combination of tuples 272 00:09:14,680 --> 00:09:16,470 from the student relation and the apply relation. 273 00:09:19,040 --> 00:09:20,520 Now, the cross-product seems 274 00:09:20,750 --> 00:09:21,570 like it might not be that 275 00:09:21,810 --> 00:09:22,990 helpful, but what is 276 00:09:23,120 --> 00:09:24,290 interesting is when we use 277 00:09:24,520 --> 00:09:26,110 the cross-product together with other operators. 278 00:09:26,920 --> 00:09:28,310 And let's see a big example of that. 279 00:09:28,970 --> 00:09:30,340 Let's suppose that we want 280 00:09:30,610 --> 00:09:32,390 to get the names and GPAs of 281 00:09:32,650 --> 00:09:33,730 students with a high school 282 00:09:33,990 --> 00:09:35,270 size greater than a thousand who 283 00:09:35,350 --> 00:09:36,880 applied to CS and were rejected. 284 00:09:37,990 --> 00:09:39,380 Okay, so let's take a look. 285 00:09:40,010 --> 00:09:41,190 We're going to have to access 286 00:09:41,620 --> 00:09:42,740 the students and the apply 287 00:09:43,150 --> 00:09:44,580 records in order to run this query. 288 00:09:45,380 --> 00:09:46,460 So what we'll do is we'll 289 00:09:46,600 --> 00:09:49,310 take student cross apply as our starting point. 290 00:09:49,980 --> 00:09:51,290 So now we have 291 00:09:51,650 --> 00:09:52,860 a big relation that contains 292 00:09:53,510 --> 00:09:56,310 eight attributes and all of those tuples that we described previously. 293 00:09:57,320 --> 00:09:58,270 But now we're going to 294 00:09:58,360 --> 00:09:59,770 start making things more interesting, because 295 00:10:00,050 --> 00:10:00,910 what we're going to do is 296 00:10:01,420 --> 00:10:03,030 a big selection over this relation. 297 00:10:03,720 --> 00:10:04,860 And that selection is first 298 00:10:04,950 --> 00:10:05,850 of all going to make sure 299 00:10:06,050 --> 00:10:07,820 that it only combines student and 300 00:10:08,000 --> 00:10:10,170 apply tuples that are referring to the same student. 301 00:10:11,030 --> 00:10:12,060 So to do that, we write 302 00:10:12,300 --> 00:10:14,030 student dot SID equals 303 00:10:15,690 --> 00:10:16,480 apply dot SID. 304 00:10:17,090 --> 00:10:18,470 So now we've filtered the result 305 00:10:18,950 --> 00:10:20,500 of that cross-product to only 306 00:10:20,900 --> 00:10:22,110 include combinations of student 307 00:10:22,600 --> 00:10:23,630 and apply by couples that make sets. 308 00:10:24,540 --> 00:10:26,460 Now we have to do a little bit of additional filtering. 309 00:10:26,850 --> 00:10:27,890 We said that we want the 310 00:10:28,040 --> 00:10:29,670 high school size to be 311 00:10:29,800 --> 00:10:30,970 greater than a thousand, so 312 00:10:31,040 --> 00:10:33,110 we do an "and" operator in the high school. 313 00:10:33,830 --> 00:10:34,770 We want them to have applied 314 00:10:35,340 --> 00:10:37,770 to CS so that's and major equals CS. 315 00:10:37,800 --> 00:10:39,370 We're getting a nice big query here. 316 00:10:40,130 --> 00:10:41,180 And finally we want them to 317 00:10:41,260 --> 00:10:42,550 have been rejected, so "and 318 00:10:42,990 --> 00:10:46,070 decision" equals, we'll just be using R for reject. 319 00:10:47,320 --> 00:10:49,500 So now, we've got that gigantic query. 320 00:10:50,090 --> 00:10:51,850 But that gets us exactly what 321 00:10:51,930 --> 00:10:52,980 we want except for one more thing, 322 00:10:53,170 --> 00:10:55,030 which is, as I said, all we want is their names and GPAs. 323 00:10:56,150 --> 00:10:57,420 So finally we take a 324 00:10:57,800 --> 00:10:59,070 big parentheses around here and 325 00:10:59,310 --> 00:11:00,820 we apply to that the 326 00:11:00,900 --> 00:11:02,770 projection operator, getting the 327 00:11:02,910 --> 00:11:05,040 student name and the GPA. 328 00:11:06,330 --> 00:11:07,180 And that is the relational 329 00:11:07,740 --> 00:11:09,280 algebra expression that produces 330 00:11:09,730 --> 00:11:10,940 the query that we have written in English. 331 00:11:12,660 --> 00:11:13,390 Now we have seen how the 332 00:11:13,490 --> 00:11:14,770 cross product allows us to 333 00:11:14,850 --> 00:11:16,230 combine tuples and then 334 00:11:16,940 --> 00:11:18,600 apply selection conditions to 335 00:11:18,700 --> 00:11:20,270 get meaningful combinations of tuples. 336 00:11:21,230 --> 00:11:22,650 It turns out that relational algebra 337 00:11:22,760 --> 00:11:24,100 includes an operator called 338 00:11:24,300 --> 00:11:25,740 the natural join that is 339 00:11:25,900 --> 00:11:27,270 used pretty much for the exact purpose. 340 00:11:27,760 --> 00:11:29,200 What the natural join does is 341 00:11:29,360 --> 00:11:30,570 it performs a cross-product 342 00:11:31,010 --> 00:11:33,190 but then it enforces equality 343 00:11:33,820 --> 00:11:35,370 on all of the attributes with the same name. 344 00:11:35,990 --> 00:11:36,820 So if we set up our 345 00:11:36,880 --> 00:11:38,230 schema properly, for example, 346 00:11:38,770 --> 00:11:40,580 we have student ID and student 347 00:11:40,660 --> 00:11:41,870 ID here, meaning the same 348 00:11:42,240 --> 00:11:43,280 thing, and when the cross 349 00:11:43,500 --> 00:11:45,230 product is created, it's only 350 00:11:45,480 --> 00:11:47,730 going to combine tuples where the student ID is the same. 351 00:11:48,490 --> 00:11:49,640 And furthermore, if we add 352 00:11:49,890 --> 00:11:51,030 college in, we can 353 00:11:51,210 --> 00:11:53,500 see that we have the college name here and the college name here. 354 00:11:54,150 --> 00:11:55,950 If we combine college and apply 355 00:11:56,250 --> 00:11:58,800 tuples, we'll only combine tuples that are talking about the same college. 356 00:11:59,960 --> 00:12:01,000 Now in addition, one more 357 00:12:01,320 --> 00:12:02,320 thing that it does is it 358 00:12:02,430 --> 00:12:05,690 gets rid of these pesky attributes that have the same names. 359 00:12:06,500 --> 00:12:07,350 So since when we combine, 360 00:12:07,910 --> 00:12:08,940 for example, student and apply 361 00:12:09,340 --> 00:12:11,050 with the natural join, we're only 362 00:12:11,340 --> 00:12:13,510 combining tuples where the 363 00:12:13,590 --> 00:12:16,020 student SID is the same as the apply SID. 364 00:12:16,840 --> 00:12:17,750 Then we don't need to keep two 365 00:12:17,890 --> 00:12:19,110 copies of that 366 00:12:19,290 --> 00:12:21,500 column because the values are always going to be equal. 367 00:12:23,210 --> 00:12:25,230 So the natural join operator 368 00:12:25,580 --> 00:12:27,850 is written using a bow 369 00:12:28,120 --> 00:12:29,490 tie, that's just the convention. 370 00:12:30,660 --> 00:12:33,260 You will find that in your text editing programs if you look carefully. 371 00:12:34,040 --> 00:12:35,900 So let's do some examples now. 372 00:12:37,050 --> 00:12:38,190 Let's go back to our same 373 00:12:38,660 --> 00:12:39,640 query where we were finding 374 00:12:39,960 --> 00:12:40,850 the names and GPAs of students 375 00:12:42,040 --> 00:12:44,260 from large high schools who applied to CS and were rejected. 376 00:12:45,390 --> 00:12:46,580 So now, instead of using 377 00:12:46,780 --> 00:12:47,770 the cross-product we're gonna 378 00:12:47,890 --> 00:12:50,470 use the natural join, which, as I said, was written with a bow tie. 379 00:12:51,950 --> 00:12:53,170 What that allows us to 380 00:12:53,380 --> 00:12:54,470 do, once we do that natural 381 00:12:54,780 --> 00:12:55,960 join, is we don't have 382 00:12:56,100 --> 00:12:57,180 to write that condition, that enforced 383 00:12:57,640 --> 00:12:58,930 equality on those two 384 00:12:59,230 --> 00:13:00,620 attributes, because it's going to do it itself. 385 00:13:01,930 --> 00:13:03,010 And once we have done that then 386 00:13:03,150 --> 00:13:03,940 all we need to do is 387 00:13:04,040 --> 00:13:04,900 apply the rest of our conditions, 388 00:13:05,500 --> 00:13:06,310 which were that the high school 389 00:13:06,610 --> 00:13:08,650 is greater than a thousand and the 390 00:13:08,760 --> 00:13:11,060 major is CS and the decision 391 00:13:11,800 --> 00:13:13,740 is reject, again we'll 392 00:13:14,110 --> 00:13:15,520 call that R. And then, 393 00:13:16,170 --> 00:13:17,020 since we're only getting the names 394 00:13:17,360 --> 00:13:19,260 and GPAs, we write the 395 00:13:19,980 --> 00:13:21,690 student name and the GPA. 396 00:13:24,000 --> 00:13:24,000 Okay. 397 00:13:24,270 --> 00:13:26,350 And that's the result of the query using a natural join. 398 00:13:26,630 --> 00:13:27,590 So, as you can see that's a 399 00:13:27,770 --> 00:13:28,770 little bit simpler than the original 400 00:13:29,090 --> 00:13:30,350 with the cross-product and by setting 401 00:13:30,650 --> 00:13:33,050 up schemas correctly, natural join can be very useful. 402 00:13:33,990 --> 00:13:36,280 Now let's add one more complication to our query. 403 00:13:37,010 --> 00:13:38,220 Let's suppose that we're only interested 404 00:13:38,640 --> 00:13:41,460 in applications to colleges where the enrollment is greater than 20,000. 405 00:13:41,770 --> 00:13:43,410 So, so far in 406 00:13:43,480 --> 00:13:44,620 our expression we refer to the 407 00:13:44,700 --> 00:13:45,690 student relation and the apply 408 00:13:46,110 --> 00:13:47,810 relation, but we haven't used the college relation. 409 00:13:48,660 --> 00:13:50,150 But if we want to have a 410 00:13:50,340 --> 00:13:51,380 filter on enrollment, we're going to have 411 00:13:51,610 --> 00:13:54,110 to bring the college relation into the picture. 412 00:13:55,160 --> 00:13:57,190 This turns out to perhaps be easier than you think. 413 00:13:57,890 --> 00:13:59,230 Let's just erase a couple 414 00:13:59,550 --> 00:14:01,420 of our parentheses here, and what 415 00:14:01,650 --> 00:14:02,580 we're going to do is we're 416 00:14:02,710 --> 00:14:04,140 going to join in the 417 00:14:04,210 --> 00:14:06,670 college relation, with the two relations we have already. 418 00:14:07,470 --> 00:14:11,090 Now, technically, the natural 419 00:14:11,490 --> 00:14:13,560 join is the binary operator, people 420 00:14:13,930 --> 00:14:14,940 often use it without parentheses 421 00:14:15,510 --> 00:14:16,700 because it's associative, but if 422 00:14:16,800 --> 00:14:19,400 we get pedantic about it we could add that and then we're in good shape. 423 00:14:20,120 --> 00:14:21,910 Now we've joined all three relations together. 424 00:14:22,430 --> 00:14:23,830 And remember, automatically the natural 425 00:14:24,310 --> 00:14:26,570 join enforces equality on the shared attributes. 426 00:14:27,350 --> 00:14:28,880 Very specifically, the college 427 00:14:29,350 --> 00:14:30,430 name here is going to 428 00:14:30,490 --> 00:14:32,780 be set equal to the apply college name as well. 429 00:14:33,800 --> 00:14:36,030 Now once we've done that, we've got all the information we need. 430 00:14:36,460 --> 00:14:37,370 We just need to add one 431 00:14:37,610 --> 00:14:39,060 more filtering condition, which is 432 00:14:39,200 --> 00:14:42,130 that the college enrollment is greater than 20,000. 433 00:14:42,290 --> 00:14:45,460 And with that, we've solved our query. 434 00:14:47,590 --> 00:14:48,860 So to summarize the 435 00:14:48,930 --> 00:14:51,670 natural join, the natural join combines relations. 436 00:14:52,250 --> 00:14:54,490 It automatically sets values equal 437 00:14:54,780 --> 00:14:55,630 when attribute names are the 438 00:14:55,980 --> 00:14:58,040 same and then it removes the duplicate columns. 439 00:14:58,980 --> 00:15:00,660 The natural join actually does not 440 00:15:01,110 --> 00:15:03,980 add any expressive power to relational algebra. 441 00:15:04,700 --> 00:15:07,690 We can rewrite the natural join without it using the cross-product. 442 00:15:08,690 --> 00:15:10,410 So let me just show that rewrite here. 443 00:15:10,910 --> 00:15:12,080 If we have, and now I'm 444 00:15:12,250 --> 00:15:14,160 going to use the general case of two expressions. 445 00:15:14,490 --> 00:15:17,050 One expression, natural join 446 00:15:17,330 --> 00:15:18,750 with another expression, that is 447 00:15:19,070 --> 00:15:21,470 actually equivalent to doing 448 00:15:21,750 --> 00:15:23,620 a projection on the 449 00:15:23,900 --> 00:15:25,610 schema of the first 450 00:15:25,860 --> 00:15:27,170 expression - I'll just 451 00:15:27,310 --> 00:15:29,030 call it E1 now - union 452 00:15:29,500 --> 00:15:30,490 the schema of the second expression. 453 00:15:30,920 --> 00:15:32,100 That's a real union, so that 454 00:15:32,280 --> 00:15:33,440 means if we have two copies we 455 00:15:33,530 --> 00:15:35,470 just keep one of them. Over 456 00:15:36,090 --> 00:15:38,870 the selection of. Now we're 457 00:15:39,040 --> 00:15:40,090 going to set all the shared attributes 458 00:15:40,720 --> 00:15:42,310 of the first expression to 459 00:15:42,400 --> 00:15:43,840 be equal to the shared attributes of the second. 460 00:15:44,130 --> 00:15:45,400 So I'll just write E1, A1 461 00:15:45,560 --> 00:15:48,210 equals E2, A1 462 00:15:48,280 --> 00:15:53,900 and E1, A2 equals E2 dot A2. 463 00:15:54,720 --> 00:15:55,940 Now these are the cases where, 464 00:15:57,050 --> 00:15:59,500 again, the attributes have the same names, and so on. 465 00:16:00,750 --> 00:16:01,870 So we're setting all those equal, 466 00:16:02,530 --> 00:16:04,040 and that is applied over 467 00:16:04,840 --> 00:16:07,750 expression one cross-product expression two. 468 00:16:07,830 --> 00:16:10,120 So again, the natural 469 00:16:10,700 --> 00:16:11,990 join is not giving us 470 00:16:12,130 --> 00:16:15,180 additional expressive power, but it is very convenient notationally. 471 00:16:18,980 --> 00:16:20,200 The last operator that I'm 472 00:16:20,260 --> 00:16:22,180 going to cover in this video is the theta join operator. 473 00:16:23,200 --> 00:16:24,370 Like natural join, theta join is 474 00:16:24,500 --> 00:16:27,040 actually an abbreviation that doesn't add expressive power to the language. 475 00:16:28,080 --> 00:16:28,680 Let me just write it. 476 00:16:28,830 --> 00:16:30,160 The theta join operator takes 477 00:16:30,500 --> 00:16:32,120 two expressions and combines them 478 00:16:32,670 --> 00:16:34,580 with the bow tie looking 479 00:16:34,870 --> 00:16:36,960 operator, but with a subscript theta. 480 00:16:37,100 --> 00:16:38,500 That theta is a condition. 481 00:16:39,050 --> 00:16:40,100 It's a condition in the style 482 00:16:40,720 --> 00:16:42,590 of the condition in the selection operator. 483 00:16:43,960 --> 00:16:45,310 And what this actually says 484 00:16:45,720 --> 00:16:47,250 - it's pretty simple - 485 00:16:47,310 --> 00:16:49,010 is it's equivalent to applying 486 00:16:49,530 --> 00:16:51,180 the theta condition to the 487 00:16:51,360 --> 00:16:52,860 cross-product of the two expressions. 488 00:16:53,670 --> 00:16:54,830 So you might wonder why 489 00:16:55,000 --> 00:16:56,000 I even mention the theta 490 00:16:56,330 --> 00:16:57,810 join operator, and the reason I 491 00:16:57,940 --> 00:16:59,140 mention it is that most 492 00:16:59,560 --> 00:17:01,630 database management systems implement the 493 00:17:01,770 --> 00:17:02,650 theta join as their basic 494 00:17:03,070 --> 00:17:04,330 operation for combining relations. 495 00:17:05,220 --> 00:17:06,620 So the basic operation is 496 00:17:07,060 --> 00:17:08,540 take two relations, combine all tuples, 497 00:17:08,870 --> 00:17:09,880 but then only keep the combinations 498 00:17:10,580 --> 00:17:11,610 that pass the theta condition. 499 00:17:12,570 --> 00:17:13,440 Often when you talk to 500 00:17:13,630 --> 00:17:15,010 people who build database systems or 501 00:17:15,190 --> 00:17:16,510 use databases, when they 502 00:17:16,640 --> 00:17:18,770 use the word join, they really mean the theta join. 503 00:17:21,190 --> 00:17:24,500 So, in conclusion, relational algebra is a formal language. 504 00:17:25,370 --> 00:17:26,970 It operates on sets of 505 00:17:27,040 --> 00:17:28,690 relations and produces relations as a result. 506 00:17:29,480 --> 00:17:30,670 The simplest query is just 507 00:17:30,930 --> 00:17:32,000 the name of a relation and 508 00:17:32,170 --> 00:17:33,320 then operators are used 509 00:17:33,670 --> 00:17:36,080 to filter relations, slice them, and combine them. 510 00:17:36,730 --> 00:17:37,920 So far, we've learned the 511 00:17:37,980 --> 00:17:39,590 select operator for selecting rows; 512 00:17:40,000 --> 00:17:41,340 the project operator for selecting 513 00:17:41,780 --> 00:17:43,150 columns; the cross-product 514 00:17:43,580 --> 00:17:45,290 operator for combining every possible 515 00:17:45,840 --> 00:17:46,800 pair of tuples from two 516 00:17:46,860 --> 00:17:48,080 relations; and then two 517 00:17:48,290 --> 00:17:49,700 abbreviations, the natural join, 518 00:17:50,200 --> 00:17:51,150 which a very useful way to combine 519 00:17:51,710 --> 00:17:53,280 relations by enforcing a equality 520 00:17:53,960 --> 00:17:55,730 on certain columns; and the theta join operator. 521 00:17:56,970 --> 00:17:57,910 In the next video, we'll learn 522 00:17:58,190 --> 00:17:59,370 some additional operators of relational 523 00:18:00,050 --> 00:18:01,040 algebra and also some alternative 524 00:18:01,580 --> 00:18:03,380 notations for relational algebra expressions.