1 00:00:00,100 --> 00:00:01,230 This is the first of seven 2 00:00:01,550 --> 00:00:03,150 videos where we're going to learn the SQL language. 3 00:00:04,000 --> 00:00:05,240 The videos are largely going to 4 00:00:05,380 --> 00:00:06,890 be live demos of SQL queries 5 00:00:07,340 --> 00:00:08,850 and updates running on an actual database. 6 00:00:09,710 --> 00:00:10,740 The first video is going 7 00:00:10,870 --> 00:00:12,740 to focus on the basics of the SELECT statement. 8 00:00:13,580 --> 00:00:14,770 As a reminder, the SELECT statement 9 00:00:15,240 --> 00:00:16,710 selects a set of attributes from 10 00:00:16,890 --> 00:00:19,560 a set of relations satisfying a particular condition. 11 00:00:20,460 --> 00:00:21,490 We will see in the demo that 12 00:00:21,680 --> 00:00:22,860 even with the these three clauses, 13 00:00:23,710 --> 00:00:25,130 we can write quite powerful queries. 14 00:00:26,170 --> 00:00:27,490 All of the seven demos are 15 00:00:27,620 --> 00:00:28,630 going to be using the simple 16 00:00:29,060 --> 00:00:30,790 college admissions database that we 17 00:00:31,030 --> 00:00:32,860 learned about in the relational algebra videos. 18 00:00:33,670 --> 00:00:35,130 As a reminder, we have three relations. 19 00:00:36,090 --> 00:00:37,750 We have the college relation: college 20 00:00:38,070 --> 00:00:39,430 relation contains information about the 21 00:00:39,490 --> 00:00:41,980 name of the colleges, the state, and the enrollment of those colleges. 22 00:00:43,140 --> 00:00:44,390 We have the student relation, which 23 00:00:44,560 --> 00:00:46,150 contains student IDs, their names, 24 00:00:46,530 --> 00:00:48,790 their GPA, and the size of the high school that they come from. 25 00:00:49,590 --> 00:00:51,760 And finally, the application information, that 26 00:00:51,890 --> 00:00:53,040 tells us that a particular student 27 00:00:53,870 --> 00:00:54,740 applied to a particular college 28 00:00:55,400 --> 00:00:56,630 for a particular major and there 29 00:00:56,700 --> 00:00:57,840 was a decision of that application 30 00:00:58,770 --> 00:01:00,190 Now as a reminder, in the 31 00:01:00,260 --> 00:01:01,370 relational model, when we underline 32 00:01:01,980 --> 00:01:04,040 attributes, that means we're designating a key for the relation. 33 00:01:04,610 --> 00:01:06,040 So, the underlying attributes in 34 00:01:06,120 --> 00:01:07,130 our example say that the 35 00:01:07,410 --> 00:01:09,360 knowledge name is going to be unique within the college relation. 36 00:01:10,020 --> 00:01:11,290 The student's idea is unique within 37 00:01:11,550 --> 00:01:13,260 the student relation and in 38 00:01:13,390 --> 00:01:16,150 the applied relation, the combination of these three attributes is unique. 39 00:01:17,280 --> 00:01:18,470 That means that student can, if 40 00:01:18,640 --> 00:01:19,820 he or she wishes, apply to a 41 00:01:20,020 --> 00:01:21,070 college many times, or apply 42 00:01:21,400 --> 00:01:22,800 for a major many times, but 43 00:01:22,920 --> 00:01:24,130 can only apply to a 44 00:01:24,430 --> 00:01:25,740 college for a particular major once. 45 00:01:26,830 --> 00:01:27,610 Let's turn to the demo. 46 00:01:28,890 --> 00:01:31,730 Let's start by looking at the actual data that we're going to be querying over. 47 00:01:32,470 --> 00:01:33,540 We have a set of four 48 00:01:33,740 --> 00:01:35,790 colleges: Stanford, Berkeley, MIT and Cornell. 49 00:01:36,940 --> 00:01:38,120 We have a bunch of students. 50 00:01:39,230 --> 00:01:40,290 And a reminder, each student has an 51 00:01:40,330 --> 00:01:42,780 ID, a name, a GPA, and a size of high school. 52 00:01:43,650 --> 00:01:45,200 And finally, we have a set 53 00:01:45,560 --> 00:01:46,990 of application records where a 54 00:01:47,090 --> 00:01:48,570 student with a particular ID 55 00:01:48,820 --> 00:01:49,880 applies to a college for a 56 00:01:50,110 --> 00:01:51,000 particular major, and there's a 57 00:01:51,190 --> 00:01:52,670 yes or no decision on that application. 58 00:01:53,650 --> 00:01:55,270 So let's go to our first SQL query. 59 00:01:56,230 --> 00:01:57,220 This query is going to find 60 00:01:57,640 --> 00:01:59,070 the ID, name, and GPA of 61 00:01:59,130 --> 00:02:00,880 students whose GPA is greater than 3.6. 62 00:02:01,440 --> 00:02:03,000 So, very simple, it's the 63 00:02:03,070 --> 00:02:04,700 basic SELECT FROM WHERE structure. 64 00:02:05,570 --> 00:02:06,540 The SELECT gives our table name, 65 00:02:06,940 --> 00:02:08,220 the WHERE gives our filtering condition 66 00:02:08,380 --> 00:02:10,750 and the SELECT tells us what we want to get out of the query. 67 00:02:11,640 --> 00:02:13,010 We'll execute that query and 68 00:02:13,120 --> 00:02:14,350 we will find here all of 69 00:02:14,430 --> 00:02:17,160 our students with a GPA greater than 3.6. 70 00:02:17,720 --> 00:02:19,050 Now, it's not necessary 71 00:02:19,410 --> 00:02:20,540 to include the GPA in 72 00:02:20,830 --> 00:02:23,470 the result to the query even if we filter on the GPA. 73 00:02:24,120 --> 00:02:25,330 So, I could just take GPA away 74 00:02:25,630 --> 00:02:27,190 from the SELECT clause, run the 75 00:02:27,260 --> 00:02:28,570 query again and now, we 76 00:02:28,680 --> 00:02:30,410 see the same result but without the GPA. 77 00:02:32,260 --> 00:02:32,260 Okay. 78 00:02:33,220 --> 00:02:34,460 Let's go to our second query. 79 00:02:35,980 --> 00:02:38,430 Our second query is going to combine two relations. 80 00:02:39,350 --> 00:02:40,310 In this query, we're going to 81 00:02:40,460 --> 00:02:41,270 find the names of the students 82 00:02:41,880 --> 00:02:43,300 and the majors for which they've applied. 83 00:02:44,210 --> 00:02:45,600 So, now, we're involving both the 84 00:02:45,660 --> 00:02:46,750 student table and the 85 00:02:46,840 --> 00:02:48,320 apply table and the 86 00:02:48,550 --> 00:02:49,440 condition we see here is the 87 00:02:49,620 --> 00:02:50,840 join condition that tells us 88 00:02:51,000 --> 00:02:52,450 we want to combine students with 89 00:02:52,610 --> 00:02:54,870 apply records that have the same student ID. 90 00:02:55,200 --> 00:02:57,030 This is what would happen automatically in 91 00:02:57,130 --> 00:02:58,400 a natural join of the 92 00:02:58,520 --> 00:02:59,760 relational algebra, but in SQL 93 00:03:00,010 --> 00:03:00,880 we need to always write the 94 00:03:00,970 --> 00:03:03,170 join condition explicitly, and finally 95 00:03:03,670 --> 00:03:05,010 we get the student name and the major. 96 00:03:05,480 --> 00:03:06,620 And if we execute the 97 00:03:06,750 --> 00:03:08,330 query, we get, expectedly, a 98 00:03:08,520 --> 00:03:10,560 bunch of students and the majors that they've applied for. 99 00:03:11,320 --> 00:03:13,810 Now, we do notice here that we have several duplicate values. 100 00:03:14,310 --> 00:03:15,260 We have two copies of Amy 101 00:03:15,570 --> 00:03:18,030 applying to CS and two copies of Craig applying to Bio-Engineering. 102 00:03:19,550 --> 00:03:20,270 As we discussed in the relational 103 00:03:20,800 --> 00:03:22,410 algebra video, in relational algebra 104 00:03:22,830 --> 00:03:24,440 which underlies SQL, it's by 105 00:03:24,590 --> 00:03:26,290 default the set model; we don't have duplicates. 106 00:03:26,890 --> 00:03:28,090 But in the SQL language we 107 00:03:28,220 --> 00:03:30,560 do have duplicates, it's based on a multi-set model. 108 00:03:31,450 --> 00:03:32,650 If we don't like the duplicates in 109 00:03:32,740 --> 00:03:35,350 our results SQL provides us a convenient way to get rid of them. 110 00:03:35,890 --> 00:03:36,980 We simply add the keyword, 111 00:03:37,220 --> 00:03:39,150 "distinct", to our query after 112 00:03:39,430 --> 00:03:41,100 the word, "select", we execute, and 113 00:03:41,300 --> 00:03:42,200 now we get the same result 114 00:03:42,750 --> 00:03:44,070 but with the duplicate values eliminated. 115 00:03:47,920 --> 00:03:49,010 Our next query is going 116 00:03:49,380 --> 00:03:50,610 to be a little more complicated; it's 117 00:03:50,810 --> 00:03:51,800 going to find the names 118 00:03:52,230 --> 00:03:53,890 and GPAs of students whose 119 00:03:54,110 --> 00:03:54,910 size high school is less 120 00:03:55,060 --> 00:03:56,200 than a thousand, they've applied to 121 00:03:56,420 --> 00:03:57,490 CS at Stanford, and we're going 122 00:03:57,680 --> 00:03:59,360 to get the decision associated with that. 123 00:03:59,630 --> 00:04:00,970 So again we have two 124 00:04:01,450 --> 00:04:03,950 relations, two tables involved, the student and the apply. 125 00:04:04,580 --> 00:04:06,050 We have the join condition, making 126 00:04:06,300 --> 00:04:07,350 sure we're talking about the same 127 00:04:07,600 --> 00:04:09,430 student and the student and apply tuples. 128 00:04:10,040 --> 00:04:11,160 Very important to remember that one. 129 00:04:11,830 --> 00:04:12,680 We are going to filter the result 130 00:04:13,050 --> 00:04:14,410 based on size high school, major, 131 00:04:15,390 --> 00:04:17,150 and the college to which they're applying. 132 00:04:18,060 --> 00:04:19,790 So let's run this query and 133 00:04:20,330 --> 00:04:21,640 we will see the result that 134 00:04:21,790 --> 00:04:22,890 we have two students who 135 00:04:22,970 --> 00:04:25,490 have applied to CS at Stanford from a small high school. 136 00:04:26,720 --> 00:04:29,890 Our next query is again a join of two relations. 137 00:04:30,260 --> 00:04:31,580 This time we're going to find all large 138 00:04:32,040 --> 00:04:34,620 campuses that have someone applying to that campus in CS. 139 00:04:35,450 --> 00:04:37,840 So this time we're going to join the college table and the apply table. 140 00:04:38,270 --> 00:04:39,340 And again, we need to 141 00:04:39,400 --> 00:04:40,250 be careful to make sure we 142 00:04:40,380 --> 00:04:42,260 only join tuples that are talking about the same college. 143 00:04:42,630 --> 00:04:43,050 So we have college.cname 144 00:04:43,750 --> 00:04:44,470 equals apply.cname. 145 00:04:46,020 --> 00:04:47,090 We have an enrollment that's greater 146 00:04:47,310 --> 00:04:49,380 than 20,000 and a major that equals CS. 147 00:04:49,930 --> 00:04:50,670 Let's run this query. 148 00:04:51,870 --> 00:04:52,590 Oops, we got an error! 149 00:04:53,270 --> 00:04:54,130 Well, actually I knew that was 150 00:04:54,280 --> 00:04:56,300 coming, but I wanted to show you what happens here. 151 00:04:57,040 --> 00:04:58,050 So the error is that we 152 00:04:58,170 --> 00:04:59,640 have an ambiguous column name, 153 00:04:59,900 --> 00:05:01,480 and that's the one right here, the C name. 154 00:05:02,170 --> 00:05:03,710 So I haven't pointed it 155 00:05:03,760 --> 00:05:05,890 out explicitly, but whenever I've 156 00:05:06,150 --> 00:05:06,790 referred to attributes where there's 157 00:05:06,850 --> 00:05:08,030 an attribute from both of 158 00:05:08,130 --> 00:05:09,540 the relations we're querying, I prefaced 159 00:05:10,230 --> 00:05:11,280 it with the name of 160 00:05:11,340 --> 00:05:13,530 the relation that we cared about, the college here in the apply. 161 00:05:14,480 --> 00:05:15,980 So the attribute name here 162 00:05:16,340 --> 00:05:17,440 in the select clause is actually 163 00:05:17,900 --> 00:05:19,540 ambiguous because there's a 164 00:05:19,630 --> 00:05:21,670 C name attribute in college and there's one there in apply. 165 00:05:22,480 --> 00:05:23,530 Now we happen to set those equal, 166 00:05:23,860 --> 00:05:24,860 but in order for the query to 167 00:05:25,110 --> 00:05:25,870 actually run we have to choose 168 00:05:26,390 --> 00:05:27,750 So let's just say we're 169 00:05:27,870 --> 00:05:29,420 going to take that C name from college. 170 00:05:30,390 --> 00:05:32,690 Now, everything should be fine, and here we go. 171 00:05:33,130 --> 00:05:34,340 So those are the colleges where we 172 00:05:34,470 --> 00:05:36,010 have at least one 173 00:05:36,230 --> 00:05:38,500 CS major and their enrollment is greater than 20,000. 174 00:05:38,590 --> 00:05:40,270 Again, we see duplicates 175 00:05:41,230 --> 00:05:42,180 so if we don't like 176 00:05:42,430 --> 00:05:43,690 the two copies of Berkeley, we 177 00:05:43,800 --> 00:05:46,370 simply add distinct and we run the query again. 178 00:05:46,580 --> 00:05:48,490 And now we have Berkeley and Cornell. 179 00:05:48,850 --> 00:05:51,700 Now, let's do a query with a bigger result. 180 00:05:52,230 --> 00:05:54,400 This time we're finally going to join all three of our relations. 181 00:05:55,110 --> 00:05:55,900 Student, college and apply. 182 00:05:56,580 --> 00:05:57,480 And we're going to apply the 183 00:05:57,570 --> 00:05:59,000 joint conditions that ensure that 184 00:05:59,130 --> 00:06:01,690 we're talking about the same student and the same college. 185 00:06:03,030 --> 00:06:04,230 And then from the result 186 00:06:04,860 --> 00:06:06,190 of that big cross-product, that 187 00:06:06,350 --> 00:06:07,500 big join, we're going to 188 00:06:07,620 --> 00:06:08,620 get the student ID, their name, 189 00:06:08,950 --> 00:06:10,460 their GPA, the college that 190 00:06:10,580 --> 00:06:12,170 they're applying to and the enrollment of that college. 191 00:06:12,670 --> 00:06:13,500 So just a whole bunch of 192 00:06:13,580 --> 00:06:16,510 information associated with this students' applications. 193 00:06:17,400 --> 00:06:18,960 And we execute this and here 194 00:06:19,080 --> 00:06:20,960 we get the result with all the attributes that we asked for. 195 00:06:21,060 --> 00:06:22,730 Now, one thing I haven't 196 00:06:23,000 --> 00:06:24,070 mentioned yet is the order 197 00:06:24,420 --> 00:06:26,400 of the results that we get when we run SQL queries. 198 00:06:27,280 --> 00:06:29,860 SO SQL is, at its heart, an unordered model. 199 00:06:30,360 --> 00:06:31,100 That means that we can get 200 00:06:31,270 --> 00:06:32,320 the results of our queries in 201 00:06:32,610 --> 00:06:33,910 any order, and in fact, 202 00:06:34,010 --> 00:06:35,010 we could run a query today 203 00:06:35,810 --> 00:06:37,100 and get our results in a particular order. 204 00:06:37,580 --> 00:06:39,440 And then run the query tomorrow and get a different order. 205 00:06:39,840 --> 00:06:41,550 And that's permitted with the 206 00:06:41,630 --> 00:06:43,670 specification of SQL on relational databases. 207 00:06:44,910 --> 00:06:45,950 If we care about the order 208 00:06:46,220 --> 00:06:47,900 of our result SQL provides a 209 00:06:48,060 --> 00:06:48,770 clause that we can ask for a 210 00:06:49,550 --> 00:06:50,750 result to be sorted by 211 00:06:50,920 --> 00:06:55,370 a particular attribute or set of attributes. So 212 00:06:55,470 --> 00:06:56,640 let's say we want our application information here 213 00:06:56,990 --> 00:06:58,280 sorted by descending GPA. 214 00:06:59,320 --> 00:07:01,920 Then we add another clause called the order by clause. 215 00:07:02,960 --> 00:07:04,110 We tell the attribute we'd like 216 00:07:04,300 --> 00:07:05,510 to be ordering by and then 217 00:07:05,820 --> 00:07:07,930 if we want it to be descending we write DESC. 218 00:07:08,930 --> 00:07:10,980 The default behavior is actually ascending. 219 00:07:12,000 --> 00:07:12,860 So if we run this query 220 00:07:13,590 --> 00:07:15,510 now we get our results by 221 00:07:15,660 --> 00:07:18,920 descending the GPA we 222 00:07:19,130 --> 00:07:20,480 see all the 3.9's, 3.8, 3.7, and so forth. 223 00:07:21,170 --> 00:07:22,140 Now we might still want 224 00:07:22,360 --> 00:07:23,550 to further sort within all the 225 00:07:23,840 --> 00:07:25,050 3.9s if we want 226 00:07:25,310 --> 00:07:26,740 to do that we can specify another 227 00:07:27,200 --> 00:07:28,470 attribute to sort each group by. 228 00:07:29,100 --> 00:07:30,330 So, for example, if we 229 00:07:30,440 --> 00:07:31,610 decide from that we 230 00:07:31,820 --> 00:07:33,310 want to sort by enrollment 231 00:07:35,250 --> 00:07:36,390 and ascending, we won't put 232 00:07:36,720 --> 00:07:37,960 anything because ascending is the default. 233 00:07:38,660 --> 00:07:38,910 And we execute. 234 00:07:39,800 --> 00:07:41,310 Now we still have GPA 235 00:07:41,750 --> 00:07:42,840 as descending as our primary 236 00:07:43,450 --> 00:07:44,510 sort order and then within each 237 00:07:44,770 --> 00:07:47,300 of those will be sorting by ascending enrollment. 238 00:07:47,640 --> 00:07:51,010 This query introduces the like predicate. 239 00:07:51,810 --> 00:07:52,770 Like is a built-in operator 240 00:07:53,080 --> 00:07:54,350 in SQL that allows us 241 00:07:54,470 --> 00:07:56,320 to do simple string matching on attribute values. 242 00:07:57,390 --> 00:07:58,530 Let's suppose, for example, that we 243 00:07:58,620 --> 00:07:59,500 wanted to find all students 244 00:07:59,950 --> 00:08:01,850 who were applying for a major that had to do with bio. 245 00:08:02,230 --> 00:08:03,710 Instead of listing all the 246 00:08:03,780 --> 00:08:04,890 biology majors we can 247 00:08:05,410 --> 00:08:06,270 simply pattern match the major 248 00:08:06,810 --> 00:08:08,170 against the special string here 249 00:08:08,740 --> 00:08:10,120 which says, match any major 250 00:08:10,890 --> 00:08:11,970 where there's some set of characters, 251 00:08:12,400 --> 00:08:14,110 followed by bio, followed by 252 00:08:14,270 --> 00:08:15,950 some set of characters we execute 253 00:08:16,420 --> 00:08:17,800 the query, and we'll find the 254 00:08:18,020 --> 00:08:18,890 students who have applied for various 255 00:08:19,590 --> 00:08:20,530 bio type majors. 256 00:08:21,750 --> 00:08:23,340 Now, I want to introduce another construct. 257 00:08:23,850 --> 00:08:24,870 I'm going to use the same query to 258 00:08:25,000 --> 00:08:27,270 do it, which is the construct select star. 259 00:08:28,200 --> 00:08:29,590 So far, we've always listed 260 00:08:30,350 --> 00:08:31,610 explicitly the attributes that we 261 00:08:31,750 --> 00:08:33,150 want to get in the result of a query. 262 00:08:33,850 --> 00:08:34,730 But if we simply want to get 263 00:08:34,930 --> 00:08:37,600 all attributes, then we can just write select star. 264 00:08:38,140 --> 00:08:39,300 And when we do that, we 265 00:08:39,490 --> 00:08:40,980 don't project away any attributes, 266 00:08:41,400 --> 00:08:44,250 but we get all the attributes in the result of the from and where expression. 267 00:08:45,730 --> 00:08:47,870 While we're at it, let's do a gigantic query. 268 00:08:48,120 --> 00:08:49,100 We'll just do the cross-product 269 00:08:49,440 --> 00:08:50,770 and student college without any 270 00:08:51,000 --> 00:08:52,290 combination, and we'll do 271 00:08:52,390 --> 00:08:53,890 select star to get all the attributes out. 272 00:08:54,780 --> 00:08:56,020 So, here goes, and you can 273 00:08:56,180 --> 00:08:57,340 see, we get all the attributes 274 00:08:57,830 --> 00:08:59,680 and we get a whole lot of tuples as well. 275 00:09:00,460 --> 00:09:01,370 Our last query is going to 276 00:09:01,450 --> 00:09:02,750 demonstrate the ability to use 277 00:09:03,090 --> 00:09:04,700 arithmetic within SQL clauses. 278 00:09:05,750 --> 00:09:06,720 So we see here a query 279 00:09:07,270 --> 00:09:08,380 that selects all the information 280 00:09:08,790 --> 00:09:10,570 from the student relation but adds 281 00:09:10,700 --> 00:09:12,980 to it a scaled GPA where 282 00:09:13,120 --> 00:09:13,960 we're going to boost the student's 283 00:09:14,410 --> 00:09:15,300 GPA if they're from a big 284 00:09:15,550 --> 00:09:17,490 high school and reduce it if they're from a small one. 285 00:09:18,170 --> 00:09:19,590 Specifically, we'll take their GPA, multiply 286 00:09:20,080 --> 00:09:21,770 it by the size high school divided by a thousand. 287 00:09:21,910 --> 00:09:23,180 So, let's run this 288 00:09:23,380 --> 00:09:24,220 query and you can see 289 00:09:24,420 --> 00:09:25,480 that we have the whole student table 290 00:09:25,850 --> 00:09:27,760 here with an additional column that 291 00:09:28,370 --> 00:09:30,910 has scaled their GPA based on the size of their high school. 292 00:09:31,810 --> 00:09:33,050 Now, if we don't like the 293 00:09:33,200 --> 00:09:34,660 label on this column, we 294 00:09:34,750 --> 00:09:35,780 could change it and so 295 00:09:36,200 --> 00:09:37,210 I'll use this query as an 296 00:09:37,290 --> 00:09:38,410 example to demonstrate the 'as' 297 00:09:38,790 --> 00:09:40,060 feature which allows us 298 00:09:40,200 --> 00:09:42,600 to change the labeling of the schema in a query result. 299 00:09:43,570 --> 00:09:44,880 Let's say as scaled GPA, 300 00:09:45,410 --> 00:09:46,220 and we should get the same 301 00:09:46,500 --> 00:09:48,490 result with a more nicely labeled attribute. 302 00:09:49,900 --> 00:09:51,820 That concludes our video introducing the basic select statement. 303 00:09:52,380 --> 00:09:53,850 We'll see many other features in 304 00:09:53,940 --> 00:09:55,530 the upcoming six videos on SQL.