1 00:00:00,540 --> 00:00:02,380 This SQL video, which as 2 00:00:02,450 --> 00:00:03,920 usual will consist mostly of 3 00:00:04,060 --> 00:00:07,310 running live queries, introduces sub-queries in the where clause. 4 00:00:08,760 --> 00:00:09,940 As usual, we start with our 5 00:00:10,100 --> 00:00:11,850 basic select from where expression, 6 00:00:13,080 --> 00:00:14,010 and we're going to add to it 7 00:00:14,110 --> 00:00:15,460 now the ability in the 8 00:00:15,520 --> 00:00:16,660 condition part of the select 9 00:00:17,020 --> 00:00:18,790 from where expression to include sub-queries. 10 00:00:19,820 --> 00:00:21,360 Sub-queries are nested, select 11 00:00:21,780 --> 00:00:22,900 statements within the condition, 12 00:00:23,300 --> 00:00:27,030 and we'll see they're actually quite powerful. 13 00:00:27,390 --> 00:00:28,960 As usual, we'll be using a sample demonstration database that 14 00:00:29,070 --> 00:00:32,010 consists of colleges, students, and students applying to colleges. 15 00:00:33,120 --> 00:00:34,070 As always, we'll have our 16 00:00:34,290 --> 00:00:36,480 four colleges, a bunch of 17 00:00:36,560 --> 00:00:37,890 students, and a bunch of 18 00:00:38,360 --> 00:00:40,400 records that show students applying to colleges. 19 00:00:40,860 --> 00:00:42,270 So let's launch right 20 00:00:42,440 --> 00:00:43,900 into our first query that shows 21 00:00:44,240 --> 00:00:45,440 a sub-query in the where clause. 22 00:00:46,960 --> 00:00:48,870 What this query finds is the 23 00:00:49,250 --> 00:00:50,520 ID's and names of all students 24 00:00:51,220 --> 00:00:53,740 who have applied to major in CS to some college. 25 00:00:54,940 --> 00:00:56,620 So here's the sub-query in the where clause. 26 00:00:57,230 --> 00:00:58,340 This expression right here in 27 00:00:58,440 --> 00:01:00,410 the where clause finds the 28 00:01:00,580 --> 00:01:01,650 ID's of all students who 29 00:01:01,720 --> 00:01:04,020 have applied to major in CS. 30 00:01:05,290 --> 00:01:05,670 So now we have that set of ID's, 31 00:01:06,060 --> 00:01:07,440 our outer query says let's 32 00:01:07,790 --> 00:01:09,280 take from the students those 33 00:01:09,490 --> 00:01:10,940 students whose ID is in 34 00:01:11,340 --> 00:01:12,710 that set, and let's select 35 00:01:13,220 --> 00:01:14,360 their ID and their name. 36 00:01:14,990 --> 00:01:16,030 So we go ahead and execute 37 00:01:16,250 --> 00:01:17,540 the query and we find out 38 00:01:17,760 --> 00:01:20,180 that five students have applied to major in CS. 39 00:01:22,110 --> 00:01:23,590 Now we actually can do 40 00:01:23,650 --> 00:01:25,590 this query without a sub-query in the where clause. 41 00:01:26,220 --> 00:01:27,030 So let's take a look. 42 00:01:29,060 --> 00:01:30,230 We can do it instead by joining 43 00:01:30,750 --> 00:01:32,360 the student relation with the apply relation. 44 00:01:33,410 --> 00:01:35,440 So here we do what we learned in previous videos. 45 00:01:36,010 --> 00:01:37,920 We take student and apply, we 46 00:01:38,170 --> 00:01:39,380 write the joined condition to make 47 00:01:39,510 --> 00:01:40,460 sure we're talking about the same 48 00:01:40,730 --> 00:01:41,910 student, we make sure 49 00:01:42,120 --> 00:01:43,250 they're majoring in CS, and 50 00:01:43,640 --> 00:01:45,200 we get their ID and their name. 51 00:01:45,850 --> 00:01:46,450 Let's run the query. 52 00:01:48,680 --> 00:01:49,420 Whoops, an error. 53 00:01:49,870 --> 00:01:50,630 I knew that was coming. 54 00:01:51,370 --> 00:01:53,930 This is just to remind you about disambiguating attributes. 55 00:01:54,500 --> 00:01:56,070 The ID here in the 56 00:01:56,140 --> 00:01:57,370 select clause could have come 57 00:01:57,530 --> 00:01:59,090 from student or apply, and 58 00:01:59,180 --> 00:02:00,180 even though the value is equal, 59 00:02:00,500 --> 00:02:02,700 we do have to disambiguate by putting one of those. 60 00:02:03,040 --> 00:02:04,490 So let's put student dot SID 61 00:02:05,500 --> 00:02:06,160 and let's run the query. 62 00:02:07,380 --> 00:02:07,380 Okay. 63 00:02:07,760 --> 00:02:08,590 Now we see we got more 64 00:02:09,010 --> 00:02:10,390 students back than we 65 00:02:10,500 --> 00:02:12,380 got back when we ran the query using the sub-query. 66 00:02:12,990 --> 00:02:13,700 Let's go back and look. 67 00:02:13,950 --> 00:02:15,980 We got five results here and 68 00:02:16,100 --> 00:02:17,310 we got several more here, 69 00:02:17,650 --> 00:02:20,330 but the additional results are actually duplicate values. 70 00:02:21,380 --> 00:02:24,280 So we have two copies, for example, of 1-2-3 Amy. 71 00:02:25,240 --> 00:02:26,560 The reason for that is 72 00:02:26,710 --> 00:02:28,130 that Amy actually applied to 73 00:02:28,240 --> 00:02:30,030 major in CS at multiple colleges. 74 00:02:30,840 --> 00:02:31,820 So if we go back and we 75 00:02:31,960 --> 00:02:33,680 look at the apply data, we'll 76 00:02:33,860 --> 00:02:35,020 see that Amy who is 77 00:02:35,420 --> 00:02:36,600 1-2-3 applied to major in CS 78 00:02:36,840 --> 00:02:38,320 at Stanford as well as Berkeley. 79 00:02:38,910 --> 00:02:39,980 Let's hope she selects Stanford. 80 00:02:40,390 --> 00:02:42,120 In any case, that's 81 00:02:42,470 --> 00:02:43,840 why we got Amy twice in 82 00:02:43,990 --> 00:02:45,890 the join because she applied twice. 83 00:02:46,640 --> 00:02:47,470 Back here where we used 84 00:02:47,630 --> 00:02:48,680 the sub-query, we were just 85 00:02:48,750 --> 00:02:51,220 looking at students and whether their ID was in the set. 86 00:02:52,420 --> 00:02:53,260 Okay, so when we do 87 00:02:53,490 --> 00:02:54,890 the join we get basically the 88 00:02:55,110 --> 00:02:56,950 same answer, but we have some duplicate values. 89 00:02:57,890 --> 00:02:59,810 Of course, we can fix that by adding distinct. 90 00:03:01,750 --> 00:03:03,270 We run the query and now 91 00:03:03,860 --> 00:03:06,410 we have the same result that we got when we used the sub-query. 92 00:03:08,290 --> 00:03:09,320 Let's look at some other 93 00:03:09,590 --> 00:03:11,200 similar queries and let's 94 00:03:11,420 --> 00:03:12,220 focus again on the duplicates 95 00:03:12,750 --> 00:03:15,170 issue because it gets a little tricky, as we'll see. 96 00:03:16,240 --> 00:03:17,510 This query is very similar 97 00:03:17,870 --> 00:03:19,330 to the previous one, finding students 98 00:03:19,770 --> 00:03:20,810 who are applying to major in 99 00:03:21,060 --> 00:03:22,180 CS, but this time 100 00:03:22,290 --> 00:03:23,440 we're only getting the names of 101 00:03:23,570 --> 00:03:24,940 the students and not their ID's. 102 00:03:25,810 --> 00:03:28,540 So we run the query and we find our same five students. 103 00:03:29,530 --> 00:03:30,530 Just want to mention that these 104 00:03:30,960 --> 00:03:33,040 two Craigs are two different Craigs. 105 00:03:33,370 --> 00:03:34,350 If we go back to our 106 00:03:34,500 --> 00:03:36,260 original result, there's three-four-five Craig 107 00:03:36,920 --> 00:03:38,020 and five-four-three Craig. 108 00:03:38,300 --> 00:03:40,320 So coming back here, we 109 00:03:40,440 --> 00:03:41,220 find the names of the 110 00:03:41,300 --> 00:03:42,750 students who majored in CS. 111 00:03:43,820 --> 00:03:44,810 Now similarly to what we 112 00:03:44,890 --> 00:03:46,530 did previously, let's write this 113 00:03:46,710 --> 00:03:48,110 query using a join instead 114 00:03:48,490 --> 00:03:49,940 of using the sub-query in the where clause. 115 00:03:51,940 --> 00:03:52,580 So here's the join. 116 00:03:52,750 --> 00:03:53,960 We're joining student and apply 117 00:03:54,210 --> 00:03:56,300 on the student's ID majoring in CS, as always. 118 00:03:56,610 --> 00:03:58,890 The only difference is that we're just selecting the name. 119 00:03:59,860 --> 00:04:01,460 We run the query and we 120 00:04:01,610 --> 00:04:02,840 get again many more results 121 00:04:03,290 --> 00:04:05,180 than we got previously because we 122 00:04:05,310 --> 00:04:06,500 get two copies when a 123 00:04:06,540 --> 00:04:07,430 student has applied to to 124 00:04:07,470 --> 00:04:08,960 major in CS at two different places. 125 00:04:09,280 --> 00:04:11,060 And just as we 126 00:04:11,350 --> 00:04:13,530 did before, we can add distinct to get rid of the duplicates. 127 00:04:18,570 --> 00:04:20,330 Well, something different happened this time. 128 00:04:20,900 --> 00:04:21,800 This time, when we get 129 00:04:21,980 --> 00:04:23,050 rid of the duplicates, we only have 130 00:04:23,390 --> 00:04:25,720 four results where previously we had five. 131 00:04:26,540 --> 00:04:27,430 And the reason is that previously 132 00:04:27,970 --> 00:04:29,260 we included the student ID 133 00:04:29,360 --> 00:04:31,300 in the result, and so the 134 00:04:31,390 --> 00:04:32,870 two instances of Craig were 135 00:04:33,020 --> 00:04:34,440 two different Craigs and didn't 136 00:04:35,000 --> 00:04:35,950 cause duplicates to be eliminated. 137 00:04:36,780 --> 00:04:38,790 We can see that back here: Craig and Craig. 138 00:04:39,690 --> 00:04:40,940 But in this result, because we 139 00:04:41,030 --> 00:04:42,430 only kept the names, the two 140 00:04:42,910 --> 00:04:44,530 copies of Craig turned into one result. 141 00:04:45,870 --> 00:04:46,880 Now we might wonder why 142 00:04:47,330 --> 00:04:48,900 do we care about duplicates so much. 143 00:04:49,800 --> 00:04:52,800 Let's see an example where duplicates really do matter quite a bit. 144 00:04:53,900 --> 00:04:55,610 We're going to do exactly the same 145 00:04:55,780 --> 00:04:57,250 query again, finding students who 146 00:04:57,420 --> 00:04:58,480 have applied to major in CS, 147 00:04:59,140 --> 00:05:00,360 but now we're not retrieving the 148 00:05:00,570 --> 00:05:01,630 IDs or names, we're retrieving 149 00:05:02,160 --> 00:05:03,180 the GPAs of those students. 150 00:05:03,970 --> 00:05:05,190 Presumably what we're interested in 151 00:05:05,310 --> 00:05:06,750 doing is some analysis of the 152 00:05:06,850 --> 00:05:09,040 GPA's of students who choose to apply for CS. 153 00:05:10,100 --> 00:05:10,960 So let's run the query. 154 00:05:11,790 --> 00:05:12,860 As always, we get our five results. 155 00:05:13,440 --> 00:05:15,010 And here's the GPA's of the 156 00:05:15,180 --> 00:05:17,490 five students who have applied to major in CS. 157 00:05:18,860 --> 00:05:19,920 Once again, this will be 158 00:05:19,980 --> 00:05:21,350 the last time, I promise, we'll 159 00:05:21,620 --> 00:05:23,690 do this query using a join instead of the sub-query. 160 00:05:24,680 --> 00:05:25,040 So here we go. 161 00:05:25,800 --> 00:05:26,970 We've got student and apply join 162 00:05:27,260 --> 00:05:29,950 on SID, majoring in CS, and returning the GPA. 163 00:05:31,300 --> 00:05:32,640 Once again, because we 164 00:05:32,780 --> 00:05:33,940 have students who applied multiple 165 00:05:34,520 --> 00:05:36,570 times for CS, we're getting more than our five results. 166 00:05:37,420 --> 00:05:38,540 So we get a rather large number 167 00:05:38,950 --> 00:05:40,900 of results here and again we have duplicates. 168 00:05:42,270 --> 00:05:43,610 So here's where the problem comes in. 169 00:05:44,340 --> 00:05:45,970 If we use this result to 170 00:05:46,210 --> 00:05:47,480 compute the average GPA, we'll be 171 00:05:47,600 --> 00:05:49,160 counting some students multiple times. 172 00:05:49,640 --> 00:05:51,310 And presumably, that's not what we want to do. 173 00:05:51,990 --> 00:05:52,870 Presumably, we want to count 174 00:05:53,110 --> 00:05:54,400 each student who's applied to 175 00:05:54,610 --> 00:05:55,790 CS once in the computation 176 00:05:56,480 --> 00:05:57,390 of, say, the average GPA. 177 00:05:58,550 --> 00:05:59,830 That worked in the previous query 178 00:06:00,390 --> 00:06:02,710 when we got the five results for the five students who applied to CS. 179 00:06:02,830 --> 00:06:05,220 When we do the join, we get too many results. 180 00:06:06,120 --> 00:06:07,430 But this time, again, we're 181 00:06:07,560 --> 00:06:08,570 going to have a problem when we 182 00:06:08,790 --> 00:06:10,240 do select distinct, because some 183 00:06:10,330 --> 00:06:11,800 of these students have the same GPA. 184 00:06:12,750 --> 00:06:13,680 And now we only have 185 00:06:13,850 --> 00:06:16,300 four GPAs instead of the five that we should have. 186 00:06:16,720 --> 00:06:17,930 And if we compute the 187 00:06:18,140 --> 00:06:19,550 average GPA now, then we're 188 00:06:19,700 --> 00:06:21,530 not factoring in one of the student's GPAs. 189 00:06:22,470 --> 00:06:23,590 So in this case, neither the 190 00:06:23,690 --> 00:06:25,050 version with distinct nor the 191 00:06:25,220 --> 00:06:26,710 version without distinct gives us 192 00:06:26,900 --> 00:06:28,590 the right number of GPAs. 193 00:06:29,820 --> 00:06:31,340 Neither of those will give us the correct average. 194 00:06:32,190 --> 00:06:33,140 The only way to get the 195 00:06:33,200 --> 00:06:34,560 correct number of duplicates is 196 00:06:34,760 --> 00:06:35,640 to use the version of the 197 00:06:35,710 --> 00:06:38,100 query that has the sub-query in the where clause. 198 00:06:39,590 --> 00:06:40,720 Now let's move to some different 199 00:06:41,100 --> 00:06:43,500 examples that also use subqueries in the where clause. 200 00:06:44,850 --> 00:06:45,910 You might remember from 201 00:06:46,050 --> 00:06:47,210 the previous video when we 202 00:06:47,330 --> 00:06:49,190 were learning the difference operator that 203 00:06:49,340 --> 00:06:50,340 we had a query that we 204 00:06:50,460 --> 00:06:51,710 could write using the difference operator, 205 00:06:52,560 --> 00:06:53,660 which in SQL is called accept, 206 00:06:54,180 --> 00:06:55,410 but we were unable to 207 00:06:55,510 --> 00:06:57,050 write that query without the difference operator. 208 00:06:57,980 --> 00:06:59,000 And the query we were trying to 209 00:06:59,180 --> 00:07:00,340 write is to find students 210 00:07:00,840 --> 00:07:01,970 who have applied to major in 211 00:07:02,220 --> 00:07:04,490 CS, but have not applied to major in EE. 212 00:07:05,440 --> 00:07:06,490 Now that we have sub-queries in 213 00:07:06,570 --> 00:07:07,890 the where clause, we can write 214 00:07:08,180 --> 00:07:10,380 that query without using the except operator. 215 00:07:11,100 --> 00:07:12,220 And here it is. 216 00:07:12,300 --> 00:07:13,870 The query looks for students where 217 00:07:14,040 --> 00:07:15,500 their ID is among the 218 00:07:15,640 --> 00:07:16,670 set of ID's of students 219 00:07:17,100 --> 00:07:18,800 who have applied to CS, but 220 00:07:18,970 --> 00:07:20,400 their ID is not among 221 00:07:20,670 --> 00:07:21,890 the set of ID's of students 222 00:07:22,250 --> 00:07:24,540 who are applying to major in EE. 223 00:07:25,690 --> 00:07:26,420 So let's run the query. 224 00:07:26,780 --> 00:07:28,450 And we discover that 225 00:07:28,660 --> 00:07:29,560 there are three students who have 226 00:07:29,640 --> 00:07:30,620 applied to major in CS, 227 00:07:30,970 --> 00:07:32,680 but not applied anywhere in EE. 228 00:07:33,580 --> 00:07:34,400 By the way, let me just show 229 00:07:34,580 --> 00:07:37,380 you a slightly different way to write exactly the same query. 230 00:07:37,920 --> 00:07:38,970 You can see that we use this 231 00:07:39,360 --> 00:07:41,100 key word combination "not in" 232 00:07:41,980 --> 00:07:44,250 to specify that the idea is not in this set. 233 00:07:44,930 --> 00:07:46,600 We can actually write it by 234 00:07:46,760 --> 00:07:48,480 writing the SID is 235 00:07:48,680 --> 00:07:50,890 in the set and then applying a "not" to that result. 236 00:07:52,010 --> 00:07:53,650 We'll execute and we'll get exactly the same result. 237 00:07:54,950 --> 00:07:56,060 It's fairly common in SQL 238 00:07:56,310 --> 00:07:57,630 for there to be multiple ways to 239 00:07:57,770 --> 00:07:58,830 write the same query, and we'll 240 00:07:58,970 --> 00:08:01,190 see more examples of that later in this video. 241 00:08:01,540 --> 00:08:03,480 So far, our examples 242 00:08:04,110 --> 00:08:05,630 of sub-queries have used "in" 243 00:08:05,670 --> 00:08:07,110 and "not in" for testing 244 00:08:07,460 --> 00:08:09,320 membership in the sets that are produced by sub-queries. 245 00:08:10,210 --> 00:08:11,740 Our next examples are going 246 00:08:11,940 --> 00:08:13,040 to show that we can apply 247 00:08:13,260 --> 00:08:14,820 the exists operator to sub-queries 248 00:08:15,600 --> 00:08:17,370 just to test whether they're empty or not empty. 249 00:08:18,570 --> 00:08:18,920 Here's the query. 250 00:08:19,700 --> 00:08:21,980 This query uses exists to check 251 00:08:22,260 --> 00:08:23,590 whether a subquery is empty 252 00:08:23,910 --> 00:08:25,390 or not empty rather than checking 253 00:08:25,680 --> 00:08:26,990 whether values are in the subquery. 254 00:08:28,040 --> 00:08:29,340 The other new construct that's 255 00:08:29,500 --> 00:08:32,140 begin to introduce here's what's known as a correlated reference. 256 00:08:33,030 --> 00:08:34,360 But inside the subquery we're going 257 00:08:34,560 --> 00:08:35,510 to refer to a value, 258 00:08:36,290 --> 00:08:38,250 C1, that comes from outside the subquery. 259 00:08:39,270 --> 00:08:41,600 So let's talk about exactly how this query works. 260 00:08:42,340 --> 00:08:44,900 First let me tell you what the query is trying to return. 261 00:08:45,710 --> 00:08:46,650 The query is going to find 262 00:08:47,090 --> 00:08:48,640 all colleges, such that there's 263 00:08:48,880 --> 00:08:51,520 some other college that is in the same state. 264 00:08:52,650 --> 00:08:52,650 Okay? 265 00:08:52,900 --> 00:08:53,980 So, in our example, just a 266 00:08:54,200 --> 00:08:56,150 reminder, we have Stanford, Berkley, MIT and Cornell. 267 00:08:56,690 --> 00:08:57,710 So, the two Colleges we should 268 00:08:57,870 --> 00:08:59,030 get back are Stanford and Berkeley 269 00:08:59,450 --> 00:09:00,470 because in each case there's another 270 00:09:00,750 --> 00:09:02,040 college in the same state. 271 00:09:02,870 --> 00:09:03,750 So how does this query work? 272 00:09:04,200 --> 00:09:05,220 It says we're gonna to take 273 00:09:05,380 --> 00:09:07,060 the colleges, and for 274 00:09:07,240 --> 00:09:08,430 each college we're going to 275 00:09:08,480 --> 00:09:10,210 check whether their exists another 276 00:09:10,600 --> 00:09:11,770 college and we're going to 277 00:09:11,860 --> 00:09:13,740 call that one C2, where the 278 00:09:13,820 --> 00:09:16,020 state of C2 is the same as the state of C1. 279 00:09:16,170 --> 00:09:17,690 This is sort of similar 280 00:09:17,920 --> 00:09:18,920 to what we saw when were 281 00:09:19,050 --> 00:09:20,630 doing self-joins with table variables, 282 00:09:21,150 --> 00:09:22,330 but now, the variables are 283 00:09:22,470 --> 00:09:24,180 appearing in the outer query and the inner query. 284 00:09:25,020 --> 00:09:25,020 Okay. 285 00:09:25,230 --> 00:09:26,220 So, let's run the query and 286 00:09:26,910 --> 00:09:28,580 we see that we get the wrong answer. 287 00:09:29,340 --> 00:09:30,010 That was again intentional. 288 00:09:31,140 --> 00:09:31,700 Well, here's the problem. 289 00:09:32,740 --> 00:09:34,250 When we were in this query, C1 290 00:09:34,550 --> 00:09:37,240 and C2 could be bound to the same college. 291 00:09:38,140 --> 00:09:39,500 So every college is in 292 00:09:39,910 --> 00:09:40,990 the same state as another college 293 00:09:41,380 --> 00:09:42,740 of the other college could be the same college. 294 00:09:43,580 --> 00:09:44,580 What we need to do is 295 00:09:44,880 --> 00:09:46,500 add inside the sub-query a 296 00:09:46,710 --> 00:09:49,480 further condition that says that C1 and C2 are different colleges. 297 00:09:50,180 --> 00:09:51,110 Specifically, C1.Cname 298 00:09:53,010 --> 00:09:54,240 is not equal to C2.Cname. 299 00:09:54,910 --> 00:09:58,090 Let's run the query 300 00:09:58,580 --> 00:09:59,980 and now, we get the correct answer. 301 00:10:02,330 --> 00:10:03,090 Now, let's look at some 302 00:10:03,320 --> 00:10:04,370 other uses of the exists, 303 00:10:05,190 --> 00:10:05,940 construct and sub-query. 304 00:10:08,200 --> 00:10:08,950 If you happen to already know SQL 305 00:10:09,060 --> 00:10:10,460 a little bit and someone 306 00:10:10,570 --> 00:10:11,320 asks you to write a query 307 00:10:11,710 --> 00:10:12,280 where you were going going to 308 00:10:12,400 --> 00:10:14,160 get a largest value of some type, 309 00:10:14,860 --> 00:10:15,690 the first thing you would think of 310 00:10:15,800 --> 00:10:17,320 is probably using the max operator, 311 00:10:18,170 --> 00:10:20,010 but we don't know max yet, we'll be learning that later videos. 312 00:10:21,300 --> 00:10:22,630 And as it happens, a number of 313 00:10:22,840 --> 00:10:23,970 queries that are effectively computing 314 00:10:24,500 --> 00:10:26,350 a max can be written using sub-queries. 315 00:10:27,130 --> 00:10:27,840 And here's our first example. 316 00:10:29,170 --> 00:10:32,120 This example is finding the college that has the largest enrollment. 317 00:10:32,390 --> 00:10:33,670 And we're going to 318 00:10:33,770 --> 00:10:36,130 do that with the sub-query and with the not-exists operator. 319 00:10:36,980 --> 00:10:38,010 Specifically, we're going to 320 00:10:38,100 --> 00:10:40,100 find all colleges where there 321 00:10:40,220 --> 00:10:41,880 does not exist another college 322 00:10:42,400 --> 00:10:44,530 whose enrollment is higher than the first college. 323 00:10:45,680 --> 00:10:46,680 So let's go ahead and run the query. 324 00:10:47,100 --> 00:10:49,810 And, not surprisingly, we get Berkeley as a result. 325 00:10:51,470 --> 00:10:52,600 So this is a form of 326 00:10:52,770 --> 00:10:53,700 query that we can write any 327 00:10:53,880 --> 00:10:55,830 time we're looking for something that's the largest or the smallest. 328 00:10:56,740 --> 00:10:58,100 Let's, for example, look for 329 00:10:58,450 --> 00:10:59,930 the student with the highest GPA. 330 00:11:00,860 --> 00:11:01,830 So we'll change it to 331 00:11:02,080 --> 00:11:04,240 the student name, and we'll 332 00:11:04,470 --> 00:11:06,500 be looking instead of colleges at students. 333 00:11:07,320 --> 00:11:10,180 Otherwise this form of this query is going to stay very similar. 334 00:11:11,980 --> 00:11:13,340 We're going to look for students here as well. 335 00:11:13,920 --> 00:11:15,890 And finally, we're going to have GPA instead of enrollment. 336 00:11:18,750 --> 00:11:19,830 So, the way this query 337 00:11:20,240 --> 00:11:21,270 works is it says "I want 338 00:11:21,510 --> 00:11:24,120 to find all students, such that 339 00:11:24,270 --> 00:11:25,810 there does not exist another student 340 00:11:26,670 --> 00:11:27,410 who's GPA is higher". 341 00:11:28,460 --> 00:11:30,480 We run the query and we get four results. 342 00:11:31,260 --> 00:11:31,740 Seems odd. 343 00:11:32,890 --> 00:11:33,800 Actually it's that odd. 344 00:11:34,780 --> 00:11:37,770 Let's just add the GPA to our query and we'll see exactly what happened. 345 00:11:39,650 --> 00:11:40,570 We can see that these four 346 00:11:40,810 --> 00:11:42,180 students all have a GPA 347 00:11:42,950 --> 00:11:44,470 of 3.9, so they're all tied for 348 00:11:44,580 --> 00:11:46,310 having the highest GPA and very 349 00:11:46,740 --> 00:11:48,100 specifically it faithfully runs the 350 00:11:48,190 --> 00:11:49,000 query that there does not 351 00:11:49,380 --> 00:11:50,740 exist another student whose 352 00:11:51,120 --> 00:11:52,380 GPA is higher than these students. 353 00:11:52,970 --> 00:11:54,860 Now, let's see if 354 00:11:54,980 --> 00:11:58,020 we can write this same query without using a sub query. 355 00:11:59,610 --> 00:12:00,590 So as usual, if we 356 00:12:00,670 --> 00:12:01,680 want to write a query without 357 00:12:02,050 --> 00:12:03,960 a sub query, we'll need to do some type of joint. 358 00:12:04,800 --> 00:12:06,530 So we're looking for students with the highest GPAs. 359 00:12:07,720 --> 00:12:08,810 So, we need to join two 360 00:12:09,090 --> 00:12:10,210 instances of the student relation 361 00:12:10,670 --> 00:12:12,230 as we've done here and then 362 00:12:12,400 --> 00:12:14,070 we'll apply a condition that 363 00:12:14,240 --> 00:12:15,280 the GPA of the first one 364 00:12:15,700 --> 00:12:17,480 is greater than the GPA of the second one. 365 00:12:18,300 --> 00:12:18,970 We run the query. 366 00:12:19,520 --> 00:12:21,590 Wow, we got a lot of answers there. 367 00:12:23,060 --> 00:12:24,130 Maybe our problem is just 368 00:12:24,220 --> 00:12:25,830 one of duplicates. So first thing 369 00:12:25,910 --> 00:12:26,680 we do when it looks too big 370 00:12:27,030 --> 00:12:28,920 we add select distinct. Nope, 371 00:12:29,590 --> 00:12:30,760 that doesn't solve the problem either. 372 00:12:31,740 --> 00:12:33,320 Actually this query is fundamentally wrong 373 00:12:34,060 --> 00:12:35,060 and we cannot write the query 374 00:12:35,360 --> 00:12:36,350 that finds the student with the 375 00:12:36,550 --> 00:12:37,900 highest GPA just by using joins. 376 00:12:38,800 --> 00:12:40,730 What this query actually does is 377 00:12:40,850 --> 00:12:42,960 it finds all students such 378 00:12:43,260 --> 00:12:46,110 that there is some other student whose GPA is lower. 379 00:12:47,040 --> 00:12:48,340 In other words, it's finding all 380 00:12:48,610 --> 00:12:50,490 students except those who have the lowest GPA. 381 00:12:50,520 --> 00:12:54,180 Now let's see another 382 00:12:54,470 --> 00:12:55,600 new construct we can use 383 00:12:56,160 --> 00:12:57,050 with sub-queries in the where 384 00:12:57,280 --> 00:12:58,450 clause and we'll continue with 385 00:12:58,590 --> 00:13:00,810 the same query of finding the student with the highest GPA. 386 00:13:02,210 --> 00:13:03,650 This query uses the all keyword. 387 00:13:04,860 --> 00:13:05,900 What all tells us is that 388 00:13:05,980 --> 00:13:07,270 instead of checking whether a 389 00:13:07,300 --> 00:13:08,440 value is either in or 390 00:13:08,510 --> 00:13:09,520 not in the result of a 391 00:13:09,550 --> 00:13:10,920 sub-query, we're going to 392 00:13:11,000 --> 00:13:12,050 check whether the value has a 393 00:13:12,190 --> 00:13:14,380 certain relationship with all the results of the sub-query. 394 00:13:14,830 --> 00:13:16,230 And here, in this case, we're 395 00:13:16,390 --> 00:13:17,790 checking to see if the 396 00:13:18,080 --> 00:13:19,240 GPA is greater than or 397 00:13:19,380 --> 00:13:20,740 equal to all elements 398 00:13:21,820 --> 00:13:24,090 of the sub-query that returns the GPA's of all students. 399 00:13:25,010 --> 00:13:26,120 If the student's GPA is indeed 400 00:13:26,670 --> 00:13:27,610 greater than or equal to all 401 00:13:27,800 --> 00:13:30,510 GPA's then the student has the highest GPA in the database. 402 00:13:31,500 --> 00:13:32,970 We'll run the query, and we'll 403 00:13:33,260 --> 00:13:35,520 get the same four students with the 3.9 GPA. 404 00:13:36,110 --> 00:13:38,080 Now, let's try writing 405 00:13:38,460 --> 00:13:39,670 this in a slightly different fashion 406 00:13:40,010 --> 00:13:41,590 just for fun to illustrate some concepts. 407 00:13:43,510 --> 00:13:44,510 Once again, we're going to 408 00:13:44,600 --> 00:13:46,710 try to find the students whose GPA is highest in the database. 409 00:13:47,810 --> 00:13:48,820 Here's how we're going to do it this time. 410 00:13:49,170 --> 00:13:50,650 Instead of using greater than or 411 00:13:50,830 --> 00:13:52,890 equal to all, we're going to use greater than all. 412 00:13:53,730 --> 00:13:54,680 So, we're going to find all students 413 00:13:55,110 --> 00:13:57,190 where their GPA is higher than every other student 414 00:13:57,910 --> 00:13:59,290 by saying GPA is greater than 415 00:13:59,480 --> 00:14:01,180 all GPA's of students 416 00:14:01,310 --> 00:14:03,520 in the database who are not the student we're looking at. 417 00:14:04,050 --> 00:14:05,490 By saying that the ID's are not equal. 418 00:14:06,600 --> 00:14:07,210 Let's run the query. 419 00:14:08,350 --> 00:14:09,250 Well, we got an empty result. 420 00:14:10,380 --> 00:14:11,460 Let's think about it for a second. 421 00:14:11,740 --> 00:14:13,780 That is actually the correct result to the query. 422 00:14:14,070 --> 00:14:15,040 The query itself was incorrect. 423 00:14:16,080 --> 00:14:17,190 The query is looking for all 424 00:14:17,450 --> 00:14:19,280 students where nobody else 425 00:14:19,620 --> 00:14:21,070 has the same GPA as that student. 426 00:14:21,480 --> 00:14:22,890 Everybody else's GPA is lower. Well, 427 00:14:23,800 --> 00:14:24,930 remember, we had four students 428 00:14:25,330 --> 00:14:27,050 with a 3.9 GPA, so 429 00:14:27,180 --> 00:14:28,080 none of those students are going 430 00:14:28,150 --> 00:14:30,330 to satisfy the query and nor will any other student. 431 00:14:30,860 --> 00:14:33,800 So this is an incorrect formulation of the query that we wanted. 432 00:14:34,890 --> 00:14:36,200 Now this query would be 433 00:14:36,310 --> 00:14:37,290 correct if we knew 434 00:14:37,420 --> 00:14:39,800 that every student had the same GP...I'm sorry. 435 00:14:40,340 --> 00:14:41,530 This query would be correct if 436 00:14:41,650 --> 00:14:43,520 we knew that every student's GPA was 437 00:14:43,720 --> 00:14:44,860 unique, because there would be 438 00:14:45,000 --> 00:14:46,670 then one student who had the highest GPA. 439 00:14:47,650 --> 00:14:48,820 So let's modify the query 440 00:14:49,160 --> 00:14:50,710 to instead look for the 441 00:14:50,890 --> 00:14:52,320 colleges that have the highest enrollment. 442 00:14:52,890 --> 00:14:55,850 Because it turns out in our database every college has a unique enrollment. 443 00:14:56,520 --> 00:14:57,750 So we'll change it to 444 00:14:58,180 --> 00:15:01,070 getting the C name of college instead of student. 445 00:15:04,140 --> 00:15:07,910 And we'll want the enrollment to 446 00:15:08,010 --> 00:15:10,250 be greater than all other enrollments. 447 00:15:10,720 --> 00:15:14,450 So we'll...enrollment, college, almost done here. 448 00:15:15,100 --> 00:15:16,390 Grab a cup of coffee if you want. 449 00:15:17,090 --> 00:15:18,440 I just have to make this 450 00:15:18,840 --> 00:15:22,370 be C name and this 451 00:15:22,550 --> 00:15:23,930 one be C name and we're all set. 452 00:15:24,280 --> 00:15:25,560 So what are we asking for here? 453 00:15:26,320 --> 00:15:27,910 We're asking for all colleges whose 454 00:15:28,070 --> 00:15:29,990 enrollment is greater than all 455 00:15:30,240 --> 00:15:32,260 of the enrollments of colleges that 456 00:15:32,410 --> 00:15:34,370 are different than the one we're looking at. 457 00:15:34,510 --> 00:15:35,580 We run the query and we get 458 00:15:35,780 --> 00:15:36,740 Berkeley as a result 459 00:15:37,190 --> 00:15:38,260 which is exactly what we'd expect. 460 00:15:39,910 --> 00:15:41,010 So far we've seen the keyword 461 00:15:41,700 --> 00:15:43,190 all for checking whether a 462 00:15:43,470 --> 00:15:46,140 value has relationship with all of the results of a sub query. 463 00:15:46,710 --> 00:15:47,900 There's also a key word 464 00:15:48,340 --> 00:15:49,700 any that is a 465 00:15:49,740 --> 00:15:50,940 companion but instead of 466 00:15:51,350 --> 00:15:52,480 having to satisfy a condition 467 00:15:52,900 --> 00:15:53,990 with all of the elements of 468 00:15:54,080 --> 00:15:56,010 a set any says you 469 00:15:56,110 --> 00:15:57,320 must satisfy the condition with 470 00:15:57,540 --> 00:15:59,120 at least one element of the set. 471 00:15:59,990 --> 00:16:00,720 So what we're going to do 472 00:16:00,770 --> 00:16:01,480 now is going to do 473 00:16:01,620 --> 00:16:04,450 the same query but we're gonna write it a little differently. 474 00:16:04,970 --> 00:16:08,210 Let me just type it in and then explain. 475 00:16:09,040 --> 00:16:10,050 Here what we're going to 476 00:16:10,090 --> 00:16:11,150 say is get me all colleges 477 00:16:12,440 --> 00:16:13,470 where it's not the case 478 00:16:14,200 --> 00:16:15,290 that the enrollment is less 479 00:16:15,790 --> 00:16:17,770 than or equal to any other college. 480 00:16:18,980 --> 00:16:18,980 Okay? 481 00:16:19,600 --> 00:16:21,860 So in other words there's no other colleges have bigger enrollment. 482 00:16:22,260 --> 00:16:23,180 Just think about it for 483 00:16:23,400 --> 00:16:25,450 a second, this is if you happen to know predicate logic. 484 00:16:26,260 --> 00:16:28,220 This is an equivalence where when 485 00:16:28,360 --> 00:16:29,370 you say for all it's equivalent 486 00:16:30,120 --> 00:16:30,980 to not exists not. 487 00:16:31,820 --> 00:16:34,200 In any case if you didn't follow that don't let that bother you. 488 00:16:34,690 --> 00:16:35,940 Let's run the query and we 489 00:16:36,130 --> 00:16:37,290 see we again get the result 490 00:16:37,780 --> 00:16:39,410 Berkeley, so again all 491 00:16:39,830 --> 00:16:41,420 tests a condition against every 492 00:16:41,690 --> 00:16:42,710 element in the result of a 493 00:16:42,740 --> 00:16:44,220 sub query, and the condition 494 00:16:44,680 --> 00:16:45,790 is true if it's satisfied with 495 00:16:45,930 --> 00:16:48,260 every element; whereas any is 496 00:16:48,490 --> 00:16:50,490 true if the condition is satisfied with one or 497 00:16:50,820 --> 00:16:51,750 more elements of the sub-query. 498 00:16:53,620 --> 00:16:55,000 Let's use any for another query. 499 00:16:55,860 --> 00:16:57,290 This query finds all students 500 00:16:57,830 --> 00:17:00,260 who are not from the smallest high school in the database. 501 00:17:01,330 --> 00:17:02,280 So, what we are looking for 502 00:17:02,470 --> 00:17:04,170 here is all students, where the 503 00:17:04,260 --> 00:17:05,230 size of their high school 504 00:17:05,530 --> 00:17:07,760 is greater than any high school size. 505 00:17:08,460 --> 00:17:09,320 In other words, a student is 506 00:17:09,470 --> 00:17:10,950 going to be returned if there's 507 00:17:11,320 --> 00:17:12,740 some other student whose size 508 00:17:13,060 --> 00:17:14,600 high school is smaller than this student. 509 00:17:15,180 --> 00:17:17,000 We run the query, and there's the result. 510 00:17:17,600 --> 00:17:18,810 And you can verify if you 511 00:17:18,890 --> 00:17:20,110 look back at the data, but 512 00:17:20,240 --> 00:17:21,400 there is in fact, a few, 513 00:17:21,570 --> 00:17:22,720 there are students who are 514 00:17:22,900 --> 00:17:24,010 from high schools that have 200 515 00:17:24,510 --> 00:17:25,760 students, so here, we 516 00:17:25,910 --> 00:17:27,460 have in our result every 517 00:17:27,810 --> 00:17:29,530 student that comes from a high school bigger than that. 518 00:17:31,330 --> 00:17:33,030 Some systems, notably at 519 00:17:33,150 --> 00:17:34,890 the current time SQL Lite, do 520 00:17:35,000 --> 00:17:36,930 not support the any and the all operators. 521 00:17:38,370 --> 00:17:39,850 We do not lose expressive power, we 522 00:17:40,020 --> 00:17:40,870 just have to write those queries 523 00:17:41,260 --> 00:17:43,130 a little differently typically using exist 524 00:17:43,670 --> 00:17:44,970 or not exists, so let's 525 00:17:45,220 --> 00:17:48,060 look at the same query written without using any. 526 00:17:49,210 --> 00:17:49,720 Here's what we do. 527 00:17:50,640 --> 00:17:52,200 We look for students where there 528 00:17:52,750 --> 00:17:54,500 exists some other student, 529 00:17:56,270 --> 00:17:57,360 okay, whose high school is 530 00:17:57,660 --> 00:17:59,140 smaller than the student we returned. 531 00:18:00,210 --> 00:18:00,960 So this is going to give us 532 00:18:01,150 --> 00:18:02,720 exactly the same result and 533 00:18:02,820 --> 00:18:04,480 we will see we will get the same set of students. 534 00:18:05,030 --> 00:18:05,930 No students from a high 535 00:18:06,030 --> 00:18:07,100 school with two hundred but we 536 00:18:07,200 --> 00:18:08,070 get all the rest of the 537 00:18:08,140 --> 00:18:10,490 students without using any or all. 538 00:18:11,770 --> 00:18:13,750 And just to emphasize, any or 539 00:18:13,900 --> 00:18:16,460 all are very convenient for writing queries, but they aren't necessary. 540 00:18:17,410 --> 00:18:18,740 It turns out we can always write 541 00:18:18,950 --> 00:18:19,810 a query that would use 542 00:18:20,120 --> 00:18:21,480 any or all by using the 543 00:18:21,720 --> 00:18:23,630 exists operator or not exists instead. 544 00:18:25,790 --> 00:18:27,560 As a grand finale query, let's 545 00:18:27,840 --> 00:18:28,880 revisit the query that finds 546 00:18:29,220 --> 00:18:30,280 students who have applied to major 547 00:18:30,670 --> 00:18:32,960 CS and have not applied to major in EE. 548 00:18:33,210 --> 00:18:34,590 And now we're going to write 549 00:18:34,760 --> 00:18:36,040 that query using the any 550 00:18:36,190 --> 00:18:38,230 operator, two instances of it. 551 00:18:38,760 --> 00:18:39,770 Now let's go back and look 552 00:18:39,910 --> 00:18:41,230 when we wrote the query using in 553 00:18:41,630 --> 00:18:42,760 and not in, and we see 554 00:18:42,880 --> 00:18:44,800 that there are three results, Helen, Irene and Craig. 555 00:18:45,510 --> 00:18:46,590 And now let's look at 556 00:18:46,650 --> 00:18:48,740 our query using any and not equal to any. 557 00:18:49,320 --> 00:18:50,680 So we find students where their 558 00:18:50,960 --> 00:18:53,110 SID is in the set 559 00:18:53,340 --> 00:18:54,030 of SID's of students who have 560 00:18:54,120 --> 00:18:56,100 applied to major in CS and 561 00:18:56,290 --> 00:18:57,430 their SID is not equal to 562 00:18:57,650 --> 00:18:58,690 any of the students who 563 00:18:58,760 --> 00:19:01,270 have applied to major in EE and let's run the query. 564 00:19:03,340 --> 00:19:04,200 Well we got the wrong 565 00:19:05,280 --> 00:19:06,640 answer and that was once again intentional. 566 00:19:07,710 --> 00:19:10,100 Let's look very closely at what this query asks for. 567 00:19:10,850 --> 00:19:11,990 This query asks for students 568 00:19:12,370 --> 00:19:14,030 students where their ID is in 569 00:19:14,190 --> 00:19:15,240 the set of ID's that 570 00:19:15,430 --> 00:19:16,830 applied to CS, that's all fine 571 00:19:17,100 --> 00:19:19,400 and good but this is where things got tricky. 572 00:19:20,060 --> 00:19:21,320 What we're saying here is 573 00:19:22,890 --> 00:19:24,360 that we want 574 00:19:24,710 --> 00:19:26,360 the condition to check whether 575 00:19:27,040 --> 00:19:28,880 there's any element in the 576 00:19:28,970 --> 00:19:30,070 set of EE's that are 577 00:19:30,180 --> 00:19:31,570 not equal to this SID. 578 00:19:32,330 --> 00:19:33,730 So, in fact, this second condition 579 00:19:34,210 --> 00:19:35,460 is satisfied as long as 580 00:19:35,590 --> 00:19:37,000 there's anybody who applied to 581 00:19:37,320 --> 00:19:38,870 EE that's not equal to 582 00:19:38,980 --> 00:19:40,590 the student we're looking at and 583 00:19:40,710 --> 00:19:41,840 that of course is commonly the case, 584 00:19:42,230 --> 00:19:43,190 so we made a mistake 585 00:19:43,290 --> 00:19:44,540 here and it's actually quite tricky 586 00:19:45,400 --> 00:19:46,490 to use the any and 587 00:19:46,600 --> 00:19:48,650 all operators and always get the right answer. 588 00:19:49,340 --> 00:19:50,340 It's especially tricky when you 589 00:19:50,430 --> 00:19:51,360 can't just eyeball the answer 590 00:19:51,780 --> 00:19:52,880 and check but the correct 591 00:19:54,120 --> 00:19:55,540 formulation that we want 592 00:19:55,900 --> 00:19:57,310 here is that it's not 593 00:19:57,970 --> 00:19:59,120 the case that the SID 594 00:19:59,430 --> 00:20:01,770 is equal to any member of the set. 595 00:20:02,080 --> 00:20:03,170 In other words, for each member 596 00:20:03,530 --> 00:20:04,740 of the set of the EE 597 00:20:05,060 --> 00:20:07,090 application, that value is 598 00:20:07,240 --> 00:20:08,440 not equal to the SID 599 00:20:09,080 --> 00:20:10,040 of the student we're going to retrieve. 600 00:20:10,860 --> 00:20:12,780 And now we run the query and we get the right answer.