This SQL video, which as usual will consist mostly of running live queries, introduces sub-queries in the where clause. As usual, we start with our basic select from where expression, and we're going to add to it now the ability in the condition part of the select from where expression to include sub-queries. Sub-queries are nested, select statements within the condition, and we'll see they're actually quite powerful. As usual, we'll be using a sample demonstration database that consists of colleges, students, and students applying to colleges. As always, we'll have our four colleges, a bunch of students, and a bunch of records that show students applying to colleges. So let's launch right into our first query that shows a sub-query in the where clause. What this query finds is the ID's and names of all students who have applied to major in CS to some college. So here's the sub-query in the where clause. This expression right here in the where clause finds the ID's of all students who have applied to major in CS. So now we have that set of ID's, our outer query says let's take from the students those students whose ID is in that set, and let's select their ID and their name. So we go ahead and execute the query and we find out that five students have applied to major in CS. Now we actually can do this query without a sub-query in the where clause. So let's take a look. We can do it instead by joining the student relation with the apply relation. So here we do what we learned in previous videos. We take student and apply, we write the joined condition to make sure we're talking about the same student, we make sure they're majoring in CS, and we get their ID and their name. Let's run the query. Whoops, an error. I knew that was coming. This is just to remind you about disambiguating attributes. The ID here in the select clause could have come from student or apply, and even though the value is equal, we do have to disambiguate by putting one of those. So let's put student dot SID and let's run the query. Okay. Now we see we got more students back than we got back when we ran the query using the sub-query. Let's go back and look. We got five results here and we got several more here, but the additional results are actually duplicate values. So we have two copies, for example, of 1-2-3 Amy. The reason for that is that Amy actually applied to major in CS at multiple colleges. So if we go back and we look at the apply data, we'll see that Amy who is 1-2-3 applied to major in CS at Stanford as well as Berkeley. Let's hope she selects Stanford. In any case, that's why we got Amy twice in the join because she applied twice. Back here where we used the sub-query, we were just looking at students and whether their ID was in the set. Okay, so when we do the join we get basically the same answer, but we have some duplicate values. Of course, we can fix that by adding distinct. We run the query and now we have the same result that we got when we used the sub-query. Let's look at some other similar queries and let's focus again on the duplicates issue because it gets a little tricky, as we'll see. This query is very similar to the previous one, finding students who are applying to major in CS, but this time we're only getting the names of the students and not their ID's. So we run the query and we find our same five students. Just want to mention that these two Craigs are two different Craigs. If we go back to our original result, there's three-four-five Craig and five-four-three Craig. So coming back here, we find the names of the students who majored in CS. Now similarly to what we did previously, let's write this query using a join instead of using the sub-query in the where clause. So here's the join. We're joining student and apply on the student's ID majoring in CS, as always. The only difference is that we're just selecting the name. We run the query and we get again many more results than we got previously because we get two copies when a student has applied to to major in CS at two different places. And just as we did before, we can add distinct to get rid of the duplicates. Well, something different happened this time. This time, when we get rid of the duplicates, we only have four results where previously we had five. And the reason is that previously we included the student ID in the result, and so the two instances of Craig were two different Craigs and didn't cause duplicates to be eliminated. We can see that back here: Craig and Craig. But in this result, because we only kept the names, the two copies of Craig turned into one result. Now we might wonder why do we care about duplicates so much. Let's see an example where duplicates really do matter quite a bit. We're going to do exactly the same query again, finding students who have applied to major in CS, but now we're not retrieving the IDs or names, we're retrieving the GPAs of those students. Presumably what we're interested in doing is some analysis of the GPA's of students who choose to apply for CS. So let's run the query. As always, we get our five results. And here's the GPA's of the five students who have applied to major in CS. Once again, this will be the last time, I promise, we'll do this query using a join instead of the sub-query. So here we go. We've got student and apply join on SID, majoring in CS, and returning the GPA. Once again, because we have students who applied multiple times for CS, we're getting more than our five results. So we get a rather large number of results here and again we have duplicates. So here's where the problem comes in. If we use this result to compute the average GPA, we'll be counting some students multiple times. And presumably, that's not what we want to do. Presumably, we want to count each student who's applied to CS once in the computation of, say, the average GPA. That worked in the previous query when we got the five results for the five students who applied to CS. When we do the join, we get too many results. But this time, again, we're going to have a problem when we do select distinct, because some of these students have the same GPA. And now we only have four GPAs instead of the five that we should have. And if we compute the average GPA now, then we're not factoring in one of the student's GPAs. So in this case, neither the version with distinct nor the version without distinct gives us the right number of GPAs. Neither of those will give us the correct average. The only way to get the correct number of duplicates is to use the version of the query that has the sub-query in the where clause. Now let's move to some different examples that also use subqueries in the where clause. You might remember from the previous video when we were learning the difference operator that we had a query that we could write using the difference operator, which in SQL is called accept, but we were unable to write that query without the difference operator. And the query we were trying to write is to find students who have applied to major in CS, but have not applied to major in EE. Now that we have sub-queries in the where clause, we can write that query without using the except operator. And here it is. The query looks for students where their ID is among the set of ID's of students who have applied to CS, but their ID is not among the set of ID's of students who are applying to major in EE. So let's run the query. And we discover that there are three students who have applied to major in CS, but not applied anywhere in EE. By the way, let me just show you a slightly different way to write exactly the same query. You can see that we use this key word combination "not in" to specify that the idea is not in this set. We can actually write it by writing the SID is in the set and then applying a "not" to that result. We'll execute and we'll get exactly the same result. It's fairly common in SQL for there to be multiple ways to write the same query, and we'll see more examples of that later in this video. So far, our examples of sub-queries have used "in" and "not in" for testing membership in the sets that are produced by sub-queries. Our next examples are going to show that we can apply the exists operator to sub-queries just to test whether they're empty or not empty. Here's the query. This query uses exists to check whether a subquery is empty or not empty rather than checking whether values are in the subquery. The other new construct that's begin to introduce here's what's known as a correlated reference. But inside the subquery we're going to refer to a value, C1, that comes from outside the subquery. So let's talk about exactly how this query works. First let me tell you what the query is trying to return. The query is going to find all colleges, such that there's some other college that is in the same state. Okay? So, in our example, just a reminder, we have Stanford, Berkley, MIT and Cornell. So, the two Colleges we should get back are Stanford and Berkeley because in each case there's another college in the same state. So how does this query work? It says we're gonna to take the colleges, and for each college we're going to check whether their exists another college and we're going to call that one C2, where the state of C2 is the same as the state of C1. This is sort of similar to what we saw when were doing self-joins with table variables, but now, the variables are appearing in the outer query and the inner query. Okay. So, let's run the query and we see that we get the wrong answer. That was again intentional. Well, here's the problem. When we were in this query, C1 and C2 could be bound to the same college. So every college is in the same state as another college of the other college could be the same college. What we need to do is add inside the sub-query a further condition that says that C1 and C2 are different colleges. Specifically, C1.Cname is not equal to C2.Cname. Let's run the query and now, we get the correct answer. Now, let's look at some other uses of the exists, construct and sub-query. If you happen to already know SQL a little bit and someone asks you to write a query where you were going going to get a largest value of some type, the first thing you would think of is probably using the max operator, but we don't know max yet, we'll be learning that later videos. And as it happens, a number of queries that are effectively computing a max can be written using sub-queries. And here's our first example. This example is finding the college that has the largest enrollment. And we're going to do that with the sub-query and with the not-exists operator. Specifically, we're going to find all colleges where there does not exist another college whose enrollment is higher than the first college. So let's go ahead and run the query. And, not surprisingly, we get Berkeley as a result. So this is a form of query that we can write any time we're looking for something that's the largest or the smallest. Let's, for example, look for the student with the highest GPA. So we'll change it to the student name, and we'll be looking instead of colleges at students. Otherwise this form of this query is going to stay very similar. We're going to look for students here as well. And finally, we're going to have GPA instead of enrollment. So, the way this query works is it says "I want to find all students, such that there does not exist another student who's GPA is higher". We run the query and we get four results. Seems odd. Actually it's that odd. Let's just add the GPA to our query and we'll see exactly what happened. We can see that these four students all have a GPA of 3.9, so they're all tied for having the highest GPA and very specifically it faithfully runs the query that there does not exist another student whose GPA is higher than these students. Now, let's see if we can write this same query without using a sub query. So as usual, if we want to write a query without a sub query, we'll need to do some type of joint. So we're looking for students with the highest GPAs. So, we need to join two instances of the student relation as we've done here and then we'll apply a condition that the GPA of the first one is greater than the GPA of the second one. We run the query. Wow, we got a lot of answers there. Maybe our problem is just one of duplicates. So first thing we do when it looks too big we add select distinct. Nope, that doesn't solve the problem either. Actually this query is fundamentally wrong and we cannot write the query that finds the student with the highest GPA just by using joins. What this query actually does is it finds all students such that there is some other student whose GPA is lower. In other words, it's finding all students except those who have the lowest GPA. Now let's see another new construct we can use with sub-queries in the where clause and we'll continue with the same query of finding the student with the highest GPA. This query uses the all keyword. What all tells us is that instead of checking whether a value is either in or not in the result of a sub-query, we're going to check whether the value has a certain relationship with all the results of the sub-query. And here, in this case, we're checking to see if the GPA is greater than or equal to all elements of the sub-query that returns the GPA's of all students. If the student's GPA is indeed greater than or equal to all GPA's then the student has the highest GPA in the database. We'll run the query, and we'll get the same four students with the 3.9 GPA. Now, let's try writing this in a slightly different fashion just for fun to illustrate some concepts. Once again, we're going to try to find the students whose GPA is highest in the database. Here's how we're going to do it this time. Instead of using greater than or equal to all, we're going to use greater than all. So, we're going to find all students where their GPA is higher than every other student by saying GPA is greater than all GPA's of students in the database who are not the student we're looking at. By saying that the ID's are not equal. Let's run the query. Well, we got an empty result. Let's think about it for a second. That is actually the correct result to the query. The query itself was incorrect. The query is looking for all students where nobody else has the same GPA as that student. Everybody else's GPA is lower. Well, remember, we had four students with a 3.9 GPA, so none of those students are going to satisfy the query and nor will any other student. So this is an incorrect formulation of the query that we wanted. Now this query would be correct if we knew that every student had the same GP...I'm sorry. This query would be correct if we knew that every student's GPA was unique, because there would be then one student who had the highest GPA. So let's modify the query to instead look for the colleges that have the highest enrollment. Because it turns out in our database every college has a unique enrollment. So we'll change it to getting the C name of college instead of student. And we'll want the enrollment to be greater than all other enrollments. So we'll...enrollment, college, almost done here. Grab a cup of coffee if you want. I just have to make this be C name and this one be C name and we're all set. So what are we asking for here? We're asking for all colleges whose enrollment is greater than all of the enrollments of colleges that are different than the one we're looking at. We run the query and we get Berkeley as a result which is exactly what we'd expect. So far we've seen the keyword all for checking whether a value has relationship with all of the results of a sub query. There's also a key word any that is a companion but instead of having to satisfy a condition with all of the elements of a set any says you must satisfy the condition with at least one element of the set. So what we're going to do now is going to do the same query but we're gonna write it a little differently. Let me just type it in and then explain. Here what we're going to say is get me all colleges where it's not the case that the enrollment is less than or equal to any other college. Okay? So in other words there's no other colleges have bigger enrollment. Just think about it for a second, this is if you happen to know predicate logic. This is an equivalence where when you say for all it's equivalent to not exists not. In any case if you didn't follow that don't let that bother you. Let's run the query and we see we again get the result Berkeley, so again all tests a condition against every element in the result of a sub query, and the condition is true if it's satisfied with every element; whereas any is true if the condition is satisfied with one or more elements of the sub-query. Let's use any for another query. This query finds all students who are not from the smallest high school in the database. So, what we are looking for here is all students, where the size of their high school is greater than any high school size. In other words, a student is going to be returned if there's some other student whose size high school is smaller than this student. We run the query, and there's the result. And you can verify if you look back at the data, but there is in fact, a few, there are students who are from high schools that have 200 students, so here, we have in our result every student that comes from a high school bigger than that. Some systems, notably at the current time SQL Lite, do not support the any and the all operators. We do not lose expressive power, we just have to write those queries a little differently typically using exist or not exists, so let's look at the same query written without using any. Here's what we do. We look for students where there exists some other student, okay, whose high school is smaller than the student we returned. So this is going to give us exactly the same result and we will see we will get the same set of students. No students from a high school with two hundred but we get all the rest of the students without using any or all. And just to emphasize, any or all are very convenient for writing queries, but they aren't necessary. It turns out we can always write a query that would use any or all by using the exists operator or not exists instead. As a grand finale query, let's revisit the query that finds students who have applied to major CS and have not applied to major in EE. And now we're going to write that query using the any operator, two instances of it. Now let's go back and look when we wrote the query using in and not in, and we see that there are three results, Helen, Irene and Craig. And now let's look at our query using any and not equal to any. So we find students where their SID is in the set of SID's of students who have applied to major in CS and their SID is not equal to any of the students who have applied to major in EE and let's run the query. Well we got the wrong answer and that was once again intentional. Let's look very closely at what this query asks for. This query asks for students students where their ID is in the set of ID's that applied to CS, that's all fine and good but this is where things got tricky. What we're saying here is that we want the condition to check whether there's any element in the set of EE's that are not equal to this SID. So, in fact, this second condition is satisfied as long as there's anybody who applied to EE that's not equal to the student we're looking at and that of course is commonly the case, so we made a mistake here and it's actually quite tricky to use the any and all operators and always get the right answer. It's especially tricky when you can't just eyeball the answer and check but the correct formulation that we want here is that it's not the case that the SID is equal to any member of the set. In other words, for each member of the set of the EE application, that value is not equal to the SID of the student we're going to retrieve. And now we run the query and we get the right answer.