1 00:00:00,310 --> 00:00:01,220 In this video we'll be exploring 2 00:00:01,780 --> 00:00:04,780 some further issues involved in recursion in the SQL language. 3 00:00:06,690 --> 00:00:09,130 First a reminder of how SQL implements recursion. 4 00:00:09,840 --> 00:00:11,350 There's a with statement in sequel 5 00:00:12,050 --> 00:00:13,370 that can be specified to have 6 00:00:13,660 --> 00:00:15,430 recursively defined relations in it. 7 00:00:15,990 --> 00:00:17,890 We say with recursive, and 8 00:00:18,030 --> 00:00:19,310 then we define a set of 9 00:00:19,410 --> 00:00:21,430 relations, where the query 10 00:00:21,910 --> 00:00:23,280 to find relation could involve 11 00:00:23,800 --> 00:00:26,020 the relation itself, so that's where recursion pops in. 12 00:00:26,690 --> 00:00:27,990 And then, at the end the 13 00:00:28,350 --> 00:00:29,390 final result is a query 14 00:00:29,690 --> 00:00:31,140 that might involve those recursively defined 15 00:00:31,510 --> 00:00:33,580 relations as well as other tables in the database. 16 00:00:35,040 --> 00:00:36,130 As we saw in the previous 17 00:00:36,650 --> 00:00:38,080 video and demo, it's very 18 00:00:38,450 --> 00:00:40,040 common for recursively defined relations 19 00:00:40,660 --> 00:00:42,300 in the with statement to take 20 00:00:42,510 --> 00:00:43,580 the structure of having a 21 00:00:43,630 --> 00:00:44,930 base query that doesn't involve 22 00:00:45,560 --> 00:00:47,690 R, the recursively define relation, unioned 23 00:00:48,310 --> 00:00:51,280 with the recursive query and we saw many examples of that form. 24 00:00:52,540 --> 00:00:53,410 The first thing I want to talk 25 00:00:53,530 --> 00:00:55,760 about in this video is what's called linear recursion. 26 00:00:57,070 --> 00:00:59,560 Linear recursion specifies that in 27 00:00:59,770 --> 00:01:00,950 the recursive definition of R, 28 00:01:01,460 --> 00:01:02,550 and again let's assume it takes 29 00:01:02,810 --> 00:01:04,730 this form of the base query union and the recursive query. 30 00:01:05,130 --> 00:01:07,020 In the recursive query, There 31 00:01:07,190 --> 00:01:09,120 is only one reference to 32 00:01:09,380 --> 00:01:11,010 the recursively defined relation R. 33 00:01:11,770 --> 00:01:12,690 So let's take a look at 34 00:01:13,240 --> 00:01:14,550 an example to understand linear 35 00:01:14,890 --> 00:01:16,240 recursion and nonlinear recursion. 36 00:01:17,580 --> 00:01:18,460 the first example we used 37 00:01:18,790 --> 00:01:21,310 when we introduced recursion was finding 38 00:01:21,710 --> 00:01:23,640 ancestor relationships from a 39 00:01:23,820 --> 00:01:25,030 base table that just has 40 00:01:25,200 --> 00:01:26,900 parent child relationships so a 41 00:01:27,150 --> 00:01:28,060 basic transit of closure 42 00:01:28,170 --> 00:01:29,280 operation and the query 43 00:01:29,550 --> 00:01:31,960 we wanted to run was to find all of Mary's ancestors. 44 00:01:32,990 --> 00:01:34,520 And here's the query that we wrote. 45 00:01:35,100 --> 00:01:36,310 It does take the form of 46 00:01:36,640 --> 00:01:38,140 having a base query here. 47 00:01:38,810 --> 00:01:39,720 Which says if we have 48 00:01:39,830 --> 00:01:41,960 a parent relationship that's also an ancestor relationship. 49 00:01:43,050 --> 00:01:44,450 And then the recursion occurs in 50 00:01:44,740 --> 00:01:45,610 the second part of the union 51 00:01:46,270 --> 00:01:47,860 where we join the recursively defined 52 00:01:48,280 --> 00:01:50,630 ancestor relationship, ancestor relation, 53 00:01:51,500 --> 00:01:52,960 with parents so that we 54 00:01:53,190 --> 00:01:55,190 extend the ancestors with one more generation. 55 00:01:56,350 --> 00:01:58,770 Now this query does have 56 00:01:59,000 --> 00:02:00,440 linear recursion because we only 57 00:02:00,780 --> 00:02:02,490 have one instance here of 58 00:02:02,750 --> 00:02:04,490 the recursively defined relation, ancestor. 59 00:02:06,190 --> 00:02:07,460 So let's take a 60 00:02:07,510 --> 00:02:09,740 look at what happens underneath when this query is executed. 61 00:02:10,620 --> 00:02:11,910 We start with our parent table. 62 00:02:12,990 --> 00:02:13,910 And here it is with 63 00:02:14,080 --> 00:02:15,210 a parent and child and let's 64 00:02:15,450 --> 00:02:16,690 suppose we have say, Sue 65 00:02:17,280 --> 00:02:18,640 and John, and John 66 00:02:19,010 --> 00:02:21,560 and Mary for example, in our parent table. 67 00:02:22,670 --> 00:02:25,460 Then in what's effectively the first iteration, 68 00:02:26,560 --> 00:02:28,170 the base query here is 69 00:02:28,350 --> 00:02:31,040 run that copies the parent table to the ancestor table. 70 00:02:32,380 --> 00:02:33,630 So now we have Sue and 71 00:02:33,860 --> 00:02:34,870 John, and John and Mary, 72 00:02:35,100 --> 00:02:36,050 and anything else that we had 73 00:02:36,260 --> 00:02:38,330 in the parent table in the ancestor table. 74 00:02:39,040 --> 00:02:40,830 As the iteration continues we're 75 00:02:41,110 --> 00:02:42,700 effectively joining the parent 76 00:02:43,070 --> 00:02:44,270 table and the ancestor table 77 00:02:44,850 --> 00:02:47,240 to get additional tuples in the ancestor table. 78 00:02:48,090 --> 00:02:49,550 For example, we see that 79 00:02:50,210 --> 00:02:51,780 Sue and John, the 80 00:02:51,910 --> 00:02:53,410 Sue and John tuple here, would 81 00:02:53,620 --> 00:02:54,700 join with the John and Mary 82 00:02:55,090 --> 00:02:56,200 tuple, and that would give 83 00:02:56,420 --> 00:02:58,530 us Sue and Mary in the ancestor table. 84 00:02:59,550 --> 00:03:00,960 The iteration continues until there 85 00:03:01,160 --> 00:03:03,260 are no new tuples to add to the ancestor table. 86 00:03:04,310 --> 00:03:05,420 And then we're done with our 87 00:03:05,610 --> 00:03:07,210 recursively defined relation, and we 88 00:03:07,390 --> 00:03:10,170 can go ahead and execute the final query in the with statement. 89 00:03:11,530 --> 00:03:14,380 And again, often when I say we, I really mean we the system. 90 00:03:14,950 --> 00:03:15,750 All of this, is of course, being 91 00:03:16,250 --> 00:03:17,420 performed by the system as it 92 00:03:17,510 --> 00:03:19,370 executes the recursively defined with statement. 93 00:03:20,490 --> 00:03:21,810 Now, let's take a look 94 00:03:22,140 --> 00:03:25,020 at a non-linear expression of the same query. 95 00:03:25,710 --> 00:03:26,250 And here it is. 96 00:03:26,920 --> 00:03:28,400 What we see here is that 97 00:03:28,560 --> 00:03:31,060 the primary change is right in here. 98 00:03:31,800 --> 00:03:33,140 Instead of joining the parent 99 00:03:33,410 --> 00:03:34,920 with the ancestor in the 100 00:03:35,020 --> 00:03:36,300 recursive half, we're going 101 00:03:36,340 --> 00:03:39,360 to join two instances of the ancestor relation. 102 00:03:40,400 --> 00:03:41,800 And let's see what happens during 103 00:03:42,120 --> 00:03:44,430 execution when this is how we express our recursion. 104 00:03:45,930 --> 00:03:47,490 So we again start by 105 00:03:47,710 --> 00:03:48,970 copying the contents of the 106 00:03:49,050 --> 00:03:50,200 parent table into the ancestor 107 00:03:50,710 --> 00:03:52,770 table as part of the base query. 108 00:03:53,250 --> 00:03:54,540 And I've already shown that here. 109 00:03:55,240 --> 00:03:56,610 But, now, instead of during 110 00:03:56,980 --> 00:04:00,170 iteration joining the parent table with the ancestor table. 111 00:04:00,810 --> 00:04:01,730 We're actually going to join 112 00:04:02,260 --> 00:04:05,040 the ancestor table with itself to generate new tuples. 113 00:04:06,060 --> 00:04:07,630 For example, we will join 114 00:04:08,180 --> 00:04:09,640 the first two tuples in ancestor 115 00:04:10,260 --> 00:04:11,300 with each other, Sue-John and John-Mary, 116 00:04:11,490 --> 00:04:13,540 in order to obtain what 117 00:04:13,780 --> 00:04:15,010 was the same tuple we obtained with 118 00:04:15,140 --> 00:04:17,740 the linear recursion, which would be the tuple with Sue and Mary. 119 00:04:18,720 --> 00:04:19,860 Just a quick reminder, I intended 120 00:04:20,310 --> 00:04:21,710 to say this earlier, but it's 121 00:04:21,890 --> 00:04:22,770 the fact that we have these 122 00:04:23,000 --> 00:04:24,850 two references to ancestor in 123 00:04:25,120 --> 00:04:26,690 the recursion here, that makes it non-linear. 124 00:04:28,580 --> 00:04:30,470 OK so what's the deal with these two queries? 125 00:04:30,830 --> 00:04:33,770 Why might we prefer one form of the query over the other? 126 00:04:34,130 --> 00:04:34,940 And take my word for it, 127 00:04:35,020 --> 00:04:35,890 by the way, we do get 128 00:04:36,110 --> 00:04:37,420 equivalent results to the 129 00:04:37,530 --> 00:04:39,720 query in its linear and non-linear versions. 130 00:04:41,230 --> 00:04:44,180 Well, here's some pros and cons to non-linear versus linear. 131 00:04:44,960 --> 00:04:46,380 For this particular query and actually 132 00:04:46,720 --> 00:04:49,490 in general, when we can express a query both ways. 133 00:04:49,970 --> 00:04:51,350 First of all there's some 134 00:04:51,630 --> 00:04:54,460 pluses to the non-linear so the query looks cleaner. 135 00:04:54,740 --> 00:04:55,560 If you go back and look at 136 00:04:55,630 --> 00:04:57,570 the 2 queries the non-linear version 137 00:04:57,920 --> 00:04:59,350 is sort of more symmetric, a 138 00:04:59,500 --> 00:05:02,700 little shorter even to express, than the linear version. 139 00:05:03,880 --> 00:05:04,820 Second of all, the nonlinear 140 00:05:05,680 --> 00:05:08,310 version actually converges faster to 141 00:05:08,570 --> 00:05:09,890 the fixed point, to the final state, 142 00:05:10,590 --> 00:05:11,270 than the linear version. 143 00:05:11,780 --> 00:05:13,300 And I'm going to show that a 144 00:05:13,420 --> 00:05:15,610 little bit abstractly because it is actually fairly important. 145 00:05:16,790 --> 00:05:17,580 So I'm going to create this 146 00:05:18,100 --> 00:05:19,960 abstract example, parent-child 147 00:05:20,360 --> 00:05:22,220 relation, which is going 148 00:05:22,470 --> 00:05:24,440 to be completely linear, just for illustrative purposes. 149 00:05:24,930 --> 00:05:26,170 So we have this person here 150 00:05:26,600 --> 00:05:27,400 who's the parent of the person 151 00:05:27,740 --> 00:05:29,050 here, who's the parent of 152 00:05:29,160 --> 00:05:30,620 a person here, and so on. 153 00:05:30,910 --> 00:05:32,000 We're going to make it eight levels deep. 154 00:05:33,140 --> 00:05:34,710 So this is an abstraction of 155 00:05:34,840 --> 00:05:36,340 our "a parent" table, and 156 00:05:36,460 --> 00:05:38,250 now let's see how ancestors are computed. 157 00:05:39,190 --> 00:05:40,180 So in the first step 158 00:05:40,570 --> 00:05:42,590 we'll add one ancestor tuple 159 00:05:42,990 --> 00:05:44,190 for each tuple in the 160 00:05:44,260 --> 00:05:46,140 parent relation, so the 161 00:05:46,370 --> 00:05:48,220 purple are the tuples that are added to ancestor. 162 00:05:49,210 --> 00:05:50,700 Then in the second iteration 163 00:05:51,030 --> 00:05:52,780 we're going to join those with themselves. 164 00:05:53,780 --> 00:05:55,670 I'm sorry, we're going to 165 00:05:55,790 --> 00:05:56,850 join the ancestor tuples with parent tuples 166 00:05:57,740 --> 00:05:59,500 so each ancestor tuple could 167 00:05:59,790 --> 00:06:01,020 be extended by one. 168 00:06:01,620 --> 00:06:04,020 So that's going to give us all pairs of tuples. 169 00:06:05,700 --> 00:06:08,330 I'm sorry, it's already getting a bit crowded here, but I think that you will get the idea. 170 00:06:09,090 --> 00:06:10,530 On the next iteration we're going 171 00:06:10,760 --> 00:06:11,800 to again take our ancestor 172 00:06:12,280 --> 00:06:14,870 tuple and extend them by one, by joining them with parent. 173 00:06:15,460 --> 00:06:16,670 So after the second we'll 174 00:06:16,830 --> 00:06:20,190 have all triples here, 175 00:06:20,550 --> 00:06:23,160 so all great-grandparent relationships. 176 00:06:24,380 --> 00:06:25,310 OK, and that's a big mess. 177 00:06:25,450 --> 00:06:26,490 But you can really see what's going on. 178 00:06:26,730 --> 00:06:28,020 Each time we iterate, we get 179 00:06:28,200 --> 00:06:30,690 one more generation added into the ancestors. 180 00:06:32,200 --> 00:06:34,850 And now let's think about what happens when we use the non-linear version. 181 00:06:35,600 --> 00:06:36,670 Where after the first step, we 182 00:06:36,780 --> 00:06:39,740 join ancestor with itself instead of ancestor with parent. 183 00:06:40,470 --> 00:06:41,510 So as before on the 184 00:06:41,710 --> 00:06:42,540 first step, and now I'm 185 00:06:42,640 --> 00:06:44,010 going to make these red, the ancestor 186 00:06:44,830 --> 00:06:47,270 relation will contain exactly the same as the parents. 187 00:06:47,820 --> 00:06:48,830 And the second step is 188 00:06:48,930 --> 00:06:49,970 the same as well, we're going 189 00:06:50,020 --> 00:06:51,710 to join ancestor with itself but, 190 00:06:51,880 --> 00:06:52,730 since each one of ancestor 191 00:06:53,370 --> 00:06:55,190 is only the parent relationship, we're 192 00:06:55,340 --> 00:06:56,420 again, going to get all pairs 193 00:06:56,940 --> 00:06:58,190 in the second step of the iteration. 194 00:06:59,470 --> 00:07:01,120 The difference begins in the third step. 195 00:07:01,680 --> 00:07:03,570 Now we're joining ancestor with itself. 196 00:07:04,590 --> 00:07:05,630 So we will be joining these 197 00:07:05,870 --> 00:07:07,260 two-step ancestors with the 198 00:07:07,330 --> 00:07:09,780 single ones, just like before, to get all the threes. 199 00:07:10,770 --> 00:07:13,520 But we will also be joining twos with twos. 200 00:07:13,820 --> 00:07:14,810 In other words we will joining 201 00:07:15,040 --> 00:07:17,310 grandparent relationships with grandparent relationships. 202 00:07:17,680 --> 00:07:20,090 And we will be getting in that same iteration the fours. 203 00:07:21,500 --> 00:07:22,920 So as you can see, the 204 00:07:23,240 --> 00:07:24,880 nonlinear version does converge faster. 205 00:07:25,510 --> 00:07:26,630 Now this example is very 206 00:07:26,890 --> 00:07:28,560 small so it's not as blatantly obvious. 207 00:07:28,950 --> 00:07:30,580 But the linear version 208 00:07:31,030 --> 00:07:32,010 is going to take a linear number 209 00:07:32,330 --> 00:07:33,800 of iterations in order to 210 00:07:33,910 --> 00:07:35,650 converge to the final recursively 211 00:07:36,340 --> 00:07:37,470 defined relation contents. 212 00:07:38,550 --> 00:07:40,990 Whereas when we use the non-linear version, it's actually logarithmic. 213 00:07:41,650 --> 00:07:44,360 So for a large database it can be considerably faster. 214 00:07:45,720 --> 00:07:47,640 So what about the downsides of non-linear recursion? 215 00:07:48,550 --> 00:07:50,110 Well, the major downside is 216 00:07:50,510 --> 00:07:51,950 that it's harder to implement or 217 00:07:52,370 --> 00:07:53,590 certainly harder to implement efficiently. 218 00:07:54,960 --> 00:07:55,900 And as a result of that actually 219 00:07:56,420 --> 00:07:58,560 the sequel standard only requires linear recursion. 220 00:07:59,210 --> 00:08:00,750 And the postgres system that 221 00:08:00,900 --> 00:08:03,770 we've been using also only supports linear recursion. 222 00:08:05,080 --> 00:08:06,060 So back to the basic 223 00:08:06,450 --> 00:08:07,660 form of our with recursive statement 224 00:08:08,160 --> 00:08:09,360 in order to introduce a different 225 00:08:09,680 --> 00:08:11,450 topic, which is the topic of mutual recursion. 226 00:08:12,710 --> 00:08:14,010 Mutual recursion, as I 227 00:08:14,130 --> 00:08:15,190 alluded to in the previous video, 228 00:08:15,890 --> 00:08:17,600 is the case where one of 229 00:08:17,660 --> 00:08:19,270 our recursively defined relations 230 00:08:19,800 --> 00:08:21,230 does not refer to itself but 231 00:08:21,530 --> 00:08:23,590 rather to a different recursively defined relation. 232 00:08:24,410 --> 00:08:26,330 And that one refers back to the first one. 233 00:08:26,490 --> 00:08:28,580 Or we could even have a loop of three or four or more. 234 00:08:29,220 --> 00:08:30,180 So, the idea is that 235 00:08:30,300 --> 00:08:31,670 we can't look at these individually to 236 00:08:31,880 --> 00:08:33,380 see the recursion but together they 237 00:08:33,600 --> 00:08:35,490 are recursive and they have 238 00:08:35,740 --> 00:08:37,140 to kind of be computed in tandem. 239 00:08:38,130 --> 00:08:39,250 So, the example I'm going to 240 00:08:39,310 --> 00:08:40,900 use here is what's known 241 00:08:41,050 --> 00:08:43,220 as hubs and authorities. 242 00:08:43,750 --> 00:08:47,250 Hubs and authorities was an algorithm for web searching actually. 243 00:08:47,620 --> 00:08:50,450 For annotating web nodes for the purposes of searching. 244 00:08:51,030 --> 00:08:52,070 Was developed around the same 245 00:08:52,240 --> 00:08:55,360 time as Google's page rank, I guess we can see which one won out. 246 00:08:55,930 --> 00:08:57,530 But hubs and authorities actually quite 247 00:08:57,950 --> 00:09:00,140 interesting just in what it does. 248 00:09:00,670 --> 00:09:01,420 Let me go ahead and 249 00:09:01,740 --> 00:09:02,650 define the meaning of Hudson 250 00:09:02,980 --> 00:09:03,980 authorities, and then show how 251 00:09:04,130 --> 00:09:05,330 mutual recursion in SQL 252 00:09:05,820 --> 00:09:06,860 can be used to compute the 253 00:09:07,080 --> 00:09:08,310 Hudson authorities in a database 254 00:09:08,530 --> 00:09:11,220 that contains a link, a structure, a graph basically. 255 00:09:12,660 --> 00:09:14,360 So here's a little graph and 256 00:09:14,460 --> 00:09:15,840 we're going to assume that each 257 00:09:16,120 --> 00:09:17,390 node has a number, say, 258 00:09:17,710 --> 00:09:19,200 associated with it, and that 259 00:09:19,360 --> 00:09:21,290 we have a relation called 260 00:09:21,590 --> 00:09:22,850 "link", that just tells 261 00:09:23,140 --> 00:09:24,480 us the edges of 262 00:09:24,570 --> 00:09:27,120 the graph, so the source node and the destination node. 263 00:09:28,070 --> 00:09:28,990 So in a graph, we're going 264 00:09:29,050 --> 00:09:30,570 to designate some of the 265 00:09:30,700 --> 00:09:32,050 nodes as hub nodes 266 00:09:32,550 --> 00:09:34,520 and some of the nodes as authority nodes. 267 00:09:35,480 --> 00:09:36,210 And we are going to define 268 00:09:36,830 --> 00:09:38,330 a hub node to be 269 00:09:38,470 --> 00:09:40,360 a node that points to 270 00:09:41,510 --> 00:09:42,870 at least some number, let's 271 00:09:43,080 --> 00:09:44,540 say three authority nodes. 272 00:09:46,130 --> 00:09:47,190 And similarly, we're going to 273 00:09:47,290 --> 00:09:48,570 say an authority node is a 274 00:09:48,850 --> 00:09:51,360 node that's pointed to by, 275 00:09:52,200 --> 00:09:54,590 let's say at least three again, hub notes. 276 00:09:54,960 --> 00:09:57,670 And by the way this numbers three and three don't have to be the same. 277 00:09:58,490 --> 00:09:59,860 And another thing I 278 00:09:59,970 --> 00:10:00,810 wanted to mention is in a 279 00:10:00,910 --> 00:10:02,010 graph say representing the web, 280 00:10:02,450 --> 00:10:03,920 we wouldn't expect a large fraction 281 00:10:04,450 --> 00:10:06,780 of the notes to be hubs in authorities, many would be normal notes. 282 00:10:07,090 --> 00:10:08,200 But, again, this just for 283 00:10:08,270 --> 00:10:09,590 illustrative purposes, but it also 284 00:10:09,790 --> 00:10:10,910 serves to teach you about the 285 00:10:11,200 --> 00:10:12,900 hubs in authorities concept, which is kind of interesting. 286 00:10:14,030 --> 00:10:15,290 Now you can see already how mutual 287 00:10:15,880 --> 00:10:17,320 recursion is going to fit into the picture. 288 00:10:18,100 --> 00:10:19,200 But how are we gonna get started? 289 00:10:19,820 --> 00:10:20,740 The only way you can 290 00:10:20,970 --> 00:10:22,160 actually get started is to 291 00:10:22,700 --> 00:10:23,730 have some nodes that are predesignated 292 00:10:25,450 --> 00:10:26,010 as hubs and authorities. 293 00:10:27,130 --> 00:10:28,550 For example, if we predesignated 294 00:10:29,810 --> 00:10:31,350 as authorities these three middle 295 00:10:31,660 --> 00:10:33,470 nodes here, then we could 296 00:10:33,840 --> 00:10:36,200 compute the fact that node one is a hub. 297 00:10:37,180 --> 00:10:38,430 So, we'll also assume that we 298 00:10:38,630 --> 00:10:40,240 have two more relations in our database. 299 00:10:41,110 --> 00:10:42,110 One of them gives us 300 00:10:42,260 --> 00:10:43,830 a set of notes predesignated as 301 00:10:44,000 --> 00:10:45,040 hugs and the other a 302 00:10:45,190 --> 00:10:46,610 set of notes predesignated as authorities. 303 00:10:47,560 --> 00:10:48,570 And our job is to 304 00:10:48,710 --> 00:10:50,300 write a query that computes 305 00:10:50,770 --> 00:10:51,760 all of the hub and authority 306 00:10:52,230 --> 00:10:55,270 nodes based on this mutually recursive definition here. 307 00:10:56,650 --> 00:10:57,850 So here is the query that does it. 308 00:10:58,070 --> 00:10:58,930 By the way, you have 309 00:10:59,120 --> 00:11:00,100 certainly noticed that I'm not doing 310 00:11:00,370 --> 00:11:01,500 a live demo of the 311 00:11:01,580 --> 00:11:03,450 recursive queries in this particular video. 312 00:11:04,420 --> 00:11:06,310 Nonlinear recursion is not 313 00:11:06,600 --> 00:11:07,760 supported in the Postgres system, 314 00:11:09,030 --> 00:11:10,930 and it's also not part of thAQL standard. 315 00:11:11,710 --> 00:11:13,340 Mutual recursion in limited 316 00:11:13,510 --> 00:11:14,630 forms is part of the 317 00:11:14,700 --> 00:11:15,730 SQL standard but it's also 318 00:11:16,180 --> 00:11:17,990 currently not supported in Postgres, 319 00:11:18,500 --> 00:11:19,840 so I've used the nice 320 00:11:20,820 --> 00:11:22,710 interface is here to get the coloring of our queries. 321 00:11:22,880 --> 00:11:24,460 But these queries currently don't run 322 00:11:25,060 --> 00:11:26,120 on the systems that we're using. 323 00:11:27,080 --> 00:11:29,600 OK, so back to our actual query here. 324 00:11:30,070 --> 00:11:30,980 So, this is a query to 325 00:11:31,140 --> 00:11:32,210 compute hubs and authorities 326 00:11:32,750 --> 00:11:33,670 given that we have a starting 327 00:11:33,920 --> 00:11:36,230 set of hub nodes and a starting set of authority nodes. 328 00:11:37,010 --> 00:11:39,980 And, then we have the link relation that gives us the structure of our graph. 329 00:11:41,040 --> 00:11:42,640 So we're going to compute to relations. 330 00:11:43,270 --> 00:11:45,010 The hub relations with the nodes that are hubs. 331 00:11:45,400 --> 00:11:46,590 The authorities relation of the 332 00:11:46,720 --> 00:11:48,370 nodes that are authorities, and they 333 00:11:48,790 --> 00:11:51,330 are going to have mutual recursion between them. 334 00:11:51,860 --> 00:11:52,710 So, let's take a look first 335 00:11:53,080 --> 00:11:54,320 at the hubs and we'll see 336 00:11:54,560 --> 00:11:55,360 that the structure of the queries 337 00:11:55,800 --> 00:11:57,640 for hubs and authorities is very, very similar. 338 00:11:58,470 --> 00:11:59,710 So the base case for the 339 00:11:59,810 --> 00:12:00,930 hubs is that the nodes 340 00:12:01,410 --> 00:12:02,350 that are in the hub start relation 341 00:12:03,220 --> 00:12:04,730 are in the hub relation, of course. 342 00:12:05,790 --> 00:12:09,370 And then the recursive query here is a little bit complex. 343 00:12:10,360 --> 00:12:11,120 So what we're going to find 344 00:12:11,730 --> 00:12:13,620 is links, elements in 345 00:12:13,720 --> 00:12:15,950 our link relation, where the 346 00:12:16,080 --> 00:12:17,490 destination is an authority 347 00:12:18,480 --> 00:12:19,650 and so we're going to find all 348 00:12:19,890 --> 00:12:21,640 of the sources that point to an authority. 349 00:12:22,070 --> 00:12:23,090 We're going to group by the 350 00:12:23,180 --> 00:12:25,880 source, so we consider each node, one at a time. 351 00:12:26,570 --> 00:12:27,820 And then we count how many 352 00:12:28,190 --> 00:12:30,420 times it appears in the link pointing to an authority. 353 00:12:31,380 --> 00:12:32,510 So, this is going to give us 354 00:12:32,870 --> 00:12:34,430 a nodes that point to greater than 355 00:12:34,770 --> 00:12:37,670 or equal to three authorities which was our definition of hubs. 356 00:12:38,800 --> 00:12:39,780 Now here of course we're referring 357 00:12:40,450 --> 00:12:43,780 to authority which itself is a recursively defined relation. 358 00:12:44,880 --> 00:12:46,930 The authority relation is very 359 00:12:47,300 --> 00:12:48,450 similar, as I said we start 360 00:12:49,170 --> 00:12:50,520 with our base case of adding 361 00:12:50,930 --> 00:12:52,800 nodes that are in the authority start relation. 362 00:12:53,890 --> 00:12:55,510 And, then we consider destinations 363 00:12:56,510 --> 00:12:58,050 instead of sources in our 364 00:12:58,210 --> 00:12:59,670 link relation such that there 365 00:13:00,090 --> 00:13:02,000 are at least three sources that 366 00:13:02,180 --> 00:13:03,810 are hubs and that's what we've got down here. 367 00:13:04,550 --> 00:13:05,460 So, this is going to give 368 00:13:06,100 --> 00:13:08,090 us elements that are 369 00:13:08,270 --> 00:13:10,800 pointed to by greater than or equal to three hubs. 370 00:13:11,800 --> 00:13:12,810 And here of course we 371 00:13:12,830 --> 00:13:15,380 are using hub, which is also a recursively defined relation. 372 00:13:16,130 --> 00:13:18,380 So you can think of these two as working in tandem. 373 00:13:18,660 --> 00:13:19,360 You can think of the system 374 00:13:19,900 --> 00:13:21,880 as sort of iteratively adding to 375 00:13:22,110 --> 00:13:24,000 the hubs and the authorities until 376 00:13:24,380 --> 00:13:25,990 there's nothing more to add to either one. 377 00:13:27,160 --> 00:13:28,160 Now one thing that this 378 00:13:28,560 --> 00:13:29,610 definition of hubs of authorities 379 00:13:30,150 --> 00:13:32,260 and this computation allows, is for 380 00:13:32,380 --> 00:13:34,090 a node to be both a hub and an authority. 381 00:13:34,560 --> 00:13:35,850 And there is nothing wrong with that 382 00:13:36,020 --> 00:13:38,100 if the structure of the graph yields that result. 383 00:13:39,160 --> 00:13:40,310 But, let's suppose we don't want 384 00:13:40,510 --> 00:13:42,170 nodes to be allowed to be both hubs and authorities. 385 00:13:42,610 --> 00:13:44,520 We want every node to be either one or the other. 386 00:13:45,500 --> 00:13:47,730 That will require us to modify our query. 387 00:13:48,490 --> 00:13:51,600 To not add nodes to become hubs if they're already authorities. 388 00:13:52,210 --> 00:13:54,300 Or to have nodes become authorities if they're already hubs. 389 00:13:55,130 --> 00:13:56,190 So let's go ahead and 390 00:13:56,490 --> 00:13:58,730 modify the query to incorporate that additional constraint. 391 00:13:59,960 --> 00:14:01,420 So, here's the query and, by 392 00:14:01,600 --> 00:14:02,860 the way, just a reminder that you 393 00:14:02,920 --> 00:14:04,450 can download these queries from 394 00:14:04,860 --> 00:14:06,100 our website even though 395 00:14:06,180 --> 00:14:07,770 you can't run them at this point in time. 396 00:14:08,380 --> 00:14:09,710 The difference in this query from 397 00:14:09,910 --> 00:14:11,350 the previous one is one 398 00:14:11,530 --> 00:14:12,520 additional condition in the definition 399 00:14:13,150 --> 00:14:14,640 of hubs right here, saying 400 00:14:14,890 --> 00:14:16,680 I'm not going to add a 401 00:14:16,730 --> 00:14:18,430 node, a source node to 402 00:14:18,670 --> 00:14:19,960 the hubs, if it's already in 403 00:14:20,040 --> 00:14:22,040 the authorities and similarly I've 404 00:14:22,440 --> 00:14:23,600 added one more condition here 405 00:14:23,830 --> 00:14:25,040 in authorities that I'm not going 406 00:14:25,300 --> 00:14:27,220 to add a node to authorities if it's already a hub. 407 00:14:28,120 --> 00:14:29,880 Now let's suppose we have the following graph structure. 408 00:14:30,340 --> 00:14:31,710 We have a node here that 409 00:14:32,080 --> 00:14:33,940 hasn't been labeled as a hub or authority yet. 410 00:14:34,810 --> 00:14:36,540 And let's suppose that this node 411 00:14:36,840 --> 00:14:38,420 is pointed to by three 412 00:14:38,740 --> 00:14:40,470 nodes that have already been 413 00:14:41,150 --> 00:14:42,830 designated as hub nodes. 414 00:14:43,720 --> 00:14:45,390 And furthermore, this node points 415 00:14:45,990 --> 00:14:48,010 to three nodes that have 416 00:14:48,350 --> 00:14:50,120 already been designated as authorities. 417 00:14:51,570 --> 00:14:52,860 So by our definition, this node 418 00:14:53,160 --> 00:14:54,290 could be a hub because it 419 00:14:54,450 --> 00:14:55,770 points to three authorities and it 420 00:14:55,940 --> 00:14:56,900 could also be an authority 421 00:14:57,390 --> 00:14:58,760 because it is pointed to by three hubs. 422 00:15:00,180 --> 00:15:01,170 But the query we've given now 423 00:15:01,660 --> 00:15:02,720 is not going to allow us 424 00:15:02,940 --> 00:15:04,730 to label this node as both a hub and authority. 425 00:15:05,570 --> 00:15:06,800 And just to be clear in the 426 00:15:07,050 --> 00:15:08,430 previous query we would 427 00:15:08,800 --> 00:15:10,160 have put this node in 428 00:15:10,290 --> 00:15:11,660 both the hub relation and the 429 00:15:11,820 --> 00:15:13,110 authority relation but, now, 430 00:15:13,390 --> 00:15:14,070 we're not going to be able to 431 00:15:14,140 --> 00:15:16,600 do that because of these conditions right down here. 432 00:15:17,370 --> 00:15:18,750 So, actually, whether this node 433 00:15:19,700 --> 00:15:20,590 ends up as a hub 434 00:15:21,230 --> 00:15:23,350 or an authority depends on effectively 435 00:15:24,040 --> 00:15:25,110 which one of these arms 436 00:15:26,070 --> 00:15:28,220 of our with statement gets executed first. 437 00:15:28,910 --> 00:15:30,170 If we first consider the possibility 438 00:15:30,900 --> 00:15:32,050 of the node being a hub, then 439 00:15:32,240 --> 00:15:34,140 it will be put in the hub relation and 440 00:15:34,750 --> 00:15:37,370 then it won't be allowed to be put in the authority relation. 441 00:15:38,050 --> 00:15:39,310 On the other hand, if we 442 00:15:39,500 --> 00:15:41,680 first make an authority, then when 443 00:15:41,820 --> 00:15:43,040 we look for computing 444 00:15:43,440 --> 00:15:45,280 the hub relation, it wouldn't be allowed to be a hub. 445 00:15:46,060 --> 00:15:46,990 So you can think of this 446 00:15:47,500 --> 00:15:49,180 as a sort of non-deterministic behavior 447 00:15:49,990 --> 00:15:51,190 or if you're into theory there's a 448 00:15:51,240 --> 00:15:52,890 non-unique fixed point of 449 00:15:53,020 --> 00:15:55,910 the recursion, and this is considered as a not good thing. 450 00:15:56,190 --> 00:15:57,680 Generally, database people, when 451 00:15:57,850 --> 00:15:58,950 they run queries, would like to 452 00:15:59,060 --> 00:16:01,110 have one answer all the time. 453 00:16:01,400 --> 00:16:02,980 They like to have deterministic answers for 454 00:16:03,080 --> 00:16:04,640 their queries, so actually this 455 00:16:05,000 --> 00:16:06,500 type of mutual recursion 456 00:16:07,550 --> 00:16:08,870 is not allowed in the 457 00:16:08,930 --> 00:16:10,320 SQL standard and the 458 00:16:10,440 --> 00:16:11,320 real crux of the problem 459 00:16:11,770 --> 00:16:13,560 here is that one recursively 460 00:16:14,070 --> 00:16:15,940 defined relation is depending 461 00:16:16,350 --> 00:16:17,710 negatively on another one. 462 00:16:19,220 --> 00:16:21,790 So this negative dependence is what causes the problem. 463 00:16:22,540 --> 00:16:23,450 And actually we can have 464 00:16:23,800 --> 00:16:26,500 a negative dependence even without mutual recursion. 465 00:16:27,030 --> 00:16:28,210 We could define a relation 466 00:16:28,990 --> 00:16:30,750 that sort of depends negatively on 467 00:16:30,900 --> 00:16:33,170 itself in a sub-query and that wouldn't be allowed either. 468 00:16:33,510 --> 00:16:35,020 So that completes the example 469 00:16:35,340 --> 00:16:36,710 of hubs authorities and again, 470 00:16:36,970 --> 00:16:37,850 what we're trying to show first 471 00:16:38,090 --> 00:16:40,400 of all, is mutual recursion which can be quite powerful. 472 00:16:41,300 --> 00:16:42,220 And, second of all, the restriction 473 00:16:42,300 --> 00:16:43,850 that we can't have 474 00:16:44,110 --> 00:16:46,800 negative subqueries across recursively defined relations. 475 00:16:48,390 --> 00:16:49,290 The last thing that I wanted to 476 00:16:49,470 --> 00:16:50,580 mention in this video, it's not 477 00:16:50,870 --> 00:16:51,770 in the title of the video 478 00:16:51,950 --> 00:16:54,340 since we are focusing mostly on nonlinear and mutual recursion 479 00:16:55,290 --> 00:16:56,260 is recursion with aggregation. 480 00:16:57,120 --> 00:16:59,350 And let me just show a simple abstract example. 481 00:17:00,600 --> 00:17:01,940 So we have a relation P 482 00:17:02,170 --> 00:17:04,710 that just contains one attribute we can assume that it's integers. 483 00:17:05,910 --> 00:17:07,000 And we're going to try in 484 00:17:07,130 --> 00:17:08,900 our with recursive statement to computer 485 00:17:09,360 --> 00:17:10,800 recursively define relation called R 486 00:17:11,440 --> 00:17:12,740 that contains the tuples in 487 00:17:12,900 --> 00:17:15,690 P, together with the 488 00:17:15,940 --> 00:17:17,450 sum of the values in 489 00:17:17,730 --> 00:17:18,600 the attribute of P, I will 490 00:17:18,700 --> 00:17:19,550 just write that as sum of P. 491 00:17:20,500 --> 00:17:21,650 So here's how we write it in SQL. 492 00:17:22,690 --> 00:17:23,870 We have our base case 493 00:17:24,210 --> 00:17:25,670 which is that the tuples in 494 00:17:25,880 --> 00:17:26,880 P are also in R. 495 00:17:27,400 --> 00:17:28,330 And then we do our 496 00:17:28,470 --> 00:17:29,880 UNION of the recursive part 497 00:17:30,320 --> 00:17:31,800 which says, and also in 498 00:17:31,920 --> 00:17:32,920 R I want to have the 499 00:17:32,980 --> 00:17:33,910 sum of the tuples in R. 500 00:17:35,250 --> 00:17:36,470 So let's say that P starts 501 00:17:36,820 --> 00:17:39,480 out with two tuples, the values one and two. 502 00:17:40,200 --> 00:17:43,160 So what does the query compute for R? 503 00:17:43,770 --> 00:17:44,870 Well certainly one and two 504 00:17:45,220 --> 00:17:47,410 are in R based on the first part here. 505 00:17:48,550 --> 00:17:49,780 And then based on the 506 00:17:49,850 --> 00:17:51,530 second part then in 507 00:17:51,630 --> 00:17:52,860 the first iteration R should 508 00:17:53,030 --> 00:17:55,920 also contain the sum of R, which is 3. 509 00:17:56,840 --> 00:17:57,990 Except as soon as we 510 00:17:58,250 --> 00:17:59,160 put three in the sum 511 00:17:59,490 --> 00:18:01,100 of R isn't three anymore, the 512 00:18:01,210 --> 00:18:02,920 sum of R is six. 513 00:18:03,090 --> 00:18:05,810 So shall we cross out the 3 in and put six there? 514 00:18:06,040 --> 00:18:07,180 But then now the sum 515 00:18:08,050 --> 00:18:10,770 of R has become six, seven, eight, nine. 516 00:18:11,640 --> 00:18:12,330 You can see the problem. 517 00:18:12,940 --> 00:18:15,270 There's no good definition for 518 00:18:15,390 --> 00:18:17,760 what R should contain based on this recursion. 519 00:18:18,220 --> 00:18:19,910 And for that reason actually, 520 00:18:20,270 --> 00:18:22,010 recursion with aggregation is disallowed 521 00:18:22,710 --> 00:18:24,840 in the SQL standard and isn't supported by any system. 522 00:18:26,220 --> 00:18:27,630 So, to summarize about both of 523 00:18:27,710 --> 00:18:29,440 our videos about recursion, SQL has 524 00:18:30,020 --> 00:18:31,460 introduced recursion into the standard 525 00:18:32,430 --> 00:18:33,480 as part of the WITH statement. 526 00:18:34,320 --> 00:18:35,820 Whether the keyword RECURSIVE goes 527 00:18:36,100 --> 00:18:36,980 with the WITH, or with recursively 528 00:18:37,610 --> 00:18:38,920 defined relations is a 529 00:18:38,950 --> 00:18:42,040 bit inconsistent, but in any case, the basic idea is the same. 530 00:18:42,490 --> 00:18:44,250 When we have this statement, we 531 00:18:44,410 --> 00:18:45,670 can write queries that refer 532 00:18:46,150 --> 00:18:47,220 to the relation that's being defined. 533 00:18:48,070 --> 00:18:49,330 And we can also have mutual recursions 534 00:18:50,310 --> 00:18:51,250 between the queries that are 535 00:18:51,320 --> 00:18:52,030 defined in the with statement, 536 00:18:52,580 --> 00:18:53,840 and finally the result 537 00:18:54,180 --> 00:18:55,440 is a running of the 538 00:18:55,550 --> 00:18:57,790 final query which might involve the recursively defined relations. 539 00:18:59,000 --> 00:19:00,310 Adding recursion to SQL does 540 00:19:00,680 --> 00:19:02,030 strictly extend it's expressiveness. 541 00:19:03,200 --> 00:19:05,250 There are queries that can't be written without recursion. 542 00:19:05,790 --> 00:19:07,260 They usually involve some type 543 00:19:07,480 --> 00:19:09,340 of unbounded computation, for example, 544 00:19:10,420 --> 00:19:12,980 computing any number of flights or any depths of ancestors. 545 00:19:14,360 --> 00:19:16,810 Usually there's a transitive closure flavor to those queries. 546 00:19:17,760 --> 00:19:20,110 Without recursion the iteration 547 00:19:20,620 --> 00:19:21,970 involved in computing recursively 548 00:19:22,500 --> 00:19:25,120 defined relation has to be written outside of the database, 549 00:19:25,520 --> 00:19:27,180 has to be written in code in some fashion. 550 00:19:28,290 --> 00:19:29,290 Now we saw that the basic 551 00:19:29,800 --> 00:19:31,830 functionality of SQL recursion 552 00:19:32,690 --> 00:19:34,230 is linear recursion, where we 553 00:19:34,360 --> 00:19:35,490 only have one instance of 554 00:19:35,640 --> 00:19:38,480 the recursively defined relation in the query defining the relation. 555 00:19:39,410 --> 00:19:41,040 We can write a lot with linear recursion. 556 00:19:41,470 --> 00:19:43,060 It's very expressive and can 557 00:19:43,650 --> 00:19:44,640 express most of the natural 558 00:19:45,390 --> 00:19:46,380 queries we might want to do 559 00:19:46,760 --> 00:19:48,370 in recursive SQL. 560 00:19:48,530 --> 00:19:50,970 But, there is extended functionality, there's non-linear recursion. 561 00:19:51,490 --> 00:19:53,110 We saw that non-linear recursion can lead 562 00:19:53,400 --> 00:19:55,230 to nicer looking queries and 563 00:19:55,420 --> 00:19:57,000 can converge faster but is 564 00:19:57,100 --> 00:19:59,010 actually more difficult to implement efficiently. 565 00:20:00,030 --> 00:20:01,460 And then there's mutual recursion where 566 00:20:01,940 --> 00:20:03,230 R1 here might be defined in 567 00:20:03,310 --> 00:20:04,380 terms of R2 which itself is defined 568 00:20:04,760 --> 00:20:07,680 in terms of R1 and we 569 00:20:07,850 --> 00:20:09,810 saw one interesting example where 570 00:20:09,930 --> 00:20:11,070 we'd like to use mutual recursion 571 00:20:11,530 --> 00:20:14,810 where it was appropriate. 572 00:20:16,220 --> 00:20:16,870 Finally in terms of what was disallowed 573 00:20:16,980 --> 00:20:17,880 recursive sub queries; by that I 574 00:20:17,970 --> 00:20:20,090 mean referencing recursively defined 575 00:20:20,450 --> 00:20:23,280 relation in sub query, is 576 00:20:23,370 --> 00:20:24,710 actually in the SQL standard not 577 00:20:24,970 --> 00:20:26,570 supported by the postgres system that we were using. 578 00:20:27,440 --> 00:20:29,260 When a reference in 579 00:20:29,540 --> 00:20:30,320 a sub query to a 580 00:20:30,470 --> 00:20:31,970 recursively defined relation is 581 00:20:32,140 --> 00:20:33,110 negative, sort of like 582 00:20:33,200 --> 00:20:34,350 a not exist or not and 583 00:20:34,540 --> 00:20:35,500 that is disallowed by the 584 00:20:35,590 --> 00:20:36,750 SQL standard and we saw 585 00:20:37,070 --> 00:20:38,210 that that can lead to 586 00:20:38,720 --> 00:20:42,340 sort of, non-obvious behavior, non-deterministic final results. 587 00:20:43,400 --> 00:20:45,100 And finally, aggregation causes complication 588 00:20:45,710 --> 00:20:47,960 as well in recursion and is disallowed, too. 589 00:20:48,700 --> 00:20:49,840 The features that are disallowed really 590 00:20:50,160 --> 00:20:51,730 don't come up that often 591 00:20:52,130 --> 00:20:53,790 naturally and once again, 592 00:20:54,180 --> 00:20:55,790 and let me just emphasize that the 593 00:20:55,930 --> 00:20:57,120 basic functionality of linear 594 00:20:57,400 --> 00:20:58,500 recursion does allow one to 595 00:20:58,610 --> 00:20:59,790 express a lot of really 596 00:21:00,050 --> 00:21:01,490 nice queries and does extend 597 00:21:02,050 --> 00:21:03,340 the expressiveness of the SQL language.