In this video we'll be exploring
some further issues involved in recursion in the SQL language.
First a reminder of how SQL implements recursion.
There's a with statement in sequel
that can be specified to have
recursively defined relations in it.
We say with recursive, and
then we define a set of
relations, where the query
to find relation could involve
the relation itself, so that's where recursion pops in.
And then, at the end the
final result is a query
that might involve those recursively defined
relations as well as other tables in the database.
As we saw in the previous
video and demo, it's very
common for recursively defined relations
in the with statement to take
the structure of having a
base query that doesn't involve
R, the recursively define relation, unioned
with the recursive query and we saw many examples of that form.
The first thing I want to talk
about in this video is what's called linear recursion.
Linear recursion specifies that in
the recursive definition of R,
and again let's assume it takes
this form of the base query union and the recursive query.
In the recursive query, There
is only one reference to
the recursively defined relation R.
So let's take a look at
an example to understand linear
recursion and nonlinear recursion.
the first example we used
when we introduced recursion was finding
ancestor relationships from a
base table that just has
parent child relationships so a
basic transit of closure
operation and the query
we wanted to run was to find all of Mary's ancestors.
And here's the query that we wrote.
It does take the form of
having a base query here.
Which says if we have
a parent relationship that's also an ancestor relationship.
And then the recursion occurs in
the second part of the union
where we join the recursively defined
ancestor relationship, ancestor relation,
with parents so that we
extend the ancestors with one more generation.
Now this query does have
linear recursion because we only
have one instance here of
the recursively defined relation, ancestor.
So let's take a
look at what happens underneath when this query is executed.
We start with our parent table.
And here it is with
a parent and  child and let's
suppose we have say, Sue
and John, and John
and Mary for example, in our parent table.
Then in what's effectively the first iteration,
the base query here is
run that copies the parent table to the ancestor table.
So now we have Sue and
John, and John and Mary,
and anything else that we had
in the parent table in the ancestor table.
As the iteration continues we're
effectively joining the parent
table and the ancestor table
to get additional tuples in the ancestor table.
For example, we see that
Sue and John, the
Sue and John tuple here, would
join with the John and Mary
tuple, and that would give
us Sue and Mary in the ancestor table.
The iteration continues until there
are no new tuples to add to the ancestor table.
And then we're done with our
recursively defined relation, and we
can go ahead and execute the final query in the with statement.
And again, often when I say we, I really mean we the system.
All of this, is of course, being
performed by the system as it
executes the recursively defined with statement.
Now, let's take a look
at a non-linear expression of the same query.
And here it is.
What we see here is that
the primary change is right in here.
Instead of joining the parent
with the ancestor in the
recursive half, we're going
to join two instances of the ancestor relation.
And let's see what happens during
execution when this is how we express our recursion.
So we again start by
copying the contents of the
parent table into the ancestor
table as part of the base query.
And I've already shown that here.
But, now, instead of during
iteration joining the parent table with the ancestor table.
We're actually going to join
the ancestor table with itself to generate new tuples.
For example, we will join
the first two tuples in ancestor
with each other, Sue-John and John-Mary,
in order to obtain what
was the same tuple we obtained with
the linear recursion, which would be the tuple with Sue and Mary.
Just a quick reminder, I intended
to say this earlier, but it's
the fact that we have these
two references to ancestor in
the recursion here, that makes it non-linear.
OK so what's the deal with these two queries?
Why might we prefer one form of the query over the other?
And take my word for it,
by the way, we do get
equivalent results to the
query in its linear and non-linear versions.
Well, here's some pros and cons to non-linear versus linear.
For this particular query and actually
in general, when we can express a query both ways.
First of all there's some
pluses to the non-linear so the query looks cleaner.
If you go back and look at
the 2 queries the non-linear version
is sort of more symmetric, a
little shorter even to express, than the linear version.
Second of all, the nonlinear
version actually converges faster to
the fixed point, to the final state,
than the linear version.
And I'm going to show that a
little bit abstractly because it is actually fairly important.
So I'm going to create this
abstract example, parent-child
relation, which is going
to be completely linear, just for illustrative purposes.
So we have this person here
who's the parent of the person
here, who's the parent of
a person here, and so on.
We're going to make it eight levels deep.
So this is an abstraction of
our "a parent" table, and
now let's see how ancestors are computed.
So in the first step
we'll add one ancestor tuple
for each tuple in the
parent relation, so the
purple are the tuples that are added to ancestor.
Then in the second iteration
we're going to join those with themselves.
I'm sorry, we're going to
join the ancestor tuples with parent tuples
so each ancestor tuple could
be extended by one.
So that's going to give us all pairs of tuples.
I'm sorry, it's already getting a bit crowded here, but I think that you will get the idea.
On the next iteration we're going
to again take our ancestor
tuple and extend them by one, by joining them with parent.
So after the second we'll
have all triples here,
so all great-grandparent relationships.
OK, and that's a big mess.
But you can really see what's going on.
Each time we iterate, we get
one more generation added into the ancestors.
And now let's think about what happens when we use the non-linear version.
Where after the first step, we
join ancestor with itself instead of ancestor with parent.
So as before on the
first step, and now I'm
going to make these red, the ancestor
relation will contain exactly the same as the parents.
And the second step is
the same as well, we're going
to join ancestor with itself but,
since each one of ancestor
is only the parent relationship, we're
again, going to get all pairs
in the second step of the iteration.
The difference begins in the third step.
Now we're joining ancestor with itself.
So we will be joining these
two-step ancestors with the
single ones, just like before, to get all the threes.
But we will also be joining twos with twos.
In other words we will joining
grandparent relationships with grandparent relationships.
And we will be getting in that same iteration the fours.
So as you can see, the
nonlinear version does converge faster.
Now this example is very
small so it's not as blatantly obvious.
But the linear version
is going to take a linear number
of iterations in order to
converge to the final recursively
defined relation contents.
Whereas when we use the non-linear version, it's actually logarithmic.
So for a large database it can be considerably faster.
So what about the downsides of non-linear recursion?
Well, the major downside is
that it's harder to implement or
certainly harder to implement efficiently.
And as a result of that actually
the sequel standard only requires linear recursion.
And the postgres system that
we've been using also only supports linear recursion.
So back to the basic
form of our with recursive statement
in order to introduce a different
topic, which is the topic of mutual recursion.
Mutual recursion, as I
alluded to in the previous video,
is the case where one of
our recursively defined relations
does not refer to itself but
rather to a different recursively defined relation.
And that one refers back to the first one.
Or we could even have a loop of three or four or more.
So, the idea is that
we can't look at these individually to
see the recursion but together they
are recursive and they have
to kind of be computed in tandem.
So, the example I'm going to
use here is what's known
as hubs and authorities.
Hubs and authorities was an algorithm for web searching actually.
For annotating web nodes for the purposes of searching.
Was developed around the same
time as Google's page rank, I guess we can see which one won out.
But hubs and authorities actually quite
interesting just in what it does.
Let me go ahead and
define the meaning of Hudson
authorities, and then show how
mutual recursion in SQL
can be used to compute the
Hudson authorities in a database
that contains a link, a structure, a graph basically.
So here's a little graph and
we're going to assume that each
node has a number, say,
associated with it, and that
we have a relation called
"link", that just tells
us the edges of
the graph, so the source node and the destination node.
So in a graph, we're going
to designate some of the
nodes as hub nodes
and some of the nodes as authority nodes.
And we are going to define
a hub node to be
a node that points to
at least some number, let's
say three authority nodes.
And similarly, we're going to
say an authority node is a
node that's pointed to by,
let's say at least three again, hub notes.
And by the way this numbers three and three don't have to be the same.
And another thing I
wanted to mention is in a
graph say representing the web,
we wouldn't expect a large fraction
of the notes to be hubs in authorities, many would be normal notes.
But, again, this just for
illustrative purposes, but it also
serves to teach you about the
hubs in authorities concept, which is kind of interesting.
Now you can see already how mutual
recursion is going to fit into the picture.
But how are we gonna get started?
The only way you can
actually get started is to
have some nodes that are predesignated
as hubs and authorities.
For example, if we predesignated
as authorities these three middle
nodes here, then we could
compute the fact that node one is a hub.
So, we'll also assume that we
have two more relations in our database.
One of them gives us
a set of notes predesignated as
hugs and the other a
set of notes predesignated as authorities.
And our job is to
write a query that computes
all of the hub and authority
nodes based on this mutually recursive definition here.
So here is the query that does it.
By the way, you have
certainly noticed that I'm not doing
a live demo of the
recursive queries in this particular video.
Nonlinear recursion is not
supported in the Postgres system,
and it's also not part of thAQL standard.
Mutual recursion in limited
forms is part of the
SQL standard but it's also
currently not supported in Postgres,
so I've used the nice
interface is here to get the coloring of our queries.
But these queries currently don't run
on the systems that we're using.
OK, so back to our actual query here.
So, this is a query to
compute hubs and authorities
given that we have a starting
set of hub nodes and a starting set of authority nodes.
And, then we have the link relation that gives us the structure of our graph.
So we're going to compute to relations.
The hub relations with the nodes that are hubs.
The authorities relation of the
nodes that are authorities, and they
are going to have mutual recursion between them.
So, let's take a look first
at the hubs and we'll see
that the structure of the queries
for hubs and authorities is very, very similar.
So the base case for the
hubs is that the nodes
that are in the hub start relation
are in the hub relation, of course.
And then the recursive query here is a little bit complex.
So what we're going to find
is links, elements in
our link relation, where the
destination is an authority
and so we're going to find all
of the sources that point to an authority.
We're going to group by the
source, so we consider each node, one at a time.
And then we count how many
times it appears in the link pointing to an authority.
So, this is going to give us
a nodes that point to greater than
or equal to three authorities which was our definition of hubs.
Now here of course we're referring
to authority which itself is a recursively defined relation.
The authority relation is very
similar, as I said we start
with our base case of adding
nodes that are in the authority start relation.
And, then we consider destinations
instead of sources in our
link relation such that there
are at least three sources that
are hubs and that's what we've got down here.
So, this is going to give
us elements that are
pointed to by greater than or equal to three hubs.
And here of course we
are using hub, which is also a recursively defined relation.
So you can think of these two as working in tandem.
You can think of the system
as sort of iteratively adding to
the hubs and the authorities until
there's nothing more to add to either one.
Now one thing that this
definition of hubs of authorities
and this computation allows, is for
a node to be both a hub and an authority.
And there is nothing wrong with that
if the structure of the graph yields that result.
But, let's suppose we don't want
nodes to be allowed to be both hubs and authorities.
We want every node to be either one or the other.
That will require us to modify our query.
To not add nodes to become hubs if they're already authorities.
Or to have nodes become authorities if they're already hubs.
So let's go ahead and
modify the query to incorporate that additional constraint.
So, here's the query and, by
the way, just a reminder that you
can download these queries from
our website even though
you can't run them at this point in time.
The difference in this query from
the previous one is one
additional condition in the definition
of hubs right here, saying
I'm not going to add a
node, a source node to
the hubs, if it's already in
the authorities and similarly I've
added one more condition here
in authorities that I'm not going
to add a node to authorities if it's already a hub.
Now let's suppose we have the following graph structure.
We have a node here that
hasn't been labeled as a hub or authority yet.
And let's suppose that this node
is pointed to by three
nodes that have already been
designated as hub nodes.
And furthermore, this node points
to three nodes that have
already been designated as authorities.
So by our definition, this node
could be a hub because it
points to three authorities and it
could also be an authority
because it is pointed to by three hubs.
But the query we've given now
is not going to allow us
to label this node as both a hub and authority.
And just to be clear in the
previous query we would
have put this node in
both the hub relation and the
authority relation but, now,
we're not going to be able to
do that because of these conditions right down here.
So, actually, whether this node
ends up as a hub
or an authority depends on effectively
which one of these arms
of our with statement gets executed first.
If we first consider the possibility
of the node being a hub, then
it will be put in the hub relation and
then it won't be allowed to be put in the authority relation.
On the other hand, if we
first make an authority, then when
we look for computing
the hub relation, it wouldn't be allowed to be a hub.
So you can think of this
as a sort of non-deterministic behavior
or if you're into theory there's a
non-unique fixed point of
the recursion, and this is considered as a not good thing.
Generally, database people, when
they run queries, would like to
have one answer all the time.
They like to have deterministic answers for
their queries, so actually this
type of mutual recursion
is not allowed in the
SQL standard and the
real crux of the problem
here is that one recursively
defined relation is depending
negatively on another one.
So this negative dependence is what causes the problem.
And actually we can have
a negative dependence even without mutual recursion.
We could define a relation
that sort of depends negatively on
itself in a sub-query and that wouldn't be allowed either.
So that completes the example
of hubs authorities and again,
what we're trying to show first
of all, is mutual recursion which can be quite powerful.
And, second of all, the restriction
that we can't have
negative subqueries across recursively defined relations.
The last thing that I wanted to
mention in this video, it's not
in the title of the video
since we are focusing mostly on nonlinear and mutual recursion
is recursion with aggregation.
And let me just show a simple abstract example.
So we have a relation P
that just contains one attribute we can assume that it's integers.
And we're going to try in
our with recursive statement to computer
recursively define relation called R
that contains the tuples in
P, together with the
sum of the values in
the attribute of P, I will
just write that as sum of P.
So here's how we write it in SQL.
We have our base case
which is that the tuples in
P are also in R.
And then we do our
UNION of the recursive part
which says, and also in
R I want to have the
sum of the tuples in R.
So let's say that P starts
out with two tuples, the values one and two.
So what does the query compute for R?
Well certainly one and two
are in R based on the first part here.
And then based on the
second part then in
the first iteration R should
also contain the sum of R, which is 3.
Except as soon as we
put three in the sum
of R isn't three anymore, the
sum of R is six.
So shall we cross out the 3 in and put six there?
But then now the sum
of R has become six, seven, eight, nine.
You can see the problem.
There's no good definition for
what R should contain based on this recursion.
And for that reason actually,
recursion with aggregation is disallowed
in the SQL standard and isn't supported by any system.
So, to summarize about both of
our videos about recursion, SQL has
introduced recursion into the standard
as part of the WITH statement.
Whether the keyword RECURSIVE goes
with the WITH, or with recursively
defined relations is a
bit inconsistent, but in any case, the basic idea is the same.
When we have this statement, we
can write queries that refer
to the relation that's being defined.
And we can also have mutual recursions
between the queries that are
defined in the with statement,
and finally the result
is a running of the
final query which might involve the recursively defined relations.
Adding recursion to SQL does
strictly extend it's expressiveness.
There are queries that can't be written without recursion.
They usually involve some type
of unbounded computation, for example,
computing any number of flights or any depths of ancestors.
Usually there's a transitive closure flavor to those queries.
Without recursion the iteration
involved in computing recursively
defined relation has to be written outside of the database,
has to be written in code in some fashion.
Now we saw that the basic
functionality of SQL recursion
is linear recursion, where we
only have one instance of
the recursively defined relation in the query defining the relation.
We can write a lot with linear recursion.
It's very expressive and can
express most of the natural
queries we might want to do
in recursive SQL.
But, there is extended functionality, there's non-linear recursion.
We saw that non-linear recursion can lead
to nicer looking queries and
can converge faster but is
actually more difficult to implement efficiently.
And then there's mutual recursion where
R1 here might be defined in
terms of R2 which itself is defined
in terms of R1 and we
saw one interesting example where
we'd like to use mutual recursion
where it was appropriate.
Finally in terms of what was disallowed
recursive sub queries; by that I
mean referencing recursively defined
relation in sub query, is
actually in the SQL standard not
supported by the postgres system that we were using.
When a reference in
a sub query to a
recursively defined relation is
negative, sort of like
a not exist or not and
that is disallowed by the
SQL standard and we saw
that that can lead to
sort of, non-obvious behavior, non-deterministic final results.
And finally, aggregation causes complication
as well in recursion and is disallowed, too.
The features that are disallowed really
don't come up that often
naturally and once again,
and let me just emphasize that the
basic functionality of linear
recursion does allow one to
express a lot of really
nice queries and does extend
the expressiveness of the SQL language.