The parse tree is a graph that shows how a
particular string was derived using some
context free grammar. These trees have a
direct connection to left most and right
most derivations which we shall explain
now. One important use of the tree
representation of derivations is it let's
us express the important concept of
ambiguity in grammars. That is,
unambiguity or the ability for a grammar
to provide a unique tree structure for
every string in its language is vital. For
example if a programming line which has
ambiguity, then there are programs with
two or more distinct meanings. The same is
true of a grammar for an actual language,
it would be insufficient to provide an
intended meaning for all valid sentences
of the language. Parse trees for a
[inaudible] tree are trees whose nodes are
each associated with a symbol of G. I'm
going to draw you a little tree here.
Okay. Leaves are always labeled by either
a terminal or by epsilon, so we might draw
for example an A on that one a terminal B
on that one, and perhaps epsilon on that.
Interior nodes are labeled by a variable.
Here are some examples. The important
property that makes a tree a parse tree is
that there is a production of the grammar
with the label of the node in question as
its head and the labels of its children
read left to right as its body. Okay? For
example, if you look at this parse tree we
would infer that A goes to BC is a
production also B goes to little A, little
B is a production and C goes to epsilon is
a production. The root must be labeled by
the start symbol. In our example tree here
I'm going to have to guess that if this is
the root, then "A" is in fact the start
symbol. Here's a parse tree using the
grammar for the balanced parentheses that
we discussed earlier. Notice how each
interior node is labeled by a variable.
Here, here, here and here. Must be asked
because that's the only variable we have.
Each leaf is labeled by a terminal, either
a left or right parenthesis, these are
here, here, here, here, here and here.
Okay, the production of the root is S goes
to [sound]. You can sort of see that by
looking at the root and its two children
here's another interior node and the
labels of its children are left paren as
right paren and you know that left paren
as right paren is another production of
the grammar that's right here. And here we
have an example of a interior node labeled
S and its children labeled left paren and
right paren that corresponds to this
production body right here. You have the
same exact use of that production, here.
Okay. The yield of the par street is the
string of labels and the leaves from the
left. This order of leaves is the order
you visit them in during a pre-order
traversal, path which would look something
like this. That's sort of a, pre order
transversal path goes around the tree,
counter-clockwise. The order in which you
visit the leaves is the order in which
their labels appear in the yield. So here
the yield is left paren, left paren, right
paren, right paren, left paren, right
paren. That's what we said here. In what
follows, we're also going to talk about
parse trees with a route A that may not be
the start symbol. These trees follow all
the other requirements of the parse tree.
The leaves are labeled by terminals or
epsilon and an interior node with its
children form a production of the grammar.
We're going to show how to convert
[inaudible] to leftmost to rightmost
derivations and vice versa. As is often
the case, a [inaudible] made easier by
proving something more general than what
we really need. In this case we'll talk
about generalized parts tree that can have
any root labels, say "A". And we'll talk
about left most and right most irrevations
for many variable "A". We'll prove two
statements... One is that there is a parse
tree with root A and the yield W. And
there is a left most derivation of W from
A [inaudible]. The second statement is the
converse, that for every left most
derivation of W from A there is a general
bias parse tree with root A and yield W.
The matter of right most derivations is
completely analogous. We will start with
point one by showing how to convert parse
trees to left most derivations. The proof
is an induction on the height of the tree.
The basis is height one. That is a tree
consisting of a root label by some
variable, A, and one or more leaves. They
must be a production with head A and body
being in the labels of the leaves from
left to right. That's A one though A N in
this picture. Then, A derives the yield of
the tree by the one step left most
[inaudible] in which this production is
used. For the induction we'll assume
statement one for height less than H and
prove it for H That is, we assume for
every parse tree with root A and height of
up to H minus one there is a leftmost
derivation of its yield from its root. Now
let's look at a parse tree of height H.
The case H=1 is the basis, so H is at
least two. Thus the production used at the
root has at least one variable on its
right side. Say this tree has root A and
the children of the root labeled X1
through Xn. If Xi is variable, then it is
the root of some sub tree of height at
most h minus one and some yield, say Wi.
If XI is a terminal or epsilon then we
imagine XI is a root of a trivial tree,
not a parse tree, that consists only of a
node labeled XI. The yield of this tree is
just Xi itself, but for consistency we
shall refer to Xi as the stream of
terminals, Wi. The yield of the entire
tree is W, which is W1 through Wn. We can
put together a left most derivation of W
from A as follows. Start by applying the
production at the root, so the second step
is X1, X2 up to Xn. That's this. Okay, now
we need the left most derivation of W one
from X one. X one is the terminal that
equals W one so we're already there. And
this step is then really zero step on the
other hand, if X1 is a variable and we
apply the inductive hypothesis, the sub
tree with root X1 and yield W1 has height
at most H minus one, so there is a left
most derivation at W1 from X1. In that
case, this X1 has been replaced now left
most derivation by W1 and this, goes to
star represents all the steps that were
necessary to replace X1 by W1. Either way
we can conclude that there is a left most
derivation of W1 X2 through Xn, that's
this. We can continue arguing this way For
each XI in turn, either it is a terminal,
in which case nothing needs to be done, or
it's a variable in which case we can
perform a leftmost derivation from WI from
XI. The needed sequence of leftmost
derivations is a leftmost derivation of
the entire yield W. That's W1 through Wn.
From A. Now we need to prove the other
direction. That leftmost derivations can
be turned into parse trees. The proof is
an induction on the number of steps of the
derivation. The bases of one step
derivation. In this derivation a variable
A is replaced by a string of terminals,
perhaps the empty string, using some
production for A. If the string is not
empty then there is a parse tree of height
one with A at the root and the sequence of
terminals in the body as the labels of the
children. This and there are the children.
If the body is epsilon then there's again
a porous tree it has A at the root and one
leaf labeled epsilon so it just sort of
looks like this. Now let's do the
inductive step. We assume that left most
derivations of fewer than K steps can be
turned into parse trees with the proper
yield. We will consider a K step
derivation from string W from variable A
The first step of this derivation uses a
production with head A and body X1 through
Xn for some n. Thus the second [inaudible]
form in this derivation of W is X1 through
Xn, that's that. Remember the important
properties of derivations. That production
bodies replace production heads at the
position where the head is. Thus we can
divide W into the concatenation of n
strings, W1 through Wn, in that order.
Where for all I, WI either is XI, if XI is
a terminal. Or XI derives WI by a leftmost
derivation of fewer than K steps. Since
the whole derivation is leftmost, the
derivation WI from XI must happen first.
Then the derivation of W2 through X2 and
so on. So we know that each XI either is
WI or derives WI in fewer than K steps at
the leftmost derivation. So the second
case, where XI is a variable and the
leftmost derivation takes one or more
steps, the inductive hypothesis tells us
that there is a parse tree with root XI
and yield WI. Plus we can build the parse
tree shown... That is this. The root uses
the first production of the derivation as
A goes to X1 through Xn and the Ith child
either is Wi if Xi is a terminal or it is
the root of a parse tree from Xi deriving
Wi. The yield of this tree is W1 through
Wn which is W. Okay, that is, this string
W is W1 through Wn. All these strings are
derived left to right in that order. This
proves the inductive step and we conclude
there is a leftmost derivation of W from
A. ... And there is a parts true with root
"A" and a yield "W". It is also true that
if there is a right most irrevations of
"W" from "A", then there is a parts true
of root "A" and yield "W" and conversely.
But we are not going to prove that part,
the proofs are essentially the same as
what we have seen. And in fact, any
derivation even the one that isn't left
most or right most of a terminal strain
"W" from variable "A" implies that there
is a parts true with root "A" and yield
"W". The first step of any derivation from
A must use a production and replace A from
some string of terminals and variables.
Say, X1 through Xn. If W's the terminal
string ultimately derived, then we can
still break W into W1 through Wn, where
each WI either is XI or is derived from XI
by fewer steps than the whole derivation.
The only tricky part is that now the steps
of the derivation from XI and XJ may be
intermingled and we have to sort the
derivation steps according to which XI is
descendant as being replaced at that step.
We'll leave you to think through the
details of this construction of parse
trees. As we mentioned at the beginning,
it is important in many applications,
including parsing of programming languages
In a compiler and parsing of natural
language sentences that we use a
context-free grammar that assigns a unique
parse tree to each string of the language.
We say a grammar is ambiguous if there is
at least one string in its language that
has two different parse trees. Otherwise
it is unambiguous. The grammar for
balanced parentheses that we have been
using is ambiguous, alas. I'll show you
two parse trees for this balance string,
that is left, right, left, right, left,
right on the next slide. Notice that each
of these trees has the same yield, that is
three pairs of left and right parentheses.
However they're evidently not the same
tree. The first replaces the first child
of the root by [sound]. And the second
does the same but with the second child.
Notice that the two constructions we just
gave, leftmost derivations from trees and
trees from leftmost derivations have the
property that two different parse trees
produce different leftmost derivations and
conversely The same is true for the
analogous transformations between
rightmost derivations in trees Thus we
could also define a grammar to be
ambiguous if it has a string with two
different leftmost derivations or two
different rightmost derivations.
Fortunately just because one grammar for a
language is ambiguous does not mean that
we can't find another grammar for the same
language that is unambiguous. But as an
aside, which we'll get to later, the
opposite is not true either. That is,
there are some ambiguous to which no
unambiguous equivalent exists. Anyway.
Here is a grammar for balanced parenthesis
that is unambiguous. Variable D, which is
the star symbol, generates all balanced
strain But R, another variable generates
all strings that are balanced except for
having one more right parenthesis than
left. By balance in this context I mean
that no prefix of the string has more
right parentheses than left. And examples
would include, well, a single, right
paren. Left, right, right that would be
generated by R and also something like
this: left, left, right, left, right,
right, right. Here's the unambiguous
grammar for balanced parentheses again.
Suppose we're given a string of
parentheses which we'll call the input.
And we want to find this leftmost
derivation or derivations. We claim that
for every input symbol and for either
variable B or R there is only one choice
of production that could possibly lead to
a leftmost derivation of the input. That
is, no string of parentheses has two
distinct leftmost derivations and
therefore the grammar is unambiguous.
Imagine we are constructing the left
sentential form as we scan the input. As
we go, the terminals to the left of the
left most variable must match the input or
we'll never derive that input string of
terminals. So we can check out input
symbols as we match them Notice that the
only place B can ever appear in a left and
central form is at the right end. That is
because the only production with D in the
body, has lead [inaudible] and it also has
head B that is this B can only go to a
string that has a B in it, and the only
way a B can come, can appear is if it was
either the start symbol. Or it was the
result of replacing this B at the end by
the, paren RB in that production. It does
an easy induction on the number of times
this production is applied so that the B
is still at the right end Now suppose B is
the leftmost variable of a left sentential
form. If there's a left parenthesis as the
next unmatched input, and we have to
expand the B using this production the
paren RB. Because if we use epsilon then
the left sentential form has no more
variables and we can never generate the
unmatched left paren. The only time we can
use epsilon to replace B is this
production. Is when we have matched the
entire input and we have found the unique
left most derivation. If R is the left
most variable then the next input symbol
forces us to use one of the two R
productions, that is if the next input
symbol is right paren You must use R goes
to right paren. That's here. And if the
next input is a left paren you must use R
goes to left paren followed by two rights.
There's never an option on a left
sentential form you're deriving can never
match the input string. Here is an
example. Suppose we want to find the
unique left most derivation for this
string of parentheses. That's this string
right here. We set the string up as an
input with a pointer to the next symbol
that must be matched in the left most
derivation. Okay. We start off with the
left sentential form that is the start
symbol B alone. The next input to match is
the left [inaudible] and we must therefore
expand this B, to left [inaudible] RB,
that is using that production. Here we've
made the expansion. The first left paren
has been removed from the input, since it
has been matched. That is it appears to
the left of the left most variable in the
current left [inaudible] form. In essence,
this left paren is what used to be on the
input there. Our next input symbol is
another left paren and we have to expand
an r, the left most variable. That's this
guy. Our only choice is to use the body
"R" goes to left [inaudible] "R", "R"
[sound]. So we're going to do that. Okay.
That's what happened here. The R was
replaced by paren RR and the second left
paren has been matched. It appears to the
left of the leftmost variable as here. Now
next input is a right paren and we must
expand the leftmost R. The only, choice is
the first production. That will enable us
to match. Now the third input has been
matched. We have R to expand. And right
paren has the next input. So we again use
R goes to the right paren. That is we
expand that R, replacing it by a right
paren. That's what's going to happen. On
the next slide. You have the left paren as
the next input and we must expand B, the
choice is the first production for B
[inaudible] now, "R" is the left most
variable, we are up here now and the next
input is the right [inaudible], so we
replace the "R" of our right [inaudible],
[inaudible] using the first production for
R, that's that and lets see Know there is
no more input to match and B must be
expanded, that's B, the left most
variable. The only choice is to use the
second B production, which is that. And we
effectively make the B disappear. And
we're done. We have a leftmost derivation
of the original input, which of course,
appears as the final step of the
derivation. Had we deviated from the
choices we made at any step the, the
result of the left most derivation would
not have been this input strain of
terminals There is a name for grammar such
as our example where it is always possible
to choose unique production to use in a
leftmost derivation of any string in the
language, in the simple manner that we
illustrated. At each step we looked only
at the first unmatched symbol of the input
and we were able to make the unique choice
correctly. Such a grammar is called
[laugh] of one. Standing for leftmost
derivation, that's the first L, left to
right scan, that's the second L, and one
symbol of look ahead. It is normal for a
programming language to have an
[inaudible] of one grammar. Probably as
this theory became widely understood by
designers of programming language as they
saw the advantages of keeping the language
parse-able in this simple way and made
choices to preserve this ability. In the
same argument we gave for our example
grammar tells us all [inaudible] one
grammars are unambiguous. And remember
unambiguity is vital for a programming
language as we must assign a unique
meaning to any legal program of the
language. For balanced parens we found the
first, simplest grammar that we wrote down
was ambiguous but we were able to redesign
the grammar to make unambiguous. One might
hope for an algorithm that would do that
for any ambiguous grammar But, alas, there
can not be such an algorithm. There are
certain contexts for languages for which
every grammar is ambiguous. Such languages
are called inherently ambiguous. We're not
going to get into the proofs of what I
just said, but I'll give you an example of
an inherently ambiguous language. The set
of strings of the form some number of
zeroes followed by some number of ones,
followed by some numbers of twos. Such
that either the numbers of zeroes and ones
are the same. That would be this condition
I equals J. Or the numbers of ones or twos
are the same. That would be J equals K. Or
both, of course. The problem is that any
grammar for this language must generate at
least some of the strings with equal
numbers of all three symbols in two
different ways. In the first derivation,
or parse tree, the grammar forces the
numbers of zeros and ones to be the same
using the same trick we used to generate a
set of strings with the form zero to the
n, one to the n. The grammar can generate
any number of twos to go along but it may
happen to generate the same number of twos
as zeros and ones. The second derivation
of parse tree makes sure the numbers of
ones and twos are the same, but may
accidentally generate the same number of
zeros as well. Here's an example grammar.
It is ambiguous but that doesn't prove the
language is inherently ambiguous. As I
said, we're not going give that proof,
it's very complicated. However you might
wish to play around with grammars from the
same language and see how you are forced
to do something like this grammar in order
to generate exactly the language you want.
To be easy and be familiar to observe that
A will generate all strains with one or
more zeros followed by exactly the same
number of ones. We've seen essential this
pair of productions, yes, as a complete
grammar before and it should also be
obvious that "B" generates any string of
one or more two's and nothing else
Likewise C generates the strings of one or
more zeros. And D generates one or more
ones followed by exactly the same number
of twos. Now, all derivations start with
"S" and the first production replaces it
by either "AB" or CD. If we go with AB,
then we get strings where the zeros and
ones match, while if we go with CD, then
the ones and twos match. As a result,
strings like 012 with the same number of
each symbol will have two left most
derivations. One starting with S goes to
AB, that's this, and the other starting
with S goes to CD. Here are the two
derivations for zero, one, two.