The parse tree is a graph that shows how a particular string was derived using some context free grammar. These trees have a direct connection to left most and right most derivations which we shall explain now. One important use of the tree representation of derivations is it let's us express the important concept of ambiguity in grammars. That is, unambiguity or the ability for a grammar to provide a unique tree structure for every string in its language is vital. For example if a programming line which has ambiguity, then there are programs with two or more distinct meanings. The same is true of a grammar for an actual language, it would be insufficient to provide an intended meaning for all valid sentences of the language. Parse trees for a [inaudible] tree are trees whose nodes are each associated with a symbol of G. I'm going to draw you a little tree here. Okay. Leaves are always labeled by either a terminal or by epsilon, so we might draw for example an A on that one a terminal B on that one, and perhaps epsilon on that. Interior nodes are labeled by a variable. Here are some examples. The important property that makes a tree a parse tree is that there is a production of the grammar with the label of the node in question as its head and the labels of its children read left to right as its body. Okay? For example, if you look at this parse tree we would infer that A goes to BC is a production also B goes to little A, little B is a production and C goes to epsilon is a production. The root must be labeled by the start symbol. In our example tree here I'm going to have to guess that if this is the root, then "A" is in fact the start symbol. Here's a parse tree using the grammar for the balanced parentheses that we discussed earlier. Notice how each interior node is labeled by a variable. Here, here, here and here. Must be asked because that's the only variable we have. Each leaf is labeled by a terminal, either a left or right parenthesis, these are here, here, here, here, here and here. Okay, the production of the root is S goes to [sound]. You can sort of see that by looking at the root and its two children here's another interior node and the labels of its children are left paren as right paren and you know that left paren as right paren is another production of the grammar that's right here. And here we have an example of a interior node labeled S and its children labeled left paren and right paren that corresponds to this production body right here. You have the same exact use of that production, here. Okay. The yield of the par street is the string of labels and the leaves from the left. This order of leaves is the order you visit them in during a pre-order traversal, path which would look something like this. That's sort of a, pre order transversal path goes around the tree, counter-clockwise. The order in which you visit the leaves is the order in which their labels appear in the yield. So here the yield is left paren, left paren, right paren, right paren, left paren, right paren. That's what we said here. In what follows, we're also going to talk about parse trees with a route A that may not be the start symbol. These trees follow all the other requirements of the parse tree. The leaves are labeled by terminals or epsilon and an interior node with its children form a production of the grammar. We're going to show how to convert [inaudible] to leftmost to rightmost derivations and vice versa. As is often the case, a [inaudible] made easier by proving something more general than what we really need. In this case we'll talk about generalized parts tree that can have any root labels, say "A". And we'll talk about left most and right most irrevations for many variable "A". We'll prove two statements... One is that there is a parse tree with root A and the yield W. And there is a left most derivation of W from A [inaudible]. The second statement is the converse, that for every left most derivation of W from A there is a general bias parse tree with root A and yield W. The matter of right most derivations is completely analogous. We will start with point one by showing how to convert parse trees to left most derivations. The proof is an induction on the height of the tree. The basis is height one. That is a tree consisting of a root label by some variable, A, and one or more leaves. They must be a production with head A and body being in the labels of the leaves from left to right. That's A one though A N in this picture. Then, A derives the yield of the tree by the one step left most [inaudible] in which this production is used. For the induction we'll assume statement one for height less than H and prove it for H That is, we assume for every parse tree with root A and height of up to H minus one there is a leftmost derivation of its yield from its root. Now let's look at a parse tree of height H. The case H=1 is the basis, so H is at least two. Thus the production used at the root has at least one variable on its right side. Say this tree has root A and the children of the root labeled X1 through Xn. If Xi is variable, then it is the root of some sub tree of height at most h minus one and some yield, say Wi. If XI is a terminal or epsilon then we imagine XI is a root of a trivial tree, not a parse tree, that consists only of a node labeled XI. The yield of this tree is just Xi itself, but for consistency we shall refer to Xi as the stream of terminals, Wi. The yield of the entire tree is W, which is W1 through Wn. We can put together a left most derivation of W from A as follows. Start by applying the production at the root, so the second step is X1, X2 up to Xn. That's this. Okay, now we need the left most derivation of W one from X one. X one is the terminal that equals W one so we're already there. And this step is then really zero step on the other hand, if X1 is a variable and we apply the inductive hypothesis, the sub tree with root X1 and yield W1 has height at most H minus one, so there is a left most derivation at W1 from X1. In that case, this X1 has been replaced now left most derivation by W1 and this, goes to star represents all the steps that were necessary to replace X1 by W1. Either way we can conclude that there is a left most derivation of W1 X2 through Xn, that's this. We can continue arguing this way For each XI in turn, either it is a terminal, in which case nothing needs to be done, or it's a variable in which case we can perform a leftmost derivation from WI from XI. The needed sequence of leftmost derivations is a leftmost derivation of the entire yield W. That's W1 through Wn. From A. Now we need to prove the other direction. That leftmost derivations can be turned into parse trees. The proof is an induction on the number of steps of the derivation. The bases of one step derivation. In this derivation a variable A is replaced by a string of terminals, perhaps the empty string, using some production for A. If the string is not empty then there is a parse tree of height one with A at the root and the sequence of terminals in the body as the labels of the children. This and there are the children. If the body is epsilon then there's again a porous tree it has A at the root and one leaf labeled epsilon so it just sort of looks like this. Now let's do the inductive step. We assume that left most derivations of fewer than K steps can be turned into parse trees with the proper yield. We will consider a K step derivation from string W from variable A The first step of this derivation uses a production with head A and body X1 through Xn for some n. Thus the second [inaudible] form in this derivation of W is X1 through Xn, that's that. Remember the important properties of derivations. That production bodies replace production heads at the position where the head is. Thus we can divide W into the concatenation of n strings, W1 through Wn, in that order. Where for all I, WI either is XI, if XI is a terminal. Or XI derives WI by a leftmost derivation of fewer than K steps. Since the whole derivation is leftmost, the derivation WI from XI must happen first. Then the derivation of W2 through X2 and so on. So we know that each XI either is WI or derives WI in fewer than K steps at the leftmost derivation. So the second case, where XI is a variable and the leftmost derivation takes one or more steps, the inductive hypothesis tells us that there is a parse tree with root XI and yield WI. Plus we can build the parse tree shown... That is this. The root uses the first production of the derivation as A goes to X1 through Xn and the Ith child either is Wi if Xi is a terminal or it is the root of a parse tree from Xi deriving Wi. The yield of this tree is W1 through Wn which is W. Okay, that is, this string W is W1 through Wn. All these strings are derived left to right in that order. This proves the inductive step and we conclude there is a leftmost derivation of W from A. ... And there is a parts true with root "A" and a yield "W". It is also true that if there is a right most irrevations of "W" from "A", then there is a parts true of root "A" and yield "W" and conversely. But we are not going to prove that part, the proofs are essentially the same as what we have seen. And in fact, any derivation even the one that isn't left most or right most of a terminal strain "W" from variable "A" implies that there is a parts true with root "A" and yield "W". The first step of any derivation from A must use a production and replace A from some string of terminals and variables. Say, X1 through Xn. If W's the terminal string ultimately derived, then we can still break W into W1 through Wn, where each WI either is XI or is derived from XI by fewer steps than the whole derivation. The only tricky part is that now the steps of the derivation from XI and XJ may be intermingled and we have to sort the derivation steps according to which XI is descendant as being replaced at that step. We'll leave you to think through the details of this construction of parse trees. As we mentioned at the beginning, it is important in many applications, including parsing of programming languages In a compiler and parsing of natural language sentences that we use a context-free grammar that assigns a unique parse tree to each string of the language. We say a grammar is ambiguous if there is at least one string in its language that has two different parse trees. Otherwise it is unambiguous. The grammar for balanced parentheses that we have been using is ambiguous, alas. I'll show you two parse trees for this balance string, that is left, right, left, right, left, right on the next slide. Notice that each of these trees has the same yield, that is three pairs of left and right parentheses. However they're evidently not the same tree. The first replaces the first child of the root by [sound]. And the second does the same but with the second child. Notice that the two constructions we just gave, leftmost derivations from trees and trees from leftmost derivations have the property that two different parse trees produce different leftmost derivations and conversely The same is true for the analogous transformations between rightmost derivations in trees Thus we could also define a grammar to be ambiguous if it has a string with two different leftmost derivations or two different rightmost derivations. Fortunately just because one grammar for a language is ambiguous does not mean that we can't find another grammar for the same language that is unambiguous. But as an aside, which we'll get to later, the opposite is not true either. That is, there are some ambiguous to which no unambiguous equivalent exists. Anyway. Here is a grammar for balanced parenthesis that is unambiguous. Variable D, which is the star symbol, generates all balanced strain But R, another variable generates all strings that are balanced except for having one more right parenthesis than left. By balance in this context I mean that no prefix of the string has more right parentheses than left. And examples would include, well, a single, right paren. Left, right, right that would be generated by R and also something like this: left, left, right, left, right, right, right. Here's the unambiguous grammar for balanced parentheses again. Suppose we're given a string of parentheses which we'll call the input. And we want to find this leftmost derivation or derivations. We claim that for every input symbol and for either variable B or R there is only one choice of production that could possibly lead to a leftmost derivation of the input. That is, no string of parentheses has two distinct leftmost derivations and therefore the grammar is unambiguous. Imagine we are constructing the left sentential form as we scan the input. As we go, the terminals to the left of the left most variable must match the input or we'll never derive that input string of terminals. So we can check out input symbols as we match them Notice that the only place B can ever appear in a left and central form is at the right end. That is because the only production with D in the body, has lead [inaudible] and it also has head B that is this B can only go to a string that has a B in it, and the only way a B can come, can appear is if it was either the start symbol. Or it was the result of replacing this B at the end by the, paren RB in that production. It does an easy induction on the number of times this production is applied so that the B is still at the right end Now suppose B is the leftmost variable of a left sentential form. If there's a left parenthesis as the next unmatched input, and we have to expand the B using this production the paren RB. Because if we use epsilon then the left sentential form has no more variables and we can never generate the unmatched left paren. The only time we can use epsilon to replace B is this production. Is when we have matched the entire input and we have found the unique left most derivation. If R is the left most variable then the next input symbol forces us to use one of the two R productions, that is if the next input symbol is right paren You must use R goes to right paren. That's here. And if the next input is a left paren you must use R goes to left paren followed by two rights. There's never an option on a left sentential form you're deriving can never match the input string. Here is an example. Suppose we want to find the unique left most derivation for this string of parentheses. That's this string right here. We set the string up as an input with a pointer to the next symbol that must be matched in the left most derivation. Okay. We start off with the left sentential form that is the start symbol B alone. The next input to match is the left [inaudible] and we must therefore expand this B, to left [inaudible] RB, that is using that production. Here we've made the expansion. The first left paren has been removed from the input, since it has been matched. That is it appears to the left of the left most variable in the current left [inaudible] form. In essence, this left paren is what used to be on the input there. Our next input symbol is another left paren and we have to expand an r, the left most variable. That's this guy. Our only choice is to use the body "R" goes to left [inaudible] "R", "R" [sound]. So we're going to do that. Okay. That's what happened here. The R was replaced by paren RR and the second left paren has been matched. It appears to the left of the leftmost variable as here. Now next input is a right paren and we must expand the leftmost R. The only, choice is the first production. That will enable us to match. Now the third input has been matched. We have R to expand. And right paren has the next input. So we again use R goes to the right paren. That is we expand that R, replacing it by a right paren. That's what's going to happen. On the next slide. You have the left paren as the next input and we must expand B, the choice is the first production for B [inaudible] now, "R" is the left most variable, we are up here now and the next input is the right [inaudible], so we replace the "R" of our right [inaudible], [inaudible] using the first production for R, that's that and lets see Know there is no more input to match and B must be expanded, that's B, the left most variable. The only choice is to use the second B production, which is that. And we effectively make the B disappear. And we're done. We have a leftmost derivation of the original input, which of course, appears as the final step of the derivation. Had we deviated from the choices we made at any step the, the result of the left most derivation would not have been this input strain of terminals There is a name for grammar such as our example where it is always possible to choose unique production to use in a leftmost derivation of any string in the language, in the simple manner that we illustrated. At each step we looked only at the first unmatched symbol of the input and we were able to make the unique choice correctly. Such a grammar is called [laugh] of one. Standing for leftmost derivation, that's the first L, left to right scan, that's the second L, and one symbol of look ahead. It is normal for a programming language to have an [inaudible] of one grammar. Probably as this theory became widely understood by designers of programming language as they saw the advantages of keeping the language parse-able in this simple way and made choices to preserve this ability. In the same argument we gave for our example grammar tells us all [inaudible] one grammars are unambiguous. And remember unambiguity is vital for a programming language as we must assign a unique meaning to any legal program of the language. For balanced parens we found the first, simplest grammar that we wrote down was ambiguous but we were able to redesign the grammar to make unambiguous. One might hope for an algorithm that would do that for any ambiguous grammar But, alas, there can not be such an algorithm. There are certain contexts for languages for which every grammar is ambiguous. Such languages are called inherently ambiguous. We're not going to get into the proofs of what I just said, but I'll give you an example of an inherently ambiguous language. The set of strings of the form some number of zeroes followed by some number of ones, followed by some numbers of twos. Such that either the numbers of zeroes and ones are the same. That would be this condition I equals J. Or the numbers of ones or twos are the same. That would be J equals K. Or both, of course. The problem is that any grammar for this language must generate at least some of the strings with equal numbers of all three symbols in two different ways. In the first derivation, or parse tree, the grammar forces the numbers of zeros and ones to be the same using the same trick we used to generate a set of strings with the form zero to the n, one to the n. The grammar can generate any number of twos to go along but it may happen to generate the same number of twos as zeros and ones. The second derivation of parse tree makes sure the numbers of ones and twos are the same, but may accidentally generate the same number of zeros as well. Here's an example grammar. It is ambiguous but that doesn't prove the language is inherently ambiguous. As I said, we're not going give that proof, it's very complicated. However you might wish to play around with grammars from the same language and see how you are forced to do something like this grammar in order to generate exactly the language you want. To be easy and be familiar to observe that A will generate all strains with one or more zeros followed by exactly the same number of ones. We've seen essential this pair of productions, yes, as a complete grammar before and it should also be obvious that "B" generates any string of one or more two's and nothing else Likewise C generates the strings of one or more zeros. And D generates one or more ones followed by exactly the same number of twos. Now, all derivations start with "S" and the first production replaces it by either "AB" or CD. If we go with AB, then we get strings where the zeros and ones match, while if we go with CD, then the ones and twos match. As a result, strings like 012 with the same number of each symbol will have two left most derivations. One starting with S goes to AB, that's this, and the other starting with S goes to CD. Here are the two derivations for zero, one, two.