Context-free grammars can be badly designed. For example they can have variables that play no role in the derivation of any terminal string and therefore shouldn't be there. As analogous to states in the [inaudible] an automaton that aren't reachable from the start state. There are also certain productions that, while they are necessary, cause derivations to take many steps that can obviously be combined. These include productions whose bodies are the empty string or unit productions where the body is a single variable. We can get rid of these and the way to do so is similar to the way we removed epsilon transitions from an NFA. Finally, we're going to introduce [inaudible] normal form where all the production bodies are either a single terminal or two variables. Incidentally, the Chomsky we refer to is Noam Chomsky. Back in 1956, he wrote the paper that introduced the idea of context-free grammars. Then he was a linguist trying to provide some mathematics for the structure of the language. Since then he has unfortunately became somewhat notorious for his political views. Oh well, back to context-free grammars. Here is an example of a really bad grammar. A is okay, it derives all strings of one or more A's using its two productions. However B has only one production and that production has B in its body. Thus, once it's sentential form has a B in it, you can never get rid of that B. And as a result, B derives no terminal strings. To make matters worse, S is only one production. So that must be the first used than any derivation. But that production introduces a B into the cententional form so it is impossible to derive any terminal string from S and therefore the language of this grammar is empty. Almost all the algorithms we need to simplify grammars are based on the same principle, which I call discovery algorithms. These discover facts by an induction process The basis is always certain facts that are obvious. Then based on what is already known, we discover facts and repeated rounds. Finally, at some round, they can discover no more facts. There's no point in going on since without new facts in the current round, we cannot discover more on the next round. We generally have to prove only that any true fact of the type we are trying to discover will be found this way. Here's the image of discovery algorithms we should keep in mind. We start with some facts you get from the basis. We expand the set of facts you know by using the basis fact, this is the first round. For the second round, we expand the set of known facts further by using both the basic facts and the facts you discovered around one. You keep going until some round you have no more facts that can be discovered. So one of the first things we need to do when dealing with grammars is to detect and get rid of variables that can't derive any terminal string. We shall give a discovery algorithm to find inductively, all variables that do derive at least one terminal string, any variable not discovered by this algorithm derives no terminal strings and can be eliminated. For the basis of a variable A has a production who's body has no variables, then A certainly derives a terminal string in one step. Note that this body could be the empty string, technically string is a string of terminals. Now suppose A has production with a body alpha and alpha consists of only terminals and variables that we have previously discovered derives from terminal string and A derives from terminal string. Since the number of variables is finite eventually this algorithm terminates where it can't find no more variables that derive terminal strings. Its is easy to prove that whenever the algorithm says a variable derives a terminal string that it really does derive a terminal string, we are not going to prove that. The harder part is showing that the algorithm doesn't miss anything, that is if variable A derives some terminal string, then the algorithm will eventually discover A, we will do that on the next slide. The proof is in the induction of the minimum height of a parse tree with root A and a yield of terminals. The basis is a tree of height one which consists of root A and one or more leaves labeled by terminals or perhaps epsilon. Then the basis step of the algorithm discovers A, so we are okay there. For the induction, assume the statement is true for height up to H-1, that is all variables at are the roots of parse trees with a height up to H-1 and a yield of terminals that are discovered by the algorithm. The parse tree for height H, which must be of course equal to or greater than two, has children of the root labeled X1 through Xn, that's this. Anyone of these Xi's that is a variable is the root of the sub tree of height at most H-1 and therefore it is discovered. Moreover, one of these variables is discovered last. At the round, with the last of the variables AIs is discovered, we must surely do another round since the set of discovered variables just changed. On the next round, A will be discovered because it has a production, that is A goes to X1through XN. Who's body consists of only bodies and discovered variables. The algorithms do eliminate variables that derive no terminal string is no-, simple. Use the algorithm that we just described to find the variables that do derive terminal strings. Call the other variables useless. Then remove from the grammar all productions in which in least one useless variable appears. It doesn't matter if the variable appears in the head, the body or both. Here's an example grammar to which we first apply the algorithm that derive terminal strings. For the basis step, we immediately discover A and C because they have productions with terminals only. Here they are. For the first round of the production, S is discovered because there is a S production with body C. And C was previously discovered. However at the next round we can discover no more variables. The only variable we have not yet found to derive a terminal string is B and B has only one production body which is little B followed by big B. This body does not exist only of terminals and discovered variables so we can never add B to the set of discovered variables. Unless B is useless and we eliminate all traces of it. That includes not only the production B goes to but the production S goes to AB, leaving us of course, fortunately, S goes to C, still remains as well as the 2A production and the C production. In addition to eliminating the variables that don't derive anything, we need to eliminate variables that derive some terminal strings but cannot be derived from the start symbol. The algorithm to define symbols both terminals and variables that appear in variation from the start symbol is a derivation of the discovery algorithm. For the basis, obviously the start symbol can be derived from zero steps from itself. For the induction, suppose we have discovered that we can reach variable A and then for every production with body alpha and head A, we can also redraw the symbols appearing in alpha. The terminals and variables that appear there. It is an easy pair of inductions to show, first that if we discover an a symbol by this algorithm, that it appears in the sentential form derived from the start symbol. Second, that if we do not discover a symbol, then there is no derivation from the start symbol from which it appears, we are not going to give the proofs here. But remember our goal is to get rid of that do not appear on the derivations from S. So after discovering all the symbols that do appear in a derivation, delete from the grammar all the productions that contain a symbol, and now that head or body or both, that does not appear in a derivation. Say a symbol is useful if it appears in a derivation of a terminal string from the start symbol. And call it useless otherwise. There are two reasons a symbol can be useless, either it derives no terminal string or it appears in no derivation from the start symbol, or both. We have algorithms to eliminate symbols that are useless fro each of the reasons but we must apply them in the right order. First eliminate the symbols that fail to derive a terminal string. And then eliminate symbols that do not appear from any derivation from the start symbol. In this example grammar, if as we should, we first eliminate variables that do not derive a terminal string. We eliminate only B, however eliminating productions with B gets rids of the only S production. We then use the algorithm to find symbol unreachable from the start symbol and we find that everything is unreachable, that is all the productions are deleted. However, if we do things in the wrong order and first eliminate unreachable symbols, we find everything is reachable from S so nothing is eliminated here. Then when we look from the symbols that do not derive terminal strings, we eliminate only B. That leaves the productions A and C goes to little c, these guys, which should not be there because A, C and little c are useless. Here's why first eliminating variables that don't derive terminal strings is the right thing to do. After eliminating those variables, every remaining symbol is either a terminal or is a variable that derives a terminal string. After removing symbols not reachable form the start symbol, all remaining symbols appear in some derivation from the start symbol of some sentential form. But the variables that appear in sentential form still derive at terminal string because such a derivation can only involve symbols that are also reachable from the start symbol. Epsilon productions are those that have body epsilon, they can be eliminated from a context free grammar and the only thing that we lose is that we can no longer derive the empty string. If the empty stream was not on the language to begin with, then we can eliminate epsilon productions and still have a grammar that derives the same language. However, if epsilon was in the language then we lose it, the two cases can be summarized by the theorem on the slide, if L has the grammar. Then L minus the set containing the empty string has a grammar with no epsilon productions. Notice that if an epsilon was not an L then L minus epsilon is just L anyway. To eliminate epsilon productions, we need yet another discovery algorithm, this one to find the variables that derive the empty string by one or more steps. We call them null-able symbols. The basis of the discovery algorithm is that if A has a production with an empty body then it is surely nullable. And the inductions is that if A has a production with body alpha and alpha consists only of variables that are nullable then A is also nullable. Here is an example grammar for which we will discover the nullable symbols. The basis we know that A is nullable because of the production with epsilon body, its that. In the first round of the induction we find B is nullabe because of the production B goes to A. That is all symbols on the body, namely the A, are already known to be nullable. In the second round of the basis we discover that S is nullable because of it's body AB, both symbols of which are already known to be nullable. This algorithm finds all and only the nullable symbols. We are not going to give the proof, which consists of two simple inductions. To eliminate epsilon productions from our grammar we need to we need to turn each production, say A goes to epsilon through Xn, into a possibly large number of productions. The idea is to guess which of the nullable symbols of the body of a production will derive epsilon in a particular derivation, since we make all possible guess by creating many different many productions, we always manage to guess right. More precisely, for each set of nullable Xis. These from the body of the production make a new A production. Note, that if two of the Xi's are the same nullable symbol then we have to consider the possibility that one [inaudible] derives epsilon and the other does not. That is we form one production for each set of positions that hold nullable symbols, not just for each set of nullable symbols. However in the special case that all the Xi's are nullable symbols, we do not consider the set of all positions and we do not create a new production with the epsilon body. Here is an example grammar. Each of A, B, and C are nullable because they have epsilon productions. Thus, S is also nullable because of it's production S goes to A, B, C. Lets construct the new grammar. For the S production there are seven sub sets of nullable positions that we must use. The set of all three positions is also nullable of course, but eliminating all of A, B, C would leave an empty body which we don't allow. However if we use the empty set positions we get body A, B, C. If we use the set of only the third position, we get AB, and so on. Now lets look at the A productions. We do not use A goes to epsilon of course, but for production A goes to little A big A, its that. Only the second position is nullable so we get two productions, one with the variable A present and the other not, that is what we have here. The situation for B is the same. However C we, for C we cannot use the C goes to epsilon production. So in the new grammar, C has no production. That means that in the new grammar, although not in the old grammar, C is useless and must be eliminated. That forces us to eliminate all productions with C in the body. And we are done. The proof that the new grammar we construct generates the same strings except for epsilon as the old grammar generates is a little tricky. So we're going to give the details in one direction. As is often the case, we need to prove something more general than we really need. In this case, we need two statements about every variable A. First, if W is a non-empty string that is derived from A in the old grammar, then A also derives W in the newly constructed grammar. Second, if A derives W in the new grammar, first of all W is not empty and second A derives W in the old grammar. Once we have that for all A we can apply the statement to the start symbol and thus prove that the language of the new grammar is the language of the old grammar with epsilon removed if it was in that language. We're going to prove the first direction and it is an induction on the number of steps by which A derives W in the old grammar. For the basis, if there's a one step derivation of W from A then A goes to W must be a production. We assume in part one that W's not empty, so A goes to W is a production of the new grammar. You make a desired conclusion that A derives W in the new grammar. Now let's do the induction. Assume the inductive hypothesis for derivations at fewer than K steps and suppose W is derived from A in K steps of the old grammar. The first step of this derivation must be A replaced by the body of some A production. Assume this body consists of symbols X1 through Xn. Then we can break W into W1 through Wn where each WI is the portion of W that either is XI if XI is a terminal or it's derived by XI if XI is a variable. All these derivations from variables are in fewer than K steps. If XI is a variable and the piece WI is not empty, then the inductive hypothesis tells us that XI derives WI in the new grammar. When we construct the new grammar we replace the A production A goes to XI through Xn. By a family of productions and one of these eliminates from the body exactly those XI's such that WI is epsilon. We know that is the case because if WI is epsilon then surely XI is nullable. We also know that not all WI's are epsilon because W is not epsilon. That is at least one XI is either a terminal or it is a variable that we do not need to remove from the body. Thus in the new grammar, the first step can replace A by a body consisting of all those XI's such that WI is not epsilon. We can continue the derivation in the new grammar by a derivation from each XI that remains of the non-empty string WI. We know this derivation in the new grammar exists by the inductive hypothesis. We also need to show part two, if W is derived from A in the new grammar then it is non-empty and also derived in the old. We're going to skip this part. So we're now ready for new simplification of grammars the eliminations of unit productions. A unit production is one whose body is a single variable. We can eliminate all such productions. The, the idea is to discover using a discovery algorithm, all pairs of variables AB. Such that A derives B by a sequence of unit productions. Eventually, a sequence of unit productions must end with the use of some other kind of production. So we can collapse the steps that use unit productions into the next one that does not use a unit production. That is if B goes to alpha is a non-unit production and A derives alpha by unit productions, then we'll add A goes to alpha. Finally, we drop all unit productions. At this point we do not need the unit productions and can drop them from the grammar. Here's the algorithm that discover the pairs of variables AB such that A derives B by unit productions. For the basis, surely A derives A by unit productions only. This is a sequence of zero steps. For the induction, suppose we already discover that A derives B by unit productions, then if B goes to C is a unit production, we may add the pair AC. There are two proofs that we need but we're not going to do them here. First, an induction on the number of rounds in the discovery process, let's just show that the pairs we find are valid. That is, they really are pairs AB such that A derives B by unit productions. Then conversely we can show by an induction on the number of steps of a derivation from A to B, using unit productions, that we will in fact discover the pair AB. Another proof that we're going to skip is the proof that in new grammar has the same language as the old. Again, we have to prove something more general, that A derives W in the new grammar if and only if it does so in the old grammar. Each production of the new grammar simulates one or more steps of the derivation of the old grammar. That is, some number of unit productions, perhaps zero, followed by a non-unit production. Conversely, every use of a production in the new grammar can be replaced by zero or more unit productions followed by a non-unit production in the old grammar. We can now combine the three simplifications to make a strong statement about grammars. If L is a context-free language, then L minus epsilon has a grammar with no useless symbols, no epsilon productions and no unit productions. Another way to put it is that L minus epsilon has a grammar in which every production body is either a single terminal or has length at least two. You plot the constructions just learned, but we have to be careful about the order. You must start by eliminating epsilon productions. We have to do this step first because eliminating epsilon productions can produce unit productions that weren't there before and as we saw in an example, it can create useless symbols that were not useless before. That can only occur if the production was only used to derive the empty string, however. Second, eliminate the unit productions. Third, eliminate variables that derive no terminal strings. We explained earlier why this step must precede the next step in our quest to eliminate all useless symbols. So finally we do the fourth step of eliminating all the unreachable symbols. We've almost got our grammars into Chomsky Normal Form or CNF. Such a grammar has only two kinds of production bodies: single terminals or two variables. We're now going to give the construction of the CNF grammar for L minus epsilon, where L is any context-free grammar. Incidentally, one important use of putting grammars in CNF is it gives us a relatively efficient algorithm for testing membership in a string of a context-free language. One might imagine that it was easy to make such a test while looking at all derivations of a certain limited length. But with epsilon productions and unit productions in the grammar, it is not obvious how long the derivation of even a short string of terminals has to be. Moreover even if we can bound the light for the derivation as we can Would still be faced with an algorithm that took an amount of time that is exponential in the length of the terminal string. Rather by putting the grammar in CNF we can make this test and at most, we can make the length of the string. Our first step is to the cleaning we just described. The result is the grammar no longer the empty string even if the old one did. But otherwise the languages are the same. But all production bodies are either single terminal or have length at least two. Our second step is to turn those bodies that aren't a single terminal into bodies into all variables. The trick is simple, for each terminal little A, create a new variable that we'll call A sub A. To that. The job of this variable is just to generate the terminal little A. So replace A in all bodies of length two or more by A sub A and add the production A sub A goes to A. That is, of course, that. Here's an example involving the production A goes to B little CB little E C and D are terminals so we must have created variables A sub C and A sub E with their productions A sub C and A sub E goes to E. Then we can replace A goes to BCDE by A goes to B A sub C D A sub E. Now all production bodies that aren't a single terminal are strings of two or more variables. If exactly two that's great because that's what CNF requires. But if a body consists of three or more terminals, we have to break the body into pieces into variables that appear nowhere else. An example should make the idea clear. If we have a production AB goes to BCDE, Then we need two new variables say F and G and they are used only for this production. They may not appear in the new grammar except as we describe here. In general if the body consists of N variables, we need N minus two new variables. The job of the first new variable F is to generate the whole body except for the first symbol, B in this case. That is we'd replace this A production by A goes to BF. Now F needs to derive CDE that's the rest of the body. But that's too long so we use G to help. G derives only BE, that's this production, using the production G goes to DE and the production for F becomes F goes to CG Plus we've replaced A goes to BCDE by the three productions that meet the CNF requirement. A goes to BF, F goes to CG and G goes to DE. These are the three productions can simulate the effect of the original production although they take three steps to do so. Thus, making this change surely allows the new grammar to generate anything the old one did. But the new grammar doesn't generate anything the old one didn't, so the languages are the same. The reason is that once we choose to replace A by BF we are forced to replace F by CG and then G by DE because these two variables have only one production. Thus using A goes to BF in the new grammar is [inaudible] to using A goes to BCDE in the old We thus have an argument why the transformation from clean grammars from clean grammars to CNF grammars did not change the language. You can do formal inductions on the length of derivations in these grammars but I hope these less formal arguments are convincing.