Context-free grammars can be badly
designed. For example they can have
variables that play no role in the
derivation of any terminal string and
therefore shouldn't be there. As analogous
to states in the [inaudible] an automaton
that aren't reachable from the start
state. There are also certain productions
that, while they are necessary, cause
derivations to take many steps that can
obviously be combined. These include
productions whose bodies are the empty
string or unit productions where the body
is a single variable. We can get rid of
these and the way to do so is similar to
the way we removed epsilon transitions
from an NFA. Finally, we're going to
introduce [inaudible] normal form where
all the production bodies are either a
single terminal or two variables.
Incidentally, the Chomsky we refer to is
Noam Chomsky. Back in 1956, he wrote the
paper that introduced the idea of
context-free grammars. Then he was a
linguist trying to provide some
mathematics for the structure of the
language. Since then he has unfortunately
became somewhat notorious for his
political views. Oh well, back to
context-free grammars. Here is an example
of a really bad grammar. A is okay, it
derives all strings of one or more A's
using its two productions. However B has
only one production and that production
has B in its body. Thus, once it's
sentential form has a B in it, you can
never get rid of that B. And as a result,
B derives no terminal strings. To make
matters worse, S is only one production.
So that must be the first used than any
derivation. But that production introduces
a B into the cententional form so it is
impossible to derive any terminal string
from S and therefore the language of this
grammar is empty. Almost all the
algorithms we need to simplify grammars
are based on the same principle, which I
call discovery algorithms. These discover
facts by an induction process The basis is
always certain facts that are obvious.
Then based on what is already known, we
discover facts and repeated rounds.
Finally, at some round, they can discover
no more facts. There's no point in going
on since without new facts in the current
round, we cannot discover more on the next
round. We generally have to prove only
that any true fact of the type we are
trying to discover will be found this way.
Here's the image of discovery algorithms
we should keep in mind. We start with some
facts you get from the basis. We expand
the set of facts you know by using the
basis fact, this is the first round. For
the second round, we expand the set of
known facts further by using both the
basic facts and the facts you discovered
around one. You keep going until some
round you have no more facts that can be
discovered. So one of the first things we
need to do when dealing with grammars is
to detect and get rid of variables that
can't derive any terminal string. We shall
give a discovery algorithm to find
inductively, all variables that do derive
at least one terminal string, any variable
not discovered by this algorithm derives
no terminal strings and can be eliminated.
For the basis of a variable A has a
production who's body has no variables,
then A certainly derives a terminal string
in one step. Note that this body could be
the empty string, technically string is a
string of terminals. Now suppose A has
production with a body alpha and alpha
consists of only terminals and variables
that we have previously discovered derives
from terminal string and A derives from
terminal string. Since the number of
variables is finite eventually this
algorithm terminates where it can't find
no more variables that derive terminal
strings. Its is easy to prove that
whenever the algorithm says a variable
derives a terminal string that it really
does derive a terminal string, we are not
going to prove that. The harder part is
showing that the algorithm doesn't miss
anything, that is if variable A derives
some terminal string, then the algorithm
will eventually discover A, we will do
that on the next slide. The proof is in
the induction of the minimum height of a
parse tree with root A and a yield of
terminals. The basis is a tree of height
one which consists of root A and one or
more leaves labeled by terminals or
perhaps epsilon. Then the basis step of
the algorithm discovers A, so we are okay
there. For the induction, assume the
statement is true for height up to H-1,
that is all variables at are the roots of
parse trees with a height up to H-1 and a
yield of terminals that are discovered by
the algorithm. The parse tree for height
H, which must be of course equal to or
greater than two, has children of the root
labeled X1 through Xn, that's this. Anyone
of these Xi's that is a variable is the
root of the sub tree of height at most H-1
and therefore it is discovered. Moreover,
one of these variables is discovered last.
At the round, with the last of the
variables AIs is discovered, we must
surely do another round since the set of
discovered variables just changed. On the
next round, A will be discovered because
it has a production, that is A goes to
X1through XN. Who's body consists of only
bodies and discovered variables. The
algorithms do eliminate variables that
derive no terminal string is no-, simple.
Use the algorithm that we just described
to find the variables that do derive
terminal strings. Call the other variables
useless. Then remove from the grammar all
productions in which in least one useless
variable appears. It doesn't matter if the
variable appears in the head, the body or
both. Here's an example grammar to which
we first apply the algorithm that derive
terminal strings. For the basis step, we
immediately discover A and C because they
have productions with terminals only. Here
they are. For the first round of the
production, S is discovered because there
is a S production with body C. And C was
previously discovered. However at the next
round we can discover no more variables.
The only variable we have not yet found to
derive a terminal string is B and B has
only one production body which is little B
followed by big B. This body does not
exist only of terminals and discovered
variables so we can never add B to the set
of discovered variables. Unless B is
useless and we eliminate all traces of it.
That includes not only the production B
goes to but the production S goes to AB,
leaving us of course, fortunately, S goes
to C, still remains as well as the 2A
production and the C production. In
addition to eliminating the variables that
don't derive anything, we need to
eliminate variables that derive some
terminal strings but cannot be derived
from the start symbol. The algorithm to
define symbols both terminals and
variables that appear in variation from
the start symbol is a derivation of the
discovery algorithm. For the basis,
obviously the start symbol can be derived
from zero steps from itself. For the
induction, suppose we have discovered that
we can reach variable A and then for every
production with body alpha and head A, we
can also redraw the symbols appearing in
alpha. The terminals and variables that
appear there. It is an easy pair of
inductions to show, first that if we
discover an a symbol by this algorithm,
that it appears in the sentential form
derived from the start symbol. Second,
that if we do not discover a symbol, then
there is no derivation from the start
symbol from which it appears, we are not
going to give the proofs here. But
remember our goal is to get rid of that do
not appear on the derivations from S. So
after discovering all the symbols that do
appear in a derivation, delete from the
grammar all the productions that contain a
symbol, and now that head or body or both,
that does not appear in a derivation. Say
a symbol is useful if it appears in a
derivation of a terminal string from the
start symbol. And call it useless
otherwise. There are two reasons a symbol
can be useless, either it derives no
terminal string or it appears in no
derivation from the start symbol, or both.
We have algorithms to eliminate symbols
that are useless fro each of the reasons
but we must apply them in the right order.
First eliminate the symbols that fail to
derive a terminal string. And then
eliminate symbols that do not appear from
any derivation from the start symbol. In
this example grammar, if as we should, we
first eliminate variables that do not
derive a terminal string. We eliminate
only B, however eliminating productions
with B gets rids of the only S production.
We then use the algorithm to find symbol
unreachable from the start symbol and we
find that everything is unreachable, that
is all the productions are deleted.
However, if we do things in the wrong
order and first eliminate unreachable
symbols, we find everything is reachable
from S so nothing is eliminated here. Then
when we look from the symbols that do not
derive terminal strings, we eliminate only
B. That leaves the productions A and C
goes to little c, these guys, which should
not be there because A, C and little c are
useless. Here's why first eliminating
variables that don't derive terminal
strings is the right thing to do. After
eliminating those variables, every
remaining symbol is either a terminal or
is a variable that derives a terminal
string. After removing symbols not
reachable form the start symbol, all
remaining symbols appear in some
derivation from the start symbol of some
sentential form. But the variables that
appear in sentential form still derive at
terminal string because such a derivation
can only involve symbols that are also
reachable from the start symbol. Epsilon
productions are those that have body
epsilon, they can be eliminated from a
context free grammar and the only thing
that we lose is that we can no longer
derive the empty string. If the empty
stream was not on the language to begin
with, then we can eliminate epsilon
productions and still have a grammar that
derives the same language. However, if
epsilon was in the language then we lose
it, the two cases can be summarized by the
theorem on the slide, if L has the
grammar. Then L minus the set containing
the empty string has a grammar with no
epsilon productions. Notice that if an
epsilon was not an L then L minus epsilon
is just L anyway. To eliminate epsilon
productions, we need yet another discovery
algorithm, this one to find the variables
that derive the empty string by one or
more steps. We call them null-able
symbols. The basis of the discovery
algorithm is that if A has a production
with an empty body then it is surely
nullable. And the inductions is that if A
has a production with body alpha and alpha
consists only of variables that are
nullable then A is also nullable. Here is
an example grammar for which we will
discover the nullable symbols. The basis
we know that A is nullable because of the
production with epsilon body, its that. In
the first round of the induction we find B
is nullabe because of the production B
goes to A. That is all symbols on the
body, namely the A, are already known to
be nullable. In the second round of the
basis we discover that S is nullable
because of it's body AB, both symbols of
which are already known to be nullable.
This algorithm finds all and only the
nullable symbols. We are not going to give
the proof, which consists of two simple
inductions. To eliminate epsilon
productions from our grammar we need to we
need to turn each production, say A goes
to epsilon through Xn, into a possibly
large number of productions. The idea is
to guess which of the nullable symbols of
the body of a production will derive
epsilon in a particular derivation, since
we make all possible guess by creating
many different many productions, we always
manage to guess right. More precisely, for
each set of nullable Xis. These from the
body of the production make a new A
production. Note, that if two of the Xi's
are the same nullable symbol then we have
to consider the possibility that one
[inaudible] derives epsilon and the other
does not. That is we form one production
for each set of positions that hold
nullable symbols, not just for each set of
nullable symbols. However in the special
case that all the Xi's are nullable
symbols, we do not consider the set of all
positions and we do not create a new
production with the epsilon body. Here is
an example grammar. Each of A, B, and C
are nullable because they have epsilon
productions. Thus, S is also nullable
because of it's production S goes to A, B,
C. Lets construct the new grammar. For the
S production there are seven sub sets of
nullable positions that we must use. The
set of all three positions is also
nullable of course, but eliminating all of
A, B, C would leave an empty body which we
don't allow. However if we use the empty
set positions we get body A, B, C. If we
use the set of only the third position, we
get AB, and so on. Now lets look at the A
productions. We do not use A goes to
epsilon of course, but for production A
goes to little A big A, its that. Only the
second position is nullable so we get two
productions, one with the variable A
present and the other not, that is what we
have here. The situation for B is the
same. However C we, for C we cannot use
the C goes to epsilon production. So in
the new grammar, C has no production. That
means that in the new grammar, although
not in the old grammar, C is useless and
must be eliminated. That forces us to
eliminate all productions with C in the
body. And we are done. The proof that the
new grammar we construct generates the
same strings except for epsilon as the old
grammar generates is a little tricky. So
we're going to give the details in one
direction. As is often the case, we need
to prove something more general than we
really need. In this case, we need two
statements about every variable A. First,
if W is a non-empty string that is derived
from A in the old grammar, then A also
derives W in the newly constructed
grammar. Second, if A derives W in the new
grammar, first of all W is not empty and
second A derives W in the old grammar.
Once we have that for all A we can apply
the statement to the start symbol and thus
prove that the language of the new grammar
is the language of the old grammar with
epsilon removed if it was in that
language. We're going to prove the first
direction and it is an induction on the
number of steps by which A derives W in
the old grammar. For the basis, if there's
a one step derivation of W from A then A
goes to W must be a production. We assume
in part one that W's not empty, so A goes
to W is a production of the new grammar.
You make a desired conclusion that A
derives W in the new grammar. Now let's do
the induction. Assume the inductive
hypothesis for derivations at fewer than K
steps and suppose W is derived from A in K
steps of the old grammar. The first step
of this derivation must be A replaced by
the body of some A production. Assume this
body consists of symbols X1 through Xn.
Then we can break W into W1 through Wn
where each WI is the portion of W that
either is XI if XI is a terminal or it's
derived by XI if XI is a variable. All
these derivations from variables are in
fewer than K steps. If XI is a variable
and the piece WI is not empty, then the
inductive hypothesis tells us that XI
derives WI in the new grammar. When we
construct the new grammar we replace the A
production A goes to XI through Xn. By a
family of productions and one of these
eliminates from the body exactly those
XI's such that WI is epsilon. We know that
is the case because if WI is epsilon then
surely XI is nullable. We also know that
not all WI's are epsilon because W is not
epsilon. That is at least one XI is either
a terminal or it is a variable that we do
not need to remove from the body. Thus in
the new grammar, the first step can
replace A by a body consisting of all
those XI's such that WI is not epsilon. We
can continue the derivation in the new
grammar by a derivation from each XI that
remains of the non-empty string WI. We
know this derivation in the new grammar
exists by the inductive hypothesis. We
also need to show part two, if W is
derived from A in the new grammar then it
is non-empty and also derived in the old.
We're going to skip this part. So we're
now ready for new simplification of
grammars the eliminations of unit
productions. A unit production is one
whose body is a single variable. We can
eliminate all such productions. The, the
idea is to discover using a discovery
algorithm, all pairs of variables AB. Such
that A derives B by a sequence of unit
productions. Eventually, a sequence of
unit productions must end with the use of
some other kind of production. So we can
collapse the steps that use unit
productions into the next one that does
not use a unit production. That is if B
goes to alpha is a non-unit production and
A derives alpha by unit productions, then
we'll add A goes to alpha. Finally, we
drop all unit productions. At this point
we do not need the unit productions and
can drop them from the grammar. Here's the
algorithm that discover the pairs of
variables AB such that A derives B by unit
productions. For the basis, surely A
derives A by unit productions only. This
is a sequence of zero steps. For the
induction, suppose we already discover
that A derives B by unit productions, then
if B goes to C is a unit production, we
may add the pair AC. There are two proofs
that we need but we're not going to do
them here. First, an induction on the
number of rounds in the discovery process,
let's just show that the pairs we find are
valid. That is, they really are pairs AB
such that A derives B by unit productions.
Then conversely we can show by an
induction on the number of steps of a
derivation from A to B, using unit
productions, that we will in fact discover
the pair AB. Another proof that we're
going to skip is the proof that in new
grammar has the same language as the old.
Again, we have to prove something more
general, that A derives W in the new
grammar if and only if it does so in the
old grammar. Each production of the new
grammar simulates one or more steps of the
derivation of the old grammar. That is,
some number of unit productions, perhaps
zero, followed by a non-unit production.
Conversely, every use of a production in
the new grammar can be replaced by zero or
more unit productions followed by a
non-unit production in the old grammar. We
can now combine the three simplifications
to make a strong statement about grammars.
If L is a context-free language, then L
minus epsilon has a grammar with no
useless symbols, no epsilon productions
and no unit productions. Another way to
put it is that L minus epsilon has a
grammar in which every production body is
either a single terminal or has length at
least two. You plot the constructions just
learned, but we have to be careful about
the order. You must start by eliminating
epsilon productions. We have to do this
step first because eliminating epsilon
productions can produce unit productions
that weren't there before and as we saw in
an example, it can create useless symbols
that were not useless before. That can
only occur if the production was only used
to derive the empty string, however.
Second, eliminate the unit productions.
Third, eliminate variables that derive no
terminal strings. We explained earlier why
this step must precede the next step in
our quest to eliminate all useless
symbols. So finally we do the fourth step
of eliminating all the unreachable
symbols. We've almost got our grammars
into Chomsky Normal Form or CNF. Such a
grammar has only two kinds of production
bodies: single terminals or two variables.
We're now going to give the construction
of the CNF grammar for L minus epsilon,
where L is any context-free grammar.
Incidentally, one important use of putting
grammars in CNF is it gives us a
relatively efficient algorithm for testing
membership in a string of a context-free
language. One might imagine that it was
easy to make such a test while looking at
all derivations of a certain limited
length. But with epsilon productions and
unit productions in the grammar, it is not
obvious how long the derivation of even a
short string of terminals has to be.
Moreover even if we can bound the light
for the derivation as we can Would still
be faced with an algorithm that took an
amount of time that is exponential in the
length of the terminal string. Rather by
putting the grammar in CNF we can make
this test and at most, we can make the
length of the string. Our first step is to
the cleaning we just described. The result
is the grammar no longer the empty string
even if the old one did. But otherwise the
languages are the same. But all production
bodies are either single terminal or have
length at least two. Our second step is to
turn those bodies that aren't a single
terminal into bodies into all variables.
The trick is simple, for each terminal
little A, create a new variable that we'll
call A sub A. To that. The job of this
variable is just to generate the terminal
little A. So replace A in all bodies of
length two or more by A sub A and add the
production A sub A goes to A. That is, of
course, that. Here's an example involving
the production A goes to B little CB
little E C and D are terminals so we must
have created variables A sub C and A sub E
with their productions A sub C and A sub E
goes to E. Then we can replace A goes to
BCDE by A goes to B A sub C D A sub E. Now
all production bodies that aren't a single
terminal are strings of two or more
variables. If exactly two that's great
because that's what CNF requires. But if a
body consists of three or more terminals,
we have to break the body into pieces into
variables that appear nowhere else. An
example should make the idea clear. If we
have a production AB goes to BCDE, Then we
need two new variables say F and G and
they are used only for this production.
They may not appear in the new grammar
except as we describe here. In general if
the body consists of N variables, we need
N minus two new variables. The job of the
first new variable F is to generate the
whole body except for the first symbol, B
in this case. That is we'd replace this A
production by A goes to BF. Now F needs to
derive CDE that's the rest of the body.
But that's too long so we use G to help. G
derives only BE, that's this production,
using the production G goes to DE and the
production for F becomes F goes to CG Plus
we've replaced A goes to BCDE by the three
productions that meet the CNF requirement.
A goes to BF, F goes to CG and G goes to
DE. These are the three productions can
simulate the effect of the original
production although they take three steps
to do so. Thus, making this change surely
allows the new grammar to generate
anything the old one did. But the new
grammar doesn't generate anything the old
one didn't, so the languages are the same.
The reason is that once we choose to
replace A by BF we are forced to replace F
by CG and then G by DE because these two
variables have only one production. Thus
using A goes to BF in the new grammar is
[inaudible] to using A goes to BCDE in the
old We thus have an argument why the
transformation from clean grammars from
clean grammars to CNF grammars did not
change the language. You can do formal
inductions on the length of derivations in
these grammars but I hope these less
formal arguments are convincing.