1
00:00:00,000 --> 00:00:04,137
Context-free grammars can be badly
designed. For example they can have

2
00:00:04,137 --> 00:00:08,452
variables that play no role in the
derivation of any terminal string and

3
00:00:08,452 --> 00:00:13,418
therefore shouldn't be there. As analogous
to states in the [inaudible] an automaton

4
00:00:13,418 --> 00:00:18,785
that aren't reachable from the start
state. There are also certain productions

5
00:00:18,785 --> 00:00:23,487
that, while they are necessary, cause
derivations to take many steps that can

6
00:00:23,487 --> 00:00:28,127
obviously be combined. These include
productions whose bodies are the empty

7
00:00:28,127 --> 00:00:33,138
string or unit productions where the body
is a single variable. We can get rid of

8
00:00:33,138 --> 00:00:38,087
these and the way to do so is similar to
the way we removed epsilon transitions

9
00:00:38,087 --> 00:00:42,702
from an NFA. Finally, we're going to
introduce [inaudible] normal form where

10
00:00:42,702 --> 00:00:47,128
all the production bodies are either a
single terminal or two variables.

11
00:00:47,128 --> 00:00:51,985
Incidentally, the Chomsky we refer to is
Noam Chomsky. Back in 1956, he wrote the

12
00:00:51,985 --> 00:00:56,288
paper that introduced the idea of
context-free grammars. Then he was a

13
00:00:56,288 --> 00:01:00,529
linguist trying to provide some
mathematics for the structure of the

14
00:01:00,529 --> 00:01:05,140
language. Since then he has unfortunately
became somewhat notorious for his

15
00:01:05,140 --> 00:01:09,771
political views. Oh well, back to
context-free grammars. Here is an example

16
00:01:09,771 --> 00:01:15,016
of a really bad grammar. A is okay, it
derives all strings of one or more A's

17
00:01:15,016 --> 00:01:20,538
using its two productions. However B has
only one production and that production

18
00:01:20,538 --> 00:01:25,438
has B in its body. Thus, once it's
sentential form has a B in it, you can

19
00:01:25,438 --> 00:01:30,779
never get rid of that B. And as a result,
B derives no terminal strings. To make

20
00:01:30,779 --> 00:01:35,810
matters worse, S is only one production.
So that must be the first used than any

21
00:01:35,810 --> 00:01:41,031
derivation. But that production introduces
a B into the cententional form so it is

22
00:01:41,031 --> 00:01:46,317
impossible to derive any terminal string
from S and therefore the language of this

23
00:01:46,317 --> 00:01:50,975
grammar is empty. Almost all the
algorithms we need to simplify grammars

24
00:01:50,975 --> 00:01:56,240
are based on the same principle, which I
call discovery algorithms. These discover

25
00:01:56,240 --> 00:02:01,533
facts by an induction process The basis is
always certain facts that are obvious.

26
00:02:01,533 --> 00:02:06,384
Then based on what is already known, we
discover facts and repeated rounds.

27
00:02:06,384 --> 00:02:11,628
Finally, at some round, they can discover
no more facts. There's no point in going

28
00:02:11,628 --> 00:02:17,134
on since without new facts in the current
round, we cannot discover more on the next

29
00:02:17,134 --> 00:02:22,116
round. We generally have to prove only
that any true fact of the type we are

30
00:02:22,116 --> 00:02:27,736
trying to discover will be found this way.
Here's the image of discovery algorithms

31
00:02:27,736 --> 00:02:33,972
we should keep in mind. We start with some
facts you get from the basis. We expand

32
00:02:33,972 --> 00:02:40,227
the set of facts you know by using the
basis fact, this is the first round. For

33
00:02:40,227 --> 00:02:44,937
the second round, we expand the set of
known facts further by using both the

34
00:02:44,937 --> 00:02:50,310
basic facts and the facts you discovered
around one. You keep going until some

35
00:02:50,310 --> 00:02:55,265
round you have no more facts that can be
discovered. So one of the first things we

36
00:02:55,265 --> 00:03:00,158
need to do when dealing with grammars is
to detect and get rid of variables that

37
00:03:00,158 --> 00:03:04,810
can't derive any terminal string. We shall
give a discovery algorithm to find

38
00:03:04,810 --> 00:03:09,825
inductively, all variables that do derive
at least one terminal string, any variable

39
00:03:09,825 --> 00:03:14,840
not discovered by this algorithm derives
no terminal strings and can be eliminated.

40
00:03:14,840 --> 00:03:20,090
For the basis of a variable A has a
production who's body has no variables,

41
00:03:20,090 --> 00:03:25,899
then A certainly derives a terminal string
in one step. Note that this body could be

42
00:03:25,899 --> 00:03:31,236
the empty string, technically string is a
string of terminals. Now suppose A has

43
00:03:31,236 --> 00:03:36,641
production with a body alpha and alpha
consists of only terminals and variables

44
00:03:36,641 --> 00:03:42,249
that we have previously discovered derives
from terminal string and A derives from

45
00:03:42,249 --> 00:03:46,377
terminal string. Since the number of
variables is finite eventually this

46
00:03:46,377 --> 00:03:50,731
algorithm terminates where it can't find
no more variables that derive terminal

47
00:03:50,731 --> 00:03:54,704
strings. Its is easy to prove that
whenever the algorithm says a variable

48
00:03:54,704 --> 00:03:59,167
derives a terminal string that it really
does derive a terminal string, we are not

49
00:03:59,167 --> 00:04:03,466
going to prove that. The harder part is
showing that the algorithm doesn't miss

50
00:04:03,466 --> 00:04:07,765
anything, that is if variable A derives
some terminal string, then the algorithm

51
00:04:07,765 --> 00:04:11,956
will eventually discover A, we will do
that on the next slide. The proof is in

52
00:04:11,956 --> 00:04:16,255
the induction of the minimum height of a
parse tree with root A and a yield of

53
00:04:16,255 --> 00:04:21,444
terminals. The basis is a tree of height
one which consists of root A and one or

54
00:04:21,444 --> 00:04:26,420
more leaves labeled by terminals or
perhaps epsilon. Then the basis step of

55
00:04:26,420 --> 00:04:31,463
the algorithm discovers A, so we are okay
there. For the induction, assume the

56
00:04:31,463 --> 00:04:36,837
statement is true for height up to H-1,
that is all variables at are the roots of

57
00:04:36,837 --> 00:04:42,411
parse trees with a height up to H-1 and a
yield of terminals that are discovered by

58
00:04:42,411 --> 00:04:48,028
the algorithm. The parse tree for height
H, which must be of course equal to or

59
00:04:48,028 --> 00:04:54,108
greater than two, has children of the root
labeled X1 through Xn, that's this. Anyone

60
00:04:54,108 --> 00:05:00,189
of these Xi's that is a variable is the
root of the sub tree of height at most H-1

61
00:05:00,189 --> 00:05:06,652
and therefore it is discovered. Moreover,
one of these variables is discovered last.

62
00:05:06,652 --> 00:05:11,419
At the round, with the last of the
variables AIs is discovered, we must

63
00:05:11,419 --> 00:05:17,003
surely do another round since the set of
discovered variables just changed. On the

64
00:05:17,003 --> 00:05:22,314
next round, A will be discovered because
it has a production, that is A goes to

65
00:05:22,314 --> 00:05:27,464
X1through XN. Who's body consists of only
bodies and discovered variables. The

66
00:05:27,464 --> 00:05:32,729
algorithms do eliminate variables that
derive no terminal string is no-, simple.

67
00:05:32,729 --> 00:05:37,927
Use the algorithm that we just described
to find the variables that do derive

68
00:05:37,927 --> 00:05:44,079
terminal strings. Call the other variables
useless. Then remove from the grammar all

69
00:05:44,079 --> 00:05:49,729
productions in which in least one useless
variable appears. It doesn't matter if the

70
00:05:49,729 --> 00:05:55,178
variable appears in the head, the body or
both. Here's an example grammar to which

71
00:05:55,178 --> 00:06:00,585
we first apply the algorithm that derive
terminal strings. For the basis step, we

72
00:06:00,585 --> 00:06:06,335
immediately discover A and C because they
have productions with terminals only. Here

73
00:06:06,335 --> 00:06:11,606
they are. For the first round of the
production, S is discovered because there

74
00:06:11,606 --> 00:06:17,238
is a S production with body C. And C was
previously discovered. However at the next

75
00:06:17,238 --> 00:06:22,949
round we can discover no more variables.
The only variable we have not yet found to

76
00:06:22,949 --> 00:06:28,660
derive a terminal string is B and B has
only one production body which is little B

77
00:06:28,660 --> 00:06:33,889
followed by big B. This body does not
exist only of terminals and discovered

78
00:06:33,889 --> 00:06:39,697
variables so we can never add B to the set
of discovered variables. Unless B is

79
00:06:39,697 --> 00:06:47,877
useless and we eliminate all traces of it.
That includes not only the production B

80
00:06:47,877 --> 00:06:55,957
goes to but the production S goes to AB,
leaving us of course, fortunately, S goes

81
00:06:55,957 --> 00:07:01,749
to C, still remains as well as the 2A
production and the C production. In

82
00:07:01,749 --> 00:07:06,272
addition to eliminating the variables that
don't derive anything, we need to

83
00:07:06,272 --> 00:07:10,795
eliminate variables that derive some
terminal strings but cannot be derived

84
00:07:10,795 --> 00:07:15,139
from the start symbol. The algorithm to
define symbols both terminals and

85
00:07:15,139 --> 00:07:19,900
variables that appear in variation from
the start symbol is a derivation of the

86
00:07:19,900 --> 00:07:24,423
discovery algorithm. For the basis,
obviously the start symbol can be derived

87
00:07:24,423 --> 00:07:29,469
from zero steps from itself. For the
induction, suppose we have discovered that

88
00:07:29,469 --> 00:07:35,146
we can reach variable A and then for every
production with body alpha and head A, we

89
00:07:35,146 --> 00:07:40,552
can also redraw the symbols appearing in
alpha. The terminals and variables that

90
00:07:40,552 --> 00:07:45,351
appear there. It is an easy pair of
inductions to show, first that if we

91
00:07:45,351 --> 00:07:50,622
discover an a symbol by this algorithm,
that it appears in the sentential form

92
00:07:50,622 --> 00:07:55,708
derived from the start symbol. Second,
that if we do not discover a symbol, then

93
00:07:55,708 --> 00:08:00,678
there is no derivation from the start
symbol from which it appears, we are not

94
00:08:00,678 --> 00:08:05,472
going to give the proofs here. But
remember our goal is to get rid of that do

95
00:08:05,472 --> 00:08:10,500
not appear on the derivations from S. So
after discovering all the symbols that do

96
00:08:10,500 --> 00:08:15,529
appear in a derivation, delete from the
grammar all the productions that contain a

97
00:08:15,529 --> 00:08:20,558
symbol, and now that head or body or both,
that does not appear in a derivation. Say

98
00:08:20,558 --> 00:08:25,402
a symbol is useful if it appears in a
derivation of a terminal string from the

99
00:08:25,402 --> 00:08:30,763
start symbol. And call it useless
otherwise. There are two reasons a symbol

100
00:08:30,763 --> 00:08:35,417
can be useless, either it derives no
terminal string or it appears in no

101
00:08:35,417 --> 00:08:40,654
derivation from the start symbol, or both.
We have algorithms to eliminate symbols

102
00:08:40,654 --> 00:08:46,020
that are useless fro each of the reasons
but we must apply them in the right order.

103
00:08:46,300 --> 00:08:51,791
First eliminate the symbols that fail to
derive a terminal string. And then

104
00:08:51,791 --> 00:08:57,578
eliminate symbols that do not appear from
any derivation from the start symbol. In

105
00:08:57,578 --> 00:09:03,775
this example grammar, if as we should, we
first eliminate variables that do not

106
00:09:03,775 --> 00:09:09,893
derive a terminal string. We eliminate
only B, however eliminating productions

107
00:09:09,893 --> 00:09:15,810
with B gets rids of the only S production.
We then use the algorithm to find symbol

108
00:09:15,810 --> 00:09:20,971
unreachable from the start symbol and we
find that everything is unreachable, that

109
00:09:20,971 --> 00:09:25,502
is all the productions are deleted.
However, if we do things in the wrong

110
00:09:25,502 --> 00:09:30,411
order and first eliminate unreachable
symbols, we find everything is reachable

111
00:09:30,411 --> 00:09:36,603
from S so nothing is eliminated here. Then
when we look from the symbols that do not

112
00:09:36,603 --> 00:09:42,362
derive terminal strings, we eliminate only
B. That leaves the productions A and C

113
00:09:42,362 --> 00:09:48,337
goes to little c, these guys, which should
not be there because A, C and little c are

114
00:09:48,337 --> 00:09:53,664
useless. Here's why first eliminating
variables that don't derive terminal

115
00:09:53,664 --> 00:09:59,316
strings is the right thing to do. After
eliminating those variables, every

116
00:09:59,316 --> 00:10:04,721
remaining symbol is either a terminal or
is a variable that derives a terminal

117
00:10:04,721 --> 00:10:09,736
string. After removing symbols not
reachable form the start symbol, all

118
00:10:09,736 --> 00:10:14,524
remaining symbols appear in some
derivation from the start symbol of some

119
00:10:14,524 --> 00:10:20,832
sentential form. But the variables that
appear in sentential form still derive at

120
00:10:20,832 --> 00:10:25,683
terminal string because such a derivation
can only involve symbols that are also

121
00:10:25,683 --> 00:10:30,295
reachable from the start symbol. Epsilon
productions are those that have body

122
00:10:30,295 --> 00:10:34,966
epsilon, they can be eliminated from a
context free grammar and the only thing

123
00:10:34,966 --> 00:10:40,037
that we lose is that we can no longer
derive the empty string. If the empty

124
00:10:40,037 --> 00:10:44,987
stream was not on the language to begin
with, then we can eliminate epsilon

125
00:10:44,987 --> 00:10:50,200
productions and still have a grammar that
derives the same language. However, if

126
00:10:50,200 --> 00:10:55,678
epsilon was in the language then we lose
it, the two cases can be summarized by the

127
00:10:55,678 --> 00:11:00,509
theorem on the slide, if L has the
grammar. Then L minus the set containing

128
00:11:00,509 --> 00:11:05,465
the empty string has a grammar with no
epsilon productions. Notice that if an

129
00:11:05,465 --> 00:11:10,614
epsilon was not an L then L minus epsilon
is just L anyway. To eliminate epsilon

130
00:11:10,614 --> 00:11:15,956
productions, we need yet another discovery
algorithm, this one to find the variables

131
00:11:15,956 --> 00:11:20,655
that derive the empty string by one or
more steps. We call them null-able

132
00:11:20,655 --> 00:11:25,883
symbols. The basis of the discovery
algorithm is that if A has a production

133
00:11:25,883 --> 00:11:31,864
with an empty body then it is surely
nullable. And the inductions is that if A

134
00:11:31,864 --> 00:11:37,207
has a production with body alpha and alpha
consists only of variables that are

135
00:11:37,207 --> 00:11:42,864
nullable then A is also nullable. Here is
an example grammar for which we will

136
00:11:42,864 --> 00:11:49,217
discover the nullable symbols. The basis
we know that A is nullable because of the

137
00:11:49,217 --> 00:11:55,093
production with epsilon body, its that. In
the first round of the induction we find B

138
00:11:55,093 --> 00:12:00,409
is nullabe because of the production B
goes to A. That is all symbols on the

139
00:12:00,409 --> 00:12:05,865
body, namely the A, are already known to
be nullable. In the second round of the

140
00:12:05,865 --> 00:12:11,251
basis we discover that S is nullable
because of it's body AB, both symbols of

141
00:12:11,251 --> 00:12:16,999
which are already known to be nullable.
This algorithm finds all and only the

142
00:12:16,999 --> 00:12:22,581
nullable symbols. We are not going to give
the proof, which consists of two simple

143
00:12:22,581 --> 00:12:28,186
inductions. To eliminate epsilon
productions from our grammar we need to we

144
00:12:28,186 --> 00:12:34,092
need to turn each production, say A goes
to epsilon through Xn, into a possibly

145
00:12:34,092 --> 00:12:40,302
large number of productions. The idea is
to guess which of the nullable symbols of

146
00:12:40,302 --> 00:12:46,208
the body of a production will derive
epsilon in a particular derivation, since

147
00:12:46,208 --> 00:12:52,342
we make all possible guess by creating
many different many productions, we always

148
00:12:52,342 --> 00:12:58,300
manage to guess right. More precisely, for
each set of nullable Xis. These from the

149
00:12:58,300 --> 00:13:03,169
body of the production make a new A
production. Note, that if two of the Xi's

150
00:13:03,169 --> 00:13:08,230
are the same nullable symbol then we have
to consider the possibility that one

151
00:13:08,230 --> 00:13:13,483
[inaudible] derives epsilon and the other
does not. That is we form one production

152
00:13:13,483 --> 00:13:18,480
for each set of positions that hold
nullable symbols, not just for each set of

153
00:13:18,480 --> 00:13:23,624
nullable symbols. However in the special
case that all the Xi's are nullable

154
00:13:23,624 --> 00:13:29,052
symbols, we do not consider the set of all
positions and we do not create a new

155
00:13:29,052 --> 00:13:40,102
production with the epsilon body. Here is
an example grammar. Each of A, B, and C

156
00:13:40,102 --> 00:13:46,863
are nullable because they have epsilon
productions. Thus, S is also nullable

157
00:13:46,863 --> 00:13:56,599
because of it's production S goes to A, B,
C. Lets construct the new grammar. For the

158
00:13:56,599 --> 00:14:02,470
S production there are seven sub sets of
nullable positions that we must use. The

159
00:14:02,470 --> 00:14:08,050
set of all three positions is also
nullable of course, but eliminating all of

160
00:14:08,050 --> 00:14:13,920
A, B, C would leave an empty body which we
don't allow. However if we use the empty

161
00:14:13,920 --> 00:14:21,716
set positions we get body A, B, C. If we
use the set of only the third position, we

162
00:14:21,716 --> 00:14:29,721
get AB, and so on. Now lets look at the A
productions. We do not use A goes to

163
00:14:29,721 --> 00:14:38,012
epsilon of course, but for production A
goes to little A big A, its that. Only the

164
00:14:38,012 --> 00:14:45,055
second position is nullable so we get two
productions, one with the variable A

165
00:14:45,055 --> 00:14:52,188
present and the other not, that is what we
have here. The situation for B is the

166
00:14:52,188 --> 00:15:00,851
same. However C we, for C we cannot use
the C goes to epsilon production. So in

167
00:15:00,851 --> 00:15:06,344
the new grammar, C has no production. That
means that in the new grammar, although

168
00:15:06,344 --> 00:15:11,631
not in the old grammar, C is useless and
must be eliminated. That forces us to

169
00:15:11,631 --> 00:15:17,517
eliminate all productions with C in the
body. And we are done. The proof that the

170
00:15:17,517 --> 00:15:22,427
new grammar we construct generates the
same strings except for epsilon as the old

171
00:15:22,427 --> 00:15:27,158
grammar generates is a little tricky. So
we're going to give the details in one

172
00:15:27,158 --> 00:15:31,888
direction. As is often the case, we need
to prove something more general than we

173
00:15:31,888 --> 00:15:36,835
really need. In this case, we need two
statements about every variable A. First,

174
00:15:36,835 --> 00:15:42,715
if W is a non-empty string that is derived
from A in the old grammar, then A also

175
00:15:42,715 --> 00:15:48,232
derives W in the newly constructed
grammar. Second, if A derives W in the new

176
00:15:48,232 --> 00:15:53,895
grammar, first of all W is not empty and
second A derives W in the old grammar.

177
00:15:53,895 --> 00:15:59,993
Once we have that for all A we can apply
the statement to the start symbol and thus

178
00:15:59,993 --> 00:16:06,018
prove that the language of the new grammar
is the language of the old grammar with

179
00:16:06,018 --> 00:16:12,272
epsilon removed if it was in that
language. We're going to prove the first

180
00:16:12,272 --> 00:16:17,312
direction and it is an induction on the
number of steps by which A derives W in

181
00:16:17,312 --> 00:16:24,662
the old grammar. For the basis, if there's
a one step derivation of W from A then A

182
00:16:24,662 --> 00:16:32,373
goes to W must be a production. We assume
in part one that W's not empty, so A goes

183
00:16:32,373 --> 00:16:38,887
to W is a production of the new grammar.
You make a desired conclusion that A

184
00:16:38,887 --> 00:16:46,378
derives W in the new grammar. Now let's do
the induction. Assume the inductive

185
00:16:46,378 --> 00:16:53,290
hypothesis for derivations at fewer than K
steps and suppose W is derived from A in K

186
00:16:53,290 --> 00:17:00,192
steps of the old grammar. The first step
of this derivation must be A replaced by

187
00:17:00,192 --> 00:17:05,620
the body of some A production. Assume this
body consists of symbols X1 through Xn.

188
00:17:06,500 --> 00:17:13,512
Then we can break W into W1 through Wn
where each WI is the portion of W that

189
00:17:13,512 --> 00:17:20,615
either is XI if XI is a terminal or it's
derived by XI if XI is a variable. All

190
00:17:20,615 --> 00:17:29,761
these derivations from variables are in
fewer than K steps. If XI is a variable

191
00:17:29,761 --> 00:17:36,237
and the piece WI is not empty, then the
inductive hypothesis tells us that XI

192
00:17:36,237 --> 00:17:45,341
derives WI in the new grammar. When we
construct the new grammar we replace the A

193
00:17:45,341 --> 00:17:50,840
production A goes to XI through Xn. By a
family of productions and one of these

194
00:17:50,840 --> 00:17:56,476
eliminates from the body exactly those
XI's such that WI is epsilon. We know that

195
00:17:56,476 --> 00:18:02,385
is the case because if WI is epsilon then
surely XI is nullable. We also know that

196
00:18:02,385 --> 00:18:08,438
not all WI's are epsilon because W is not
epsilon. That is at least one XI is either

197
00:18:08,438 --> 00:18:14,419
a terminal or it is a variable that we do
not need to remove from the body. Thus in

198
00:18:14,419 --> 00:18:19,679
the new grammar, the first step can
replace A by a body consisting of all

199
00:18:19,679 --> 00:18:25,479
those XI's such that WI is not epsilon. We
can continue the derivation in the new

200
00:18:25,479 --> 00:18:31,162
grammar by a derivation from each XI that
remains of the non-empty string WI. We

201
00:18:31,162 --> 00:18:36,428
know this derivation in the new grammar
exists by the inductive hypothesis. We

202
00:18:36,428 --> 00:18:41,626
also need to show part two, if W is
derived from A in the new grammar then it

203
00:18:41,626 --> 00:18:48,857
is non-empty and also derived in the old.
We're going to skip this part. So we're

204
00:18:48,857 --> 00:18:53,349
now ready for new simplification of
grammars the eliminations of unit

205
00:18:53,349 --> 00:19:02,229
productions. A unit production is one
whose body is a single variable. We can

206
00:19:02,229 --> 00:19:08,070
eliminate all such productions. The, the
idea is to discover using a discovery

207
00:19:08,070 --> 00:19:13,689
algorithm, all pairs of variables AB. Such
that A derives B by a sequence of unit

208
00:19:13,689 --> 00:19:19,238
productions. Eventually, a sequence of
unit productions must end with the use of

209
00:19:19,238 --> 00:19:24,365
some other kind of production. So we can
collapse the steps that use unit

210
00:19:24,365 --> 00:19:29,914
productions into the next one that does
not use a unit production. That is if B

211
00:19:29,914 --> 00:19:35,813
goes to alpha is a non-unit production and
A derives alpha by unit productions, then

212
00:19:35,813 --> 00:19:42,928
we'll add A goes to alpha. Finally, we
drop all unit productions. At this point

213
00:19:42,928 --> 00:19:48,890
we do not need the unit productions and
can drop them from the grammar. Here's the

214
00:19:48,890 --> 00:19:54,707
algorithm that discover the pairs of
variables AB such that A derives B by unit

215
00:19:54,707 --> 00:20:01,237
productions. For the basis, surely A
derives A by unit productions only. This

216
00:20:01,237 --> 00:20:07,107
is a sequence of zero steps. For the
induction, suppose we already discover

217
00:20:07,107 --> 00:20:13,533
that A derives B by unit productions, then
if B goes to C is a unit production, we

218
00:20:13,533 --> 00:20:20,102
may add the pair AC. There are two proofs
that we need but we're not going to do

219
00:20:20,102 --> 00:20:25,756
them here. First, an induction on the
number of rounds in the discovery process,

220
00:20:25,756 --> 00:20:31,009
let's just show that the pairs we find are
valid. That is, they really are pairs AB

221
00:20:31,009 --> 00:20:36,832
such that A derives B by unit productions.
Then conversely we can show by an

222
00:20:36,832 --> 00:20:41,602
induction on the number of steps of a
derivation from A to B, using unit

223
00:20:41,602 --> 00:20:47,835
productions, that we will in fact discover
the pair AB. Another proof that we're

224
00:20:47,835 --> 00:20:52,940
going to skip is the proof that in new
grammar has the same language as the old.

225
00:20:52,940 --> 00:20:57,390
Again, we have to prove something more
general, that A derives W in the new

226
00:20:57,390 --> 00:21:02,202
grammar if and only if it does so in the
old grammar. Each production of the new

227
00:21:02,202 --> 00:21:07,074
grammar simulates one or more steps of the
derivation of the old grammar. That is,

228
00:21:07,074 --> 00:21:11,826
some number of unit productions, perhaps
zero, followed by a non-unit production.

229
00:21:11,826 --> 00:21:16,818
Conversely, every use of a production in
the new grammar can be replaced by zero or

230
00:21:16,818 --> 00:21:21,552
more unit productions followed by a
non-unit production in the old grammar. We

231
00:21:21,552 --> 00:21:27,775
can now combine the three simplifications
to make a strong statement about grammars.

232
00:21:27,775 --> 00:21:33,332
If L is a context-free language, then L
minus epsilon has a grammar with no

233
00:21:33,332 --> 00:21:39,101
useless symbols, no epsilon productions
and no unit productions. Another way to

234
00:21:39,101 --> 00:21:44,901
put it is that L minus epsilon has a
grammar in which every production body is

235
00:21:44,901 --> 00:21:52,462
either a single terminal or has length at
least two. You plot the constructions just

236
00:21:52,462 --> 00:21:57,868
learned, but we have to be careful about
the order. You must start by eliminating

237
00:21:57,868 --> 00:22:06,459
epsilon productions. We have to do this
step first because eliminating epsilon

238
00:22:06,459 --> 00:22:12,252
productions can produce unit productions
that weren't there before and as we saw in

239
00:22:12,252 --> 00:22:17,700
an example, it can create useless symbols
that were not useless before. That can

240
00:22:17,700 --> 00:22:23,079
only occur if the production was only used
to derive the empty string, however.

241
00:22:23,079 --> 00:22:29,057
Second, eliminate the unit productions.
Third, eliminate variables that derive no

242
00:22:29,057 --> 00:22:34,092
terminal strings. We explained earlier why
this step must precede the next step in

243
00:22:34,092 --> 00:22:38,759
our quest to eliminate all useless
symbols. So finally we do the fourth step

244
00:22:38,759 --> 00:22:45,042
of eliminating all the unreachable
symbols. We've almost got our grammars

245
00:22:45,042 --> 00:22:50,587
into Chomsky Normal Form or CNF. Such a
grammar has only two kinds of production

246
00:22:50,587 --> 00:22:56,083
bodies: single terminals or two variables.
We're now going to give the construction

247
00:22:56,083 --> 00:23:01,110
of the CNF grammar for L minus epsilon,
where L is any context-free grammar.

248
00:23:01,110 --> 00:23:05,328
Incidentally, one important use of putting
grammars in CNF is it gives us a

249
00:23:05,328 --> 00:23:10,053
relatively efficient algorithm for testing
membership in a string of a context-free

250
00:23:10,053 --> 00:23:14,609
language. One might imagine that it was
easy to make such a test while looking at

251
00:23:14,609 --> 00:23:18,940
all derivations of a certain limited
length. But with epsilon productions and

252
00:23:18,940 --> 00:23:23,665
unit productions in the grammar, it is not
obvious how long the derivation of even a

253
00:23:23,665 --> 00:23:28,350
short string of terminals has to be.
Moreover even if we can bound the light

254
00:23:28,350 --> 00:23:33,166
for the derivation as we can Would still
be faced with an algorithm that took an

255
00:23:33,166 --> 00:23:37,719
amount of time that is exponential in the
length of the terminal string. Rather by

256
00:23:37,719 --> 00:23:41,994
putting the grammar in CNF we can make
this test and at most, we can make the

257
00:23:41,994 --> 00:23:49,596
length of the string. Our first step is to
the cleaning we just described. The result

258
00:23:49,596 --> 00:23:55,591
is the grammar no longer the empty string
even if the old one did. But otherwise the

259
00:23:55,591 --> 00:24:01,586
languages are the same. But all production
bodies are either single terminal or have

260
00:24:01,586 --> 00:24:07,367
length at least two. Our second step is to
turn those bodies that aren't a single

261
00:24:07,367 --> 00:24:12,934
terminal into bodies into all variables.
The trick is simple, for each terminal

262
00:24:12,934 --> 00:24:18,981
little A, create a new variable that we'll
call A sub A. To that. The job of this

263
00:24:18,981 --> 00:24:25,127
variable is just to generate the terminal
little A. So replace A in all bodies of

264
00:24:25,127 --> 00:24:31,349
length two or more by A sub A and add the
production A sub A goes to A. That is, of

265
00:24:31,349 --> 00:24:38,811
course, that. Here's an example involving
the production A goes to B little CB

266
00:24:38,811 --> 00:24:47,287
little E C and D are terminals so we must
have created variables A sub C and A sub E

267
00:24:47,287 --> 00:24:59,011
with their productions A sub C and A sub E
goes to E. Then we can replace A goes to

268
00:24:59,011 --> 00:25:10,796
BCDE by A goes to B A sub C D A sub E. Now
all production bodies that aren't a single

269
00:25:10,796 --> 00:25:16,081
terminal are strings of two or more
variables. If exactly two that's great

270
00:25:16,081 --> 00:25:22,009
because that's what CNF requires. But if a
body consists of three or more terminals,

271
00:25:22,009 --> 00:25:28,240
we have to break the body into pieces into
variables that appear nowhere else. An

272
00:25:28,240 --> 00:25:35,699
example should make the idea clear. If we
have a production AB goes to BCDE, Then we

273
00:25:35,699 --> 00:25:41,157
need two new variables say F and G and
they are used only for this production.

274
00:25:41,157 --> 00:25:46,755
They may not appear in the new grammar
except as we describe here. In general if

275
00:25:46,755 --> 00:25:52,493
the body consists of N variables, we need
N minus two new variables. The job of the

276
00:25:52,493 --> 00:25:58,161
first new variable F is to generate the
whole body except for the first symbol, B

277
00:25:58,161 --> 00:26:04,965
in this case. That is we'd replace this A
production by A goes to BF. Now F needs to

278
00:26:04,965 --> 00:26:14,016
derive CDE that's the rest of the body.
But that's too long so we use G to help. G

279
00:26:14,016 --> 00:26:23,369
derives only BE, that's this production,
using the production G goes to DE and the

280
00:26:23,369 --> 00:26:33,181
production for F becomes F goes to CG Plus
we've replaced A goes to BCDE by the three

281
00:26:33,181 --> 00:26:40,531
productions that meet the CNF requirement.
A goes to BF, F goes to CG and G goes to

282
00:26:40,531 --> 00:26:45,864
DE. These are the three productions can
simulate the effect of the original

283
00:26:45,864 --> 00:26:50,562
production although they take three steps
to do so. Thus, making this change surely

284
00:26:50,562 --> 00:26:55,089
allows the new grammar to generate
anything the old one did. But the new

285
00:26:55,089 --> 00:27:00,827
grammar doesn't generate anything the old
one didn't, so the languages are the same.

286
00:27:00,827 --> 00:27:06,357
The reason is that once we choose to
replace A by BF we are forced to replace F

287
00:27:06,357 --> 00:27:11,956
by CG and then G by DE because these two
variables have only one production. Thus

288
00:27:11,956 --> 00:27:17,763
using A goes to BF in the new grammar is
[inaudible] to using A goes to BCDE in the

289
00:27:17,763 --> 00:27:22,106
old We thus have an argument why the
transformation from clean grammars from

290
00:27:22,106 --> 00:27:26,396
clean grammars to CNF grammars did not
change the language. You can do formal

291
00:27:26,396 --> 00:27:30,853
inductions on the length of derivations in
these grammars but I hope these less

292
00:27:30,853 --> 00:27:32,636
formal arguments are convincing.