1
00:00:00,000 --> 00:00:05,581
The parse tree is a graph that shows how a
particular string was derived using some

2
00:00:05,581 --> 00:00:10,964
context free grammar. These trees have a
direct connection to left most and right

3
00:00:10,964 --> 00:00:15,881
most derivations which we shall explain
now. One important use of the tree

4
00:00:15,881 --> 00:00:21,065
representation of derivations is it let's
us express the important concept of

5
00:00:21,065 --> 00:00:25,783
ambiguity in grammars. That is,
unambiguity or the ability for a grammar

6
00:00:25,783 --> 00:00:32,237
to provide a unique tree structure for
every string in its language is vital. For

7
00:00:32,237 --> 00:00:37,314
example if a programming line which has
ambiguity, then there are programs with

8
00:00:37,314 --> 00:00:42,649
two or more distinct meanings. The same is
true of a grammar for an actual language,

9
00:00:42,649 --> 00:00:47,791
it would be insufficient to provide an
intended meaning for all valid sentences

10
00:00:47,791 --> 00:00:55,141
of the language. Parse trees for a
[inaudible] tree are trees whose nodes are

11
00:00:55,141 --> 00:01:01,360
each associated with a symbol of G. I'm
going to draw you a little tree here.

12
00:01:06,480 --> 00:01:15,192
Okay. Leaves are always labeled by either
a terminal or by epsilon, so we might draw

13
00:01:15,192 --> 00:01:23,800
for example an A on that one a terminal B
on that one, and perhaps epsilon on that.

14
00:01:29,580 --> 00:01:37,372
Interior nodes are labeled by a variable.
Here are some examples. The important

15
00:01:37,372 --> 00:01:43,857
property that makes a tree a parse tree is
that there is a production of the grammar

16
00:01:43,857 --> 00:01:50,114
with the label of the node in question as
its head and the labels of its children

17
00:01:50,114 --> 00:01:59,888
read left to right as its body. Okay? For
example, if you look at this parse tree we

18
00:01:59,888 --> 00:02:07,004
would infer that A goes to BC is a
production also B goes to little A, little

19
00:02:07,004 --> 00:02:15,024
B is a production and C goes to epsilon is
a production. The root must be labeled by

20
00:02:15,024 --> 00:02:21,716
the start symbol. In our example tree here
I'm going to have to guess that if this is

21
00:02:21,716 --> 00:02:29,472
the root, then "A" is in fact the start
symbol. Here's a parse tree using the

22
00:02:29,472 --> 00:02:35,563
grammar for the balanced parentheses that
we discussed earlier. Notice how each

23
00:02:35,563 --> 00:02:42,861
interior node is labeled by a variable.
Here, here, here and here. Must be asked

24
00:02:42,861 --> 00:02:49,393
because that's the only variable we have.
Each leaf is labeled by a terminal, either

25
00:02:49,393 --> 00:02:55,059
a left or right parenthesis, these are
here, here, here, here, here and here.

26
00:02:55,059 --> 00:03:01,433
Okay, the production of the root is S goes
to [sound]. You can sort of see that by

27
00:03:01,433 --> 00:03:11,495
looking at the root and its two children
here's another interior node and the

28
00:03:11,495 --> 00:03:18,212
labels of its children are left paren as
right paren and you know that left paren

29
00:03:18,212 --> 00:03:24,929
as right paren is another production of
the grammar that's right here. And here we

30
00:03:24,929 --> 00:03:31,892
have an example of a interior node labeled
S and its children labeled left paren and

31
00:03:31,892 --> 00:03:38,199
right paren that corresponds to this
production body right here. You have the

32
00:03:38,199 --> 00:03:44,933
same exact use of that production, here.
Okay. The yield of the par street is the

33
00:03:44,933 --> 00:03:51,521
string of labels and the leaves from the
left. This order of leaves is the order

34
00:03:51,521 --> 00:03:58,356
you visit them in during a pre-order
traversal, path which would look something

35
00:03:58,356 --> 00:04:05,447
like this. That's sort of a, pre order
transversal path goes around the tree,

36
00:04:05,705 --> 00:04:12,842
counter-clockwise. The order in which you
visit the leaves is the order in which

37
00:04:12,842 --> 00:04:19,894
their labels appear in the yield. So here
the yield is left paren, left paren, right

38
00:04:19,894 --> 00:04:26,091
paren, right paren, left paren, right
paren. That's what we said here. In what

39
00:04:26,091 --> 00:04:31,438
follows, we're also going to talk about
parse trees with a route A that may not be

40
00:04:31,438 --> 00:04:36,720
the start symbol. These trees follow all
the other requirements of the parse tree.

41
00:04:36,720 --> 00:04:41,741
The leaves are labeled by terminals or
epsilon and an interior node with its

42
00:04:41,741 --> 00:04:50,539
children form a production of the grammar.
We're going to show how to convert

43
00:04:50,539 --> 00:04:55,359
[inaudible] to leftmost to rightmost
derivations and vice versa. As is often

44
00:04:55,359 --> 00:05:00,368
the case, a [inaudible] made easier by
proving something more general than what

45
00:05:00,368 --> 00:05:06,818
we really need. In this case we'll talk
about generalized parts tree that can have

46
00:05:06,818 --> 00:05:11,880
any root labels, say "A". And we'll talk
about left most and right most irrevations

47
00:05:11,880 --> 00:05:17,454
for many variable "A". We'll prove two
statements... One is that there is a parse

48
00:05:17,454 --> 00:05:23,104
tree with root A and the yield W. And
there is a left most derivation of W from

49
00:05:23,104 --> 00:05:28,754
A [inaudible]. The second statement is the
converse, that for every left most

50
00:05:28,754 --> 00:05:34,618
derivation of W from A there is a general
bias parse tree with root A and yield W.

51
00:05:34,618 --> 00:05:40,340
The matter of right most derivations is
completely analogous. We will start with

52
00:05:40,340 --> 00:05:46,276
point one by showing how to convert parse
trees to left most derivations. The proof

53
00:05:46,276 --> 00:05:51,822
is an induction on the height of the tree.
The basis is height one. That is a tree

54
00:05:51,822 --> 00:05:56,646
consisting of a root label by some
variable, A, and one or more leaves. They

55
00:05:56,646 --> 00:06:01,927
must be a production with head A and body
being in the labels of the leaves from

56
00:06:01,927 --> 00:06:08,656
left to right. That's A one though A N in
this picture. Then, A derives the yield of

57
00:06:08,656 --> 00:06:15,071
the tree by the one step left most
[inaudible] in which this production is

58
00:06:15,071 --> 00:06:21,657
used. For the induction we'll assume
statement one for height less than H and

59
00:06:21,657 --> 00:06:27,906
prove it for H That is, we assume for
every parse tree with root A and height of

60
00:06:27,906 --> 00:06:33,998
up to H minus one there is a leftmost
derivation of its yield from its root. Now

61
00:06:33,998 --> 00:06:40,266
let's look at a parse tree of height H.
The case H=1 is the basis, so H is at

62
00:06:40,266 --> 00:06:46,407
least two. Thus the production used at the
root has at least one variable on its

63
00:06:46,407 --> 00:06:52,241
right side. Say this tree has root A and
the children of the root labeled X1

64
00:06:52,241 --> 00:06:58,306
through Xn. If Xi is variable, then it is
the root of some sub tree of height at

65
00:06:58,306 --> 00:07:03,995
most h minus one and some yield, say Wi.
If XI is a terminal or epsilon then we

66
00:07:03,995 --> 00:07:09,524
imagine XI is a root of a trivial tree,
not a parse tree, that consists only of a

67
00:07:09,524 --> 00:07:16,290
node labeled XI. The yield of this tree is
just Xi itself, but for consistency we

68
00:07:16,290 --> 00:07:22,659
shall refer to Xi as the stream of
terminals, Wi. The yield of the entire

69
00:07:22,659 --> 00:07:29,823
tree is W, which is W1 through Wn. We can
put together a left most derivation of W

70
00:07:29,823 --> 00:07:37,076
from A as follows. Start by applying the
production at the root, so the second step

71
00:07:37,076 --> 00:07:44,783
is X1, X2 up to Xn. That's this. Okay, now
we need the left most derivation of W one

72
00:07:44,783 --> 00:07:52,512
from X one. X one is the terminal that
equals W one so we're already there. And

73
00:07:52,512 --> 00:08:01,656
this step is then really zero step on the
other hand, if X1 is a variable and we

74
00:08:01,656 --> 00:08:08,920
apply the inductive hypothesis, the sub
tree with root X1 and yield W1 has height

75
00:08:08,920 --> 00:08:15,827
at most H minus one, so there is a left
most derivation at W1 from X1. In that

76
00:08:15,827 --> 00:08:23,181
case, this X1 has been replaced now left
most derivation by W1 and this, goes to

77
00:08:23,181 --> 00:08:30,427
star represents all the steps that were
necessary to replace X1 by W1. Either way

78
00:08:30,427 --> 00:08:37,466
we can conclude that there is a left most
derivation of W1 X2 through Xn, that's

79
00:08:37,466 --> 00:08:43,701
this. We can continue arguing this way For
each XI in turn, either it is a terminal,

80
00:08:43,701 --> 00:08:48,894
in which case nothing needs to be done, or
it's a variable in which case we can

81
00:08:48,894 --> 00:08:54,020
perform a leftmost derivation from WI from
XI. The needed sequence of leftmost

82
00:08:54,020 --> 00:08:59,278
derivations is a leftmost derivation of
the entire yield W. That's W1 through Wn.

83
00:08:59,475 --> 00:09:04,601
From A. Now we need to prove the other
direction. That leftmost derivations can

84
00:09:04,601 --> 00:09:10,057
be turned into parse trees. The proof is
an induction on the number of steps of the

85
00:09:10,057 --> 00:09:15,811
derivation. The bases of one step
derivation. In this derivation a variable

86
00:09:15,811 --> 00:09:20,538
A is replaced by a string of terminals,
perhaps the empty string, using some

87
00:09:20,538 --> 00:09:25,644
production for A. If the string is not
empty then there is a parse tree of height

88
00:09:25,644 --> 00:09:31,065
one with A at the root and the sequence of
terminals in the body as the labels of the

89
00:09:31,065 --> 00:09:36,623
children. This and there are the children.
If the body is epsilon then there's again

90
00:09:36,623 --> 00:09:42,501
a porous tree it has A at the root and one
leaf labeled epsilon so it just sort of

91
00:09:42,501 --> 00:09:47,584
looks like this. Now let's do the
inductive step. We assume that left most

92
00:09:47,584 --> 00:09:53,041
derivations of fewer than K steps can be
turned into parse trees with the proper

93
00:09:53,041 --> 00:09:57,960
yield. We will consider a K step
derivation from string W from variable A

94
00:10:00,420 --> 00:10:06,581
The first step of this derivation uses a
production with head A and body X1 through

95
00:10:06,581 --> 00:10:12,816
Xn for some n. Thus the second [inaudible]
form in this derivation of W is X1 through

96
00:10:12,816 --> 00:10:18,684
Xn, that's that. Remember the important
properties of derivations. That production

97
00:10:18,684 --> 00:10:24,405
bodies replace production heads at the
position where the head is. Thus we can

98
00:10:24,405 --> 00:10:29,760
divide W into the concatenation of n
strings, W1 through Wn, in that order.

99
00:10:31,240 --> 00:10:38,689
Where for all I, WI either is XI, if XI is
a terminal. Or XI derives WI by a leftmost

100
00:10:38,689 --> 00:10:45,511
derivation of fewer than K steps. Since
the whole derivation is leftmost, the

101
00:10:45,511 --> 00:10:52,781
derivation WI from XI must happen first.
Then the derivation of W2 through X2 and

102
00:10:52,781 --> 00:10:58,842
so on. So we know that each XI either is
WI or derives WI in fewer than K steps at

103
00:10:58,842 --> 00:11:04,220
the leftmost derivation. So the second
case, where XI is a variable and the

104
00:11:04,220 --> 00:11:09,888
leftmost derivation takes one or more
steps, the inductive hypothesis tells us

105
00:11:09,888 --> 00:11:18,172
that there is a parse tree with root XI
and yield WI. Plus we can build the parse

106
00:11:18,172 --> 00:11:25,999
tree shown... That is this. The root uses
the first production of the derivation as

107
00:11:25,999 --> 00:11:34,016
A goes to X1 through Xn and the Ith child
either is Wi if Xi is a terminal or it is

108
00:11:34,016 --> 00:11:41,843
the root of a parse tree from Xi deriving
Wi. The yield of this tree is W1 through

109
00:11:41,843 --> 00:11:49,577
Wn which is W. Okay, that is, this string
W is W1 through Wn. All these strings are

110
00:11:49,577 --> 00:11:57,052
derived left to right in that order. This
proves the inductive step and we conclude

111
00:11:57,052 --> 00:12:03,260
there is a leftmost derivation of W from
A. ... And there is a parts true with root

112
00:12:03,260 --> 00:12:08,136
"A" and a yield "W". It is also true that
if there is a right most irrevations of

113
00:12:08,136 --> 00:12:12,760
"W" from "A", then there is a parts true
of root "A" and yield "W" and conversely.

114
00:12:12,760 --> 00:12:17,763
But we are not going to prove that part,
the proofs are essentially the same as

115
00:12:17,763 --> 00:12:22,450
what we have seen. And in fact, any
derivation even the one that isn't left

116
00:12:22,450 --> 00:12:27,326
most or right most of a terminal strain
"W" from variable "A" implies that there

117
00:12:27,326 --> 00:12:33,094
is a parts true with root "A" and yield
"W". The first step of any derivation from

118
00:12:33,094 --> 00:12:38,519
A must use a production and replace A from
some string of terminals and variables.

119
00:12:38,519 --> 00:12:43,649
Say, X1 through Xn. If W's the terminal
string ultimately derived, then we can

120
00:12:43,649 --> 00:12:49,218
still break W into W1 through Wn, where
each WI either is XI or is derived from XI

121
00:12:49,218 --> 00:12:54,923
by fewer steps than the whole derivation.
The only tricky part is that now the steps

122
00:12:54,923 --> 00:13:00,153
of the derivation from XI and XJ may be
intermingled and we have to sort the

123
00:13:00,153 --> 00:13:05,858
derivation steps according to which XI is
descendant as being replaced at that step.

124
00:13:05,858 --> 00:13:10,952
We'll leave you to think through the
details of this construction of parse

125
00:13:10,952 --> 00:13:15,448
trees. As we mentioned at the beginning,
it is important in many applications,

126
00:13:15,448 --> 00:13:20,353
including parsing of programming languages
In a compiler and parsing of natural

127
00:13:20,353 --> 00:13:25,270
language sentences that we use a
context-free grammar that assigns a unique

128
00:13:25,270 --> 00:13:30,705
parse tree to each string of the language.
We say a grammar is ambiguous if there is

129
00:13:30,705 --> 00:13:35,945
at least one string in its language that
has two different parse trees. Otherwise

130
00:13:35,945 --> 00:13:41,489
it is unambiguous. The grammar for
balanced parentheses that we have been

131
00:13:41,489 --> 00:13:47,700
using is ambiguous, alas. I'll show you
two parse trees for this balance string,

132
00:13:47,700 --> 00:13:53,751
that is left, right, left, right, left,
right on the next slide. Notice that each

133
00:13:53,751 --> 00:14:00,440
of these trees has the same yield, that is
three pairs of left and right parentheses.

134
00:14:01,400 --> 00:14:06,342
However they're evidently not the same
tree. The first replaces the first child

135
00:14:06,342 --> 00:14:11,221
of the root by [sound]. And the second
does the same but with the second child.

136
00:14:11,221 --> 00:14:16,414
Notice that the two constructions we just
gave, leftmost derivations from trees and

137
00:14:16,414 --> 00:14:21,481
trees from leftmost derivations have the
property that two different parse trees

138
00:14:21,481 --> 00:14:26,681
produce different leftmost derivations and
conversely The same is true for the

139
00:14:26,681 --> 00:14:31,200
analogous transformations between
rightmost derivations in trees Thus we

140
00:14:31,200 --> 00:14:35,790
could also define a grammar to be
ambiguous if it has a string with two

141
00:14:35,790 --> 00:14:40,252
different leftmost derivations or two
different rightmost derivations.

142
00:14:40,252 --> 00:14:45,608
Fortunately just because one grammar for a
language is ambiguous does not mean that

143
00:14:45,608 --> 00:14:50,835
we can't find another grammar for the same
language that is unambiguous. But as an

144
00:14:50,835 --> 00:14:55,362
aside, which we'll get to later, the
opposite is not true either. That is,

145
00:14:55,362 --> 00:15:00,080
there are some ambiguous to which no
unambiguous equivalent exists. Anyway.

146
00:15:00,080 --> 00:15:06,050
Here is a grammar for balanced parenthesis
that is unambiguous. Variable D, which is

147
00:15:06,050 --> 00:15:12,203
the star symbol, generates all balanced
strain But R, another variable generates

148
00:15:12,203 --> 00:15:17,931
all strings that are balanced except for
having one more right parenthesis than

149
00:15:17,931 --> 00:15:25,090
left. By balance in this context I mean
that no prefix of the string has more

150
00:15:25,090 --> 00:15:33,023
right parentheses than left. And examples
would include, well, a single, right

151
00:15:33,023 --> 00:15:40,484
paren. Left, right, right that would be
generated by R and also something like

152
00:15:40,484 --> 00:15:48,001
this: left, left, right, left, right,
right, right. Here's the unambiguous

153
00:15:48,001 --> 00:15:52,946
grammar for balanced parentheses again.
Suppose we're given a string of

154
00:15:52,946 --> 00:15:58,030
parentheses which we'll call the input.
And we want to find this leftmost

155
00:15:58,030 --> 00:16:03,462
derivation or derivations. We claim that
for every input symbol and for either

156
00:16:03,462 --> 00:16:09,242
variable B or R there is only one choice
of production that could possibly lead to

157
00:16:09,242 --> 00:16:14,535
a leftmost derivation of the input. That
is, no string of parentheses has two

158
00:16:14,535 --> 00:16:19,480
distinct leftmost derivations and
therefore the grammar is unambiguous.

159
00:16:22,300 --> 00:16:27,074
Imagine we are constructing the left
sentential form as we scan the input. As

160
00:16:27,074 --> 00:16:32,159
we go, the terminals to the left of the
left most variable must match the input or

161
00:16:32,159 --> 00:16:38,543
we'll never derive that input string of
terminals. So we can check out input

162
00:16:38,543 --> 00:16:46,819
symbols as we match them Notice that the
only place B can ever appear in a left and

163
00:16:46,819 --> 00:16:54,415
central form is at the right end. That is
because the only production with D in the

164
00:16:54,415 --> 00:17:01,827
body, has lead [inaudible] and it also has
head B that is this B can only go to a

165
00:17:01,827 --> 00:17:09,423
string that has a B in it, and the only
way a B can come, can appear is if it was

166
00:17:09,423 --> 00:17:18,814
either the start symbol. Or it was the
result of replacing this B at the end by

167
00:17:18,814 --> 00:17:26,529
the, paren RB in that production. It does
an easy induction on the number of times

168
00:17:26,529 --> 00:17:32,346
this production is applied so that the B
is still at the right end Now suppose B is

169
00:17:32,346 --> 00:17:38,480
the leftmost variable of a left sentential
form. If there's a left parenthesis as the

170
00:17:38,480 --> 00:17:43,891
next unmatched input, and we have to
expand the B using this production the

171
00:17:43,891 --> 00:17:49,448
paren RB. Because if we use epsilon then
the left sentential form has no more

172
00:17:49,448 --> 00:17:55,365
variables and we can never generate the
unmatched left paren. The only time we can

173
00:17:55,365 --> 00:18:01,002
use epsilon to replace B is this
production. Is when we have matched the

174
00:18:01,002 --> 00:18:06,840
entire input and we have found the unique
left most derivation. If R is the left

175
00:18:06,840 --> 00:18:12,312
most variable then the next input symbol
forces us to use one of the two R

176
00:18:12,312 --> 00:18:18,317
productions, that is if the next input
symbol is right paren You must use R goes

177
00:18:18,317 --> 00:18:25,069
to right paren. That's here. And if the
next input is a left paren you must use R

178
00:18:25,069 --> 00:18:31,252
goes to left paren followed by two rights.
There's never an option on a left

179
00:18:31,252 --> 00:18:37,909
sentential form you're deriving can never
match the input string. Here is an

180
00:18:37,909 --> 00:18:43,148
example. Suppose we want to find the
unique left most derivation for this

181
00:18:43,148 --> 00:18:48,818
string of parentheses. That's this string
right here. We set the string up as an

182
00:18:48,818 --> 00:18:54,416
input with a pointer to the next symbol
that must be matched in the left most

183
00:18:54,416 --> 00:18:59,943
derivation. Okay. We start off with the
left sentential form that is the start

184
00:18:59,943 --> 00:19:07,255
symbol B alone. The next input to match is
the left [inaudible] and we must therefore

185
00:19:07,255 --> 00:19:17,573
expand this B, to left [inaudible] RB,
that is using that production. Here we've

186
00:19:17,573 --> 00:19:23,563
made the expansion. The first left paren
has been removed from the input, since it

187
00:19:23,563 --> 00:19:29,554
has been matched. That is it appears to
the left of the left most variable in the

188
00:19:29,554 --> 00:19:35,692
current left [inaudible] form. In essence,
this left paren is what used to be on the

189
00:19:35,692 --> 00:19:43,748
input there. Our next input symbol is
another left paren and we have to expand

190
00:19:43,748 --> 00:19:52,233
an r, the left most variable. That's this
guy. Our only choice is to use the body

191
00:19:52,233 --> 00:20:05,520
"R" goes to left [inaudible] "R", "R"
[sound]. So we're going to do that. Okay.

192
00:20:05,520 --> 00:20:11,988
That's what happened here. The R was
replaced by paren RR and the second left

193
00:20:11,988 --> 00:20:18,960
paren has been matched. It appears to the
left of the leftmost variable as here. Now

194
00:20:18,960 --> 00:20:26,016
next input is a right paren and we must
expand the leftmost R. The only, choice is

195
00:20:26,016 --> 00:20:33,492
the first production. That will enable us
to match. Now the third input has been

196
00:20:33,492 --> 00:20:40,164
matched. We have R to expand. And right
paren has the next input. So we again use

197
00:20:40,164 --> 00:20:46,499
R goes to the right paren. That is we
expand that R, replacing it by a right

198
00:20:46,499 --> 00:20:56,648
paren. That's what's going to happen. On
the next slide. You have the left paren as

199
00:20:56,648 --> 00:21:02,320
the next input and we must expand B, the
choice is the first production for B

200
00:21:02,780 --> 00:21:13,163
[inaudible] now, "R" is the left most
variable, we are up here now and the next

201
00:21:13,163 --> 00:21:17,980
input is the right [inaudible], so we
replace the "R" of our right [inaudible],

202
00:21:18,400 --> 00:21:31,127
[inaudible] using the first production for
R, that's that and lets see Know there is

203
00:21:31,127 --> 00:21:37,771
no more input to match and B must be
expanded, that's B, the left most

204
00:21:37,771 --> 00:21:45,744
variable. The only choice is to use the
second B production, which is that. And we

205
00:21:45,744 --> 00:21:51,592
effectively make the B disappear. And
we're done. We have a leftmost derivation

206
00:21:51,592 --> 00:21:56,915
of the original input, which of course,
appears as the final step of the

207
00:21:56,915 --> 00:22:02,902
derivation. Had we deviated from the
choices we made at any step the, the

208
00:22:02,902 --> 00:22:08,785
result of the left most derivation would
not have been this input strain of

209
00:22:08,785 --> 00:22:16,549
terminals There is a name for grammar such
as our example where it is always possible

210
00:22:16,549 --> 00:22:21,804
to choose unique production to use in a
leftmost derivation of any string in the

211
00:22:21,804 --> 00:22:26,863
language, in the simple manner that we
illustrated. At each step we looked only

212
00:22:26,863 --> 00:22:32,442
at the first unmatched symbol of the input
and we were able to make the unique choice

213
00:22:32,442 --> 00:22:37,112
correctly. Such a grammar is called
[laugh] of one. Standing for leftmost

214
00:22:37,112 --> 00:22:42,107
derivation, that's the first L, left to
right scan, that's the second L, and one

215
00:22:42,107 --> 00:22:46,471
symbol of look ahead. It is normal for a
programming language to have an

216
00:22:46,471 --> 00:22:51,120
[inaudible] of one grammar. Probably as
this theory became widely understood by

217
00:22:51,120 --> 00:22:56,122
designers of programming language as they
saw the advantages of keeping the language

218
00:22:56,122 --> 00:23:01,380
parse-able in this simple way and made
choices to preserve this ability. In the

219
00:23:01,380 --> 00:23:06,311
same argument we gave for our example
grammar tells us all [inaudible] one

220
00:23:06,311 --> 00:23:11,374
grammars are unambiguous. And remember
unambiguity is vital for a programming

221
00:23:11,374 --> 00:23:16,108
language as we must assign a unique
meaning to any legal program of the

222
00:23:16,108 --> 00:23:22,032
language. For balanced parens we found the
first, simplest grammar that we wrote down

223
00:23:22,032 --> 00:23:27,259
was ambiguous but we were able to redesign
the grammar to make unambiguous. One might

224
00:23:27,259 --> 00:23:34,636
hope for an algorithm that would do that
for any ambiguous grammar But, alas, there

225
00:23:34,636 --> 00:23:40,695
can not be such an algorithm. There are
certain contexts for languages for which

226
00:23:40,695 --> 00:23:48,113
every grammar is ambiguous. Such languages
are called inherently ambiguous. We're not

227
00:23:48,113 --> 00:23:53,734
going to get into the proofs of what I
just said, but I'll give you an example of

228
00:23:53,734 --> 00:23:59,217
an inherently ambiguous language. The set
of strings of the form some number of

229
00:23:59,217 --> 00:24:04,560
zeroes followed by some number of ones,
followed by some numbers of twos. Such

230
00:24:04,560 --> 00:24:10,459
that either the numbers of zeroes and ones
are the same. That would be this condition

231
00:24:10,459 --> 00:24:16,219
I equals J. Or the numbers of ones or twos
are the same. That would be J equals K. Or

232
00:24:16,219 --> 00:24:21,370
both, of course. The problem is that any
grammar for this language must generate at

233
00:24:21,370 --> 00:24:25,848
least some of the strings with equal
numbers of all three symbols in two

234
00:24:25,848 --> 00:24:30,511
different ways. In the first derivation,
or parse tree, the grammar forces the

235
00:24:30,511 --> 00:24:35,664
numbers of zeros and ones to be the same
using the same trick we used to generate a

236
00:24:35,664 --> 00:24:41,228
set of strings with the form zero to the
n, one to the n. The grammar can generate

237
00:24:41,228 --> 00:24:46,970
any number of twos to go along but it may
happen to generate the same number of twos

238
00:24:46,970 --> 00:24:52,374
as zeros and ones. The second derivation
of parse tree makes sure the numbers of

239
00:24:52,374 --> 00:24:57,508
ones and twos are the same, but may
accidentally generate the same number of

240
00:24:57,508 --> 00:25:02,559
zeros as well. Here's an example grammar.
It is ambiguous but that doesn't prove the

241
00:25:02,559 --> 00:25:06,658
language is inherently ambiguous. As I
said, we're not going give that proof,

242
00:25:06,658 --> 00:25:11,194
it's very complicated. However you might
wish to play around with grammars from the

243
00:25:11,194 --> 00:25:15,784
same language and see how you are forced
to do something like this grammar in order

244
00:25:15,784 --> 00:25:22,119
to generate exactly the language you want.
To be easy and be familiar to observe that

245
00:25:22,119 --> 00:25:27,738
A will generate all strains with one or
more zeros followed by exactly the same

246
00:25:27,738 --> 00:25:33,286
number of ones. We've seen essential this
pair of productions, yes, as a complete

247
00:25:33,286 --> 00:25:39,282
grammar before and it should also be
obvious that "B" generates any string of

248
00:25:39,282 --> 00:25:45,911
one or more two's and nothing else
Likewise C generates the strings of one or

249
00:25:45,911 --> 00:25:55,350
more zeros. And D generates one or more
ones followed by exactly the same number

250
00:25:55,350 --> 00:26:04,496
of twos. Now, all derivations start with
"S" and the first production replaces it

251
00:26:04,496 --> 00:26:11,182
by either "AB" or CD. If we go with AB,
then we get strings where the zeros and

252
00:26:11,182 --> 00:26:17,399
ones match, while if we go with CD, then
the ones and twos match. As a result,

253
00:26:17,399 --> 00:26:23,782
strings like 012 with the same number of
each symbol will have two left most

254
00:26:23,782 --> 00:26:30,247
derivations. One starting with S goes to
AB, that's this, and the other starting

255
00:26:30,247 --> 00:26:38,460
with S goes to CD. Here are the two
derivations for zero, one, two.