Good morning, good afternoon, good evening
or good whatever or whenever, you are
actually listening to this. Today we're
going to start a completely new tack,
where we learn about a second important
class of languages other than the regular
languages. This class is the context-free
languages. They include all the regular
languages and more. The most common
description of these languages is a
context-free grammar. We'll see the
definition of these grammars and the way
these grammars define languages, which is
through derivations. We're also going to
introduce a related notation called
Bacchusnouer four. The two most important
applications of context free grammars are
probably a processing natural languages
and computer languages. I am going to
focus on computer languages Where almost
every language today has its syntax
defined by a context free grammar in the
Bacchusnouer notation. Grammars are also
essential when designing a parser for the
language. That part of the compiler that
puts together the tokens of the language
into the proper structure. For example,
the parser discovers the order in which
arithmetic operators in an arithmetic
expression and group statements properly
so their execution sequence is as intended
by the programmer. A context free grammar
defines a language by a mechanism we will
soon learn. Every regular language has a
context free grammar describing it but
there are also languages that can be
described by a grammar but that are not
regular. On the other hand, the context
free grammar is still a relatively weak
formalism. There are some languages that
are simple to describe yet have to context
free grammar. Many of the languages in the
context free class that are outside the
regular languages are languages that
involve nested structures such as the
patterns of left and right parenthesis
that we called balanced. The central
elements of the grammar are variables.
These are symbols that generate particular
sets of strings. One of these symbols
called the start symbol will generate the
entire language. But we can add many
others to help us in that definition. The
variables or rather the sets of strings
they generate, are defined recursively in
terms of one another. The rules that
define the languages of the variables are
called productions. Each production has a
variable on the left, say A, an arrow And
zero or more symbols on the right will
draw say x or y on the right, that serve
as the definition. A rule like this says
that the concatenation of the languages
represented by x and y is a subset of the
language represented by a. A variable may
have several productions and its languages
thus defined to be the union of the
languages described by the right side of
each of its productions, but remember, all
of this may be recursive so grammars are
in fact far more powerful than the regular
expressions we can build from unions and
concatenations. For our first example,
let's consider the language that we showed
not to be a regular language. The set of
strings of the form of N, zero followed by
the same number of 1s. Here is a grammar
for this language, Two productions or
rules and only one variable s. The first
production, this one is a basis rule. It
says that the string zero one is in the
language of s. The second, this, is an
inductive rule. It says that if w is a
string in the language of s, then zero w
one is also in the language. That rule
lets us build longer and longer strings at
each step adding one zero to the beginning
and one, one to the end. So we always have
the same numbers of zeros and ones with
the zeros preceding the ones. Now lets
make precise our informal introduction to
what context free grammars look like. The
terminals of the grammar are analogous to
the input symbols of an atomaton. They
form the alphabet for the language being
defined. The variables or non-terminals
are something like states of automaton but
they are more powerful. One variable is
called the start symbol. It is the
language of this variable that the grammar
defines. And the other variables are used
as auxiliaries but we can think of them as
internal to the grammar and their
languages are not visible outside. The
productions of the grammar which are akin
to the transition function of an atomaton
have the form of the variable on the left,
sometimes called the head, an arrow and a
string of zero or more symbols which can
terminals of variables. These are
sometimes called the body of the
production. We have the convention about
the letters used for certain symbols and
strings. These conventions are more
complex than the convention we use to
distinguish input symbols, little a,
little b and so on from strings w, x and
so on. But they're really important as a
reminder of the roles different components
play and it is something worth committing
to memory early on. First, we use capital
letters at the beginning of the alphabet
as variables. However S is also normally a
variable. In fact it will be the start
symbol in many grammars. On the other
hand, lower case letters at the beginning
of the alphabet are terminals. This
convention agrees with an earlier
convention that made these letters stand
for input symbols. Since there is a good
analogy between the inputs of an atomaton
and the terminals of a grammar. Capital
letters near the end of the alphabet are
used for symbols that can be either
terminals or variables, we typically don't
know which. Lower case letters at the end
of the alphabet stand for strings of
terminals only. Again, this matches our
earlier convention. And we use Greek
letters at the beginning of the Greek
alphabet for strings that may consist of
both terminals and variables mixed. We'll
design a grammar for the language of
strings with the form zero to the n, one
to the n. The terminal alphabet is zero
and one of course. We need only one
variable in this case, we'll call it S. S
will be the start symbol. There's no other
option since it is the only variable. And
here are the productions which we
explained earlier in our informal
discussion. The first production generates
only the string zero one and the second
production puts a zero and one at the
beginning and end respectively of a
shorter string in the language. A
derivation consists of a sequence of
strings that typically have both terminals
and variables, although they can have only
one kind of symbol or even be empty. We
start with the string consisting of just
the start symbol. At each step, we find a
variable to replace, say A. The
productions for A, or the A productions,
are those that have A on the left side,
that is the head of production. We replace
this A by the right side or the body of
the production. We can then repeat the
process as many times as we like until we
are left with only terminals at which
point no replacement is possible and we
have in fact generated a string that is in
the language of the grammar. Here's what
the replacement looks like. You take any
string that has an occurrence of some
variable A and anything to the left and
right. Terminals and or variables, which
is what the alpha and beta we have here,
suggest you take an A production who's
body is gamma and you replace the A by
gamma. Here is an example derivation. We
always start with the string that has Just
the start symbol s. We choose the second
production so S replaced by zero, S, one.
Now, we replace the s, again using the
second production. And again we replace
the s, but this time we use the first
production whose body is just zero one. We
are now left with a string that has only
terminals and this string cannot be
subjected to further replacements. You
should be aware that in more complicated
grammars, variables can be replaced by
strings that contain two or more
variables, and when that happens we have
lots of choices of what variable to
replace at each step. And derivations can
be much more complicated. The double arrow
symbol represents single steps of a
derivation and just as we extend the delta
the many steps for atomaton, we need to
have a notation that means any number of
steps for a derivation. This is the arrow
star symbol, and we define it inductively.
The basis is that in zero steps the string
can't change, so any string goes to star
itself. The inductive step lets us get
from alpha to beta using any number of
steps using any numbers of steps including
zero steps. That's that. And then with one
more step, get from beta to gamma. The
conclusion is then that alpha goes on some
number of steps to gamma. Here's an
example using the same grammar as before
and the same derivation sequence we just
discussed. We can conclude that s goes to
itself in zero steps, that is we just
start here, don't go anywhere. So we can
conclude that s goes to star itself. Then
inductively, S goes to 0,1, zero, S, one
rather 00, S1 and 000111. Notice also that
we don't have to start with the start
symbol. It is also true for example, that
0S1 goes to star Zero, zero, zero, one,
one, one. That is anything along the
derivation sequence goes to any later,
goes to itself and any later position
along the sequence. A sentential form for
a grammar is any string derivable from the
start symbol. That is, s arrow star the
sentential form. The [inaudible] form can
consist of any mix of terminals and non
terminals. The language of the grammar G
is the set of terminal strings W such as
the start symbol G derives W. Here's an
example grammar that's just a little
different from the previous one. Here the
basis rule is that S goes to Epsilon
rather than S goes to 01. Note that
Epsilon is a perfectly legitimate body for
production and is effective to make the
variable on the left disappear. As a
result, this grammar can derive the empty
string along with all the other strings
that have some numbers of 0s followed by
the same number of 1s. That is, the
language of this grammar is this set of
zero to the n, one to the n, such that n
is at least zero while the previous
grammar had at least one as the condition
on n. The class of language is called
context-free languages consists of all
those languages that are defined by some
context-free grammar. We now see that
there are context-free languages that are
not regular languages, such as the zero to
the n, one to the n example just given.
However there are languages that are not
context free and the intuition is that
context free grammars can count two things
but not three. Thus an example of a
language that is not context free is zero
to the N, One to the n, two to the n, say,
such that n is equal to a greater than
one. That is, you cannot match both the
zeroes against the ones and the ones
against the two at the same time. I want
to introduce a notation called BNF or
Bacchusnaur Form, which you may have seen
in manuals for programming languages, and
which is closely related to context-free
grammars. Bnf has a number of extensions
of the grammar notation we use, and these
extensions are useful in manuals but don't
add any power. B and F style notations
were used for two of the original
programming languages. John Bacchus used
it in the original description of Fortran
and Peter Nouer used it in the original
description of [inaudible]. In B and F you
usually use a word to describe a variable
for example, statement for the intent is
that this variable will generate all the
strings that valid statements of
programming language. Conventionally these
words are put in triangular brackets that
tell us they are variables rather than
terminals. Some terminals for programming
language are single characters just like
in our formal context free grammars. For
example, the plus sign or semi-colon are
commonly terminals in the grammar for a
programming language. However other
terminals are really reserve words like if
or while. And we see these shown either in
bold or underline depending on the style
used to remind us that they are single
symbols. In B and F we often colon, colon
equals used in productions rather than
arrow. We also find a vertical bar used to
list several production bodies that have
the same head. That is a useful convention
that we should use in formal conventions
as well. For example, our original grammar
can be written with s ones on the left
side and the bodies of the two s
productions listed with the bar connecting
them. We follow a symbol or bracketed
expression by dot, dot, dot. So here's an
example. We have one variable digit with
the obvious ten productions all grouped
together with the bars. Then we have one
production for the variable unsigned
integer. With the right side digit, ... ,
that is one or more digits. In general, we
can replace alpha ... By two productions.
Let A be a new variable that generates all
sequences of one or more alphas. A has two
productions with bodies a alpha and just
alpha. They are. You should be able to see
how a can be replaced of n alphas. Just
use the first production n minus one times
and then the second production once. For
example, here is a grammar for unsigned
integers, where the BNF d dot, dot, dot
has been replaced by the these two
productions. These generate any sequence
of one or more D's. Then D generates each
of the ten strings consisting of a single
digit. That's, of course, that. We can
make part of a production body be optional
if we surround it by square brackets. For
example, many programming languages have
both an if then and an if then construct
for statements. We can see this as an if
then statement with an optional else
clause. In B and F we put brackets around
the else clause to make it optional.
That's essentially this stuff there. We
can replace an optional alpha by a new
variable a. This variable has two
productions, one with a body alpha and the
other with the empty string for a body.
Thus the alpha can be either there or not,
when we expand the new variable a. Here
we're using I for IF, T for then E for
else and the semicolon is another terminal
standing for itself. S is the start symbol
standing for statement. And C is another
variable representing conditions. We
really need to add production for
conditions but I haven't done so in this
fragment of a real grammar. Notice the a
is a variable standing for an optional
else clause. Okay. It can be replaced by a
semi colon and else and another statement
if we want to have the else clause there.
Or by epsilon if we just want an if/then
statement with no else clause. Curly
braces are used in B and F for grouping
several different elements. You need this
for example if you want to have a
repeating group of elements and a ... Or
one or more construct. For example, it is
common if programming languages to allow a
statement list to be one or more
statements. The statements are separated
by semicolons so there is one fewer
semicolon than statements. That is a
statement list consists of one statement,
that's this. Followed optionally by one or
more groups consisting of a semi-colon and
a statement There. The Brackets form a
group and then finally the ... Applying to
the group says one or more of these
groups. Finally you have the braces And
those braces say that the whole thing is
optional. To translate groups to our
original notation, just create a new
variable a for the group. A has one
production whose body is the group. Here's
an example of a production that uses all
three BNF features, one or more optional
and grouping. It says a list of statements
L is a single statement S optionally
followed by one or more groups each group
consisting of a semi-colon and a
statement. The first thing we'll do is
replace the group semicolon S by a new
variable A. A has one production in which
it is replaced by thing it stands for,
that's this guy right here. And next we'll
introduce a new variable B which stands
for the optional A ., ., ., . B has two
productions, it is replaced either by the
A... , that would be choice, or, it is
replaced by the empty string. Finally, we
replace the A... And the B productions by
the new variable C. The productions for C,
which are here, allow C to be replaced by
any sequence by one or more As. When a
sentential form has a number of variables,
we can replace any one of them at any
step. But what string of terminals a
variable ultimately gets replaces by is
independent of what else is in the
sentential form. That's actually where the
term context free comes from. As a result,
we have many different derivations of the
same string of terminals. We can restore
some order to the world by requiring that
a particular variable be replaced at each
step. Although we cannot demand that any
particular production be used for the
replacement. One reasonable rule is to
require that the leftmost variable be the
one replaced at each step. This rule
restricts us to what are called leftmost
derivations. Similarly we could require
that the rightmost variable be replaced at
each step, and that gives us the rightmost
derivations. The double arrow with an LM
subscript That's this, represents one step
of a left most derivation. That is, on the
left of the arrow LM we must have a string
in the form WA Alpha. Here you see one.
That is since W by our convention has
terminals only, A must be the left most
variable in this string. On the right is
the same string with the body, say Beta,
is that, of some A production replacing
it, A. The symbol consisting of the double
arrow with the subscript LM and the star.
That's this, means becomes by a sequence
of zero or more leftmost derivation steps.
Let's introduce another very simple
grammar that generates a language that is
not a regular language. This grammar has
only one variable but unlike the zero to
the n one to the n grammar there are
productions with more than one variable in
the body. This grammar generates strings
of balanced parentheses. Those strings
that are legal and arithmetic expressions.
The last production, here's it's body.
Says that a pair of matching parentheses
is balanced. Of course the left
parenthesis must come first. The middle
production, this, says that if we put
matching parentheses around a balanced
string it is still balanced. And the first
production, this, says that the
concatenation of two balanced strings is
balanced. We need to prove that every
string of balanced parentheses can be
generated by this grammar. The proof is
not too hard but we're not going to do it
here. Here's an example of a leftmost
derivation. We start with just S, so that
is the leftmost variable. We replace it by
two S's at the first step. There we go.
Next, the first of these S's must be
replaced in a leftmost derivation. We
choose the second production for the
replacement. There we are. At the third
step, the first of the S's must be
replaced and here we choose the last
production in the replacement, that's
giving us that. [sound] At the last step,
we have only one S and that naturally is
the one we replace. We've chosen to
replace Using the last of the three
productions so now we have a terminal
string and are done. The arrow star
leftmost notation can be used to express
zero or more leftmost derivation steps. So
for example, S is related to the terminal
string by this relationship. It is also
related to itself and all the other steps
in the derivation by the same relationship
and in fact each step is related to itself
and all the following steps. Here is an
example of a derivation that is not left
most. The problem is that if the second
step, the second S, rather than the first,
was replaced. Rightmost derivations are
defined quite analogously to leftmost
derivations. The arrow with an RM
subscript, this. Means that the rightmost
variable is replaced at the step. Notice
that the string on the left which is this,
has W which must be a string of only
terminals following the variable A that
it's replaced. Thus A is surely the
rightmost variable. And the arrow with the
RM and a star means zero or more steps in
the rightmost derivation. Here's a
balanced parenthesis grammar again. Now we
have a rightmost derivation of the same
string as before. Notice that at the
second step, okay, the second S got
replaced rather than the first. S is
related to the terminal string by the
arrow star rightmost operator. And as the
leftmost derivations each step in the
rightmost derivation is related by the
operator to itself and all the steps that
follow. Here's an example of a derivation
that is neither a leftmost nor a
rightmost. See how at the third step The
middle S is replaced. Also notice that the
second step is correct but the ambiguous
in a concerning way. One of the [sound],
here, one of the these got replaced by two
[sound] but we don't know which.