Good morning, good afternoon, good evening or good whatever or whenever, you are actually listening to this. Today we're going to start a completely new tack, where we learn about a second important class of languages other than the regular languages. This class is the context-free languages. They include all the regular languages and more. The most common description of these languages is a context-free grammar. We'll see the definition of these grammars and the way these grammars define languages, which is through derivations. We're also going to introduce a related notation called Bacchusnouer four. The two most important applications of context free grammars are probably a processing natural languages and computer languages. I am going to focus on computer languages Where almost every language today has its syntax defined by a context free grammar in the Bacchusnouer notation. Grammars are also essential when designing a parser for the language. That part of the compiler that puts together the tokens of the language into the proper structure. For example, the parser discovers the order in which arithmetic operators in an arithmetic expression and group statements properly so their execution sequence is as intended by the programmer. A context free grammar defines a language by a mechanism we will soon learn. Every regular language has a context free grammar describing it but there are also languages that can be described by a grammar but that are not regular. On the other hand, the context free grammar is still a relatively weak formalism. There are some languages that are simple to describe yet have to context free grammar. Many of the languages in the context free class that are outside the regular languages are languages that involve nested structures such as the patterns of left and right parenthesis that we called balanced. The central elements of the grammar are variables. These are symbols that generate particular sets of strings. One of these symbols called the start symbol will generate the entire language. But we can add many others to help us in that definition. The variables or rather the sets of strings they generate, are defined recursively in terms of one another. The rules that define the languages of the variables are called productions. Each production has a variable on the left, say A, an arrow And zero or more symbols on the right will draw say x or y on the right, that serve as the definition. A rule like this says that the concatenation of the languages represented by x and y is a subset of the language represented by a. A variable may have several productions and its languages thus defined to be the union of the languages described by the right side of each of its productions, but remember, all of this may be recursive so grammars are in fact far more powerful than the regular expressions we can build from unions and concatenations. For our first example, let's consider the language that we showed not to be a regular language. The set of strings of the form of N, zero followed by the same number of 1s. Here is a grammar for this language, Two productions or rules and only one variable s. The first production, this one is a basis rule. It says that the string zero one is in the language of s. The second, this, is an inductive rule. It says that if w is a string in the language of s, then zero w one is also in the language. That rule lets us build longer and longer strings at each step adding one zero to the beginning and one, one to the end. So we always have the same numbers of zeros and ones with the zeros preceding the ones. Now lets make precise our informal introduction to what context free grammars look like. The terminals of the grammar are analogous to the input symbols of an atomaton. They form the alphabet for the language being defined. The variables or non-terminals are something like states of automaton but they are more powerful. One variable is called the start symbol. It is the language of this variable that the grammar defines. And the other variables are used as auxiliaries but we can think of them as internal to the grammar and their languages are not visible outside. The productions of the grammar which are akin to the transition function of an atomaton have the form of the variable on the left, sometimes called the head, an arrow and a string of zero or more symbols which can terminals of variables. These are sometimes called the body of the production. We have the convention about the letters used for certain symbols and strings. These conventions are more complex than the convention we use to distinguish input symbols, little a, little b and so on from strings w, x and so on. But they're really important as a reminder of the roles different components play and it is something worth committing to memory early on. First, we use capital letters at the beginning of the alphabet as variables. However S is also normally a variable. In fact it will be the start symbol in many grammars. On the other hand, lower case letters at the beginning of the alphabet are terminals. This convention agrees with an earlier convention that made these letters stand for input symbols. Since there is a good analogy between the inputs of an atomaton and the terminals of a grammar. Capital letters near the end of the alphabet are used for symbols that can be either terminals or variables, we typically don't know which. Lower case letters at the end of the alphabet stand for strings of terminals only. Again, this matches our earlier convention. And we use Greek letters at the beginning of the Greek alphabet for strings that may consist of both terminals and variables mixed. We'll design a grammar for the language of strings with the form zero to the n, one to the n. The terminal alphabet is zero and one of course. We need only one variable in this case, we'll call it S. S will be the start symbol. There's no other option since it is the only variable. And here are the productions which we explained earlier in our informal discussion. The first production generates only the string zero one and the second production puts a zero and one at the beginning and end respectively of a shorter string in the language. A derivation consists of a sequence of strings that typically have both terminals and variables, although they can have only one kind of symbol or even be empty. We start with the string consisting of just the start symbol. At each step, we find a variable to replace, say A. The productions for A, or the A productions, are those that have A on the left side, that is the head of production. We replace this A by the right side or the body of the production. We can then repeat the process as many times as we like until we are left with only terminals at which point no replacement is possible and we have in fact generated a string that is in the language of the grammar. Here's what the replacement looks like. You take any string that has an occurrence of some variable A and anything to the left and right. Terminals and or variables, which is what the alpha and beta we have here, suggest you take an A production who's body is gamma and you replace the A by gamma. Here is an example derivation. We always start with the string that has Just the start symbol s. We choose the second production so S replaced by zero, S, one. Now, we replace the s, again using the second production. And again we replace the s, but this time we use the first production whose body is just zero one. We are now left with a string that has only terminals and this string cannot be subjected to further replacements. You should be aware that in more complicated grammars, variables can be replaced by strings that contain two or more variables, and when that happens we have lots of choices of what variable to replace at each step. And derivations can be much more complicated. The double arrow symbol represents single steps of a derivation and just as we extend the delta the many steps for atomaton, we need to have a notation that means any number of steps for a derivation. This is the arrow star symbol, and we define it inductively. The basis is that in zero steps the string can't change, so any string goes to star itself. The inductive step lets us get from alpha to beta using any number of steps using any numbers of steps including zero steps. That's that. And then with one more step, get from beta to gamma. The conclusion is then that alpha goes on some number of steps to gamma. Here's an example using the same grammar as before and the same derivation sequence we just discussed. We can conclude that s goes to itself in zero steps, that is we just start here, don't go anywhere. So we can conclude that s goes to star itself. Then inductively, S goes to 0,1, zero, S, one rather 00, S1 and 000111. Notice also that we don't have to start with the start symbol. It is also true for example, that 0S1 goes to star Zero, zero, zero, one, one, one. That is anything along the derivation sequence goes to any later, goes to itself and any later position along the sequence. A sentential form for a grammar is any string derivable from the start symbol. That is, s arrow star the sentential form. The [inaudible] form can consist of any mix of terminals and non terminals. The language of the grammar G is the set of terminal strings W such as the start symbol G derives W. Here's an example grammar that's just a little different from the previous one. Here the basis rule is that S goes to Epsilon rather than S goes to 01. Note that Epsilon is a perfectly legitimate body for production and is effective to make the variable on the left disappear. As a result, this grammar can derive the empty string along with all the other strings that have some numbers of 0s followed by the same number of 1s. That is, the language of this grammar is this set of zero to the n, one to the n, such that n is at least zero while the previous grammar had at least one as the condition on n. The class of language is called context-free languages consists of all those languages that are defined by some context-free grammar. We now see that there are context-free languages that are not regular languages, such as the zero to the n, one to the n example just given. However there are languages that are not context free and the intuition is that context free grammars can count two things but not three. Thus an example of a language that is not context free is zero to the N, One to the n, two to the n, say, such that n is equal to a greater than one. That is, you cannot match both the zeroes against the ones and the ones against the two at the same time. I want to introduce a notation called BNF or Bacchusnaur Form, which you may have seen in manuals for programming languages, and which is closely related to context-free grammars. Bnf has a number of extensions of the grammar notation we use, and these extensions are useful in manuals but don't add any power. B and F style notations were used for two of the original programming languages. John Bacchus used it in the original description of Fortran and Peter Nouer used it in the original description of [inaudible]. In B and F you usually use a word to describe a variable for example, statement for the intent is that this variable will generate all the strings that valid statements of programming language. Conventionally these words are put in triangular brackets that tell us they are variables rather than terminals. Some terminals for programming language are single characters just like in our formal context free grammars. For example, the plus sign or semi-colon are commonly terminals in the grammar for a programming language. However other terminals are really reserve words like if or while. And we see these shown either in bold or underline depending on the style used to remind us that they are single symbols. In B and F we often colon, colon equals used in productions rather than arrow. We also find a vertical bar used to list several production bodies that have the same head. That is a useful convention that we should use in formal conventions as well. For example, our original grammar can be written with s ones on the left side and the bodies of the two s productions listed with the bar connecting them. We follow a symbol or bracketed expression by dot, dot, dot. So here's an example. We have one variable digit with the obvious ten productions all grouped together with the bars. Then we have one production for the variable unsigned integer. With the right side digit, ... , that is one or more digits. In general, we can replace alpha ... By two productions. Let A be a new variable that generates all sequences of one or more alphas. A has two productions with bodies a alpha and just alpha. They are. You should be able to see how a can be replaced of n alphas. Just use the first production n minus one times and then the second production once. For example, here is a grammar for unsigned integers, where the BNF d dot, dot, dot has been replaced by the these two productions. These generate any sequence of one or more D's. Then D generates each of the ten strings consisting of a single digit. That's, of course, that. We can make part of a production body be optional if we surround it by square brackets. For example, many programming languages have both an if then and an if then construct for statements. We can see this as an if then statement with an optional else clause. In B and F we put brackets around the else clause to make it optional. That's essentially this stuff there. We can replace an optional alpha by a new variable a. This variable has two productions, one with a body alpha and the other with the empty string for a body. Thus the alpha can be either there or not, when we expand the new variable a. Here we're using I for IF, T for then E for else and the semicolon is another terminal standing for itself. S is the start symbol standing for statement. And C is another variable representing conditions. We really need to add production for conditions but I haven't done so in this fragment of a real grammar. Notice the a is a variable standing for an optional else clause. Okay. It can be replaced by a semi colon and else and another statement if we want to have the else clause there. Or by epsilon if we just want an if/then statement with no else clause. Curly braces are used in B and F for grouping several different elements. You need this for example if you want to have a repeating group of elements and a ... Or one or more construct. For example, it is common if programming languages to allow a statement list to be one or more statements. The statements are separated by semicolons so there is one fewer semicolon than statements. That is a statement list consists of one statement, that's this. Followed optionally by one or more groups consisting of a semi-colon and a statement There. The Brackets form a group and then finally the ... Applying to the group says one or more of these groups. Finally you have the braces And those braces say that the whole thing is optional. To translate groups to our original notation, just create a new variable a for the group. A has one production whose body is the group. Here's an example of a production that uses all three BNF features, one or more optional and grouping. It says a list of statements L is a single statement S optionally followed by one or more groups each group consisting of a semi-colon and a statement. The first thing we'll do is replace the group semicolon S by a new variable A. A has one production in which it is replaced by thing it stands for, that's this guy right here. And next we'll introduce a new variable B which stands for the optional A ., ., ., . B has two productions, it is replaced either by the A... , that would be choice, or, it is replaced by the empty string. Finally, we replace the A... And the B productions by the new variable C. The productions for C, which are here, allow C to be replaced by any sequence by one or more As. When a sentential form has a number of variables, we can replace any one of them at any step. But what string of terminals a variable ultimately gets replaces by is independent of what else is in the sentential form. That's actually where the term context free comes from. As a result, we have many different derivations of the same string of terminals. We can restore some order to the world by requiring that a particular variable be replaced at each step. Although we cannot demand that any particular production be used for the replacement. One reasonable rule is to require that the leftmost variable be the one replaced at each step. This rule restricts us to what are called leftmost derivations. Similarly we could require that the rightmost variable be replaced at each step, and that gives us the rightmost derivations. The double arrow with an LM subscript That's this, represents one step of a left most derivation. That is, on the left of the arrow LM we must have a string in the form WA Alpha. Here you see one. That is since W by our convention has terminals only, A must be the left most variable in this string. On the right is the same string with the body, say Beta, is that, of some A production replacing it, A. The symbol consisting of the double arrow with the subscript LM and the star. That's this, means becomes by a sequence of zero or more leftmost derivation steps. Let's introduce another very simple grammar that generates a language that is not a regular language. This grammar has only one variable but unlike the zero to the n one to the n grammar there are productions with more than one variable in the body. This grammar generates strings of balanced parentheses. Those strings that are legal and arithmetic expressions. The last production, here's it's body. Says that a pair of matching parentheses is balanced. Of course the left parenthesis must come first. The middle production, this, says that if we put matching parentheses around a balanced string it is still balanced. And the first production, this, says that the concatenation of two balanced strings is balanced. We need to prove that every string of balanced parentheses can be generated by this grammar. The proof is not too hard but we're not going to do it here. Here's an example of a leftmost derivation. We start with just S, so that is the leftmost variable. We replace it by two S's at the first step. There we go. Next, the first of these S's must be replaced in a leftmost derivation. We choose the second production for the replacement. There we are. At the third step, the first of the S's must be replaced and here we choose the last production in the replacement, that's giving us that. [sound] At the last step, we have only one S and that naturally is the one we replace. We've chosen to replace Using the last of the three productions so now we have a terminal string and are done. The arrow star leftmost notation can be used to express zero or more leftmost derivation steps. So for example, S is related to the terminal string by this relationship. It is also related to itself and all the other steps in the derivation by the same relationship and in fact each step is related to itself and all the following steps. Here is an example of a derivation that is not left most. The problem is that if the second step, the second S, rather than the first, was replaced. Rightmost derivations are defined quite analogously to leftmost derivations. The arrow with an RM subscript, this. Means that the rightmost variable is replaced at the step. Notice that the string on the left which is this, has W which must be a string of only terminals following the variable A that it's replaced. Thus A is surely the rightmost variable. And the arrow with the RM and a star means zero or more steps in the rightmost derivation. Here's a balanced parenthesis grammar again. Now we have a rightmost derivation of the same string as before. Notice that at the second step, okay, the second S got replaced rather than the first. S is related to the terminal string by the arrow star rightmost operator. And as the leftmost derivations each step in the rightmost derivation is related by the operator to itself and all the steps that follow. Here's an example of a derivation that is neither a leftmost nor a rightmost. See how at the third step The middle S is replaced. Also notice that the second step is correct but the ambiguous in a concerning way. One of the [sound], here, one of the these got replaced by two [sound] but we don't know which.