Today we're going introduce the idea of regular expressions. This is a simple algebraic notation that describes exactly the regular languages. In practice it's very common. To have. A regular expression like notation to, to describe patterns, such as patterns in text and to have. Behind the scenes, a compiler from regular expressions into finite automata. In particular, deterministic finite automata. Because these are the things that actually can be executed. So, after introducing the notion of a regular expression, we're going to, show you that. You can convert any regular expression into an automaton and also any automaton into a regular expression, that proves that the language is accepted by regular expressions and the various kinds of final automaton that we have met are exactly the same. The algebra you are most familiar with is probably arithmetic using operators plus and times and operating on numbers. The regular expression algebra operates on languages and it has three specialized operators. Regular expressions like algebraic expressions built up by applying these operators to operate. So regular expressions then describe languages of an algebra. And they as I said describe the regular languages. We'll use L and E as the notation for the language described by the regular expression E. And we will describe regular expressions and their languages recursively. Regular expressions use three operations, the union, concatenation, and the Kleene star. Okay, a union of languages is the usual thing, since languages are sets. And here's a, a very simple evolve. Here too. Sets, their languages, each with three strings and with two strings. Zero one happens to be in both so it will be once in the union. One, one, one is there, zero is there, and zero, zero is there, so that's the common notion of the union on sets or languages. Concatenation is. Also a fairly simple operation. We'll denote the concatenation of two languages, say L and M, by juxtaposition, that is L, M with no punctuation in between them. The language LM contains every string that is W concatenated with X, such that W is in L and X is in M. So, here's an example, we are in effect multiplying the two languages, that is we'll take 01 from the first language, and we can concatenate it with 00. That gives us 0100. We could also take 01 again, concatenate it with 01, and that gives us that. And one, one, one [inaudible] with the two strings here gives us these two strings and the result and one zero concatenated with zero, zero and zero one gives us. The final two strings in the result, okay. Kleene star is probably something that you're least familiar with. If L is a language, then L star which will, is called the Kleene star, or just the star operator, is the set of strings that you can form by concatenating zero or more strings from L, in any order. That is, you can take, One string from L, you take no strings from L. That would give you the empty string. You take one string from L. You take two strings from L, they can be any two strings, they don't have to be the same. You can concatenate them. You can take three strings from L, concatenate them and so on. Incidentally Stephen Kleene was the fellow who invented regular expressions, and showed that the describe the same languages that finite automata described. Okay, if L is a language, then L star, the Kleene star, or just star operator, is the set of strings formed by concatenating zero or more strings from L in any order. Thus, L star would consist of the set containing the empty tree. Empty string is always in the star of any [inaudible]. Because that represents no choices of string from L. Then the union with L itself. And then we take two strings from L so we take L concatenated with L. We can take three strings and so on. Any [inaudible] from L, anything you can form by concatenating any number of strings from L will be in the language L star. So here's an example. Language l is just as two strings. Zero And one zero. The star of that language. Well, we can take no. No choices, that would give us the empty string. We can take one or the other string. That will give us these two. We can take two choices from the language L so we if both choices is zero we get 00. We take zero from the first choice and ten for the second that is 0100. All it takes ten for the first choice and zero for the second choice that gives us 100 or we can make both choices be ten that give us. In the string 1010 have any choices and so on. There are three parts to the basis in the, in the Regular Expressions definition. The first part is for the single symbol. If A is the symbol. Then A also denotes a regular expression. It denotes, the language, that, this regular expression denotes is the language with one string. That string has length one and the one position of that string has A in it. Okay, by the way, in order to distinguish A as a symbol or string from A as a regular expression, we usually make the regular expression bold-faced, but context should, help you to distinguish, Strings, symbols, and regular expressions anyway. Okay. The second part of the basis is the symbol epsilon. Okay, this is a regular expression and it's language is the language that has one string. That string is the empty string. And the third part of the basis is the empty set symbol. This is a regular expression and it's language is the empty language. Okay, the inductive part of the definition also happens to have three parts. 'Kay. For the first part we can connect any two regular expressions by a plus sign. And this plus sign represents set union. That is the language of e one plus e two. Is the union of languages that e one and e two denote. The second part. Involves the concatenation operator. We can write one regular expression next to another to denote the concatenation of their languages. That is E1 followed by E2 denotes the concatenation of the languages that E1 and E2 denote. As we shall discuss in a minute, we sometimes need parentheses to group expressions properly. So in some circumstances we need to put parentheses around E1 and or E2. So we might, for example, see something like that, okay. The third part is the star operator. If we follow a regular expression E by a star, then the language we denote is the clean enclosure of the language that E denotes. Again, it is in some circumstances necessary. Put parentheses around the E to make sure the operators inside the expression E group properly like that. As with other algebras, we can and must use parentheses to force the intended order for operations. For regular expressions, the order is star, then concatenation, then plus. [inaudible] of plus, the lowest [inaudible] operator is analogous to addition in arithmetic. Concatenation is analogous to multiplication, and star as analogous to exponentiation. Although, in the case of star, [inaudible]. No power to which its argument is raised. In a sense, star means raised to all powers. For example. The regular expression zero one represents the concatenation of the language consisting of one string which is zero. And the language consisting of one string one. The results of the language containing one string zero one. In general. Any string of symbols as a regular expression represents the language that contains only that one string. The language of expression zero one plus zero. That's this. Is the union of the language containing only the strings zero one and the language containing only the string zero. The language of zero concatenated with one plus zero, that's this expression, is the two strings. 01 and 00. Notice that we need parentheses to force the plus the group first. Without them since concatenation take precedence over plus we get the interpretation of the second example. That is this one and obviously you get somewhat different languages. The language zero star is the star of the language containing only the string zero. This is all strings of 0's including the empty string. Here's a little more complicated example. It denotes the language that we've been playing with when we talked about automana, that is, all strings of 0's and 1's without two consecutive 1's. To see why this works in every such string, that is any string in the language, each one is either followed immediately by a zero. Where it comes at the end of the string. This part of the expression. Zero plus one, zero, star. Denotes all strings in which every one is, is followed by a zero. Okay. These strings are surely in the language that we want, but we also want those strings that are followed by a final one. Thus we can concatenate the language of zero plus one, zero star. With. The union of two languages. One is epsilon, and concatenating with epsilon, just give us the same strings. That is, that is those that don't have a final one. And we can also concatenate with the language containing the string one. That gives us any string that, in which every one is followed by a zero, and then it also has a final one, okay? We're going to show the regular expressions, and find exactly the regular languages. We already have three equivalent representations to the regular languages, DFAs, NFAs, and Epsilon NFAs. We need to show that for every regular expression, there is some automaton that defines the same language. And for this job, we may as well pick the most powerful of the three varieties of automaton, the Epsilon NFA. For the other direction, we need to show that every regular language is defined by some regular expression. Here, we may as well with the most restrictive variety of automaton, the DFA. We'll begin with, the process of how you convert a regular expression to an epsilon NFA. The proof is an induction on the number of operators, the operators of course being the plus concatenation and the star, that appear in the regular expression. And we're always going to construct an automaton that has a special form which I'll show you on the next slide. The special fall of epsilon NFA we construct as suggested by this sketch. They can be any number of states in the middle, but only one starts [inaudible] as true for any automaton. And only one final state, which is a restriction. More importantly, as we build larger automata from this from smaller ones, we never allow an arc into the middle or into the final state. The only outside arcs must come into the star state. That is, we might add arcs like that but you can't add an arc that goes with state that's inside. Okay. Likewise you can't add an arch that goes to the final state. [inaudible] We never have an ark from any of these states. Except the final state. To some place on the outside. So you can't install some arc leaving like that. And notice that we put quotes around the term final, in final state, because although this state is, is final if this is the entire automaton that we construct. If it's a piece of a larger automaton, then the final state will no longer be final in the, ... >> Allow the larger automaton. >> Okay. >> The induction is on the number of operators in the regular expression. And here are the bases cases, the expressions with zero operators. Okay, a regular expression without operators has to be one of the bases cases from the definition or regular expressions. If the expression is a symbol A, then the epsilon [inaudible] we need consists of only a start and a final state, with an arc labeled A from start to final. Obviously, this automaton accepts the language of the regular expression A. That being only the one string A. If the expression is epsilon, then it is sort of the same, except the arc is labeled epsilon. And if the expression is the empty set symbol, and we just have the start and the final state, and no way to get from one to the other, okay? Now what do you get in the induction? There are three cases, depending on whether the outermost operator's union concatenation is star. Here's the instruction for union. Suppose our expression is E1+E2. Both E1 and E2 have fewer operators than the entire expression. So the inductive hypothesis applies to these sub expressions. We may thus assume that there is an epsilon NFA for, of the desired form for E1 and likewise for E2. We form the epsilon NFA for E1 plus E2 by introducing a new start state and a new final state. The old final states are no longer final. There are epsilon transitions from the new start state to each of the two old start states, and there are epsilon transitions from the old states to the new final state. If there's a path from the new start state to the new final state, it must consist of epsilon transitions at the beginning and end, with a path that is either wholly within E1 or wholly within E2. So for example the path might look like, this go here and then we go around to go through the same statement many times finally comes the old final state of E1 and then out to the new final state. Notice the fact that there is no way to get into the green thing labeled for E1. And no way to get out of it except as I've shown. Guarantees that the path must from, the path from here to here must either go. Through E1 or through E2. There are no other options. ?Kay. Here's the construction for concatenation of E1 and E2. Again we assume by the inductor hypothesis that there is an Epsilon NFA of the right form for E1 and dido for E2. We add an epsilon arc from the final state of the automaton for E1. And that state, of course, then, is your final. The start state of the automaton for E2. The start state with the bigger automaton is the start state of the automaton for E1. And the final state is the final state of the automaton for E2. The new automatons the concatenation of the languages. E1 and E2. Any path from the start to final state must first pass through E1 but as it will do something like this Then it will take this epsilon arch and then it will do something in two. There is no other option because you can't get into or out of the green areas except as shown. Finally, here's the construction for star. Suppose we have an automaton for some expression E, and we want to construct the automaton for E star. We add a new start and final states, and the old final state is no longer final. There's an epsilon arc from the new start state to the new final state, so epsilon is always in the language of this new automaton. There are three other epsilon arcs as shown. The first of these epsilon arcs, brings us to the start state for, the automaton for E. We must reverse the automaton for E, following a path whose label is any string in the language for E No. When we reach the old final state of this automaton, they have a choice. We follow the [inaudible] r in the new final state. Then we're done. We've found the path. That is, has a label that is one string from the language of e. Or, we have another option. We can go from the old final state back to the old start state. Maybe take another path to there. Now we have a path that has a label that is the concatenation that is two strains from the language of E. And we can go back again. Take another path. Make, a label of that path would have now, concatenation of three strings for me, and so on. We can do this as many times as we like. But then, finally, we go off to the final state, new final. And, we are, done with that. So, this automaton really does have a language that's E star. You can go through the automaton for E. Zero times by just, just going around it. Or you can go through it once, twice, three times as many times as you'd like. Now we're going to do the construction the other way. We'll start with a DFA and construct a regular expression to find the same language. The proof is inductive and to make the induction work, we need to assume the states are named one through N. There's no harm in making this assumption since the names of the states do not influence the language in automanotic sets. The induction is on the maximum number of a state that is allowed to be in the middle of a path. An idea that is explained on the next slide. We're going to talk about k-Paths, which are paths that can go from any state to any state. But in the middle that is excluding the endpoints only states numbered K or less can be found. The inductive on k, and it states that there is a regular expression, whose language is the set of labels of all k-paths who state I to state j. We start with 0-Paths, paths as a basis, and by the time we get to n-Paths, there's no restriction on, on paths at all, so we have a regular expression describing the language of the automaton itself. Notice that an n-Path represents no restriction on paths at all since there are no states whose names are higher than N. To get a regular expression for the language of the DFA we take the union of the expressions for the n-Path that describe how you get from the start state to each of the final states. For example, he's a little automatoned. The 0-Paths from state two to state three are the language of the regular expression zero. The reason is that we can only follow an arc from two to three, and there's only one. The one that has label zero. That is, that's the zero is describes the only path that gets the state two to three without going through any another state. 1-Path from two to three we can go through state one but not through two or three. Plus we can still follow the arch from two to three label zero but we can also follow the arch labeled one from state two to state one. That's this. And from there, the arc labeled one. From state one to state three. Notice that once we get to state one, we cannot go back to two in a 1-Path. And if we go to three, we're finished, the path must end. The 2-Paths from state two to state three are more varied. The only think it cannot do is pass through state three. Thus, from state two, we can go to state one and back to two zero or more times. The labels along each such path is 1-0. So 1-0 star. Describes the labels we reached by following this path any number of times. Including zero times. After oscillating as many times as we wish. In warming up in state two. We can then follow the zero arc to three. That's that part of the regular expression. This path from state two to state three are described by the regular expression. One zero Star zero. That's this whole part. An alternative plan is to oscillate between state two and one. Winding up in one. The labels of all these paths are described by one, that is you can to state one for the first time, and then you can go 01, 01 as many times as you like including, including zero times. Finally though you have to follow the path from one to three, the arch from one to three, in fact. And, so one followed by one, sorry, one followed by zero one star, followed by another one, describes all of those paths that start in two. Go from one to two several times. Wind up in one. And then go to three. The regular expression for the labels of all 2-Paths. From state two to state three. Is the union of these two expressions that we have just described. The 3-Paths from state to state three also have a regular expression. But it is sufficiently convicted that we're going to save that example until we have seen the complete construction. Okay. Now, here's the induction of how you, go from a DFA to a regular expression. Okay. The basis is K equals zero. Notice that a 0-Path. Can have no intermediate states at all, plus the set of 0-Paths from the state I to state J, can only consist of a single arc plus, in the special case that I=J, no arcs at all. We can construct a regular expression for this language. Connect with pluses each of the labels and in addition, if I=J, then add an epsilon. So for example, here's state I, here's state J, and suppose it has an arc, labeled A, B, and C. Then it's regular expression is A+B+C. And if there's also an arc, let's say, from I to I, in addition, let's say that there is a loop with labels D and E on the state I. Then regular expression for the paths from I to I would be D+E+epsilon. We now need to introduce some notation. Let Rij super K, be the regular expression for the set of labels of the K paths from state I to state J. We've really seen the basis case, K equals zero, that is Rij super zero is the sum of the labels on the arc from state I to state J. If there's no such arc, then use the empty sets symbol as the regular expression, but we add epsilon if I equals J. Use our example DFA again. R one two super zero equals zero, since that is the label of the arc from the state one to state two, and of course the state one is not equal to state two. Now consider our R-1-1 super zero. There's no arc from state one to itself, so we start out with the empty set. However, since the beginning and the end states are the same, we have epsilon. Okay. And incidentally, notice that there's an algebraic law. That is, the empty set union anything is that other thing. So, whenever you see an empty set symbol plus something else, you can just get rid of the empty set and plus, and just take what's. Now we shall do the inductive step where we assume we've written all the regular expressions with super K minus one, and we need to write the expressions for the K paths. A K path. I would never go through state K at all or does so one or more times. That let's us write the expression for Rij super K in terms of expressions for K minus one paths. The first term covers the case that the K path doesn't even go through the state K once. That is every K minus one path is a K path. All other k pads. Gets to state k at least once. Thus, we start out as Rik super k minus one. Which generates us the labels of all the pads that get us from state I to state k for the first time. Then comes Rkk super K-1, all starred. It generates the labels of all paths to go from state K to state K zero or more times with those paths not passing through any state K or higher. Finally, we, we concatenate with the language of the expression Rkj super K-1. That generates the labels of all paths from state K to state J that do not pass through state K or higher. Here's a picture of what a k pad looks like with the vertical axis representing the state number. We've shown I and J above K. Although they can be at any height. That is for example I could be down here. So could J, could be down there. Or they can be higher. Doesn't matter. So one possibility is that the K path never goes through state K. Or it could go through state K for the first time but only to a lower states. And from k to k, you could go get through lower states zero or more times and, finally, you go from k to j through lower number states, number states numbered lower than k and that represents all the paths that go from I to j through k and lower numbered states. Finally, as we have hinted, the regular expression with the same language is the given DFA is formed by taking the union overall final states "J" of "Rij super N" where "i" is the start state. So here's, here our example of DFA again. Now we have to install a start state and some final states. So let's assume that state two is the start state and three is the only final state. So the regular expression that we want is just R23 super three. Okay. Here's the expression for r23 super three in terms of the super two expressions. This expression is special because state three appears in two places as both k and j and as a result we can simplify it as shown to r23 super two concatenated with r33 super two star. The reason is that if e is any regular expression, such as r33 super two. Then. E star E. Is the concatenation of one or more Es. Okay. Now we have. Another expression are 2,3, super two. That you can think of as, itself concatenated with none of the e's. Remember e is actually r33 super two. So this is r23 super two concatenated with no e's. This is R323 Supertube concatenated with EE, that's one or more Es. So together you have R23 Super two in candidate with zero one more Es and that's what that expression is, is saying. Again, remember, E is R33 Super two. Okay, so, we figured out what R 23 super two is, we did that earlier, so lets just use it here. And here's an expression for R33 super two. I won't give the details but remember that this expression has to represent paths from three to three that never go through three. So you could only jump from one to two and back again until it returns to three. And here's the simplified expression composed of R23 super two and R33 super two. That last being starred. So, what you have at the end is a regular expression whose language is the same as the regular expression of this automaton order. Each of the three types of automatons, the DFA, NFA, and the epsilon NFA that we discussed, and regular expressions as well, define exactly the same set of languages, regular languages. That is, if you remember, we started with the DFA. And we show that any nfa could be converted to dfa. That's the subset construction. We then show that any x one nfa could be converted to an ordinary nfa. That was the closure construction. And earlier in this lecture, we showed that any regular expression can be converted to an epsilon NFA and now we just showed that every DFA can be converted to a regular expression. That says any language defined by any one of these four notations is defined by all the others. Okay, we're going to finish up with a little discussion with the algebraic laws for regular expressions. We have several times argued that two different regular expressions represent the same language and therefore one can be substituted for the other. Now these are examples of the sort of algebraic laws that we find in the algebra of regular expressions of arithmetic, Boolean algebra, and other algebras. The laws for regular expressions are not too different from those for arithmetic, but we have to be very careful to observe the differences. First, plus and concatenation behave almost like plus and times for arithmetic. The plus operator which remember, is a set union, is associative and communicative just like addition. Remember, ultimate, the commutative law says that X plus Y equals Y plus X. That is the order of the arguments doesn't matter. The associative law says that you can group the operator in any order. Or formally, x plus y grouped plus z equals x plus the grouping of y plus z. That is, you can apply plus to the first two arguments first, or apply it to the last two arguments first. It doesn't matter, the result is the same. Concatenation, like multiplication, is associative. Moreover, concatenation distributes over union just like multiplication distributes over addition. That is x. Concatenated with y plus z. Equals xy plus xa. That is xy plus z equals xy plus xz. That is the distributive law. And the big difference between regular expressions in arithmetic is that concatenation is not commutative. Although multiplication obviously is. That is, the regular expressions, AB and BA, [sound] don't denote the same languages. That is, the first AB just denotes the language that contains the string AB, while the second, BA, denotes the language that contains only the string BA. Okay. Arithmetic also has certain constants that serve as identities for addition and multiplication. Obviously zero is the identity for addition, and that adding zero to any number gives you that number back. Likewise,1 is the identity for multiplication because multiplying any number by one, gives you that number. Finally zero is also the annihilator for multiplication because multiplying any number by zero gives you zero. Union and concatenation have analogous identities as we should see the identity for union is the annihilator of concatenation just like the identity for addition is the annihilator for multiplication. Now, the identity for union as we actually have mentioned is the empty set. Obviously taking the union of the empty set with any set yields that set. The language containing only the empty string is the identity for calculation. Calculating any string as with the empty string yields as you can calculating the language represented by the regular session r [inaudible] with the other language say the line represented by [inaudible] would result in the same language the one represented by r. Finally the empty set is the annihilator from concatenation. That is we take any language, say the language represented by regular expression R and concatenate it with the empty language, we get the empty language. Why? The result of the concatenation requires that we take strings from both languages and concatenate them in all possible ways. But you can't find any strings in the empty language to use in the concatenation of strings.