Today we're going introduce the idea of
regular expressions. This is a simple
algebraic notation that describes exactly
the regular languages. In practice it's
very common. To have. A regular expression
like notation to, to describe patterns,
such as patterns in text and to have.
Behind the scenes, a compiler from regular
expressions into finite automata. In
particular, deterministic finite automata.
Because these are the things that actually
can be executed. So, after introducing the
notion of a regular expression, we're
going to, show you that. You can convert
any regular expression into an automaton
and also any automaton into a regular
expression, that proves that the language
is accepted by regular expressions and the
various kinds of final automaton that we
have met are exactly the same. The algebra
you are most familiar with is probably
arithmetic using operators plus and times
and operating on numbers. The regular
expression algebra operates on languages
and it has three specialized operators.
Regular expressions like algebraic
expressions built up by applying these
operators to operate. So regular
expressions then describe languages of an
algebra. And they as I said describe the
regular languages. We'll use L and E as
the notation for the language described by
the regular expression E. And we will
describe regular expressions and their
languages recursively. Regular expressions use three
operations, the union, concatenation, and the Kleene
star. Okay, a union of languages is the
usual thing, since languages are sets. And
here's a, a very simple evolve. Here too.
Sets, their languages, each with three
strings and with two strings. Zero one
happens to be in both so it will be once
in the union. One, one, one is there, zero
is there, and zero, zero is there, so
that's the common notion of the union on
sets or languages. Concatenation is. Also
a fairly simple operation. We'll denote
the concatenation of two languages, say L
and M, by juxtaposition, that is L, M with
no punctuation in between them. The
language LM contains every string that is
W concatenated with X, such that W is in L
and X is in M. So, here's an example, we
are in effect multiplying the two
languages, that is we'll take 01 from the
first language, and we can concatenate it
with 00. That gives us 0100. We could also
take 01 again, concatenate it with 01, and
that gives us that. And one, one, one
[inaudible] with the two strings here
gives us these two strings and the result
and one zero concatenated with zero, zero
and zero one gives us. The final two
strings in the result, okay. Kleene star
is probably something that you're least
familiar with. If L is a language, then L
star which will, is called the Kleene
star, or just the star operator, is the
set of strings that you can form by
concatenating zero or more strings from L,
in any order. That is, you can take, One
string from L, you take no strings from L.
That would give you the empty string. You
take one string from L. You take two
strings from L, they can be any two
strings, they don't have to be the same.
You can concatenate them. You can take
three strings from L, concatenate them and
so on. Incidentally Stephen Kleene was
the fellow who invented regular
expressions, and showed that the describe
the same languages that finite automata
described. Okay, if L is a language, then
L star, the Kleene star, or just star
operator, is the set of strings formed by
concatenating zero or more strings from L
in any order. Thus, L star would consist
of the set containing the empty tree.
Empty string is always in the star of any
[inaudible]. Because that represents no
choices of string from L. Then the union
with L itself. And then we take two
strings from L so we take L concatenated
with L. We can take three strings and so
on. Any [inaudible] from L, anything you
can form by concatenating any number of
strings from L will be in the language L
star. So here's an example. Language l is
just as two strings. Zero And one zero.
The star of that language. Well, we can
take no. No choices, that would give us
the empty string. We can take one or the
other string. That will give us these two.
We can take two choices from the language
L so we if both choices is zero we get 00.
We take zero from the first choice and ten
for the second that is 0100.
All it takes ten for the first choice and
zero for the second choice that gives us
100 or we can make both choices be ten
that give us. In the string 1010 have any
choices and so on.
There are three parts to the basis in the,
in the Regular Expressions definition.
The first part is
for the single symbol. If A is the symbol.
Then A also denotes a regular expression.
It denotes, the language, that, this
regular expression denotes is the language
with one string. That string has length
one and the one position of that string
has A in it. Okay, by the way, in order to
distinguish A as a symbol or string from A
as a regular expression, we usually make
the regular expression bold-faced, but
context should, help you to distinguish,
Strings, symbols, and regular expressions
anyway. Okay. The second part of the basis
is the symbol epsilon. Okay, this is a
regular expression and it's language is
the language that has one string. That
string is the empty string. And the third
part of the basis is the empty set symbol.
This is a regular expression and it's
language is the empty language. Okay, the
inductive part of the definition also
happens to have three parts. 'Kay. For the
first part we can connect any two regular
expressions by a plus sign. And this plus
sign represents set union. That is the
language of e one plus e two. Is the union
of languages that e one and e two denote.
The second part. Involves the
concatenation operator. We can write one
regular expression next to another to
denote the concatenation of their
languages. That is E1 followed by E2
denotes the concatenation of the languages
that E1 and E2 denote. As we shall discuss
in a minute, we sometimes need parentheses
to group expressions properly. So in some
circumstances we need to put parentheses
around E1 and or E2. So we might, for
example, see something like that, okay.
The third part is the star operator. If we
follow a regular expression E by a star,
then the language we denote is the clean
enclosure of the language that E denotes.
Again, it is in some circumstances
necessary. Put parentheses around the E to
make sure the operators inside the
expression E group properly like that. As
with other algebras, we can and must use
parentheses to force the intended order
for operations. For regular expressions,
the order is star, then concatenation,
then plus. [inaudible] of plus, the lowest
[inaudible] operator is analogous to
addition in arithmetic. Concatenation is
analogous to multiplication, and star as
analogous to exponentiation. Although, in
the case of star, [inaudible]. No power to
which its argument is raised. In a sense,
star means raised to all powers. For
example. The regular expression zero one
represents the concatenation of the
language consisting of one string which is
zero. And the language consisting of one
string one. The results of the language
containing one string zero one. In
general. Any string of symbols as a
regular expression represents the language
that contains only that one string. The
language of expression zero one plus zero.
That's this. Is the union of the language
containing only the strings zero one and
the language containing only the string
zero. The language of zero concatenated
with one plus zero, that's this
expression, is the two strings. 01 and 00.
Notice that we need parentheses to force
the plus the group first. Without them
since concatenation take precedence over
plus we get the interpretation of the
second example. That is this one and
obviously you get somewhat different
languages. The language zero star is the
star of the language containing only the
string zero. This is all strings of 0's
including the empty string. Here's a
little more complicated example. It
denotes the language that we've been
playing with when we talked about
automana, that is, all strings of 0's and
1's without two consecutive 1's. To see
why this works in every such string, that
is any string in the language, each one is
either followed immediately by a zero.
Where it comes at the end of the string.
This part of the expression. Zero plus
one, zero, star. Denotes all strings in
which every one is, is followed by a zero.
Okay. These strings are surely in the
language that we want, but we also want
those strings that are followed by a final
one. Thus we can concatenate the language
of zero plus one, zero star. With. The
union of two languages. One is epsilon,
and concatenating with epsilon, just give
us the same strings. That is, that is
those that don't have a final one. And we
can also concatenate with the language
containing the string one. That gives us
any string that, in which every one is
followed by a zero, and then it also has a
final one, okay? We're going to show the
regular expressions, and find exactly the
regular languages. We already have three
equivalent representations to the regular
languages, DFAs, NFAs, and Epsilon NFAs.
We need to show that for every regular
expression, there is some automaton that
defines the same language. And for this
job, we may as well pick the most powerful
of the three varieties of automaton, the
Epsilon NFA. For the other direction, we
need to show that every regular language
is defined by some regular expression.
Here, we may as well with the most
restrictive variety of automaton, the DFA.
We'll begin with, the process of how you
convert a regular expression to an epsilon
NFA. The proof is an induction on the
number of operators, the operators of
course being the plus concatenation and
the star, that appear in the regular
expression. And we're always going to
construct an automaton that has a special
form which I'll show you on the next
slide. The special fall of epsilon NFA we
construct as suggested by this sketch.
They can be any number of states in the
middle, but only one starts [inaudible] as
true for any automaton. And only one final
state, which is a restriction. More
importantly, as we build larger automata
from this from smaller ones, we never
allow an arc into the middle or into the
final state. The only outside arcs must
come into the star state. That is, we
might add arcs like that but you can't add
an arc that goes with state that's inside.
Okay. Likewise you can't add an arch that
goes to the final state. [inaudible] We
never have an ark from any of these
states. Except the final state. To some
place on the outside. So you can't install
some arc leaving like that. And notice
that we put quotes around the term final,
in final state, because although this
state is, is final if this is the entire
automaton that we construct. If it's a
piece of a larger automaton, then the
final state will no longer be final in
the, ... >> Allow the larger automaton. >>
Okay. >> The induction is on the number of
operators in the regular expression. And
here are the bases cases, the expressions
with zero operators. Okay, a regular
expression without operators has to be one
of the bases cases from the definition or
regular expressions. If the expression is
a symbol A, then the epsilon [inaudible]
we need consists of only a start and a
final state, with an arc labeled A from
start to final. Obviously, this automaton
accepts the language of the regular
expression A. That being only the one
string A. If the expression is epsilon,
then it is sort of the same, except the
arc is labeled epsilon. And if the
expression is the empty set symbol, and we
just have the start and the final state,
and no way to get from one to the other,
okay? Now what do you get in the
induction? There are three cases,
depending on whether the outermost
operator's union concatenation is star.
Here's the instruction for union. Suppose
our expression is E1+E2. Both E1 and E2
have fewer operators than the entire
expression. So the inductive hypothesis
applies to these sub expressions. We may
thus assume that there is an epsilon NFA
for, of the desired form for E1 and
likewise for E2. We form the epsilon NFA
for E1 plus E2 by introducing a new start
state and a new final state. The old final
states are no longer final. There are
epsilon transitions from the new start
state to each of the two old start states,
and there are epsilon transitions from the
old states to the new final state. If
there's a path from the new start state to
the new final state, it must consist of
epsilon transitions at the beginning and
end, with a path that is either wholly
within E1 or wholly within E2. So for
example the path might look like, this go
here and then we go around to go through
the same statement many times finally
comes the old final state of E1 and then
out to the new final state. Notice the
fact that there is no way to get into the
green thing labeled for E1. And no way to
get out of it except as I've shown.
Guarantees that the path must from, the
path from here to here must either go.
Through E1 or through E2. There are no
other options. ?Kay. Here's the
construction for concatenation of E1 and
E2. Again we assume by the inductor
hypothesis that there is an Epsilon NFA of
the right form for E1 and dido for E2. We
add an epsilon arc from the final state of
the automaton for E1. And that state, of
course, then, is your final. The start
state of the automaton for E2. The start
state with the bigger automaton is the
start state of the automaton for E1. And
the final state is the final state of the
automaton for E2. The new automatons the
concatenation of the languages. E1 and E2.
Any path from the start to final state
must first pass through E1 but as it will
do something like this Then it will take
this epsilon arch and then it will do
something in two. There is no other option
because you can't get into or out of the
green areas except as shown. Finally,
here's the construction for star. Suppose
we have an automaton for some expression
E, and we want to construct the automaton
for E star. We add a new start and final
states, and the old final state is no
longer final. There's an epsilon arc from
the new start state to the new final
state, so epsilon is always in the
language of this new automaton. There are
three other epsilon arcs as shown. The
first of these epsilon arcs, brings us to
the start state for, the automaton for E.
We must reverse the automaton for E,
following a path whose label is any string
in the language for E No. When we reach
the old final state of this automaton,
they have a choice. We follow the
[inaudible] r in the new final state. Then
we're done. We've found the path. That is,
has a label that is one string from the
language of e. Or, we have another option.
We can go from the old final state back to
the old start state. Maybe take another
path to there. Now we have a path that has
a label that is the concatenation that is
two strains from the language of E. And we
can go back again. Take another path.
Make, a label of that path would have now,
concatenation of three strings for me, and
so on. We can do this as many times as we
like. But then, finally, we go off to the
final state, new final. And, we are, done
with that. So, this automaton really does
have a language that's E star. You can go
through the automaton for E. Zero times by
just, just going around it. Or you can go
through it once, twice, three times as
many times as you'd like. Now we're going
to do the construction the other way.
We'll start with a DFA and construct a
regular expression to find the same
language. The proof is inductive and to
make the induction work, we need to assume
the states are named one through N.
There's no harm in making this assumption
since the names of the states do not
influence the language in automanotic
sets. The induction is on the maximum
number of a state that is allowed to be in
the middle of a path. An idea that is
explained on the next slide. We're going
to talk about k-Paths, which are paths
that can go from any state to any state.
But in the middle that is excluding the
endpoints only states numbered K or less
can be found. The inductive on k, and it
states that there is a regular expression,
whose language is the set of labels of all
k-paths who state I to state j. We start
with 0-Paths, paths as a basis, and by
the time we get to n-Paths, there's no
restriction on, on paths at all, so we
have a regular expression describing the
language of the automaton itself. Notice
that an n-Path represents no restriction
on paths at all since there are no states
whose names are higher than N. To get a
regular expression for the language of the
DFA we take the union of the expressions
for the n-Path that describe how you get
from the start state to each of the final
states. For example, he's a little
automatoned. The 0-Paths from state two
to state three are the language of the
regular expression zero. The reason is
that we can only follow an arc from two to
three, and there's only one. The one that
has label zero. That is, that's the zero
is describes the only path that gets the
state two to three without going through
any another state. 1-Path from two to
three we can go through state one but not
through two or three. Plus we can still follow
the arch from two to three label zero
but we can also follow the arch labeled
one from state two to state one. That's this.
And from there, the arc labeled one.
From state one to state three.
Notice that once we get to state one,
we cannot go back to two in a 1-Path.
And if we go to three, we're finished, the path
must end. The 2-Paths from state two
to state three are more varied. The only think
it cannot do is pass through state three.
Thus, from state two, we can go to state
one and back to two zero or more times.
The labels along
each such path is 1-0. So 1-0 star.
Describes the labels we reached by
following this path any number of times.
Including zero times. After oscillating as
many times as we wish. In warming up in
state two. We can then follow the zero arc
to three. That's that part of the
regular expression. This path from state
two to state three are described by the
regular expression. One zero Star zero.
That's this whole part. An alternative
plan is to oscillate between state two and
one. Winding up in one. The labels of all
these paths are described by one, that is
you can to state one for the first time,
and then you can go 01, 01 as many times
as you like including, including zero
times. Finally though you have to follow
the path from one to three, the arch from
one to three, in fact. And, so one
followed by one, sorry, one followed by
zero one star, followed by another one,
describes all of those paths that start in
two. Go from one to two several times.
Wind up in one. And then go to three. The
regular expression for the labels of all
2-Paths. From state two to state three.
Is the union of these two expressions that
we have just described. The 3-Paths
from state to state three also have a
regular expression. But it is
sufficiently convicted that we're going to
save that example until we have seen the
complete construction. Okay. Now, here's
the induction of how you, go from a DFA to
a regular expression. Okay. The basis is K
equals zero. Notice that a 0-Path. Can
have no intermediate states at all, plus
the set of 0-Paths from the state I to
state J, can only consist of a single arc
plus, in the special case that I=J, no
arcs at all. We can construct a regular
expression for this language. Connect with
pluses each of the labels and in addition,
if I=J, then add an epsilon. So for
example, here's state I, here's state J,
and suppose it has an arc, labeled A, B,
and C. Then it's regular expression is
A+B+C. And if there's also an arc, let's
say, from I to I, in addition, let's say
that there is a loop with labels D and E
on the state I. Then regular expression for
the paths from I to I would be D+E+epsilon.
We now need to introduce some
notation. Let Rij super K, be the regular
expression for the set of labels of the K
paths from state I to state J. We've
really seen the basis case, K equals zero,
that is Rij super zero is the sum of the
labels on the arc from state I to state J.
If there's no such arc, then use the empty
sets symbol as the regular expression, but
we add epsilon if I equals J. Use our
example DFA again. R one two super zero
equals zero, since that is the label of
the arc from the state one to state two,
and of course the state one is not equal
to state two. Now consider our R-1-1 super
zero. There's no arc from state one to
itself, so we start out with the empty
set. However, since the beginning and the
end states are the same, we have epsilon.
Okay. And incidentally, notice that
there's an algebraic law. That is, the
empty set union anything is that other
thing. So, whenever you see an empty set
symbol plus something else, you can just
get rid of the empty set and plus, and
just take what's. Now we shall do the
inductive step where we assume we've
written all the regular expressions with
super K minus one, and we need to write
the expressions for the K paths. A K path.
I would never go through state K at all or
does so one or more times. That let's us
write the expression for Rij super K in
terms of expressions for K minus one
paths. The first term covers the case that
the K path doesn't even go through the
state K once. That is every K minus one
path is a K path. All other k pads. Gets
to state k at least once. Thus, we
start out as Rik super k minus one. Which
generates us the labels of all the pads
that get us from state I to state k for
the first time. Then comes Rkk super K-1,
all starred. It generates the labels of
all paths to go from state K to state K
zero or more times with those paths not
passing through any state K or higher.
Finally, we, we concatenate with the
language of the expression Rkj super K-1.
That generates the labels of all paths
from state K to state J that do not pass
through state K or higher. Here's a
picture of what a k pad looks like with
the vertical axis representing the state
number. We've shown I and J above K.
Although they can be at any height. That
is for example I could be down here. So
could J, could be down there. Or they can
be higher. Doesn't matter. So one
possibility is that the K path never goes
through state K. Or it could go through
state K for the first time but only to a
lower states. And from k to k, you could
go get through lower states zero or more
times and, finally, you go from k to j
through lower number states, number states
numbered lower than k and that represents
all the paths that go from I to j through
k and lower numbered states. Finally, as
we have hinted, the regular expression
with the same language is the given
DFA is formed by taking the union
overall final states "J" of "Rij super
N" where "i" is the start state.
So here's, here our example of DFA again.
Now we have to install a start state and some
final states. So let's assume that state
two is the start state and three is the
only final state. So the regular
expression that we want is just R23 super
three. Okay. Here's the expression for r23
super three in terms of the super two
expressions. This expression is special
because state three appears in two places
as both k and j and as a result we can
simplify it as shown to r23 super two
concatenated with r33 super two star. The
reason is that if e is any regular
expression, such as r33 super two. Then. E
star E. Is the concatenation of one or
more Es. Okay. Now we have. Another
expression are 2,3, super two. That you
can think of as, itself concatenated with
none of the e's. Remember e is actually
r33 super two. So this is r23 super two
concatenated with no e's. This is R323
Supertube concatenated with E<i>E, that's
one or more Es. So together you have R23</i>
Super two in candidate with zero one more
Es and that's what that expression is, is
saying. Again, remember, E is R33
Super two. Okay, so, we figured out what R
23 super two is, we did that earlier, so
lets just use it here. And here's an
expression for R33 super two. I won't give
the details but remember that this
expression has to represent paths from
three to three that never go through
three. So you could only jump from one to
two and back again until it returns to
three. And here's the simplified
expression composed of R23 super two and
R33 super two. That last being starred.
So, what you have at the end is a regular
expression whose language is the same as
the regular expression of this automaton
order. Each of the three types of
automatons, the DFA, NFA, and the epsilon
NFA that we discussed, and regular
expressions as well, define exactly the
same set of languages, regular languages.
That is, if you remember, we started with
the DFA. And we show that any nfa could be
converted to dfa. That's the subset
construction. We then show that any x one
nfa could be converted to an ordinary nfa.
That was the closure construction. And
earlier in this lecture, we showed that
any regular expression can be converted to
an epsilon NFA and now we just showed that
every DFA can be converted to a regular
expression. That says any language defined
by any one of these four notations is
defined by all the others. Okay, we're
going to finish up with a little
discussion with the algebraic laws for
regular expressions. We have several times
argued that two different regular
expressions represent the same language
and therefore one can be substituted for
the other. Now these are examples of the
sort of algebraic laws that we find in the
algebra of regular expressions of
arithmetic, Boolean algebra, and other
algebras. The laws for regular expressions
are not too different from those for
arithmetic, but we have to be very careful
to observe the differences. First, plus
and concatenation behave almost like plus
and times for arithmetic. The plus
operator which remember, is a set union,
is associative and communicative just like
addition. Remember, ultimate, the
commutative law says that X plus Y equals
Y plus X. That is the order of the
arguments doesn't matter. The associative
law says that you can group the operator
in any order. Or formally, x plus y
grouped plus z equals x plus the grouping
of y plus z. That is, you can apply plus
to the first two arguments first, or apply
it to the last two arguments first. It
doesn't matter, the result is the same.
Concatenation, like multiplication, is
associative. Moreover, concatenation
distributes over union just like
multiplication distributes over addition.
That is x. Concatenated with y plus z.
Equals xy plus xa. That is xy plus z
equals xy plus xz. That is the
distributive law. And the big difference
between regular expressions in arithmetic
is that concatenation is not commutative.
Although multiplication obviously is. That
is, the regular expressions, AB and BA,
[sound] don't denote the same languages.
That is, the first AB just denotes the
language that contains the string AB,
while the second, BA, denotes the language
that contains only the string BA. Okay.
Arithmetic also has certain constants that
serve as identities for addition and
multiplication. Obviously zero is the
identity for addition, and that adding
zero to any number gives you that number
back. Likewise,1 is the identity for
multiplication because multiplying any
number by one, gives you that number.
Finally zero is also the annihilator for
multiplication because multiplying any
number by zero gives you zero. Union and
concatenation have analogous identities as
we should see the identity for union is
the annihilator of concatenation just like
the identity for addition is the
annihilator for multiplication. Now, the
identity for union as we actually have
mentioned is the empty set. Obviously
taking the union of the empty set with any
set yields that set. The language
containing only the empty string is the
identity for calculation. Calculating any
string as with the empty string yields as
you can calculating the language
represented by the regular session r
[inaudible] with the other language say
the line represented by [inaudible] would
result in the same language the one
represented by r. Finally the empty set is
the annihilator from concatenation. That
is we take any language, say the language
represented by regular expression R and
concatenate it with the empty language, we
get the empty language. Why? The result of
the concatenation requires that we take
strings from both languages and
concatenate them in all possible ways. But
you can't find any strings in the empty
language to use in the concatenation of
strings.