[sound]. Today, we're going to give some
hints about how regular expressions are
used in practice. I'm going to mention
some of the extensions that are found in
various Unix commands. I'll also talk
about some specifics of test processing
algorithms, and concentrate on the
important task of lexical analysis. That
part of a compiler that looks at the entire
program being compiled, and breaking it up
into tokens, which are sequences of
characters that logically belong together.
Many systems use regular expressions of
some sort to describe patterns. Often
these are embedded in the proprietary code
of company but there are also some quite
visible ones such as a number of UNIX
commands. I'm going to tell you one
particular story involving proprietary
software before moving on to generalities
regarding text processing. Junglee was a start up founded by three of
my PHD students and two others in 1994. At
the time, the web was really new, and they
had the idea of building systems to
integrate web pages into more useful
products, and were doing that for about
two years when they got a contract from
Yahoo to build a facility that would let
Yahoo visitors search for books and get a
table [inaudible] the price, the delivery
charge and delivery delay at different
booksellers. Immediately upon deployment
of that product, Amazon bought Junglee
in order to shut down the comparison
shopping capability. Interestingly, Amazon
did not understand that I bought from them
not because they were the cheapest, but
because I was confident that they would
deliver what I asked for, and deliver it
on time. And apparently, I was not alone
in that thinking. But the world of online
commerce was new, and Amazon could not be
sure of the impact of automated comparison
shopping. An interesting [inaudible],
Amazon actually got good value for
Junglee, because one of its founders,
Anand Rajaraman, while at Amazon, was the
inventor of Mechanical Turk. But the first
paid work that Junglee got was his
contract from the Washington Post to
produce an online table of the employment
opportunities offered by companies that
were placing print classified ads
at The Post. This job was not trivial.
Junglee had to go to several hundred
websites and extract the information about
each job automatically. If you look at one
side, you can probably figure out how to
do it. Which links to follow from the home
page, to get to the employment pages. And
what's there. And how to navigate an html
source text. To find the critical
information about the job such as the
title and salary. But you need to do this
for each site. And to make matters worse,
they Junglee guys discovered that
these sites were evolving. That is, not
only the jobs change, but the structure of
the pages, or even the whole website
changed. The result was the approximately
once a week, the reader for any given site
would break, and have to be redesigned. So
the Junglee guys developed a regular
expression language for describing how to
navigate a website to extract the
information they needed. The input symbols
were the usual characters such as letters,
so they could look for important words
like salary. They also treated HTML tags
like OL. That is this. As input symbols
because they gave important hints about
the page structure. For instance a page might
say, send salary requirements to. But that
doesn't indicate the salary for a
particular job. But when you get to a list
of jobs. And that is indicated by the
order of list tags. That's the OL. And
only after that, does salary mean that the
number following is the salary for a job.
Another important kind of input element
is a link, which indicate, that, that it's
necessary to move to another page. Once
this regular expression language was
implemented, it was easier to write
regular expression to find key information
like title of salary. And to write the
code to process web pages directly. Thus,
there was an increase in productivity as
they added size to their data base. More
over when the site changed. It was
relatively easy to modify the regular
expression to track the changes in the
site. The architecture of this system
developed at Junglee appears in many
places. The input language is regular
expressions plus actions which are
arbitrary codes executed when the regular
expression is recognized. In this case,
the actions would be things like return
this integer as a salary. The regular
expression is compiled into a DFA, or
sometimes into an NFA that is simulated,
effectively by executing the lazy version
of the subset construction as needed.
Simulation of the DFA or NFA goes on
exactly as we have described it
theoretically. Each input symbol causes a
state change and nothing more. The magic
happens when an accepting state is
reached. Then the associated action is
executed and this action allows the
regular expression processor to interact
with the rest of the world in some useful
way. We're now going to talk about the
Unix extensions to regular expressions.
There are many commands in Unix that have
some sort of regular expression notation
for input. An important example is the
"grep" command which stands for Global
Regular Expression and Print. Most of the
regular expression-based languages, even
though they have extensions at the end of
their notation, we've covered so far to
find only regular language. There are also
some commands that have additional
extensions, that allow non-regular
extensions to be recognized, but we're not
going to go into these features.
Incidentally, it is no coincidence that
regular expressions figure heavily into
the original Unix commands. Before he did
Unix, Ken Thompson had worked on a system
for processing regular expressions by
converting them only as far as a NFA and
simulating the NFA in code. We're now
going to meet some of the Unix
extensions to regular expressions. You can
put brackets around any list of characters
as a shorthand for the same characters
connected by plus signs in the notation we
have used until now. You can also describe
a sequence of symbols that are consecutive
in the ASCII order of characters by giving
the first character, a dash and then the
last character. For example [a-z], that's
this, stands for any lower case letter
because the lower case letters have
consecutive codes in ASCII. You can
represent any letter by putting dashes
between lower case a and z and the same
for upper case, that's this. Okay. Notice that
the upper and lower case letters are not
consecutive in ASCII, so you cannot
represent this 52 characters with a single
range. Incidentally, as we see, characters
like brackets and dash have special needs,
meanings in Unix regular expressions. So
if you want to use one of these characters
as itself, you need to precede it by a
backslash. So, for example, slash left
bracket is a real left bracket. It's not
the left bracket that you find in, in the,
in range expression. And the character dot
or period is shorthand for any character.
Here are some more Unix changes to the
regular expression notation we learned.
Okay, the union operator is actually
represented by a vertical bar, instead of
the plus. But the plus symbol is another
operator used like star, and meaning one
or more. That is, in the Unix notation,
E+, that's that, is shorthand for E
concatenated with E<i>. So for example,</i>
[a-z]+ means one or more lower
case letters. The question mark operator
is also used like star but means zero or
one of. The is E? Is shorthand for E plus
epsilon. So for example, [AB]? means an
optional A or B. We would write it in our
original notation as A plus B plus
epsilon. You may remember our DFA example
for recognizing strings that end in "ing".
It involved a lot of explanation as we
considered where to go from each of the
state that represented some progress
toward finding ing. However, there is an
easy regular expression for the same
language using the Unix dot, it's just .<i>ing,
like that. Okay. Or even if we</i>
didn't have the dot in our notation, could
simply replace it by all the legal input
symbols connected by pluses. In fact, it's
even much easier to design an NFA for this
language than it is to design a DFA. Okay,
here is an NFA. Essentially, it guesses
when it has seen the 'i' that begins the
final "ing". Thus, even on an input like the
first 'i' in skiing, it can remain in the
start state. That is, it can go from the
start state to itself on an 'i' if it likes.
And in fact, it always does like, 'cause
the NFA always does everything. Okay it
can, anyway, it can remain, in the start
state on first 'i' and then only travel to
the right that is here on the second 'i'.
There's no need to worry about what to do
in a state like one that represents the 'i' as
the discovery of 'i' and 'n' where do you go
if the next input in not 'g' it doesn't
matter because you'll still be here. And
another sequence of states will get you
where you actually need to go. Now let's
talk a little bit about lexical analysis,
the breaking of an input program into its
basic units called tokens. For example,
every identifier in a program is a token.
Likewise the reserved words like if or why
are tokens. Many single characters are
tokens by themselves. In typical
programing languages semicolon is a token
used for separating statements, plus is a
token indicating addition, less than is a
token by itself indicating the less than
comparison operator. There are also
multi-character operators such as the less
than or equal sign which together indicate
the less than equal comparison. There are
tools, like Lex, or its open source version
flex, that let you write a regular
expression for each token. You can also
provide a piece of code as an action to be
executed, when an instance of that regular
expression is recognized. For example, the
code for when an integer is found might
simply return that integer. As an example,
the expression for identifiers might be
the one shown here. That's this. Using the
unix notation, this expression describes
identifiers as any letter, that's this,
followed by zero or more star letters or
digits. In many languages identifiers
allow some more options, for example,
underscore might be included as if it were
another digit so it would just appear in
this list here. Okay, underscore. In Lex
you write an action which is an arbitrary
code to be executed when the regular
expression for a token is matched. In the
simplest cases, all this code does is
return an integer code representing the
token found. But the action might be much
more complicated. For example, if an
identifier is found, the action might
involve installing that identifier in a
symbol table where all identifiers used by
the program are stored. When building a
lexical analyzer using regular
expressions for the tokens there are some
resolution and ambiguities that need to be
faced. Two examples are illustrated here.
For one. Reserve words like "if" also match
the expression for identifiers. But "if"
is not a legal identifier. So we have to
make sure that lexical analyzer does
the right thing on "if". Okay. For another
when we see less than. We don't
immediately know if it's a token by itself
or a part of a larger token which would be
less than or equal in this case. We need
to make sure we don't prematurely declare
victory and return, less then, when less
than or equal, is intended by the
programmer. A good architecture for
building a lexical analysis code from
regular expressions is to start by
converting each regular expression to an
Epsilon NFA. Each of these Epsilon NFAs
has it's own final state with which the
action for that regular expression is
associated. We combine all these epsilon
NFAs by introducing a new start state. The
start state has epsilon transitions to the
start states for each of the NFAs. Looks
something like this. Here's the new start
state. Here are all the old start states
and their automata. And we've just put
transitions on epsilon to each one of
them. Okay. All the final states of the
NFAs remain final and they each have their
associated actions. So for example, these
are, these are all final states. Nfa can
have several final states in fact. After
that combination we convert to DFA or
perhaps an NFA without epsilon transitions
which we will then simulate. Okay, we
need to give the regular expressions in
ordering and this ordering determines the
priority among actions. A typical
ordering puts all the reserved words
ahead of identifier, so that way if the
DFA discovers that the next token is if,
it in principle does not know whether to
execute the action for the reserved word
if, or for identifiers, or both. But the
fact that if takes priority over
identifiers says that this token should be
treated as a reserved word and not as an
identifier. However to make all this work
right, the DFA action need a special
ability. The ability to take an input
symbol that is read and put it back at the
front of the input stream. That input will
then be read again, typically the next
time the lexical analyzer is
told to look for the next token. Here's an
example of why the ability to restore an
input symbol dressed red back to the front
of the input is important. Suppose the
lexical analyzer is told to find the
next token and the first character it
reads on the input is the less-than
symbol. It has to read the next input, and
if that input is an equal sign then we
can be sure the token is less than or
equal. But, if the input is anything else,
for example a blank, a letter, or a digit,
then that character must be put back on
the input and less than, by itself,
declared to be the token. For another
example if the lexical analyzer has read
the characters 'i' and 'f' in its quest for a
token, it might have seen the reserved word
"if". But we won't know until it reads the
next character if that character is a
letter or a digit, anything that can
extend that identifier, then we do not have
the reserved word "if", we have a longer
identifier. However, if the next character
is not one that can extend that identifier
such as a blank or a left parentheses,
then we really do have the reserved word
"if", in that case, the next character must
be pushed back onto the front of the input.