So that's a regular expression,
pattern-matching that we implement by
constructing and simulating a
nondeterministic finite state machine.
Really, one of the most ingenious
algorithms that we look at in this course.
next we'll look at some applications. so,
the most famous is called, Grep and this
is what Ken Thompson implemented in the
initial Unix. And it was a very, very
important tool for programmers working on
the tiny really tiny, computers that were
available at that time. it was really
important to be able to ask questions
about programs of this type. and it
allowed the development of bigger and
bigger computational infrastructure like
the Unix Operating System that's still
widely used today. so Grep was a simple
command. what you want to do is take a
regular expression as a command line
argument and print the lines from standard
input to have some substring that's
matched by the regular expression.
Programmers use that all the time to try
to figure out what variables they use in
what part of what programming. many
programmers, use Grep every day still
today. and here's how we implement it
using the code that you know, we've talked
about so far. so first thing is, is create
the RE that we're going to match to make
it more like subscreen search. We just
care for the things in there anywhere at
all. So we put a dot star before and after
and we enclose it in parenthesis just cuz,
cuz our code simplified, if our RE's
enclosed in parenthesis. Build an NFA from
that regular expression. and then as long
as there's an align in standard input, we
read the line and we ask if the NFA
recognizes it. So the constructor builds
the NFA and recognizes is the simulation.
if one NFA on that string, if it's there,
you print it out. if it doesn't recognize
it the final accept state is not in the
substrateq you could get to, for that
string, it won't print it out. And we just
do that for every line in the input.
pretty, amazingly simple implementation,
although, quite elegant, conceptually.
It's a very, Elegant, and, and, ef ficient
permutation of this basic process.
Absolutely correct. so, in the bottom line
for it is in the worst case it's the same
as for the, elementary substring search
algorithm that we looked at when we first
started talking about string searching and
it's really amazing. the brute force
algorithm that you come up with of
tri-matching the string, the single string
at every character position. The worst
case is time proportional to M times N.
But here we have a regular expression
which specifies an infinite set of
patterns. And we can tell if any one of
those infinite set of patterns is matched
by our string in the same worst case kind.
it's really an amazing algorithm that's
Ken Thompson, Grep. Once you have Grep
available, then crossword puzzles become a
lot easier. a lot of it's standard in UNIX
to a dictionary or if you can't find one
on your system, we have it on the book
site that simply has all the words on the
dictionary, all the lower case one per
line. and if you have a crossword puzzle
where you're missing a couple of letters
you can just Grep for in words.txt for
and, and leave, don't care is for those
letters, and it'll tell you the, the
things you can put in there. In this case
it's the one that has exactly that many
letters. It's stricter. so that's a
typical, application and I'm sure lots of
people use it that way now. now for
industrial strength nowadays there's, a
lot of things that people expect when
using regular expressions. our
implementation doesn't have character
classes. it doesn't have the extra meta
characters like plus and other things. the
idea of capturing which actually get the
part of the string that satisfies, that
matches the regular expression of various
different extensions of closure. and then
there's the idea of greedy versus
reluctant matching. So, for example, if
you have a regular expression, blink.
/blink which you'd find in HML for the
blinking text. there's two possibilities.
so called reluctant matching would be like
the tightest match, matches that you'd
get. You h ave two of them but then, the
greedy matching would get as much text as
it possibly could in different
applications, you might want different
things. Capturing means give me the
substring that matches. and so those
things are not difficult to add to the
basic code that we've done and all that
has to be in industrial strength
implementation. many programming systems
have Grep that have all these sorts of
things. again originated, originated in
Unix in the 1970s and nowadays, every
language has to have some kind of extended
regular expression facility. there's,
language Unix commands like Grep and lock,
and old programmer's tools like EAMCs.
There's modern programmer's tools like
Perl or PHP Python and Javascript. All of
these things have regular expression
processing. and so, programmers demand
this facility nowadays. in fact, Perl is
an example of an entire language that's
built on regular expressions. and again
there is various command line options you
can say run it for each line and you have
replacement facilities and many other
things that is, is certainly a worthwhile
facility for any programmer to use
nowadays and many programmers use these
things. Now to, again to go back to the
slide that was filled up with a regular
expression you want to use these things
wisely there are definitely computational
tasks, where it might be better to just
write a Java program, not try to use a
regular expression. So every tool has it's
place and try not to do everything with
the regular expression language. in Java
various facilities, are built in to
different parts of the Java library. and
so and understanding how the
implementation works. you can understand
how to use these. so one simple,
implementation is in the Java string
library. so if there's a, you have a
string called input and another string
called regexp and the matches method in
the string library will return the bullion
that says if the strings in the line which
you know, described by the regular
expression. so this is just a simple stub
that I call validate that takes a regular
expression and an input from the command
line, unless you know if the, the, the
string is in the line that's described by
the regular expression. So, like as I dent
one, two, three, and leave a little Java
identifier giving that regexp legal Java
identifiers that we looked at earlier and
so forth. so that's one thing that's built
in to Java. So in Java programs you can
use regular expression pattern-matching in
that simple way. another thing, another
kind of task that is better handled with a
that's handled with this Java
implementation as the idea of harvesting.
so we want to print all the sub strings of
the input that match a given RE. So, say
this is the fragile X syndrome we want to
harvest all the patterns from the DNA
that, have this property that starts with
gcg and ends with ctg and has any number
of these triple gcg or cth, triples
inside. or maybe you want to harvest the
same program harvests all the e-mail
addresses from a web page. That's kind of
amazing that the same program can work for
two such diverse tasks. that's, that's a
testimony to the utility of this
abstraction. so how do we do harvesting
within Java? Well it's got a, Java has two
classes called pattern and matcher that
implement regular expression
pattern-matching basically by, as we did,
separating the construction of the machine
from the simulation. so this is what the
code looks like for harvester. We take our
regular expression as the first command
line argument and we setup an input stream
from the second command line argument, can
be a file or a webpage. and then we read
the input stream. then we use Java's
pattern class, to, to build a pattern
which essentially is an NFA that is
constructed from the regular expression.
in the pattern class, has a method called
a matcher and that essentially creates an
NFA stimulator from that NFA for that
text. and, so that's a machine that not
only does matching in the way that we did,
but it also finds a match and keeps track
of what's called group, which is the
substring that caus e the match. so it
adds that extra code to do these things
that are useful in this illustrates one
line of code for each one of those
facilities. That is, create the NFA,
create the simulator find the next match
and get the substring that cause the
match. That's an implementation of
harvester that can go through and do
things like get fragile X indicators from
a chromosome or get e-mail addresses from
a web page. And of course, you could
extend that to print out the index of
where they occur and other things. this is
a very powerful facility to have in a
programming language and Java certainly
has it. Now there is a caveat about this
that's really important. this introduces
the idea that we talked about with hashing
and this is another example of the
algorithmic complexity attack. And that
is, if you know something about the way
that a website is implemented at facility.
in this case, it's possible to implement
regular expression pattern-matching in a
not so efficient way and atypically,
actually, the implementations that we see
out in the wild like the ones in Unix or,
or Java or Perl, do not guarantee
quadratic performance the way we have.
They do not guarantee the running time's
going to be proportional to the product or
the length of the string and the length of
the regular expression. In fact, they go
exponential. again, going back to Kleene's
theorem, it's relatively simple to prove.
it's not so difficult to develop an
implementation that is more like Kleene's
theorem, but it can take exponential time.
and you can try it yourself in Java or
Perl, if you try to look for a match for
this regular expression in a string like
that, you add just a couple of characters
to the string then running time will
double. again, going back to our
algorithmic performance lectures. if
that's what happens, if I just add one
thing and my running time doubles that
means I have exponential time. and so if
I'm using a facility that has one of these
exponential time implementations then,
this is, what's called a SpamAssassin reg
ular expression. it's going to take
exponential time on certain e-mail
addresses and, somebody can create trouble
by sending such addresses to a mail server
the mail server will take exponential time
just trying to determine if it's spam or
not. and by having an inefficient
algorithm, a server like that is
definitely subject to such a tax.
generally, you don't want to have
exponential algorithms you particularly
don't want to have exponential algorithms
if you have arbitrary clients out there
that can cause trouble for you. a another
pitfall is things that kind of look like
regular expressions but aren't actually
regular expressions. So it's common to
have the backslash one notation. So that's
supposed to match the self expression that
was matched earlier. So this thing dot
plus back slash one is a string that is
two copies of something, like couscous or
with at least one character. and this one
is a similar one. This one actually is
number of ones is not prime. it matches,
so you can write pretty sophisticated
computational thing just with these little
references. but there's a problem and it's
a fundamental problem, is that . that
kinds of languages of the sets of strings
that you specify with such notation are
not regular. That is, like I said, it's a
point you can't write a regular expression
to specify. and so these are examples like
strings to the form ww For sum W, you
can't write a regular expression that,
will recognize, that will describe all
such strings. There are even scientific
applications like complemented
palindromes. So that's like a palindrome,
but also complements a's to t's and c's
and g's. So that's another, example. And
if they're not regular, then Kleene's
theorem doesn't hold. there's no NFA that
corresponds to them, and in fact we're,
we'll talk in a couple of lectures about
intractability. in fact, nobody knows an
efficient method for guaranteeing
performance when you have back references.
so, if you're allowing back references
you're pretty much subject to performance
attacks like the one we just mentioned.
and that's where, we'll will finish up
this lecture. this is an amazingly
elegant, and efficient, and ingenious
algorithm, and it's based on the theory of
computation that has been studied since
the 1930's and it's the basis for much of
the programming, programming languages
that we do. it's really worth
understanding this theory and Grep is a
wonderful example of why it's important to
understand theory like this. it really it
takes us to the next level of algorithms
and that's writing programs that translate
a program to machine code. actually, what
we've looked at, both for KMP and for Grep
are examples of compiler, they're simple
examples of compiler. for KMP, we took a
string and we built a DFA. For Grep, we
took an RE and we built an NFA. and all
Java C does is take Java language code and
translate it to a byte code. Now, it uses
more complicated theorems and laws from
theory of computation to get the job done,
because a java program is also not
regular. But, it really is worthwhile
thinking about the progression from KMP to
a Java compiler. Whereas in Java compiler,
you have a program, now, you want to
compile of and know if it's legal you want
to get byte code and then you've got a
java simulator, and that's not so
substantially different from what we just
did for Grep in a stage along the way.
that's quite interesting to see this kind
of context to get us such a practically
useful and important algorithm. so the
summary is we did some from the standpoint
of the program, we implemented substring
search by simulating a DFA and we
implement a regular expression
pattern-matching by simulating an NFA.
from theoretician what's interesting about
these abstractions is that NFAs and DFAs
and RE are equivalent in power. if you can
describe a set of strings for one of them,
you could describe a set of strings with
any of them. so that's interesting. And
also interesting is that there's
limitations, there's sets of strings like
palindromes and so forth, that you can't
specify with a DFA or, o r an NFA. but
from a student in an algorithms class it,
it really is you know, worthwhile to see
that these principles are not just theory.
they're a practical use and this is an
essential paradigm all throughout computer
science. you know, we want to find an
intermediate abstraction that we can
understand and make sure we pick the right
one, and then we pick clients that can use
it in a real implementations that can
implement it. And by building the right
intermediate abstraction, we can solve
some important practical problems. those
are the lessons that Grep teaches us.