1
00:00:03,520 --> 00:00:08,057
So that's a regular expression,
pattern-matching that we implement by

2
00:00:08,238 --> 00:00:12,412
constructing and simulating a
nondeterministic finite state machine.

3
00:00:12,412 --> 00:00:16,950
Really, one of the most ingenious
algorithms that we look at in this course.

4
00:00:17,131 --> 00:00:23,376
next we'll look at some applications. so,
the most famous is called, Grep and this

5
00:00:23,376 --> 00:00:29,835
is what Ken Thompson implemented in the
initial Unix. And it was a very, very

6
00:00:29,835 --> 00:00:37,602
important tool for programmers working on
the tiny really tiny, computers that were

7
00:00:37,602 --> 00:00:44,470
available at that time. it was really
important to be able to ask questions

8
00:00:44,470 --> 00:00:50,131
about programs of this type. and it
allowed the development of bigger and

9
00:00:50,131 --> 00:00:55,420
bigger computational infrastructure like
the Unix Operating System that's still

10
00:00:55,420 --> 00:01:01,326
widely used today. so Grep was a simple
command. what you want to do is take a

11
00:01:01,326 --> 00:01:06,876
regular expression as a command line
argument and print the lines from standard

12
00:01:06,876 --> 00:01:11,940
input to have some substring that's
matched by the regular expression.

13
00:01:11,940 --> 00:01:16,869
Programmers use that all the time to try
to figure out what variables they use in

14
00:01:16,869 --> 00:01:21,739
what part of what programming. many
programmers, use Grep every day still

15
00:01:21,739 --> 00:01:30,248
today. and here's how we implement it
using the code that you know, we've talked

16
00:01:30,248 --> 00:01:38,889
about so far. so first thing is, is create
the RE that we're going to match to make

17
00:01:38,889 --> 00:01:45,979
it more like subscreen search. We just
care for the things in there anywhere at

18
00:01:45,979 --> 00:01:53,867
all. So we put a dot star before and after
and we enclose it in parenthesis just cuz,

19
00:01:53,335 --> 00:02:00,603
cuz our code simplified, if our RE's
enclosed in parenthesis. Build an NFA from

20
00:02:00,603 --> 00:02:07,016
that regular expression. and then as long
as there's an align in standard input, we

21
00:02:07,016 --> 00:02:12,589
read the line and we ask if the NFA
recognizes it. So the constructor builds

22
00:02:12,589 --> 00:02:18,776
the NFA and recognizes is the simulation.
if one NFA on that string, if it's there,

23
00:02:18,776 --> 00:02:25,390
you print it out. if it doesn't recognize
it the final accept state is not in the

24
00:02:25,390 --> 00:02:31,079
substrateq you could get to, for that
string, it won't print it out. And we just

25
00:02:31,079 --> 00:02:40,896
do that for every line in the input.
pretty, amazingly simple implementation,

26
00:02:40,896 --> 00:02:49,809
although, quite elegant, conceptually.
It's a very, Elegant, and, and, ef ficient

27
00:02:49,809 --> 00:02:55,750
permutation of this basic process.
Absolutely correct. so, in the bottom line

28
00:02:55,750 --> 00:03:01,758
for it is in the worst case it's the same
as for the, elementary substring search

29
00:03:01,758 --> 00:03:07,567
algorithm that we looked at when we first
started talking about string searching and

30
00:03:07,567 --> 00:03:12,453
it's really amazing. the brute force
algorithm that you come up with of

31
00:03:12,453 --> 00:03:17,734
tri-matching the string, the single string
at every character position. The worst

32
00:03:17,734 --> 00:03:22,818
case is time proportional to M times N.
But here we have a regular expression

33
00:03:22,818 --> 00:03:27,703
which specifies an infinite set of
patterns. And we can tell if any one of

34
00:03:27,703 --> 00:03:33,295
those infinite set of patterns is matched
by our string in the same worst case kind.

35
00:03:33,479 --> 00:03:40,754
it's really an amazing algorithm that's
Ken Thompson, Grep. Once you have Grep

36
00:03:40,754 --> 00:03:46,982
available, then crossword puzzles become a
lot easier. a lot of it's standard in UNIX

37
00:03:46,982 --> 00:03:52,795
to a dictionary or if you can't find one
on your system, we have it on the book

38
00:03:52,795 --> 00:03:58,331
site that simply has all the words on the
dictionary, all the lower case one per

39
00:03:58,331 --> 00:04:04,589
line. and if you have a crossword puzzle
where you're missing a couple of letters

40
00:04:04,811 --> 00:04:11,864
you can just Grep for in words.txt for
and, and leave, don't care is for those

41
00:04:11,864 --> 00:04:17,951
letters, and it'll tell you the, the
things you can put in there. In this case

42
00:04:18,174 --> 00:04:24,113
it's the one that has exactly that many
letters. It's stricter. so that's a

43
00:04:24,113 --> 00:04:31,239
typical, application and I'm sure lots of
people use it that way now. now for

44
00:04:31,462 --> 00:04:38,026
industrial strength nowadays there's, a
lot of things that people expect when

45
00:04:38,026 --> 00:04:43,725
using regular expressions. our
implementation doesn't have character

46
00:04:43,725 --> 00:04:50,930
classes. it doesn't have the extra meta
characters like plus and other things. the

47
00:04:50,930 --> 00:04:58,205
idea of capturing which actually get the
part of the string that satisfies, that

48
00:04:58,205 --> 00:05:05,480
matches the regular expression of various
different extensions of closure. and then

49
00:05:05,480 --> 00:05:11,440
there's the idea of greedy versus
reluctant matching. So, for example, if

50
00:05:11,440 --> 00:05:19,423
you have a regular expression, blink.
/blink which you'd find in HML for the

51
00:05:19,423 --> 00:05:26,246
blinking text. there's two possibilities.
so called reluctant matching would be like

52
00:05:26,246 --> 00:05:32,658
the tightest match, matches that you'd
get. You h ave two of them but then, the

53
00:05:32,658 --> 00:05:38,659
greedy matching would get as much text as
it possibly could in different

54
00:05:38,659 --> 00:05:44,907
applications, you might want different
things. Capturing means give me the

55
00:05:44,907 --> 00:05:50,937
substring that matches. and so those
things are not difficult to add to the

56
00:05:50,937 --> 00:05:56,191
basic code that we've done and all that
has to be in industrial strength

57
00:05:56,391 --> 00:06:02,044
implementation. many programming systems
have Grep that have all these sorts of

58
00:06:02,044 --> 00:06:09,914
things. again originated, originated in
Unix in the 1970s and nowadays, every

59
00:06:09,914 --> 00:06:16,411
language has to have some kind of extended
regular expression facility. there's,

60
00:06:17,252 --> 00:06:23,825
language Unix commands like Grep and lock,
and old programmer's tools like EAMCs.

61
00:06:23,825 --> 00:06:30,169
There's modern programmer's tools like
Perl or PHP Python and Javascript. All of

62
00:06:30,169 --> 00:06:37,373
these things have regular expression
processing. and so, programmers demand

63
00:06:37,373 --> 00:06:44,631
this facility nowadays. in fact, Perl is
an example of an entire language that's

64
00:06:44,631 --> 00:06:51,398
built on regular expressions. and again
there is various command line options you

65
00:06:51,398 --> 00:06:58,168
can say run it for each line and you have
replacement facilities and many other

66
00:06:58,168 --> 00:07:05,086
things that is, is certainly a worthwhile
facility for any programmer to use

67
00:07:05,086 --> 00:07:11,268
nowadays and many programmers use these
things. Now to, again to go back to the

68
00:07:11,268 --> 00:07:17,376
slide that was filled up with a regular
expression you want to use these things

69
00:07:17,376 --> 00:07:23,464
wisely there are definitely computational
tasks, where it might be better to just

70
00:07:23,464 --> 00:07:29,123
write a Java program, not try to use a
regular expression. So every tool has it's

71
00:07:29,123 --> 00:07:34,931
place and try not to do everything with
the regular expression language. in Java

72
00:07:35,237 --> 00:07:44,108
various facilities, are built in to
different parts of the Java library. and

73
00:07:44,108 --> 00:07:51,928
so and understanding how the
implementation works. you can understand

74
00:07:51,928 --> 00:07:58,084
how to use these. so one simple,
implementation is in the Java string

75
00:07:58,084 --> 00:08:03,174
library. so if there's a, you have a
string called input and another string

76
00:08:03,174 --> 00:08:10,447
called regexp and the matches method in
the string library will return the bullion

77
00:08:10,447 --> 00:08:16,221
that says if the strings in the line which
you know, described by the regular

78
00:08:16,221 --> 00:08:23,680
expression. so this is just a simple stub
that I call validate that takes a regular

79
00:08:23,680 --> 00:08:28,839
expression and an input from the command
line, unless you know if the, the, the

80
00:08:29,251 --> 00:08:34,479
string is in the line that's described by
the regular expression. So, like as I dent

81
00:08:34,479 --> 00:08:40,395
one, two, three, and leave a little Java
identifier giving that regexp legal Java

82
00:08:40,395 --> 00:08:46,379
identifiers that we looked at earlier and
so forth. so that's one thing that's built

83
00:08:46,379 --> 00:08:52,570
in to Java. So in Java programs you can
use regular expression pattern-matching in

84
00:08:52,570 --> 00:09:01,700
that simple way. another thing, another
kind of task that is better handled with a

85
00:09:01,897 --> 00:09:06,410
that's handled with this Java
implementation as the idea of harvesting.

86
00:09:06,650 --> 00:09:13,296
so we want to print all the sub strings of
the input that match a given RE. So, say

87
00:09:13,296 --> 00:09:21,525
this is the fragile X syndrome we want to
harvest all the patterns from the DNA

88
00:09:21,525 --> 00:09:29,218
that, have this property that starts with
gcg and ends with ctg and has any number

89
00:09:29,486 --> 00:09:36,772
of these triple gcg or cth, triples
inside. or maybe you want to harvest the

90
00:09:36,772 --> 00:09:42,910
same program harvests all the e-mail
addresses from a web page. That's kind of

91
00:09:42,910 --> 00:09:50,233
amazing that the same program can work for
two such diverse tasks. that's, that's a

92
00:09:50,233 --> 00:09:56,167
testimony to the utility of this
abstraction. so how do we do harvesting

93
00:09:56,167 --> 00:10:03,301
within Java? Well it's got a, Java has two
classes called pattern and matcher that

94
00:10:03,301 --> 00:10:09,378
implement regular expression
pattern-matching basically by, as we did,

95
00:10:09,378 --> 00:10:16,687
separating the construction of the machine
from the simulation. so this is what the

96
00:10:16,687 --> 00:10:22,698
code looks like for harvester. We take our
regular expression as the first command

97
00:10:22,698 --> 00:10:28,204
line argument and we setup an input stream
from the second command line argument, can

98
00:10:28,204 --> 00:10:34,433
be a file or a webpage. and then we read
the input stream. then we use Java's

99
00:10:34,433 --> 00:10:41,312
pattern class, to, to build a pattern
which essentially is an NFA that is

100
00:10:41,312 --> 00:10:49,972
constructed from the regular expression.
in the pattern class, has a method called

101
00:10:49,972 --> 00:10:59,023
a matcher and that essentially creates an
NFA stimulator from that NFA for that

102
00:10:59,023 --> 00:11:07,003
text. and, so that's a machine that not
only does matching in the way that we did,

103
00:11:07,003 --> 00:11:14,023
but it also finds a match and keeps track
of what's called group, which is the

104
00:11:14,023 --> 00:11:20,854
substring that caus e the match. so it
adds that extra code to do these things

105
00:11:21,093 --> 00:11:28,032
that are useful in this illustrates one
line of code for each one of those

106
00:11:28,032 --> 00:11:33,936
facilities. That is, create the NFA,
create the simulator find the next match

107
00:11:33,936 --> 00:11:39,590
and get the substring that cause the
match. That's an implementation of

108
00:11:39,590 --> 00:11:45,794
harvester that can go through and do
things like get fragile X indicators from

109
00:11:45,794 --> 00:11:51,605
a chromosome or get e-mail addresses from
a web page. And of course, you could

110
00:11:51,605 --> 00:11:57,888
extend that to print out the index of
where they occur and other things. this is

111
00:11:57,888 --> 00:12:04,020
a very powerful facility to have in a
programming language and Java certainly

112
00:12:04,020 --> 00:12:10,701
has it. Now there is a caveat about this
that's really important. this introduces

113
00:12:10,701 --> 00:12:16,751
the idea that we talked about with hashing
and this is another example of the

114
00:12:17,007 --> 00:12:23,314
algorithmic complexity attack. And that
is, if you know something about the way

115
00:12:23,314 --> 00:12:30,147
that a website is implemented at facility.
in this case, it's possible to implement

116
00:12:30,356 --> 00:12:35,723
regular expression pattern-matching in a
not so efficient way and atypically,

117
00:12:35,723 --> 00:12:41,370
actually, the implementations that we see
out in the wild like the ones in Unix or,

118
00:12:41,161 --> 00:12:46,180
or Java or Perl, do not guarantee
quadratic performance the way we have.

119
00:12:46,180 --> 00:12:51,827
They do not guarantee the running time's
going to be proportional to the product or

120
00:12:51,827 --> 00:12:57,543
the length of the string and the length of
the regular expression. In fact, they go

121
00:12:57,543 --> 00:13:04,326
exponential. again, going back to Kleene's
theorem, it's relatively simple to prove.

122
00:13:04,326 --> 00:13:09,947
it's not so difficult to develop an
implementation that is more like Kleene's

123
00:13:10,925 --> 00:13:17,198
theorem, but it can take exponential time.
and you can try it yourself in Java or

124
00:13:17,198 --> 00:13:23,878
Perl, if you try to look for a match for
this regular expression in a string like

125
00:13:23,878 --> 00:13:30,639
that, you add just a couple of characters
to the string then running time will

126
00:13:30,639 --> 00:13:36,534
double. again, going back to our
algorithmic performance lectures. if

127
00:13:36,534 --> 00:13:42,450
that's what happens, if I just add one
thing and my running time doubles that

128
00:13:42,450 --> 00:13:50,866
means I have exponential time. and so if
I'm using a facility that has one of these

129
00:13:51,168 --> 00:14:00,020
exponential time implementations then,
this is, what's called a SpamAssassin reg

130
00:14:00,020 --> 00:14:05,711
ular expression. it's going to take
exponential time on certain e-mail

131
00:14:05,711 --> 00:14:12,829
addresses and, somebody can create trouble
by sending such addresses to a mail server

132
00:14:13,039 --> 00:14:19,127
the mail server will take exponential time
just trying to determine if it's spam or

133
00:14:19,127 --> 00:14:24,305
not. and by having an inefficient
algorithm, a server like that is

134
00:14:24,515 --> 00:14:29,413
definitely subject to such a tax.
generally, you don't want to have

135
00:14:29,413 --> 00:14:35,431
exponential algorithms you particularly
don't want to have exponential algorithms

136
00:14:35,641 --> 00:14:41,704
if you have arbitrary clients out there
that can cause trouble for you. a another

137
00:14:41,957 --> 00:14:49,029
pitfall is things that kind of look like
regular expressions but aren't actually

138
00:14:49,029 --> 00:14:55,848
regular expressions. So it's common to
have the backslash one notation. So that's

139
00:14:55,848 --> 00:15:04,436
supposed to match the self expression that
was matched earlier. So this thing dot

140
00:15:04,436 --> 00:15:12,616
plus back slash one is a string that is
two copies of something, like couscous or

141
00:15:12,220 --> 00:15:18,954
with at least one character. and this one
is a similar one. This one actually is

142
00:15:19,192 --> 00:15:25,768
number of ones is not prime. it matches,
so you can write pretty sophisticated

143
00:15:25,768 --> 00:15:33,215
computational thing just with these little
references. but there's a problem and it's

144
00:15:33,215 --> 00:15:40,445
a fundamental problem, is that . that
kinds of languages of the sets of strings

145
00:15:40,445 --> 00:15:45,726
that you specify with such notation are
not regular. That is, like I said, it's a

146
00:15:45,726 --> 00:15:52,196
point you can't write a regular expression
to specify. and so these are examples like

147
00:15:52,394 --> 00:15:57,478
strings to the form ww For sum W, you
can't write a regular expression that,

148
00:15:57,676 --> 00:16:02,759
will recognize, that will describe all
such strings. There are even scientific

149
00:16:02,759 --> 00:16:07,843
applications like complemented
palindromes. So that's like a palindrome,

150
00:16:07,843 --> 00:16:15,013
but also complements a's to t's and c's
and g's. So that's another, example. And

151
00:16:15,238 --> 00:16:21,551
if they're not regular, then Kleene's
theorem doesn't hold. there's no NFA that

152
00:16:21,551 --> 00:16:28,426
corresponds to them, and in fact we're,
we'll talk in a couple of lectures about

153
00:16:28,426 --> 00:16:33,889
intractability. in fact, nobody knows an
efficient method for guaranteeing

154
00:16:33,889 --> 00:16:40,071
performance when you have back references.
so, if you're allowing back references

155
00:16:40,287 --> 00:16:46,110
you're pretty much subject to performance
attacks like the one we just mentioned.

156
00:16:46,303 --> 00:16:51,329
and that's where, we'll will finish up
this lecture. this is an amazingly

157
00:16:51,523 --> 00:16:56,742
elegant, and efficient, and ingenious
algorithm, and it's based on the theory of

158
00:16:56,742 --> 00:17:02,220
computation that has been studied since
the 1930's and it's the basis for much of

159
00:17:02,220 --> 00:17:06,924
the programming, programming languages
that we do. it's really worth

160
00:17:06,924 --> 00:17:12,595
understanding this theory and Grep is a
wonderful example of why it's important to

161
00:17:12,595 --> 00:17:18,698
understand theory like this. it really it
takes us to the next level of algorithms

162
00:17:18,698 --> 00:17:25,636
and that's writing programs that translate
a program to machine code. actually, what

163
00:17:25,636 --> 00:17:32,650
we've looked at, both for KMP and for Grep
are examples of compiler, they're simple

164
00:17:32,650 --> 00:17:39,216
examples of compiler. for KMP, we took a
string and we built a DFA. For Grep, we

165
00:17:39,216 --> 00:17:46,110
took an RE and we built an NFA. and all
Java C does is take Java language code and

166
00:17:46,110 --> 00:17:52,624
translate it to a byte code. Now, it uses
more complicated theorems and laws from

167
00:17:52,624 --> 00:17:59,238
theory of computation to get the job done,
because a java program is also not

168
00:17:59,238 --> 00:18:05,820
regular. But, it really is worthwhile
thinking about the progression from KMP to

169
00:18:06,080 --> 00:18:12,402
a Java compiler. Whereas in Java compiler,
you have a program, now, you want to

170
00:18:12,402 --> 00:18:19,678
compile of and know if it's legal you want
to get byte code and then you've got a

171
00:18:19,678 --> 00:18:25,910
java simulator, and that's not so
substantially different from what we just

172
00:18:25,910 --> 00:18:32,714
did for Grep in a stage along the way.
that's quite interesting to see this kind

173
00:18:32,714 --> 00:18:38,823
of context to get us such a practically
useful and important algorithm. so the

174
00:18:38,823 --> 00:18:45,302
summary is we did some from the standpoint
of the program, we implemented substring

175
00:18:45,302 --> 00:18:50,199
search by simulating a DFA and we
implement a regular expression

176
00:18:50,199 --> 00:18:56,729
pattern-matching by simulating an NFA.
from theoretician what's interesting about

177
00:18:56,729 --> 00:19:02,905
these abstractions is that NFAs and DFAs
and RE are equivalent in power. if you can

178
00:19:02,905 --> 00:19:08,931
describe a set of strings for one of them,
you could describe a set of strings with

179
00:19:08,931 --> 00:19:14,213
any of them. so that's interesting. And
also interesting is that there's

180
00:19:14,213 --> 00:19:20,240
limitations, there's sets of strings like
palindromes and so forth, that you can't

181
00:19:20,484 --> 00:19:27,828
specify with a DFA or, o r an NFA. but
from a student in an algorithms class it,

182
00:19:27,584 --> 00:19:34,030
it really is you know, worthwhile to see
that these principles are not just theory.

183
00:19:34,275 --> 00:19:40,189
they're a practical use and this is an
essential paradigm all throughout computer

184
00:19:40,189 --> 00:19:44,712
science. you know, we want to find an
intermediate abstraction that we can

185
00:19:44,712 --> 00:19:49,770
understand and make sure we pick the right
one, and then we pick clients that can use

186
00:19:49,770 --> 00:19:54,947
it in a real implementations that can
implement it. And by building the right

187
00:19:54,947 --> 00:20:00,113
intermediate abstraction, we can solve
some important practical problems. those

188
00:20:00,113 --> 00:20:03,240
are the lessons that Grep teaches us.