1 00:00:03,260 --> 00:00:07,561 Today, we're going to talk about regular expressions. Now, this is another in the 2 00:00:07,561 --> 00:00:12,088 series of ingenious algorithms that we're looking at in the second half of this 3 00:00:12,088 --> 00:00:16,445 course. Now, this is really one of the most widely used and one of the coolest 4 00:00:16,445 --> 00:00:21,086 algorithms that we're going to cover. to get started, we have to talk about what is 5 00:00:21,086 --> 00:00:26,010 a regular expression. And just putting the problem in context we've always talked 6 00:00:26,010 --> 00:00:30,650 about before. in the last lecture, we talked about the substring search problem, 7 00:00:30,650 --> 00:00:35,485 which we're just trying to find one string in a text. Now, we'll talk about a more 8 00:00:35,485 --> 00:00:40,468 general problem, a pattern matching problem, where we find one of a specified 9 00:00:40,468 --> 00:00:45,843 set of strings in the decks. So, that's specified set. rather than talk about one 10 00:00:45,843 --> 00:00:50,301 string, we have to describe a set of strings. And that's what regular 11 00:00:50,301 --> 00:00:55,677 expressions are all about. here's an example from genomics, an actual example 12 00:00:55,677 --> 00:01:00,528 from a real life scientific data processing. So, there's a thing called 13 00:01:00,528 --> 00:01:05,773 Fragile X syndrome, which is a common cause of mental retardation. the human 14 00:01:05,773 --> 00:01:12,810 genome contains is a string that consists of Cs, Ts, As, and Gs, it's the way we've 15 00:01:12,810 --> 00:01:18,939 talked about before. and the, they naturally group themselves into groups of 16 00:01:18,939 --> 00:01:24,613 three. And there's a way to describe a correlation with the syndrome from a 17 00:01:24,613 --> 00:01:31,347 property of the text string or the genome string. the, the actual scientific data is 18 00:01:31,347 --> 00:01:37,807 shown over here on the right, but we'll just work with the text strings. In the in 19 00:01:37,807 --> 00:01:44,714 the English, the description is that you've got somewhere in your genome 20 00:01:44,990 --> 00:01:55,115 repeats of either CGG or AGG, that are bracketed by GCG at the beginning and CTG 21 00:01:55,115 --> 00:02:01,340 at the end. Now, the number of repeats inside is very well. So in this case, 22 00:02:01,340 --> 00:02:08,458 we've got GCJ that's the beginning thing and then we've got a CGG that's one, AGG 23 00:02:08,458 --> 00:02:14,706 that two, CGG that's three. So, we got three repeats on one of these two patterns 24 00:02:14,706 --> 00:02:20,796 and then we have CTG. now, a different genome might have many more of these 25 00:02:20,796 --> 00:02:26,570 triplet patterns inside the number of repeats is a variable but high, high 26 00:02:26,570 --> 00:02:32,064 repeat is correlated with Fragile X syndrome. So, clearly medical p 27 00:02:32,064 --> 00:02:39,054 rofessionals want to know, given a genome does it have a high number of repeats with 28 00:02:39,054 --> 00:02:45,038 this kind of pattern? then we if so then we want to look out for this Fragile X 29 00:02:45,038 --> 00:02:50,445 syndrome. And the specifying the problem in English is one thing, but specifying it 30 00:02:50,445 --> 00:02:56,298 in a way that is enable to writing a program to help us find these patterns is 31 00:02:56,298 --> 00:03:01,188 another. and this is the way that we specify the pattern. Now, this thing is 32 00:03:01,188 --> 00:03:06,967 called a regular expression. Let's take this CGG, and then either a CGG, GCG, and 33 00:03:06,967 --> 00:03:13,190 then either a CGG or an AGG any number of times followed by a CGG. that's specifying 34 00:03:13,190 --> 00:03:19,340 a set of strings and we might find that if the text had one of that specified set. 35 00:03:19,596 --> 00:03:27,271 here's another example where regular expressions appear in real life. there's a 36 00:03:27,879 --> 00:03:36,076 new source highlight program that we use to highlight syntax, highlight keywords in 37 00:03:36,076 --> 00:03:42,685 our, in our text, and, and also gray out comments, and so forth. and this is a very 38 00:03:42,685 --> 00:03:47,886 general program that works for many different kinds of languages and outputs 39 00:03:47,886 --> 00:03:53,492 to many different print formats. and that's all about identifying one of a 40 00:03:53,492 --> 00:03:58,288 specified set of strings in a text and doing something with it. It's an 41 00:03:58,288 --> 00:04:04,837 application of regular expressions in real life. in Google, you can search in public 42 00:04:04,837 --> 00:04:11,640 search code, it's public source code for regular expression search so you can 43 00:04:11,640 --> 00:04:18,979 search for code that contains any one of a specified set of strings. and that's 44 00:04:19,246 --> 00:04:26,780 available facility on the web. There's many, many other applications. Another one 45 00:04:26,780 --> 00:04:33,992 is PROSITE in Biology. there is a bunch of specified patterns that you can use to 46 00:04:33,992 --> 00:04:39,181 search for genome or processing ahm natural language, it's all about string 47 00:04:39,181 --> 00:04:44,847 processing and specifying sets of strings, programming languages themselves. and the 48 00:04:44,847 --> 00:04:50,144 other one we'll look at later on is validating when you are getting data entry 49 00:04:50,331 --> 00:04:55,497 from the web when you type in a credit card number, you are told whether it's 50 00:04:55,497 --> 00:05:00,890 legal or not and that's all done with regular expressions since the string you 51 00:05:00,890 --> 00:05:09,945 typed in one of the specified sets. so, this is very widely used and widely useful 52 00:05:09,945 --> 00:05:17,733 application. so we're going to look into the more general idea that this brings 53 00:05:17,733 --> 00:05:24,072 out, is the idea of parsing text files, and not only finding if the string matches 54 00:05:24,072 --> 00:05:30,316 a pattern, but finding if finding the structure of a string that we can use in 55 00:05:30,525 --> 00:05:36,829 some kind of beneficial way. And we'll come back to it more. so, we're looking at 56 00:05:36,829 --> 00:05:43,169 a basic capability that's widely useful, but it also extends to cover deep issues 57 00:05:43,169 --> 00:05:49,104 and processes in Computer Science. So, what is a regular expression? Well, it's 58 00:05:49,104 --> 00:05:55,365 simply a notation to specify a set of strings. the set could be infinite. we're 59 00:05:55,365 --> 00:06:00,519 going to specify it with a finite pattern but it, it could be infinite. And we make 60 00:06:00,519 --> 00:06:06,174 a regular expression up just from four simple operations. first one is called 61 00:06:06,174 --> 00:06:11,813 concatenation. so, that's we just take a bunch of letters and put them, one after 62 00:06:11,813 --> 00:06:17,252 the other and a regular expression made from concatenation matches exactly 63 00:06:17,252 --> 00:06:22,623 precisely the sequence of characters that are specified and it doesn't match 64 00:06:22,623 --> 00:06:27,792 anything else. So, that's a simple one by itself and that's what we use for 65 00:06:27,792 --> 00:06:33,706 substring search for example. then there's or. so that one, we give two different 66 00:06:33,915 --> 00:06:39,412 regular expressions and put a vertical bar between them. And that says we're 67 00:06:39,412 --> 00:06:45,257 specifying now a set of size two and we're happy to match either string in that set. 68 00:06:45,465 --> 00:06:51,642 there's no other string that we match. and then there's one called closure. And this 69 00:06:51,642 --> 00:06:58,193 is where an infinite comes in. so, when we put a star after a character, or a regular 70 00:06:58,193 --> 00:07:03,345 expression, as we'll see, that's specifying zero or more occurrences of 71 00:07:03,345 --> 00:07:09,131 that character. So, AA matches ABA because that's zero occurrences of B in 72 00:07:09,131 --> 00:07:16,154 between. Or A with eight Bs followed by an A matches, because that's got zero or more 73 00:07:16,154 --> 00:07:22,399 Bs in between two As. but if you don't have an A at the beginning and the end, or 74 00:07:22,399 --> 00:07:27,961 if you don't have all Bs inside, that doesn't match. And that's an infinite 75 00:07:27,961 --> 00:07:33,673 number of strings, zero on any number of occurrences of B. So, with just four 76 00:07:33,673 --> 00:07:39,911 characters were specifying an infinite set of strings. and then, the next operation 77 00:07:39,911 --> 00:07:45,542 to allow us to build regular expressions of arbitrary complexity is si mply 78 00:07:45,548 --> 00:07:52,069 parentheses. if you include regular expressions within a parentheses, then you 79 00:07:52,069 --> 00:07:57,778 can apply the star operator or concatenation to a, another regular 80 00:07:57,778 --> 00:08:04,577 expression. So, these are examples of regular expressions using parentheses. So, 81 00:08:04,577 --> 00:08:09,505 this says, A and then what's inside the parentheses, which is the regular 82 00:08:09,505 --> 00:08:14,499 expression that says either A or B and then followed by AAB. So, that matches 83 00:08:14,499 --> 00:08:19,558 precisely these two strings. either A followed by an A followed by AB or A 84 00:08:19,558 --> 00:08:24,968 followed by B followed by AAA and it doesn't match anything else. or if you put 85 00:08:24,968 --> 00:08:30,193 a more complicated regular expression inside parenthesis with a star, it still 86 00:08:30,193 --> 00:08:35,982 means zero or more occurrences. So, just A would be zero occurrences. and then, this 87 00:08:35,982 --> 00:08:41,516 is one, two, three, four occurrences of AB followed by A. And again, if it's not of 88 00:08:41,516 --> 00:08:47,681 that form, it doesn't match. this one doesn't have AB, and, and neither does 89 00:08:47,681 --> 00:08:53,355 this, it has one AB, but then the next one would have to be either AB or A. There's 90 00:08:53,355 --> 00:08:59,170 been a way to get one of these strings by including zero more occurrences of AB. 91 00:08:59,413 --> 00:09:04,678 these things here are, are precedence order when we're performing the 92 00:09:04,678 --> 00:09:10,430 operations. so the first thing is parenthesis, next thing is a star, next 93 00:09:10,430 --> 00:09:17,073 thing is concatenation and finally the or operation. In this table then completely 94 00:09:17,073 --> 00:09:23,150 specifies what we mean by regular expression. It's a sequence of characters. 95 00:09:23,374 --> 00:09:29,062 alphabet characters like A and B and metacharacters, like or and star and 96 00:09:29,062 --> 00:09:35,393 parenthesis that we use to describe a possibly infinite set of strings. Now in 97 00:09:35,631 --> 00:09:41,743 the real world it's often worthwhile and, and useful to add some additional 98 00:09:41,743 --> 00:09:48,252 operations for convenience. for example, the wild card operation needs to have a 99 00:09:48,252 --> 00:09:57,612 dot and another metacharacter that matches any letter at all. and that's shorthand 100 00:09:57,612 --> 00:10:04,422 for listing all the letters separated by or and then but that's a lot of characters 101 00:10:04,422 --> 00:10:10,516 and so, with one character, we can match the same regular expression. and so this 102 00:10:10,516 --> 00:10:16,402 one you can use like to solve a crossword puzzle, for example. this one matches all 103 00:10:16,402 --> 00:10:22,218 the words that will have alternating U with all the seven letter words that have 104 00:10:22,218 --> 00:10:28,090 alternating Us. and so there's a couple. but there's some other ones tha sort of 105 00:10:28,090 --> 00:10:32,561 have alternating use, but not quite. And they don't match, so it's a wild card 106 00:10:32,561 --> 00:10:39,942 character. and then, there's classes of character, so capital A-Z means any 107 00:10:39,942 --> 00:10:46,121 capital letter. and then a-z means any lower case letter. so any capitalized 108 00:10:46,121 --> 00:10:52,629 letter, any lower case letter followed by any number of lower case letters will 109 00:10:52,629 --> 00:10:59,137 match something like this but it won't match something like that, just classes of 110 00:10:59,137 --> 00:11:05,892 characters. And again, this is shorthand for typing all the characters and using or 111 00:11:06,140 --> 00:11:12,155 at least one. So, rather than star, which says, zero or more occurrences, plus means 112 00:11:12,155 --> 00:11:18,743 one or more occurrences. so that's another example of a convenient shorthand or 113 00:11:18,743 --> 00:11:25,395 exactly K putting in braces that I want to specify, this specifies I want exactly 114 00:11:25,395 --> 00:11:31,300 five digits, followed by a dash, followed by exactly four digits. So, that's a way 115 00:11:31,300 --> 00:11:37,209 to specify, you know, something like a legal plus for zip code. and it doesn't 116 00:11:37,209 --> 00:11:43,580 match other things. now in, in practice, in, in, in terms of practical clients 117 00:11:43,805 --> 00:11:50,101 these types of shorthands are extremely important in terms of the theory and the 118 00:11:50,101 --> 00:11:56,622 idea of processing regular expressions since we can specify them all with the 119 00:11:56,622 --> 00:12:02,851 basic operations we'll pretty much ignore them in our code. so, in every, in every 120 00:12:02,851 --> 00:12:09,111 case we could write [A-E] + if we wanted we could write this longhand of it, but of 121 00:12:09,111 --> 00:12:13,855 course users are going to write the shorthand. but still, if we have a 122 00:12:13,855 --> 00:12:19,390 mechanism for dealing with this, then we can certainly take care of the shorthand. 123 00:12:19,390 --> 00:12:26,185 so, just with those basic operations and, and also the shorthand is helpful and our 124 00:12:26,185 --> 00:12:33,685 notation is amazingly expressive. There's really a lot you can do with it. so, for 125 00:12:33,685 --> 00:12:40,493 example, you can do substring search with a regular expression. one way to express 126 00:12:40,493 --> 00:12:45,839 the substring search problem is to put your string preceded by dot star and put a 127 00:12:45,839 --> 00:12:50,315 dot star at the end, you know, and that's asking if your string is in the infinite 128 00:12:50,315 --> 00:12:55,164 language described by this regular expression, is equivalent to the substring 129 00:12:55,164 --> 00:13:00,448 search problem. It matches if and only, if that string is in there so, for example, 130 00:13:00,448 --> 00:13:05,664 in these two and not in that two so notation covers the substring search 131 00:13:05,664 --> 00:13:11,580 problem. But then also, it can specify easily useful sets of strings that, that, 132 00:13:11,580 --> 00:13:17,221 you know, that we often want to check. So this is similar to the zip plus four, the 133 00:13:17,221 --> 00:13:22,311 three digits followed by a dash followed by two digits, followed by a dash, 134 00:13:22,311 --> 00:13:28,159 followed by four digits. that's a good start towards knowing whether the string 135 00:13:28,159 --> 00:13:34,472 is a legal Social Security Number or not. Or is it a legal e-mail address or not. 136 00:13:34,687 --> 00:13:40,142 this is a simplified version but it basically says, you ought to have at least 137 00:13:40,142 --> 00:13:46,028 one letter followed by an @, followed by some letters, followed by a dot, followed 138 00:13:46,028 --> 00:13:51,985 by edu or com or of course to cover all e-mail addresses, you need to consider 139 00:13:51,985 --> 00:13:58,157 some more possibilities. but this is a, a quick and succinct way to specify a set of 140 00:13:58,157 --> 00:14:05,226 string or Java identifier, what's a legal Java identifier. so, this is one way to 141 00:14:05,226 --> 00:14:11,149 specify legal Java identifier. And indeed, programming languages like Java use 142 00:14:11,149 --> 00:14:17,380 regular expressions like this to define what they mean by a legal identifier, and 143 00:14:17,380 --> 00:14:23,841 then, the kinds of processes that we're going to talk about to actually check that 144 00:14:23,841 --> 00:14:29,610 what you typed is legal and echoes all the way up to the program itself. the, you 145 00:14:29,610 --> 00:14:35,082 know, is, is this a legal Java program? It's almost but not quite, as we'll see in 146 00:14:35,082 --> 00:14:39,814 regular expression pattern matching problem. Very, very expressive in a 147 00:14:39,814 --> 00:14:46,131 widely-used language. Now, what's, what's really interesting about this topic and 148 00:14:46,131 --> 00:14:51,978 we'll just I, I touch on it, even though it's extremely important to what we're 149 00:14:51,978 --> 00:14:57,671 going to do is that just from the standpoint of the theory of computation 150 00:14:58,098 --> 00:15:03,365 starting with the, the time of Turing and others in the 30s', our regular 151 00:15:03,365 --> 00:15:09,841 expressions play an important role in our understanding of the theory of computation 152 00:15:10,624 --> 00:15:16,531 and that understanding actually has led to that algorithm that we're going to talk 153 00:15:16,531 --> 00:15:23,459 about. That's a very interesting aspect of this. but still, even given that you know, 154 00:15:23,459 --> 00:15:29,870 regular expressions are just useful in everyday life. this is an example of 155 00:15:30,116 --> 00:15:37,542 someone that used the regular expression to screen a job candidate. and, and this 156 00:15:37,542 --> 00:15:44,848 is actually illegal to look for words like Enron, or Kerry, or Bush or Gore or 157 00:15:45,905 --> 00:15:52,771 Republican. to decide whether a, a, a job candidate's job application will get 158 00:15:52,771 --> 00:16:00,165 through. This one uses an exclamation point, which mean just end with anything. 159 00:16:00,429 --> 00:16:10,427 it's like dot star. so someone, a typical computer user can learn to regular 160 00:16:10,427 --> 00:16:20,517 expressions to with great effect. even Google supports, a regular Google search 161 00:16:20,517 --> 00:16:29,287 window supports a star as a full word wild card and or for union. Although not full 162 00:16:29,514 --> 00:16:36,254 regular expression pattern matching. although that day is surely coming. so 163 00:16:36,486 --> 00:16:42,756 people who just get into it just a little bit and start to see the utility of 164 00:16:42,756 --> 00:16:49,412 regular expressions certainly get excited about it. And you can find examples of 165 00:16:49,412 --> 00:16:56,301 this all through the web. this is an X, XKCD comic about it where there's a 166 00:16:56,301 --> 00:17:02,106 problem, you know, how can we search through 200 megabytes of emails looking 167 00:17:02,106 --> 00:17:08,143 for something formated like an address. And lawyers want to do this all the time 168 00:17:08,143 --> 00:17:13,737 nowadays, and it's hopeless. But now, I know regular expressions. and the regular 169 00:17:13,737 --> 00:17:19,945 expression super being swoops to the rescue and types a regular expression in 170 00:17:19,945 --> 00:17:25,463 Perl, which is a, a widely-used language that's based on regular expressions that 171 00:17:25,463 --> 00:17:31,119 we'll talk about briefly in a minute and swoops away. So, there is somewhat of a 172 00:17:31,119 --> 00:17:36,913 feeling what people don't know just a little bit that regular expressions can be 173 00:17:36,913 --> 00:17:42,006 used to solve any problem. that's not quite true as we know from the theory of 174 00:17:42,006 --> 00:17:46,894 computation and we'll also touch on that a little bit. so the, the question is, can 175 00:17:46,894 --> 00:17:51,905 the average programmer learn to use regular expressions effectively? And 176 00:17:51,905 --> 00:17:56,982 there's all kinds of evidence out there that lots of programmers use, use in 177 00:17:57,182 --> 00:18:02,995 extensively. But not to pour cold water on it but we have to take everything with 178 00:18:02,995 --> 00:18:08,540 some restraint. And here's an example of a regular expression that you can find out 179 00:18:08,540 --> 00:18:14,485 on the web. it's written in Perl, but it's a regular expression to check whether an 180 00:18:14,485 --> 00:18:20,882 -email address is va lid or not. now, that is quite a regular expression. and one 181 00:18:20,882 --> 00:18:26,342 might argue whether that's really an effective way to use regular expressions. 182 00:18:26,342 --> 00:18:32,340 Maybe there's a simpler and easier way to check for valid e-mail, e-mail addresses. 183 00:18:32,543 --> 00:18:38,272 but anyway, this one is out there and is used probably being used efficiently 184 00:18:38,474 --> 00:18:43,835 somewhere to the end, effectively somewhere today. but, but really for 185 00:18:43,835 --> 00:18:49,492 people who are educated in, in Computer Science, need to know that regular 186 00:18:49,492 --> 00:18:55,894 expressions are definitely powerful tools, but writing in regular expression really 187 00:18:55,894 --> 00:19:00,613 is like writing a program. you have, you have to understand the underlying 188 00:19:00,613 --> 00:19:06,115 programming model. It's so often even more so than many programming languages since 189 00:19:06,115 --> 00:19:11,550 there's so few rules, it's easier to write a regular expression than it is to read 190 00:19:11,550 --> 00:19:17,223 one. And if it's difficult to read, it means it's difficult to debug. and so, you 191 00:19:17,223 --> 00:19:23,353 can find a lot of flaming on the web about this. and this is a particular apt, 192 00:19:23,353 --> 00:19:29,188 particularly apt quote. Some people, when confronted with a problem I think I know, 193 00:19:29,188 --> 00:19:34,893 I'll use regular expressions. Now, they have two problems. And I think that's a 194 00:19:34,893 --> 00:19:40,609 good way to summarize the bottom line. That regular expressions are amazingly 195 00:19:40,609 --> 00:19:45,806 powerful and expressive. But you can, they can get very complex in real in real 196 00:19:45,806 --> 00:19:51,225 applications. still, there's plenty of real applications where they can be 197 00:19:51,225 --> 00:19:56,941 succinct and effective. And so, we're going to look next at regular expression 198 00:19:56,941 --> 00:19:59,020 pattern matching algorithms.