1 00:00:00,000 --> 00:00:05,174 [sound]. Today, we're going to give some hints about how regular expressions are 2 00:00:05,174 --> 00:00:10,348 used in practice. I'm going to mention some of the extensions that are found in 3 00:00:10,348 --> 00:00:15,392 various Unix commands. I'll also talk about some specifics of test processing 4 00:00:15,392 --> 00:00:20,239 algorithms, and concentrate on the important task of lexical analysis. That 5 00:00:20,239 --> 00:00:25,806 part of a compiler that looks at the entire program being compiled, and breaking it up 6 00:00:25,806 --> 00:00:30,850 into tokens, which are sequences of characters that logically belong together. 7 00:00:30,850 --> 00:00:35,894 Many systems use regular expressions of some sort to describe patterns. Often 8 00:00:35,894 --> 00:00:41,396 these are embedded in the proprietary code of company but there are also some quite 9 00:00:41,396 --> 00:00:46,178 visible ones such as a number of UNIX commands. I'm going to tell you one 10 00:00:46,178 --> 00:00:51,484 particular story involving proprietary software before moving on to generalities 11 00:00:51,484 --> 00:00:56,725 regarding text processing. Junglee was a start up founded by three of 12 00:00:56,725 --> 00:01:01,886 my PHD students and two others in 1994. At the time, the web was really new, and they 13 00:01:01,886 --> 00:01:06,210 had the idea of building systems to integrate web pages into more useful 14 00:01:06,210 --> 00:01:10,889 products, and were doing that for about two years when they got a contract from 15 00:01:10,889 --> 00:01:15,806 Yahoo to build a facility that would let Yahoo visitors search for books and get a 16 00:01:15,806 --> 00:01:20,544 table [inaudible] the price, the delivery charge and delivery delay at different 17 00:01:20,544 --> 00:01:25,401 booksellers. Immediately upon deployment of that product, Amazon bought Junglee 18 00:01:25,401 --> 00:01:30,022 in order to shut down the comparison shopping capability. Interestingly, Amazon 19 00:01:30,022 --> 00:01:34,737 did not understand that I bought from them not because they were the cheapest, but 20 00:01:34,737 --> 00:01:39,054 because I was confident that they would deliver what I asked for, and deliver it 21 00:01:39,054 --> 00:01:43,372 on time. And apparently, I was not alone in that thinking. But the world of online 22 00:01:43,372 --> 00:01:47,905 commerce was new, and Amazon could not be sure of the impact of automated comparison 23 00:01:47,905 --> 00:01:51,737 shopping. An interesting [inaudible], Amazon actually got good value for 24 00:01:51,737 --> 00:01:55,838 Junglee, because one of its founders, Anand Rajaraman, while at Amazon, was the 25 00:01:55,838 --> 00:02:00,210 inventor of Mechanical Turk. But the first paid work that Junglee got was his 26 00:02:00,210 --> 00:02:04,473 contract from the Washington Post to produce an online table of the employment 27 00:02:04,473 --> 00:02:09,534 opportunities offered by companies that were placing print classified ads 28 00:02:09,534 --> 00:02:14,923 at The Post. This job was not trivial. Junglee had to go to several hundred 29 00:02:14,923 --> 00:02:20,072 websites and extract the information about each job automatically. If you look at one 30 00:02:20,072 --> 00:02:24,831 side, you can probably figure out how to do it. Which links to follow from the home 31 00:02:24,831 --> 00:02:29,532 page, to get to the employment pages. And what's there. And how to navigate an html 32 00:02:29,532 --> 00:02:33,653 source text. To find the critical information about the job such as the 33 00:02:33,653 --> 00:02:38,863 title and salary. But you need to do this for each site. And to make matters worse, 34 00:02:38,863 --> 00:02:43,521 they Junglee guys discovered that these sites were evolving. That is, not 35 00:02:43,521 --> 00:02:48,302 only the jobs change, but the structure of the pages, or even the whole website 36 00:02:48,302 --> 00:02:53,390 changed. The result was the approximately once a week, the reader for any given site 37 00:02:53,390 --> 00:02:58,942 would break, and have to be redesigned. So the Junglee guys developed a regular 38 00:02:58,942 --> 00:03:03,300 expression language for describing how to navigate a website to extract the 39 00:03:03,300 --> 00:03:08,116 information they needed. The input symbols were the usual characters such as letters, 40 00:03:08,116 --> 00:03:13,460 so they could look for important words like salary. They also treated HTML tags 41 00:03:13,460 --> 00:03:19,298 like OL. That is this. As input symbols because they gave important hints about 42 00:03:19,298 --> 00:03:25,435 the page structure. For instance a page might say, send salary requirements to. But that 43 00:03:25,435 --> 00:03:31,124 doesn't indicate the salary for a particular job. But when you get to a list 44 00:03:31,124 --> 00:03:36,662 of jobs. And that is indicated by the order of list tags. That's the OL. And 45 00:03:36,662 --> 00:03:42,800 only after that, does salary mean that the number following is the salary for a job. 46 00:03:42,800 --> 00:03:48,801 Another important kind of input element is a link, which indicate, that, that it's 47 00:03:48,801 --> 00:03:54,560 necessary to move to another page. Once this regular expression language was 48 00:03:54,560 --> 00:03:59,501 implemented, it was easier to write regular expression to find key information 49 00:03:59,501 --> 00:04:04,441 like title of salary. And to write the code to process web pages directly. Thus, 50 00:04:04,441 --> 00:04:09,572 there was an increase in productivity as they added size to their data base. More 51 00:04:09,572 --> 00:04:14,132 over when the site changed. It was relatively easy to modify the regular 52 00:04:14,132 --> 00:04:20,472 expression to track the changes in the site. The architecture of this system 53 00:04:20,472 --> 00:04:26,222 developed at Junglee appears in many places. The input language is regular 54 00:04:26,222 --> 00:04:32,124 expressions plus actions which are arbitrary codes executed when the regular 55 00:04:32,124 --> 00:04:38,103 expression is recognized. In this case, the actions would be things like return 56 00:04:38,103 --> 00:04:44,046 this integer as a salary. The regular expression is compiled into a DFA, or 57 00:04:44,046 --> 00:04:49,747 sometimes into an NFA that is simulated, effectively by executing the lazy version 58 00:04:49,747 --> 00:04:56,039 of the subset construction as needed. Simulation of the DFA or NFA goes on 59 00:04:56,039 --> 00:05:00,656 exactly as we have described it theoretically. Each input symbol causes a 60 00:05:00,656 --> 00:05:05,398 state change and nothing more. The magic happens when an accepting state is 61 00:05:05,398 --> 00:05:10,078 reached. Then the associated action is executed and this action allows the 62 00:05:10,078 --> 00:05:15,327 regular expression processor to interact with the rest of the world in some useful 63 00:05:15,327 --> 00:05:20,187 way. We're now going to talk about the Unix extensions to regular expressions. 64 00:05:20,187 --> 00:05:24,926 There are many commands in Unix that have some sort of regular expression notation 65 00:05:24,926 --> 00:05:29,209 for input. An important example is the "grep" command which stands for Global 66 00:05:29,209 --> 00:05:33,833 Regular Expression and Print. Most of the regular expression-based languages, even 67 00:05:33,833 --> 00:05:38,458 though they have extensions at the end of their notation, we've covered so far to 68 00:05:38,458 --> 00:05:43,884 find only regular language. There are also some commands that have additional 69 00:05:43,884 --> 00:05:48,328 extensions, that allow non-regular extensions to be recognized, but we're not 70 00:05:48,328 --> 00:05:52,649 going to go into these features. Incidentally, it is no coincidence that 71 00:05:52,649 --> 00:05:57,736 regular expressions figure heavily into the original Unix commands. Before he did 72 00:05:57,736 --> 00:06:02,698 Unix, Ken Thompson had worked on a system for processing regular expressions by 73 00:06:02,698 --> 00:06:07,597 converting them only as far as a NFA and simulating the NFA in code. We're now 74 00:06:07,597 --> 00:06:12,433 going to meet some of the Unix extensions to regular expressions. You can 75 00:06:12,433 --> 00:06:17,583 put brackets around any list of characters as a shorthand for the same characters 76 00:06:17,583 --> 00:06:25,427 connected by plus signs in the notation we have used until now. You can also describe 77 00:06:25,427 --> 00:06:32,260 a sequence of symbols that are consecutive in the ASCII order of characters by giving 78 00:06:32,260 --> 00:06:38,537 the first character, a dash and then the last character. For example [a-z], that's 79 00:06:38,537 --> 00:06:44,417 this, stands for any lower case letter because the lower case letters have 80 00:06:44,417 --> 00:06:50,296 consecutive codes in ASCII. You can represent any letter by putting dashes 81 00:06:50,296 --> 00:06:58,976 between lower case a and z and the same for upper case, that's this. Okay. Notice that 82 00:06:58,976 --> 00:07:03,753 the upper and lower case letters are not consecutive in ASCII, so you cannot 83 00:07:03,753 --> 00:07:09,503 represent this 52 characters with a single range. Incidentally, as we see, characters 84 00:07:09,503 --> 00:07:16,313 like brackets and dash have special needs, meanings in Unix regular expressions. So 85 00:07:16,313 --> 00:07:22,791 if you want to use one of these characters as itself, you need to precede it by a 86 00:07:22,791 --> 00:07:29,103 backslash. So, for example, slash left bracket is a real left bracket. It's not 87 00:07:29,103 --> 00:07:36,494 the left bracket that you find in, in the, in range expression. And the character dot 88 00:07:36,494 --> 00:07:45,037 or period is shorthand for any character. Here are some more Unix changes to the 89 00:07:45,037 --> 00:07:50,938 regular expression notation we learned. Okay, the union operator is actually 90 00:07:50,938 --> 00:07:58,540 represented by a vertical bar, instead of the plus. But the plus symbol is another 91 00:07:58,540 --> 00:08:06,390 operator used like star, and meaning one or more. That is, in the Unix notation, 92 00:08:06,390 --> 00:08:13,730 E+, that's that, is shorthand for E concatenated with E. So for example, 93 00:08:14,036 --> 00:08:22,883 [a-z]+ means one or more lower case letters. The question mark operator 94 00:08:22,883 --> 00:08:30,814 is also used like star but means zero or one of. The is E? Is shorthand for E plus 95 00:08:30,814 --> 00:08:39,429 epsilon. So for example, [AB]? means an optional A or B. We would write it in our 96 00:08:39,429 --> 00:08:46,473 original notation as A plus B plus epsilon. You may remember our DFA example 97 00:08:46,473 --> 00:08:51,916 for recognizing strings that end in "ing". It involved a lot of explanation as we 98 00:08:51,916 --> 00:08:57,221 considered where to go from each of the state that represented some progress 99 00:08:57,221 --> 00:09:02,457 toward finding ing. However, there is an easy regular expression for the same 100 00:09:02,457 --> 00:09:08,367 language using the Unix dot, it's just .ing, like that. Okay. Or even if we 101 00:09:08,367 --> 00:09:13,730 didn't have the dot in our notation, could simply replace it by all the legal input 102 00:09:13,730 --> 00:09:20,593 symbols connected by pluses. In fact, it's even much easier to design an NFA for this 103 00:09:20,593 --> 00:09:26,537 language than it is to design a DFA. Okay, here is an NFA. Essentially, it guesses 104 00:09:26,537 --> 00:09:32,707 when it has seen the 'i' that begins the final "ing". Thus, even on an input like the 105 00:09:32,707 --> 00:09:38,576 first 'i' in skiing, it can remain in the start state. That is, it can go from the 106 00:09:38,576 --> 00:09:44,821 start state to itself on an 'i' if it likes. And in fact, it always does like, 'cause 107 00:09:44,821 --> 00:09:52,676 the NFA always does everything. Okay it can, anyway, it can remain, in the start 108 00:09:52,676 --> 00:09:59,233 state on first 'i' and then only travel to the right that is here on the second 'i'. 109 00:09:59,233 --> 00:10:05,954 There's no need to worry about what to do in a state like one that represents the 'i' as 110 00:10:05,954 --> 00:10:12,429 the discovery of 'i' and 'n' where do you go if the next input in not 'g' it doesn't 111 00:10:12,429 --> 00:10:19,783 matter because you'll still be here. And another sequence of states will get you 112 00:10:19,783 --> 00:10:25,849 where you actually need to go. Now let's talk a little bit about lexical analysis, 113 00:10:25,849 --> 00:10:30,947 the breaking of an input program into its basic units called tokens. For example, 114 00:10:30,947 --> 00:10:36,640 every identifier in a program is a token. Likewise the reserved words like if or why 115 00:10:36,640 --> 00:10:41,653 are tokens. Many single characters are tokens by themselves. In typical 116 00:10:41,653 --> 00:10:47,599 programing languages semicolon is a token used for separating statements, plus is a 117 00:10:47,599 --> 00:10:53,472 token indicating addition, less than is a token by itself indicating the less than 118 00:10:53,472 --> 00:10:59,059 comparison operator. There are also multi-character operators such as the less 119 00:10:59,059 --> 00:11:05,564 than or equal sign which together indicate the less than equal comparison. There are 120 00:11:05,564 --> 00:11:10,469 tools, like Lex, or its open source version flex, that let you write a regular 121 00:11:10,469 --> 00:11:16,329 expression for each token. You can also provide a piece of code as an action to be 122 00:11:16,329 --> 00:11:21,292 executed, when an instance of that regular expression is recognized. For example, the 123 00:11:21,292 --> 00:11:28,143 code for when an integer is found might simply return that integer. As an example, 124 00:11:28,143 --> 00:11:36,352 the expression for identifiers might be the one shown here. That's this. Using the 125 00:11:36,352 --> 00:11:43,390 unix notation, this expression describes identifiers as any letter, that's this, 126 00:11:43,390 --> 00:11:50,427 followed by zero or more star letters or digits. In many languages identifiers 127 00:11:50,427 --> 00:11:57,104 allow some more options, for example, underscore might be included as if it were 128 00:11:57,104 --> 00:12:04,090 another digit so it would just appear in this list here. Okay, underscore. In Lex 129 00:12:04,090 --> 00:12:09,628 you write an action which is an arbitrary code to be executed when the regular 130 00:12:09,628 --> 00:12:15,096 expression for a token is matched. In the simplest cases, all this code does is 131 00:12:15,096 --> 00:12:20,705 return an integer code representing the token found. But the action might be much 132 00:12:20,705 --> 00:12:25,752 more complicated. For example, if an identifier is found, the action might 133 00:12:25,752 --> 00:12:31,571 involve installing that identifier in a symbol table where all identifiers used by 134 00:12:31,571 --> 00:12:36,758 the program are stored. When building a lexical analyzer using regular 135 00:12:36,758 --> 00:12:43,450 expressions for the tokens there are some resolution and ambiguities that need to be 136 00:12:43,450 --> 00:12:50,003 faced. Two examples are illustrated here. For one. Reserve words like "if" also match 137 00:12:50,003 --> 00:12:56,023 the expression for identifiers. But "if" is not a legal identifier. So we have to 138 00:12:56,023 --> 00:13:02,044 make sure that lexical analyzer does the right thing on "if". Okay. For another 139 00:13:02,044 --> 00:13:06,765 when we see less than. We don't immediately know if it's a token by itself 140 00:13:06,765 --> 00:13:12,063 or a part of a larger token which would be less than or equal in this case. We need 141 00:13:12,063 --> 00:13:17,361 to make sure we don't prematurely declare victory and return, less then, when less 142 00:13:17,361 --> 00:13:22,390 than or equal, is intended by the programmer. A good architecture for 143 00:13:22,390 --> 00:13:27,714 building a lexical analysis code from regular expressions is to start by 144 00:13:27,714 --> 00:13:34,624 converting each regular expression to an Epsilon NFA. Each of these Epsilon NFAs 145 00:13:34,624 --> 00:13:39,318 has it's own final state with which the action for that regular expression is 146 00:13:39,318 --> 00:13:46,361 associated. We combine all these epsilon NFAs by introducing a new start state. The 147 00:13:46,361 --> 00:13:53,194 start state has epsilon transitions to the start states for each of the NFAs. Looks 148 00:13:53,194 --> 00:13:59,630 something like this. Here's the new start state. Here are all the old start states 149 00:13:59,630 --> 00:14:06,104 and their automata. And we've just put transitions on epsilon to each one of 150 00:14:06,104 --> 00:14:13,158 them. Okay. All the final states of the NFAs remain final and they each have their 151 00:14:13,158 --> 00:14:20,038 associated actions. So for example, these are, these are all final states. Nfa can 152 00:14:20,038 --> 00:14:26,742 have several final states in fact. After that combination we convert to DFA or 153 00:14:26,742 --> 00:14:31,511 perhaps an NFA without epsilon transitions which we will then simulate. Okay, we 154 00:14:31,511 --> 00:14:36,399 need to give the regular expressions in ordering and this ordering determines the 155 00:14:36,399 --> 00:14:41,110 priority among actions. A typical ordering puts all the reserved words 156 00:14:41,110 --> 00:14:46,088 ahead of identifier, so that way if the DFA discovers that the next token is if, 157 00:14:46,088 --> 00:14:51,256 it in principle does not know whether to execute the action for the reserved word 158 00:14:51,256 --> 00:14:55,731 if, or for identifiers, or both. But the fact that if takes priority over 159 00:14:55,731 --> 00:15:00,646 identifiers says that this token should be treated as a reserved word and not as an 160 00:15:00,646 --> 00:15:05,790 identifier. However to make all this work right, the DFA action need a special 161 00:15:05,790 --> 00:15:10,530 ability. The ability to take an input symbol that is read and put it back at the 162 00:15:10,530 --> 00:15:15,196 front of the input stream. That input will then be read again, typically the next 163 00:15:15,196 --> 00:15:21,314 time the lexical analyzer is told to look for the next token. Here's an 164 00:15:21,314 --> 00:15:26,815 example of why the ability to restore an input symbol dressed red back to the front 165 00:15:26,815 --> 00:15:31,988 of the input is important. Suppose the lexical analyzer is told to find the 166 00:15:31,988 --> 00:15:36,834 next token and the first character it reads on the input is the less-than 167 00:15:36,834 --> 00:15:42,138 symbol. It has to read the next input, and if that input is an equal sign then we 168 00:15:42,138 --> 00:15:48,319 can be sure the token is less than or equal. But, if the input is anything else, 169 00:15:48,319 --> 00:15:53,683 for example a blank, a letter, or a digit, then that character must be put back on 170 00:15:53,683 --> 00:16:00,043 the input and less than, by itself, declared to be the token. For another 171 00:16:00,043 --> 00:16:06,227 example if the lexical analyzer has read the characters 'i' and 'f' in its quest for a 172 00:16:06,227 --> 00:16:12,152 token, it might have seen the reserved word "if". But we won't know until it reads the 173 00:16:12,152 --> 00:16:16,376 next character if that character is a letter or a digit, anything that can 174 00:16:16,376 --> 00:16:20,942 extend that identifier, then we do not have the reserved word "if", we have a longer 175 00:16:20,942 --> 00:16:25,737 identifier. However, if the next character is not one that can extend that identifier 176 00:16:25,737 --> 00:16:30,246 such as a blank or a left parentheses, then we really do have the reserved word 177 00:16:30,246 --> 00:16:34,870 "if", in that case, the next character must be pushed back onto the front of the input.