[sound]. Today, we're going to give some hints about how regular expressions are used in practice. I'm going to mention some of the extensions that are found in various Unix commands. I'll also talk about some specifics of test processing algorithms, and concentrate on the important task of lexical analysis. That part of a compiler that looks at the entire program being compiled, and breaking it up into tokens, which are sequences of characters that logically belong together. Many systems use regular expressions of some sort to describe patterns. Often these are embedded in the proprietary code of company but there are also some quite visible ones such as a number of UNIX commands. I'm going to tell you one particular story involving proprietary software before moving on to generalities regarding text processing. Junglee was a start up founded by three of my PHD students and two others in 1994. At the time, the web was really new, and they had the idea of building systems to integrate web pages into more useful products, and were doing that for about two years when they got a contract from Yahoo to build a facility that would let Yahoo visitors search for books and get a table [inaudible] the price, the delivery charge and delivery delay at different booksellers. Immediately upon deployment of that product, Amazon bought Junglee in order to shut down the comparison shopping capability. Interestingly, Amazon did not understand that I bought from them not because they were the cheapest, but because I was confident that they would deliver what I asked for, and deliver it on time. And apparently, I was not alone in that thinking. But the world of online commerce was new, and Amazon could not be sure of the impact of automated comparison shopping. An interesting [inaudible], Amazon actually got good value for Junglee, because one of its founders, Anand Rajaraman, while at Amazon, was the inventor of Mechanical Turk. But the first paid work that Junglee got was his contract from the Washington Post to produce an online table of the employment opportunities offered by companies that were placing print classified ads at The Post. This job was not trivial. Junglee had to go to several hundred websites and extract the information about each job automatically. If you look at one side, you can probably figure out how to do it. Which links to follow from the home page, to get to the employment pages. And what's there. And how to navigate an html source text. To find the critical information about the job such as the title and salary. But you need to do this for each site. And to make matters worse, they Junglee guys discovered that these sites were evolving. That is, not only the jobs change, but the structure of the pages, or even the whole website changed. The result was the approximately once a week, the reader for any given site would break, and have to be redesigned. So the Junglee guys developed a regular expression language for describing how to navigate a website to extract the information they needed. The input symbols were the usual characters such as letters, so they could look for important words like salary. They also treated HTML tags like OL. That is this. As input symbols because they gave important hints about the page structure. For instance a page might say, send salary requirements to. But that doesn't indicate the salary for a particular job. But when you get to a list of jobs. And that is indicated by the order of list tags. That's the OL. And only after that, does salary mean that the number following is the salary for a job. Another important kind of input element is a link, which indicate, that, that it's necessary to move to another page. Once this regular expression language was implemented, it was easier to write regular expression to find key information like title of salary. And to write the code to process web pages directly. Thus, there was an increase in productivity as they added size to their data base. More over when the site changed. It was relatively easy to modify the regular expression to track the changes in the site. The architecture of this system developed at Junglee appears in many places. The input language is regular expressions plus actions which are arbitrary codes executed when the regular expression is recognized. In this case, the actions would be things like return this integer as a salary. The regular expression is compiled into a DFA, or sometimes into an NFA that is simulated, effectively by executing the lazy version of the subset construction as needed. Simulation of the DFA or NFA goes on exactly as we have described it theoretically. Each input symbol causes a state change and nothing more. The magic happens when an accepting state is reached. Then the associated action is executed and this action allows the regular expression processor to interact with the rest of the world in some useful way. We're now going to talk about the Unix extensions to regular expressions. There are many commands in Unix that have some sort of regular expression notation for input. An important example is the "grep" command which stands for Global Regular Expression and Print. Most of the regular expression-based languages, even though they have extensions at the end of their notation, we've covered so far to find only regular language. There are also some commands that have additional extensions, that allow non-regular extensions to be recognized, but we're not going to go into these features. Incidentally, it is no coincidence that regular expressions figure heavily into the original Unix commands. Before he did Unix, Ken Thompson had worked on a system for processing regular expressions by converting them only as far as a NFA and simulating the NFA in code. We're now going to meet some of the Unix extensions to regular expressions. You can put brackets around any list of characters as a shorthand for the same characters connected by plus signs in the notation we have used until now. You can also describe a sequence of symbols that are consecutive in the ASCII order of characters by giving the first character, a dash and then the last character. For example [a-z], that's this, stands for any lower case letter because the lower case letters have consecutive codes in ASCII. You can represent any letter by putting dashes between lower case a and z and the same for upper case, that's this. Okay. Notice that the upper and lower case letters are not consecutive in ASCII, so you cannot represent this 52 characters with a single range. Incidentally, as we see, characters like brackets and dash have special needs, meanings in Unix regular expressions. So if you want to use one of these characters as itself, you need to precede it by a backslash. So, for example, slash left bracket is a real left bracket. It's not the left bracket that you find in, in the, in range expression. And the character dot or period is shorthand for any character. Here are some more Unix changes to the regular expression notation we learned. Okay, the union operator is actually represented by a vertical bar, instead of the plus. But the plus symbol is another operator used like star, and meaning one or more. That is, in the Unix notation, E+, that's that, is shorthand for E concatenated with E. So for example, [a-z]+ means one or more lower case letters. The question mark operator is also used like star but means zero or one of. The is E? Is shorthand for E plus epsilon. So for example, [AB]? means an optional A or B. We would write it in our original notation as A plus B plus epsilon. You may remember our DFA example for recognizing strings that end in "ing". It involved a lot of explanation as we considered where to go from each of the state that represented some progress toward finding ing. However, there is an easy regular expression for the same language using the Unix dot, it's just .ing, like that. Okay. Or even if we didn't have the dot in our notation, could simply replace it by all the legal input symbols connected by pluses. In fact, it's even much easier to design an NFA for this language than it is to design a DFA. Okay, here is an NFA. Essentially, it guesses when it has seen the 'i' that begins the final "ing". Thus, even on an input like the first 'i' in skiing, it can remain in the start state. That is, it can go from the start state to itself on an 'i' if it likes. And in fact, it always does like, 'cause the NFA always does everything. Okay it can, anyway, it can remain, in the start state on first 'i' and then only travel to the right that is here on the second 'i'. There's no need to worry about what to do in a state like one that represents the 'i' as the discovery of 'i' and 'n' where do you go if the next input in not 'g' it doesn't matter because you'll still be here. And another sequence of states will get you where you actually need to go. Now let's talk a little bit about lexical analysis, the breaking of an input program into its basic units called tokens. For example, every identifier in a program is a token. Likewise the reserved words like if or why are tokens. Many single characters are tokens by themselves. In typical programing languages semicolon is a token used for separating statements, plus is a token indicating addition, less than is a token by itself indicating the less than comparison operator. There are also multi-character operators such as the less than or equal sign which together indicate the less than equal comparison. There are tools, like Lex, or its open source version flex, that let you write a regular expression for each token. You can also provide a piece of code as an action to be executed, when an instance of that regular expression is recognized. For example, the code for when an integer is found might simply return that integer. As an example, the expression for identifiers might be the one shown here. That's this. Using the unix notation, this expression describes identifiers as any letter, that's this, followed by zero or more star letters or digits. In many languages identifiers allow some more options, for example, underscore might be included as if it were another digit so it would just appear in this list here. Okay, underscore. In Lex you write an action which is an arbitrary code to be executed when the regular expression for a token is matched. In the simplest cases, all this code does is return an integer code representing the token found. But the action might be much more complicated. For example, if an identifier is found, the action might involve installing that identifier in a symbol table where all identifiers used by the program are stored. When building a lexical analyzer using regular expressions for the tokens there are some resolution and ambiguities that need to be faced. Two examples are illustrated here. For one. Reserve words like "if" also match the expression for identifiers. But "if" is not a legal identifier. So we have to make sure that lexical analyzer does the right thing on "if". Okay. For another when we see less than. We don't immediately know if it's a token by itself or a part of a larger token which would be less than or equal in this case. We need to make sure we don't prematurely declare victory and return, less then, when less than or equal, is intended by the programmer. A good architecture for building a lexical analysis code from regular expressions is to start by converting each regular expression to an Epsilon NFA. Each of these Epsilon NFAs has it's own final state with which the action for that regular expression is associated. We combine all these epsilon NFAs by introducing a new start state. The start state has epsilon transitions to the start states for each of the NFAs. Looks something like this. Here's the new start state. Here are all the old start states and their automata. And we've just put transitions on epsilon to each one of them. Okay. All the final states of the NFAs remain final and they each have their associated actions. So for example, these are, these are all final states. Nfa can have several final states in fact. After that combination we convert to DFA or perhaps an NFA without epsilon transitions which we will then simulate. Okay, we need to give the regular expressions in ordering and this ordering determines the priority among actions. A typical ordering puts all the reserved words ahead of identifier, so that way if the DFA discovers that the next token is if, it in principle does not know whether to execute the action for the reserved word if, or for identifiers, or both. But the fact that if takes priority over identifiers says that this token should be treated as a reserved word and not as an identifier. However to make all this work right, the DFA action need a special ability. The ability to take an input symbol that is read and put it back at the front of the input stream. That input will then be read again, typically the next time the lexical analyzer is told to look for the next token. Here's an example of why the ability to restore an input symbol dressed red back to the front of the input is important. Suppose the lexical analyzer is told to find the next token and the first character it reads on the input is the less-than symbol. It has to read the next input, and if that input is an equal sign then we can be sure the token is less than or equal. But, if the input is anything else, for example a blank, a letter, or a digit, then that character must be put back on the input and less than, by itself, declared to be the token. For another example if the lexical analyzer has read the characters 'i' and 'f' in its quest for a token, it might have seen the reserved word "if". But we won't know until it reads the next character if that character is a letter or a digit, anything that can extend that identifier, then we do not have the reserved word "if", we have a longer identifier. However, if the next character is not one that can extend that identifier such as a blank or a left parentheses, then we really do have the reserved word "if", in that case, the next character must be pushed back onto the front of the input.