1 00:00:00,000 --> 00:00:04,599 Today we are going to discuss algorithms for answering questions about regular 2 00:00:04,599 --> 00:00:08,791 languages or really about the representations for the languages such as 3 00:00:08,791 --> 00:00:13,448 [inaudible]. The questions we can resolve for [inaudible] include many we cannot 4 00:00:13,448 --> 00:00:18,339 resolve for programs in general. Examples include whether a given string is accepted 5 00:00:18,339 --> 00:00:22,530 by a given [inaudible], that's the membership problem. Or whether a given 6 00:00:22,530 --> 00:00:26,897 [inaudible] accepts any string at all, the emptiness problem. As part of our 7 00:00:26,897 --> 00:00:31,147 discussion we are going to prove an important theorem called the pumping 8 00:00:31,147 --> 00:00:35,685 [inaudible] which lets us show certain languages not to be regular. Automotive 9 00:00:35,685 --> 00:00:40,393 theory talks about many different classes of languages including context feeling, 10 00:00:40,393 --> 00:00:44,984 redigit cursive and recursive, inumeral languages. Regards to meet each of these 11 00:00:44,984 --> 00:00:49,634 classes but for the moment we know only one class, the regular languages. When we 12 00:00:49,634 --> 00:00:54,109 investigate a class of languages there are two important issues. The first is 13 00:00:54,109 --> 00:00:58,758 decision properties. We shall see that there are many questions we might like to 14 00:00:58,758 --> 00:01:02,885 ask about a language or rather its representation, such as whether it is 15 00:01:02,885 --> 00:01:07,418 empty, finite or infinite. Its good to know there are algorithms to answer such 16 00:01:07,418 --> 00:01:12,612 questions ... At least for the regular languages. Unfortunately as we meet larger 17 00:01:12,612 --> 00:01:17,817 classes of languages, we find that in general the larger the class of languages 18 00:01:17,817 --> 00:01:22,496 the less likely there is to be an algorithms to answer questions about 19 00:01:22,496 --> 00:01:27,503 languages in that class. The second important issue is closure of properties 20 00:01:27,503 --> 00:01:32,577 of the class. These involve applying operations such as union to languages in 21 00:01:32,577 --> 00:01:37,519 the class. We are going to defer the discussion of closure of properties to 22 00:01:37,519 --> 00:01:42,703 another lecture although I'll give you an example on the next slide. [inaudible] 23 00:01:42,703 --> 00:01:47,393 properties are statements that when we apply certain operations to languages in 24 00:01:47,393 --> 00:01:51,906 the class, the result is also in the class. For example we say that a class of 25 00:01:51,906 --> 00:01:56,655 languages is closed under union, if given two languages in the class, the union of 26 00:01:56,655 --> 00:02:01,227 those languages is also in the class. So if i have two regular languages i can 27 00:02:01,227 --> 00:02:06,093 represent them by regular expressions and connect those expressions by a plus, with 28 00:02:06,093 --> 00:02:10,313 the appropriate parentheses, to get a regular expression for their union. 29 00:02:10,313 --> 00:02:14,651 Similar constructions work for their concatenation and closure. Now let us 30 00:02:14,651 --> 00:02:20,994 address the main topic. Of this lecture, decision properties of regular languages. 31 00:02:20,994 --> 00:02:28,092 We've used both formal and informal ways of describing languages. The formal ways 32 00:02:28,092 --> 00:02:34,051 include [inaudible] and regular expressions, each of which defines a 33 00:02:34,051 --> 00:02:40,273 language by a precise mathematical definition. But we've also described 34 00:02:40,273 --> 00:02:47,371 languages informally by pro statements and set formers such as this. Or even more 35 00:02:47,371 --> 00:02:52,939 informal statements like this. Okay. However you can't answer questions about a 36 00:02:52,939 --> 00:02:57,226 language unless you have a formal description. For example, an instance 37 00:02:57,226 --> 00:03:02,003 where we'll talk shortly about testing whether a regular language is infinite, 38 00:03:02,003 --> 00:03:06,841 given one of its formal representations. It looks like the last of the informal 39 00:03:06,841 --> 00:03:13,741 descriptions describes an infinite set, but does the word ?some' Mean any or some 40 00:03:13,741 --> 00:03:19,246 one particular number like ten for example that I had in mind. Thus we are only going 41 00:03:19,246 --> 00:03:24,621 to use formal descriptions of languages when we talk about algorithms for deciding 42 00:03:24,621 --> 00:03:29,775 things about those languages. Thus a decision property for a class of languages 43 00:03:29,775 --> 00:03:33,773 is an algorithm that takes a formal representative of the language. The 44 00:03:33,773 --> 00:03:38,559 algorithm answers some particular question about the language such as whether or not 45 00:03:38,559 --> 00:03:44,369 the language described by the representation is empty. Here are a few 46 00:03:44,369 --> 00:03:50,400 examples of why we might be interested in decision properties of regular languages. 47 00:03:51,740 --> 00:03:58,004 Both involve protocols represented by [inaudible] If we ask whether the language 48 00:03:58,004 --> 00:04:02,254 as such in the [inaudible] is finite we are, in effect, asking whether the 49 00:04:02,254 --> 00:04:06,970 protocol it represents is guaranteed to terminate. Or if we make the final states 50 00:04:06,970 --> 00:04:11,336 of the [inaudible] be error states, then asking if its language is empty, is 51 00:04:11,336 --> 00:04:16,168 tantamount to asking whether the protocol can fail. And remember we couldn't answer 52 00:04:16,168 --> 00:04:21,000 either of these questions about programs in general, so we couldn't get the answers 53 00:04:21,000 --> 00:04:25,600 to these questions about protocols by looking at the code that implements them. 54 00:04:26,060 --> 00:04:30,753 Another use for decision properties of regular languages involves minimizing 55 00:04:30,753 --> 00:04:35,263 their representations. For instance, the "F" "A"'s are a good representations for 56 00:04:35,263 --> 00:04:39,957 certain kinds of digital circuits. Those that have memory. We usually want the 57 00:04:39,957 --> 00:04:44,955 smallest circuit to accomplish a task and a good first step is to find a DFA that 58 00:04:44,955 --> 00:04:49,831 does what we want and has the smallest number of states of any DFA for the same 59 00:04:49,831 --> 00:04:54,665 language. It turns out that we can determine whether two DFA's are 60 00:04:54,665 --> 00:04:59,357 equivalent. That is whether they define the same language. That lets us find the 61 00:04:59,357 --> 00:05:03,812 minimum state DFA equivalent to any given DFA. Again, we do none of this for 62 00:05:03,812 --> 00:05:08,564 programs in general, you cant tell whether two programs do the same thing, and we 63 00:05:08,564 --> 00:05:13,138 cant find the smallest equivalent for a given program even though we know in 64 00:05:13,138 --> 00:05:18,691 principle that one must exist. The membership question for regular languages 65 00:05:18,691 --> 00:05:24,191 answered by an algorithm that takes a DFA and a string and tells whether or not the 66 00:05:24,191 --> 00:05:29,166 string is accepted by the DFA. The algorithm is the obvious one. Simulate the 67 00:05:29,166 --> 00:05:37,100 DFA on the input. Here is an example, it's something we've was seen several times 68 00:05:37,100 --> 00:05:42,880 before each time I show it to you I use a different style of presentation, but the 69 00:05:42,880 --> 00:05:48,095 idea is always the same. Here, the DFA for strings without consecutive ones, 70 00:05:48,095 --> 00:05:53,452 something which I know we have seen before, and the input string is zero one 71 00:05:53,452 --> 00:05:58,245 zero one, one. That obviously has consecutive ones so it shouldn't be 72 00:05:58,245 --> 00:06:04,854 accepted. Well, let's see what happens when we simulate it. We read a zero we 73 00:06:04,854 --> 00:06:12,254 stay in A. When we read a one we go to B. We read a zero we go back to A. Read a one 74 00:06:12,254 --> 00:06:18,696 we go to B, read another one we go to C. When we simulate this DFA on the input, we 75 00:06:18,696 --> 00:06:26,148 see that in D the string gets [inaudible] to state C, and it is not accepted. Now 76 00:06:26,148 --> 00:06:31,001 you might wonder what if the regular language were represented by an NFA or a 77 00:06:31,001 --> 00:06:35,778 regular expression for example Then you first need to convert the representation 78 00:06:35,778 --> 00:06:40,508 to a DFA and then simulate the DFA. It is possible to convert from many of the full 79 00:06:40,508 --> 00:06:44,610 representations we know about to any of the others using the circles of 80 00:06:44,610 --> 00:06:49,225 conversions. So doing might exponentiate the size of the description but there is 81 00:06:49,225 --> 00:06:53,783 still an algorithm to do the conversion, and that's all we need to show there is, 82 00:06:53,783 --> 00:06:58,456 for example, and algorithm to tell, given a regular expression and a string, Whether 83 00:06:58,456 --> 00:07:02,965 the string is in the language o the regular expression. Generally, proofs of 84 00:07:02,965 --> 00:07:08,536 closure or decision properties require either a DFA or regular expression by the 85 00:07:08,536 --> 00:07:14,037 way. The emptiness problem is given a representation for a regular language does 86 00:07:14,037 --> 00:07:20,698 its language contain any string at all. Okay, we are going to assume the line is 87 00:07:20,698 --> 00:07:25,718 represented by a DFA. Obviously if by some other representation, then converted to a 88 00:07:25,718 --> 00:07:30,853 DFA. Finding reachable states requires a breadth first a depth first search from 89 00:07:30,853 --> 00:07:35,346 the start state. I'm not going to assume you are familiar with these search 90 00:07:35,346 --> 00:07:40,259 techniques, but it is fairly easy to think of some way of searching a graph from a 91 00:07:40,259 --> 00:07:44,932 single node by following all arcs out of the node and marking those nodes you 92 00:07:44,932 --> 00:07:49,905 visited. Then follow arcs out of the nodes you visited and mark any other nodes you 93 00:07:49,905 --> 00:07:55,150 visit. Keep doing that until you can not mark any other nodes. The mark [inaudible] 94 00:07:55,150 --> 00:07:59,778 of those that are reached from the start state on some input. If at least one final 95 00:07:59,778 --> 00:08:04,072 state is marked, then the DFA accepts at least one input. If no final state is 96 00:08:04,072 --> 00:08:08,700 marked then it is impossible for the DFA to accept anything and its language empty. 97 00:08:09,540 --> 00:08:24,785 Here is an example. Here is your start state. We might mark, mark it, mark these 98 00:08:24,785 --> 00:08:37,929 guys. Maybe mark these guys as we go. But if all your final states are out here, and 99 00:08:37,929 --> 00:08:45,810 never get marked. Then obviously no matter how complex this automaton is, you cant 100 00:08:45,810 --> 00:08:50,349 reach your final state, and you cant accept anything. Now if you have a 101 00:08:50,349 --> 00:08:55,601 automaton with three states it is pretty easy to tell what is reachable and what 102 00:08:55,601 --> 00:09:00,530 isn't. But if you have a million states represented by some table, then it is 103 00:09:00,530 --> 00:09:05,717 hardly easy to tell. But fortunately we have a straight forward search algorithm 104 00:09:05,717 --> 00:09:11,071 regardless of how large the automaton or graph is. Now let us take up a more 105 00:09:11,071 --> 00:09:15,664 difficult problem, but one that we can still solve. We'd like to know whether or 106 00:09:15,664 --> 00:09:20,547 not the language defined by DFA is finite or infinite. The first fact we're going to 107 00:09:20,547 --> 00:09:25,256 prove is that if the language of the DFA contains any string of length n or more, 108 00:09:25,256 --> 00:09:29,733 where n is the number of states of the DFA, then the DFA contains an infinite 109 00:09:29,733 --> 00:09:35,780 number of strings. Surely if the DFA doesn't accept any string of money equal 110 00:09:35,780 --> 00:09:40,446 to or greater than N then it accepts only a finite number of strings. However it 111 00:09:40,446 --> 00:09:45,111 doesn't seem feasible to test membership for all input strings of length and or 112 00:09:45,111 --> 00:09:49,894 more since there are an infinite number of them. Can we ever finish? If not then we 113 00:09:49,894 --> 00:09:54,754 really don't have an algorithm for testing infiniteness. But as we shall see it is 114 00:09:54,754 --> 00:09:59,968 possible to limit the length of strings we have to test to twice the number of state 115 00:09:59,968 --> 00:10:04,630 so we have a really large and ugly but finite task, and we really do have an 116 00:10:04,630 --> 00:10:09,414 algorithm. So let's try to prove the point we need, that if the DFA accepts any 117 00:10:09,414 --> 00:10:14,505 string whose length is at least the number of states N, that it accepts an infinite 118 00:10:14,505 --> 00:10:20,369 number of strings First observe that a string of a length N or more has at least 119 00:10:20,369 --> 00:10:26,187 N plus one states along its path. For example a string a string of length two 120 00:10:26,187 --> 00:10:34,140 has three states on its path, here is a typical picture we might have. String AB. 121 00:10:35,520 --> 00:10:44,100 Looks like that. Notice no matter how many, well, the number of [inaudible] 122 00:10:44,100 --> 00:10:51,620 symbols in the string is the number of arcs and there's always one more node than 123 00:10:51,620 --> 00:11:00,093 there are arcs in the path. That's why there will be, n plus one states for a 124 00:11:00,093 --> 00:11:07,548 string of length n. Now if there are only N different states, and there are N plus 125 00:11:07,548 --> 00:11:13,725 one states along the path, then two states along the path must be the same. That's 126 00:11:13,725 --> 00:11:19,903 called the pigeonhole principle, you might remember. Here is a picture of the path 127 00:11:19,903 --> 00:11:25,775 for string W, which we are breaking up as X, Y, and Z. X is the prefix of W that 128 00:11:25,775 --> 00:11:31,572 gets the DFA to the fist state that repeats on the path, which we call state 129 00:11:31,572 --> 00:11:38,470 Q, that's right here of course. Then Y is a part of W that gets the [inaudible] back 130 00:11:38,470 --> 00:11:44,966 to Q for the first time. That is the end of Y is the second occurrence of Q. Notice 131 00:11:44,966 --> 00:11:52,314 that therefore Y cannot be the empty strain although X and Q might be Finally, 132 00:11:52,314 --> 00:11:58,625 Z is the rest of W and we know it gets DFA to a final state because W is accepted by 133 00:11:58,625 --> 00:12:04,016 the DFA. Notice that the path label Z may have states that also appear earlier but 134 00:12:04,016 --> 00:12:08,825 it doesn't matter, the important thing is that we identify the first repeating state 135 00:12:08,825 --> 00:12:16,422 "Q". The claim that X, followed by I repetitions of Y, followed by Z, is also 136 00:12:16,422 --> 00:12:23,707 accepted by the DFA for any integer I. To see why, X takes us to state Q, so we 137 00:12:23,707 --> 00:12:33,257 could for example go like this and then we could skip Y altogether. Just follow Z and 138 00:12:33,257 --> 00:12:40,467 we get to a final state. That tells us that XZ is in the language. Or, we could 139 00:12:40,467 --> 00:12:48,686 follow X to state Q, we could go around loops many, many times. And then finally 140 00:12:48,686 --> 00:12:57,006 go off following Z, and again get to the final state. Or, after I use of input Y 141 00:12:57,006 --> 00:13:05,121 for any I, however large, we follow input Z and we accept. This proves that X1 or 142 00:13:05,121 --> 00:13:11,360 the IZ is accepted for any I. Remember that Y cannot be empty so all these 143 00:13:11,360 --> 00:13:16,580 accepted strings are different. Thus the DSA accept an infinite number of strings, 144 00:13:16,580 --> 00:13:21,735 one for each I. Remember, we still do not have an algorithm because we can't test 145 00:13:21,735 --> 00:13:26,697 the infinite number of strings that lengths equal or greater to N, however we 146 00:13:26,697 --> 00:13:32,233 don't have to Because it is sufficient to test strings of length between N and 2N 147 00:13:32,233 --> 00:13:37,601 minus one and there are a finite number of such strings. When we prove this statement 148 00:13:37,601 --> 00:13:42,273 we show at least have an algorithm although it is a rather time-consuming 149 00:13:42,273 --> 00:13:47,578 one. Now we picked Y to be the first cycle on the path. So the length of XY cannot be 150 00:13:47,578 --> 00:13:52,692 greater than N. That is, some state within the first N plus one states on the path 151 00:13:52,692 --> 00:13:58,992 surely repeats. We also know that if length of "Y" is at least one, so "Y" lies 152 00:13:58,992 --> 00:14:05,901 between one and N then if W is the shortest accepted string of length at 153 00:14:05,901 --> 00:14:11,764 least n, then we claim that W can not be as long as 2n. But suppose it were. Now 154 00:14:11,764 --> 00:14:16,789 XZ, as we know, is another accepted string, and the length of XZ is the length 155 00:14:16,789 --> 00:14:22,278 of W minus the length of Y, and the length of Y is at most N. So the length of XZ is 156 00:14:22,278 --> 00:14:27,700 at least N. That means that XZ is a string shorter than W, yet at least N in length 157 00:14:27,700 --> 00:14:33,056 and it is also accepted. But we assume that there was no string accepted that was 158 00:14:33,056 --> 00:14:40,416 shorter than W and also of length of at least N. As a result Given any really long 159 00:14:40,416 --> 00:14:46,283 string W that's accepted we can keep taking out pieces of length between one 160 00:14:46,283 --> 00:14:52,074 and N, that's the Y in each of these diagrams, and we just keep throwing them 161 00:14:52,074 --> 00:15:00,613 away and eventually what's left of W will be between N and 2N minus one. So the 162 00:15:00,613 --> 00:15:06,318 algorithm to decide whether a regular language is infinite is to construct a DFA 163 00:15:06,318 --> 00:15:11,813 for it. And let the DFA have N states. Test all the strings of length between N 164 00:15:11,813 --> 00:15:17,166 and 2N minus one, and say infinite if any of them are accepted. Otherwise say 165 00:15:17,166 --> 00:15:23,234 finite. This is a terrible algorithm. If there are K input symbols and N states, 166 00:15:23,234 --> 00:15:28,412 then the number of strings we have to simulate is about K to the power 2N. That 167 00:15:28,412 --> 00:15:33,787 is a lot of work and there is a much more efficient algorithm. One that takes time 168 00:15:33,787 --> 00:15:39,228 proportional to the number of transitions, that is K times N, if implemented right. I 169 00:15:39,228 --> 00:15:44,669 wanted to give you the argument about the length of strings because it's important 170 00:15:44,669 --> 00:15:50,175 when we take up the pumping [inaudible], a technique for showing languages not to be 171 00:15:50,175 --> 00:15:57,085 regular. We already discussed searching forward from a node in a graph to find all 172 00:15:57,085 --> 00:16:02,541 the nodes you can reach. So we can eliminate the states that are not 173 00:16:02,541 --> 00:16:08,714 reachable from the start state. We then want to eliminate states that don't reach 174 00:16:08,714 --> 00:16:13,755 a final state. This algorithm is the same except start by marking the final states 175 00:16:13,755 --> 00:16:20,298 and follow the arcs backwards. Now there is an elegant algorithm for finding cycles 176 00:16:20,298 --> 00:16:25,308 using depth first search that takes time proportional to the number of edges or 177 00:16:25,308 --> 00:16:30,255 transitions. I am going to trust that you'll meet this algorithm in a course on 178 00:16:30,255 --> 00:16:36,283 algorithms if a data structures, if you haven't already done so. However, here is 179 00:16:36,283 --> 00:16:40,980 a simple way to test for a cycle in a graph, it takes time proportional to the 180 00:16:40,980 --> 00:16:45,375 number of nodes or states, times the number of arcs or transitions. We are 181 00:16:45,375 --> 00:16:50,313 going to do the same thing for each node N. Starting at N, search forward until you 182 00:16:50,313 --> 00:16:55,860 either can reach no more nodes. Or you discover that you can reach N. That is, 183 00:16:55,860 --> 00:17:01,905 here's node N, we explore forward it, and if we're lucky at some point we reach a 184 00:17:01,905 --> 00:17:08,252 node that has an arc back to N. If you can reach N, then you have a cycle and you can 185 00:17:08,252 --> 00:17:13,768 conclude the language of the DFA is infinite. If not, try the same process 186 00:17:13,768 --> 00:17:19,812 from another node. If you exhaust all the nodes as starting points and you still 187 00:17:19,812 --> 00:17:25,631 haven't found the cycle, then there are none and you conclude the language is 188 00:17:25,631 --> 00:17:30,241 finite. This is a good time to introduce the pumping lemma for regular languages 189 00:17:30,241 --> 00:17:34,386 because we have essentially proved it during our analysis of the [inaudible] of 190 00:17:34,386 --> 00:17:43,060 this problem Here is the statement of the pumping lemma. For every regular language 191 00:17:43,060 --> 00:17:50,044 L there is an integer N. Which happens to be the number of states which sum DFA for 192 00:17:50,044 --> 00:17:56,078 L, such that if every string w and l, whose length is at least n, we can break w 193 00:17:56,078 --> 00:18:04,555 into w=xyz where Y is the first label of sub-string of W that goes from a state to 194 00:18:04,555 --> 00:18:10,008 the, the same state, as we saw in the previous slides, such that three things 195 00:18:10,008 --> 00:18:15,970 are true. First the prefix of XY is short, it is of length at most N. We are sure of 196 00:18:15,970 --> 00:18:22,576 that by making y the label of the first cycle we encounter. Second, Y is not the 197 00:18:22,576 --> 00:18:28,151 empty string. We are sure of this because Y connects two different occurrences of 198 00:18:28,151 --> 00:18:35,438 the same state along the path of W. And lastly, XY to the IZ is in L for all 199 00:18:35,438 --> 00:18:41,692 integers I. The statement is particularly complex because it is of the form for all 200 00:18:41,692 --> 00:18:46,643 there exists, for all there exists. But here's how we use it. Think of a game 201 00:18:46,643 --> 00:18:52,562 played between you and an adversary. You pick the language L that you want to show 202 00:18:52,562 --> 00:18:58,265 is not regular and suppose the adversary claims it is regular Then the adversary 203 00:18:58,265 --> 00:19:03,762 has to provide that their exist parts while you play the for all parts. You have 204 00:19:03,762 --> 00:19:09,397 already picked L. Now the adversary has to pick M. He can pick a number as large as 205 00:19:09,397 --> 00:19:14,345 he wants , but once picked, it is finalized and the game proceeds. Now you 206 00:19:14,345 --> 00:19:20,117 get to pick the string W, subject only to the constraint that it is at least as long 207 00:19:20,117 --> 00:19:25,481 as N, the number the adversary picked. Next the advisory has to break your W up 208 00:19:25,481 --> 00:19:30,985 into X Y Z subject to the constraints of the length of X Y as at most N and the 209 00:19:30,985 --> 00:19:36,627 length of Y as at least one. You win the game by picking an I so if your X want to 210 00:19:36,627 --> 00:19:42,132 be I Z is not an L. However, in a proof we don't know what moves an advisory will 211 00:19:42,132 --> 00:19:47,633 make, but to win we want to cover all possible moves that is we know the picked 212 00:19:47,633 --> 00:19:53,259 N, but we don't know N's actual value. So we must pick W in terms of N. Similarly, 213 00:19:53,259 --> 00:19:58,815 we know W equals X, Y, Z, but we don't know exactly where Y is, except that it is 214 00:19:58,815 --> 00:20:04,370 not empty and it is among the first N positions of W. Thus, our argument that X, 215 00:20:04,370 --> 00:20:09,844 Y to the I, Z is not an [inaudible] for any of these possible Y's. Now, let's see 216 00:20:09,844 --> 00:20:14,669 an example. Let us pick this language as L. It is the set of strings consisting of 217 00:20:14,669 --> 00:20:19,736 some number of zeros followed by the same number of ones and we have claimed before 218 00:20:19,736 --> 00:20:26,853 that it is an example of a non regular language. Now we're going to prove it. Now 219 00:20:26,853 --> 00:20:32,460 the adversary picks N, we don't know what N is, but we know it has some fixed value. 220 00:20:33,120 --> 00:20:41,229 Now we get to choose W in terms of N. And we pick, W equals zero to the N one to the 221 00:20:41,229 --> 00:20:47,128 N, that is N zeros followed by N ones. But then the adversary gets to break W into X, 222 00:20:47,128 --> 00:20:53,822 Y, Z and you don't know exactly how it is broken. But we know enough about X, Y, Z 223 00:20:53,822 --> 00:21:00,495 to show that there is some string, in particular the case I=2, that the pumping 224 00:21:00,495 --> 00:21:07,696 lemma says has to be in the language L. But obviously it isn't because it has more 225 00:21:07,696 --> 00:21:16,255 0's than 1's. That is, We know that Y, being part of the first N positions of W, 226 00:21:16,255 --> 00:21:23,811 Can have only zeros. So two Ys have more zeros than one Y and the number one which 227 00:21:23,811 --> 00:21:30,150 are all contained within a Z doesn't change. We next take up the question of 228 00:21:30,150 --> 00:21:36,323 testing whether two regular languages are the same. We've supposedly given 229 00:21:36,323 --> 00:21:44,220 representation that the two languages, L and M Whatever representation we are given 230 00:21:44,220 --> 00:21:49,881 we convert to DFAs, then we have to combine those DFAs into a single DFA, that 231 00:21:49,881 --> 00:21:57,680 in a sense, runs both DFAs in parallel, we call it the product DFA. Suppose these two 232 00:21:57,680 --> 00:22:05,748 DFA's have states Q and R in the product DFA the states are pairs, one state from Q 233 00:22:05,748 --> 00:22:12,403 the other from R The start state of the [inaudible] DFA is the pair consisting of 234 00:22:12,403 --> 00:22:19,195 the start state from each DFA. For the transitions to the product DFA suppose we 235 00:22:19,195 --> 00:22:25,766 have a state that is the pair QR And suppose A is the input symbol from which 236 00:22:25,766 --> 00:22:31,467 we want to figure out the transition. We look at the transition function for the 237 00:22:31,467 --> 00:22:38,040 first DFA, Say Delta L, and we see where Q goes on input A. So here's Q and on input 238 00:22:38,040 --> 00:22:46,275 A it goes to some state like that. Then we look at the transition function for the 239 00:22:46,275 --> 00:22:54,410 second DFA. Say, delta M, and we see where R goes on input A. So here's R on input A 240 00:22:54,410 --> 00:23:02,085 it goes somewhere here. Okay. Then in the product DFA, the transition from the state 241 00:23:02,085 --> 00:23:10,557 Qr ... Then in the product DFA the transition from the state QR on input A is 242 00:23:10,557 --> 00:23:20,722 the state pair that is the first component is delta L of Qa, and whose second 243 00:23:20,722 --> 00:23:28,316 component is delta M of Ra. That is we simulate the two transitions in parallel. 244 00:23:28,316 --> 00:23:36,197 Here is a little example. Here are the two given DFAs. We will call this the orange 245 00:23:36,197 --> 00:23:46,407 DFA and that the purple DFA. And here is the product DFA. For example let's figure 246 00:23:46,407 --> 00:23:54,947 out the transition from AC on zero, so here's state AC and I look in the orange 247 00:23:54,947 --> 00:24:03,254 [inaudible] a on a zero goes to a itself And in the purple automaton, C on a zero 248 00:24:03,254 --> 00:24:10,483 goes to D. So the Combination AC goes to AD, and you see that transition here. For 249 00:24:10,483 --> 00:24:18,772 another example where does AD go on input one? Well, the orange automaton says A 250 00:24:18,772 --> 00:24:27,529 goes to B on one. The purple automaton says D goes to C on one. So, on one AD 251 00:24:27,529 --> 00:24:35,292 goes to BC. That's this transition there. The algorithm for testing whether two 252 00:24:35,292 --> 00:24:40,060 DFA's are equivalent, that is whether they accept the same language, begins by 253 00:24:40,060 --> 00:24:44,828 constructing the product DFA. Make the final states of the product DFA be all 254 00:24:44,828 --> 00:24:49,967 those pairs such that one is a final state and the other isn't. If string W reaches 255 00:24:49,967 --> 00:24:55,106 one of these final states in the product, then W is accepted by one of the original 256 00:24:55,106 --> 00:24:59,812 DFA's and not the other. Thus the two languages are not the same. Only if the 257 00:24:59,812 --> 00:25:04,827 product DFA with this selection of final states has an empty language of the two 258 00:25:04,827 --> 00:25:13,421 DFA's equivalent. Here's an example AC has made a final state because in the original 259 00:25:13,421 --> 00:25:20,647 [inaudible] c is final and a is not. Likewise BD is final because B is final 260 00:25:20,647 --> 00:25:26,835 but D is not. We now see that the two original DFA's are not equivalent. It 261 00:25:26,835 --> 00:25:31,608 happens that the final state B, D is not reachable from the start state so there 262 00:25:31,608 --> 00:25:35,793 are no strings that the orange [inaudible], [inaudible], [inaudible] but 263 00:25:35,793 --> 00:25:41,086 the purple one does not accept. However, AC is also a final state. And it is 264 00:25:41,086 --> 00:25:46,960 obviously reachable from the final states simply because it is the final state. That 265 00:25:46,960 --> 00:25:52,764 is the empty string distinguishes between the orange and purple automata. The empty 266 00:25:52,764 --> 00:25:59,148 string is accepted by the orange automaton but not the purple one. A related question 267 00:25:59,148 --> 00:26:04,728 to ask about regular languages is whether one is contained in the other. The test 268 00:26:04,728 --> 00:26:09,412 is, in a sense, one half of the equivalence test we just saw. Start by 269 00:26:09,412 --> 00:26:14,234 building the product automaton. But we have to define the final states 270 00:26:14,234 --> 00:26:19,744 differently. How will you do that? That is L is not contained in M if, and only if, 271 00:26:19,744 --> 00:26:26,058 there is some string W that is in L but not in M. Such a string would get the DFA 272 00:26:26,058 --> 00:26:31,338 for L to a final state, but would not get the DFA for M to a final state. So the 273 00:26:31,338 --> 00:26:36,885 question of containment is the same as the question of whether there is any string, 274 00:26:36,885 --> 00:26:42,098 W, that gets the product automaton to a state QR, where Q is final and R is not. 275 00:26:42,098 --> 00:26:47,177 Here B is the only final state of the first automaton and D is the only non 276 00:26:47,177 --> 00:26:52,709 final state of the second automaton, so only BD is final. Okay. As we observed 277 00:26:52,709 --> 00:26:58,357 before, BD is not reachable from the start state. It has arcs out but no arcs in. 278 00:26:58,357 --> 00:27:03,862 Thus the language of the product automaton is empty. And we conclude that the 279 00:27:03,862 --> 00:27:09,224 language of the orange automaton is a subset of the language of the purple 280 00:27:09,224 --> 00:27:15,229 automaton. Next we're going to attack the problem of given a DFA find the equivalent 281 00:27:15,229 --> 00:27:20,778 DFA with the fewest states. There is an obvious dumb algorithm, just consider all 282 00:27:20,778 --> 00:27:26,581 the DFA's with the same input alphabet but a smaller number of states, there's a huge 283 00:27:26,581 --> 00:27:31,814 but finite number of such states, sorry, got to do that over. Next, we're going to 284 00:27:31,814 --> 00:27:36,693 attack the problem of, given a DFA, find the equivalent DFA with the fewest states. 285 00:27:36,693 --> 00:27:41,572 There's an obvious dumb algorithm. Just consider all the DFAs with the same input 286 00:27:41,572 --> 00:27:46,451 alphabet, but a smaller number of states. There's a huge but finite number of such 287 00:27:46,451 --> 00:27:51,451 [inaudible] so in principal we can solve this problem. This time we're not going to 288 00:27:51,451 --> 00:27:56,210 dwell on the bad algorithm, but talk you through the good algorithm immediately. 289 00:27:56,210 --> 00:28:00,898 The key idea is to build a table of pairs of states and figure out which pairs are 290 00:28:00,898 --> 00:28:05,587 distinguishable in the sense that there is some input string that leads one of the 291 00:28:05,587 --> 00:28:09,994 pair to a final state and the other to a non-final state. Otherwise, states are 292 00:28:09,994 --> 00:28:16,276 indistinguishable and they can be merged into a single state. Here is the 293 00:28:16,276 --> 00:28:22,708 [inaudible] automaton we saw way back in the beginning of the course. Now, let's 294 00:28:22,708 --> 00:28:31,233 look at the states forty thirty here and add in ... Input S takes them both to an 295 00:28:31,233 --> 00:28:39,307 accepting state that is [inaudible] and input O takes them both to deuce, here and 296 00:28:39,307 --> 00:28:47,080 here. Thus, no further inputs could ever distinguish 4030 from add in. Similarly 297 00:28:47,080 --> 00:28:55,108 thirty forty and [inaudible] are indistinguishable Now we can deduce that 298 00:28:55,108 --> 00:29:06,234 thirty all , that's here, and deuce. Are also indistinguishable, on input S They go 299 00:29:06,234 --> 00:29:14,975 to forty thirty and add in, respectively, that is, thirty all goes to forty thirty 300 00:29:14,975 --> 00:29:22,044 on S. And deuce goes to add in on S. But we know that those two states are 301 00:29:22,044 --> 00:29:29,763 indistinguishable so we'll never be able to distinguish 30 all from deuce by a 302 00:29:29,763 --> 00:29:39,720 sequence beginning with S. And further, on input O well, 30 all goes to 30-40. And 303 00:29:39,720 --> 00:29:45,184 deuce goes to add out. And we said we cap these thing with these two states. So 304 00:29:45,184 --> 00:29:50,859 there is no string [inaudible] that can distinguish 30 all from deuce. We are now 305 00:29:50,859 --> 00:29:56,323 going to talk about how we find the distinguishable states. The basis is pairs 306 00:29:56,323 --> 00:30:01,858 that are distinguishable by the empty string. These are the pairs that have one 307 00:30:01,858 --> 00:30:08,873 final and one non-final state. For the inductive step we can mark a pair QR if 308 00:30:08,873 --> 00:30:15,401 those stay [inaudible] put A to distinguish [inaudible]. If delta of QA 309 00:30:15,401 --> 00:30:22,849 and delta of RA are marked, then they're distinguishable by some string W. That is, 310 00:30:22,849 --> 00:30:30,481 here we have Q, here's Q, and it goes on W to let's say a non final state, and R goes 311 00:30:30,481 --> 00:30:38,681 on W to some final state. For the inductive step we can mark a pair Q, R if 312 00:30:38,681 --> 00:30:45,417 these states go at some input A to distinguishable states, that is to say 313 00:30:45,417 --> 00:30:53,058 here is Q And on A goes to some state, I don't know whet it is, but its certainly 314 00:30:53,058 --> 00:30:59,037 delta Q and A, and here is R and it goes on A to again some other state i don't 315 00:30:59,037 --> 00:31:05,092 know what it is but its a delta of R and A. And then these two states on input W 316 00:31:05,092 --> 00:31:11,374 will say that this one goes to a non-final state and that goes to a final state on 317 00:31:11,374 --> 00:31:18,861 input W. Then, a claim AW distinguishes Q from R, because obviously Q goes on AW to 318 00:31:18,861 --> 00:31:25,854 a non-final state and R goes on the same AW to a final state. After no more marks 319 00:31:25,854 --> 00:31:32,242 are possible, the unmarked pairs are equivalent and can be merged into one 320 00:31:32,242 --> 00:31:39,862 state. This point may be obvious, but we are going to need it in what follows. If 321 00:31:39,862 --> 00:31:47,009 there is no string W that distinguishes state P and Q, and there is no string that 322 00:31:47,009 --> 00:31:54,069 distinguishes Q from R. Then how can we distinguish P from R? That would mean that 323 00:31:54,069 --> 00:32:00,780 some string W leads one of P and R, say R to a final state, so here is P and W, 324 00:32:00,780 --> 00:32:07,557 let's say it leads it to a final state. If there is no string W that distinguishes 325 00:32:07,557 --> 00:32:12,962 States P and Q, then there is no string that distinguishes Q from R, then how 326 00:32:12,962 --> 00:32:18,864 could some string W distinguish P from R. That would mean that there is some string 327 00:32:18,864 --> 00:32:28,380 W, needs one of P and R To a final state and the other to a non-final state. Let's 328 00:32:28,380 --> 00:32:35,564 say R leads to a final state, so here's W leading P to a non-final state. And here's 329 00:32:35,564 --> 00:32:43,161 R and on the same W gets to a final state. But then W also distinguishes Q from 330 00:32:43,161 --> 00:32:54,650 either P or R. So here's Q. And W leads it to some state. Okay, so let's say that W 331 00:32:54,650 --> 00:33:03,890 leads Q to a final state. Then W distinguishes Q from P, because here, P on 332 00:33:03,890 --> 00:33:11,555 W goes to a non-final state, Q on W goes to a final state. If Q goes to a non-final 333 00:33:11,555 --> 00:33:18,438 state, then it doesn't distinguish, P from Q but it does distinguish R from Q because 334 00:33:18,438 --> 00:33:25,962 Q would then go to a non-final state while R goes to a final state. Incidentally, 335 00:33:25,962 --> 00:33:31,851 distinguishable is not transitive. It is quite possible that W distinguishes P from 336 00:33:31,851 --> 00:33:37,102 Q and also Q from R, but does not distinguish P from R. For example W could 337 00:33:37,102 --> 00:33:43,310 lead both P and R to final state and Q to a non-final state. We are now going to use 338 00:33:43,310 --> 00:33:47,837 the table of in distinguishability to merge states that are indistinguishable. 339 00:33:47,837 --> 00:33:52,364 That gives us the minimum state DFA. Although we must be careful to remove at 340 00:33:52,364 --> 00:33:58,703 some stage, all the states that are not reachable from the start state. Suppose we 341 00:33:58,703 --> 00:34:05,740 have set of indistinguishable states, say Q1 through QK. We're going to replace them 342 00:34:05,740 --> 00:34:11,008 all by a single state that behaves as they all do, called the representative Q. It 343 00:34:11,008 --> 00:34:16,405 can be one of the QI's or some new name we create for this purpose. On any symbol A, 344 00:34:16,405 --> 00:34:20,892 all the indistinguishable states, the QI's, go to states that are also 345 00:34:20,892 --> 00:34:25,965 indistinguishable from one another. For if not, then we can use the distinction 346 00:34:25,965 --> 00:34:31,232 between say the states delta Q1 of A and delta Q2 of A to distinguish between Q1 347 00:34:31,232 --> 00:34:36,304 and Q2. But we already know that Q1 and Q2 are indistinguishable so that can't 348 00:34:36,304 --> 00:34:42,646 happen. Thus, make the transition for state Q on input A be the representative 349 00:34:42,646 --> 00:34:49,037 for the indistinguishable states delta Q1A and so on. Let's work with the DFA that we 350 00:34:49,037 --> 00:34:54,616 constructed from the NFA that represented the moves on a chess board. We are going 351 00:34:54,616 --> 00:35:00,183 to make it easier to work with by renaming the states to be single letters. For 352 00:35:00,183 --> 00:35:05,750 example A is the set containing only one, and G is the set containing one, three, 353 00:35:05,750 --> 00:35:13,173 Five, seven, and nine. Here is a little trick for arranging pairs in a triangle so 354 00:35:13,173 --> 00:35:17,715 that each pair appears exactly once. Notice that the rows are the states in 355 00:35:17,715 --> 00:35:22,780 order, except for the last state G, which doesn't appear there. Then the columns are 356 00:35:22,780 --> 00:35:28,155 labeled by the states in backwards order except for the first state A We begin the 357 00:35:28,155 --> 00:35:33,362 table of indistinguishabilities by marking each pair that consists of a final state 358 00:35:33,362 --> 00:35:38,259 and a non-final state. Here, the final states are F and G. So pairs that have one 359 00:35:38,259 --> 00:35:43,094 of these and one of the other states A through E are marked. Let's look at the 360 00:35:43,094 --> 00:35:47,805 transitions on input R. Notice that the column for R has only states B and D. 361 00:35:47,805 --> 00:35:52,454 Since we've not yet distinguished B from D there's no way input R can help 362 00:35:52,454 --> 00:35:57,923 distinguish other pairs of states at this point. However we have more luck with 363 00:35:57,923 --> 00:36:03,795 input B. Some states go to final states on input B, namely C, D, E, and G. And 364 00:36:03,795 --> 00:36:09,666 others, A, B, and F, go to non-final states, thus we can distinguish any of C, 365 00:36:09,666 --> 00:36:16,101 D, E, or G from any of A, B, or F. Some of these pairs are already distinguished, but 366 00:36:16,101 --> 00:36:24,478 we get seven new pairs marked in red here. At the next step we discovered two more 367 00:36:24,478 --> 00:36:30,289 distinguishable pairs, CD and CE. For example C and D lead to F and G 368 00:36:30,289 --> 00:36:37,179 respectively, on input B. So we know that whatever string W distinguishes F from G, 369 00:36:37,179 --> 00:36:43,717 D followed by W will distinguish P from D now we can lock the pair A, B. These 370 00:36:43,717 --> 00:36:49,450 states transition on input R to B and D respectively. And we already know that we 371 00:36:49,450 --> 00:36:56,164 can distinguish B from D. Unfortunately, D and E can never be marked because on both 372 00:36:56,164 --> 00:37:04,237 inputs they go to the same state. Okay, we are now done. We have distinguished every 373 00:37:04,237 --> 00:37:09,429 pair of states except for D and E. Moreover, we can never distinguish these 374 00:37:09,429 --> 00:37:19,001 states since they both go to D on input R and to G on input B. Thus, D and E form an 375 00:37:19,001 --> 00:37:25,026 indistinguishable group of states and we can replace them by representative. We 376 00:37:25,026 --> 00:37:30,822 choose the new state A which is the representative and all transitions from 377 00:37:30,822 --> 00:37:36,999 other states to D or E are replaced by transitions to H. The rows for D and E are 378 00:37:36,999 --> 00:37:42,796 replaced by a row for H. As D and E transition to D on input R, H transitions 379 00:37:42,796 --> 00:37:52,000 to itself on R. The transition on input B for H is to G, which is the same as it was 380 00:37:52,000 --> 00:37:57,685 for both D and E. As we mentioned, collapsing indistinguishable states to a 381 00:37:57,685 --> 00:38:02,156 single state goes a long way to finding the minimum state equivalent to a given 382 00:38:02,156 --> 00:38:07,246 DFA. But there's one other issue that in distinguishability doesn't address. The 383 00:38:07,246 --> 00:38:11,907 possible existence of unreachable states that are cluttering up the transition 384 00:38:11,907 --> 00:38:16,628 diagram or table. But it is easy to find such states and we can either eliminate 385 00:38:16,628 --> 00:38:20,522 them from the original DFA or we can eliminate them after merging 386 00:38:20,522 --> 00:38:27,790 indistinguishable states. It doesn't matter. Now we've done our best to combine 387 00:38:27,790 --> 00:38:33,977 states that we know how to combine. But it is in principle possible that there is 388 00:38:33,977 --> 00:38:39,811 some other smaller DFA that we can't get by combining states of our DFA. And 389 00:38:39,811 --> 00:38:49,771 fortunately that can't happen as we shall now see. Here's the proof that there is 390 00:38:49,771 --> 00:38:54,773 nothing smaller than the DFA we get by merging states and eliminating unreachable 391 00:38:54,773 --> 00:39:03,940 states. Suppose As are DSA and D is a hypothetical equivalent with fewer states 392 00:39:06,060 --> 00:39:11,878 Imagine we combine the states of A and B to form a larger DFA. It doesn't matter 393 00:39:11,878 --> 00:39:17,478 what the start state of the combined [inaudible] is but the final states are 394 00:39:17,478 --> 00:39:22,642 those of A and B. We need to use distinguishable in its contra-positive 395 00:39:22,642 --> 00:39:28,388 form, that is if W distinguishes delta of Q, A from delta of P, A then surely A, W 396 00:39:28,388 --> 00:39:33,988 distinguishes Q from P. So if Q and P are indistinguishable, then so are their 397 00:39:33,988 --> 00:39:40,810 successors on any input A. Here is an informal illustration of the proof 398 00:39:40,810 --> 00:39:45,928 technique. We start off with the fact that the start dates of automata A and B are 399 00:39:45,928 --> 00:39:52,672 surely indistinguishable because the automata accept the same language Now 400 00:39:52,672 --> 00:40:00,118 suppose the start states go to some states, P and Q on input A. P and Q must 401 00:40:00,118 --> 00:40:04,596 be indisguishable because if they were distinguishable then we could distinguish 402 00:40:04,596 --> 00:40:12,680 the start dates and we know we cannot do that Now suppose, Q and P go on input B to 403 00:40:13,017 --> 00:40:21,080 other states R and S. Then R and S must be indistinguishable for the same reason. 404 00:40:24,680 --> 00:40:29,880 Formally we shall prove that every state Q of A is in [inaudible] of some state of 405 00:40:29,880 --> 00:40:34,767 being. The proof is in the induction on the length of the shortest string that 406 00:40:34,767 --> 00:40:39,717 gets you to Q from a start state. Notice that because we eliminated unreachable 407 00:40:39,717 --> 00:40:47,685 states we know there is such a shorter string For the basis, the state of A that 408 00:40:47,685 --> 00:40:52,263 is reachable from the start state by a string of length zero, which of course is 409 00:40:52,263 --> 00:40:58,343 the start state itself. We know that this state of A is indistinguishable from the 410 00:40:58,343 --> 00:41:05,465 start state of B because the languages are the same. For the induction, assume the 411 00:41:05,465 --> 00:41:11,530 inductive hypothesis for strings shorter then W and suppose W is a shorter string 412 00:41:11,530 --> 00:41:17,448 getting A to state Q. Let W equal X, A that is A is the last symbol of W and X is 413 00:41:17,448 --> 00:41:25,785 all the rest of W. We can apply the inductive hypothesis to X because it is 414 00:41:25,785 --> 00:41:32,003 shorter than W. We know X gets A to some state R that is indistinguishable from 415 00:41:32,003 --> 00:41:40,291 some state P of B. But then A takes state R on input A to state Q and we know B 416 00:41:40,291 --> 00:41:46,929 takes state P on input A to some state, say S. Then Q must be indistinguishable 417 00:41:46,929 --> 00:41:53,850 from S using the argument that we saw two slides ago. Okay. Now we use the 418 00:41:53,850 --> 00:41:57,935 transitivity of indistinguishable to argue that no two states of A are 419 00:41:57,935 --> 00:42:02,193 indistinguishable from the same state of B. For if they were they would be 420 00:42:02,193 --> 00:42:07,145 indistinguishable from each other. But we A, cannot have indistinguishable states 421 00:42:07,145 --> 00:42:13,027 because we merged them all constructing A Thus, B has at least as many states of A, 422 00:42:13,027 --> 00:42:19,243 even though we started off assuming that there was no relationship between the 423 00:42:19,243 --> 00:42:25,066 automata A and B except that they each accepted the same language. So, that 424 00:42:25,066 --> 00:42:31,676 concludes the, entire argument that says that by throwing away unreachable states 425 00:42:31,676 --> 00:42:38,286 and then merging indistinguishable states you get an automaton that is a small, that 426 00:42:38,286 --> 00:42:43,480 is has as few states as any other automaton for the same language.