1 00:00:00,000 --> 00:00:04,977 There is a pumping lemma for context free languages analogous to the pumping lemma 2 00:00:04,977 --> 00:00:09,654 for regular languages. The goal of this lecture is to [inaudible] improve this 3 00:00:09,654 --> 00:00:14,271 lemma, which is naturally more complicated than the pumping lemma for regular 4 00:00:14,271 --> 00:00:19,008 languages. And then use it to show certain languages are not context free. Let's 5 00:00:19,008 --> 00:00:23,626 review the pumping lemma for regular languages. It said that any sufficiently 6 00:00:23,626 --> 00:00:28,183 long string W in a regular language has some short non empty piece near the 7 00:00:28,183 --> 00:00:33,100 beginning that we could pump. That is, we could repeat it as many times as we liked, 8 00:00:33,100 --> 00:00:38,208 including zero. And the resulting stream would also be in the language. We found 9 00:00:38,208 --> 00:00:43,075 the strength of pump by looking for the first state to repeat as a finite 10 00:00:43,075 --> 00:00:48,205 automaton process that's input W. The pumping level of a context free language 11 00:00:48,205 --> 00:00:53,334 says that we can find two pieces in any long string Z and pump them in tandem. 12 00:00:53,334 --> 00:00:58,464 That is, we can repeat each of them I times or any I equal to the greater than 13 00:00:58,464 --> 00:01:03,659 zero, and the result will be in the same language. We'll also see that these two 14 00:01:03,659 --> 00:01:08,792 strings are close together in Z, and one can be empty but not both. Here's the 15 00:01:08,792 --> 00:01:14,483 statement of the ? For context free languages. We can see it as the same sort 16 00:01:14,483 --> 00:01:19,597 game with an adversary that we talked about in connection with regular 17 00:01:19,597 --> 00:01:25,432 languages. For every context free language L, that is you get to pick L presumably 18 00:01:25,432 --> 00:01:30,978 the one you want to show isn't really context free, there's an integer N. This 19 00:01:30,978 --> 00:01:36,741 is something the adversary gets to pick but once it's picked it's fixed for the 20 00:01:36,741 --> 00:01:42,300 rest of the game. Such that for every string Z and L of length at least N And 21 00:01:42,300 --> 00:01:48,092 here, you get to pick the Z to focus on. You can break Z into five pieces - U, V, 22 00:01:48,092 --> 00:01:53,659 W, X, Y, such that three things are true. The adversary gets to pick how Z is 23 00:01:53,659 --> 00:01:58,774 broken, but subject to two constraints we'll see in just a moment. And 24 00:01:58,774 --> 00:02:04,792 incidentally, V and X are the substrings that get pumped. The first constraint is 25 00:02:04,792 --> 00:02:10,884 that the middle three component, V, W, and X are short, no longer than the length of 26 00:02:10,884 --> 00:02:16,984 them put together. Remember that V and X will get pumped, so that's a. Not only are 27 00:02:16,984 --> 00:02:22,890 they short but they appear within a bounded distance from each other within Z. 28 00:02:22,890 --> 00:02:29,175 The second condition that the adversary has to respect is that V and X cannot both 29 00:02:29,175 --> 00:02:35,536 be empty, although one can. And if all the above is satisfied and L really is context 30 00:02:35,536 --> 00:02:41,518 free then for all integers I equal to or greater than zero: U, V to the I, W, X to 31 00:02:41,518 --> 00:02:47,804 the Y, Y is also an L. We win the game and prove is not context free by allowing the 32 00:02:47,804 --> 00:02:53,683 adversary's choices of N and. Breakup of Z, subject to the constraints one and two. 33 00:02:53,683 --> 00:02:59,350 And then picking an I, such that UV to the IW, X the IY is not an L. The proof the, 34 00:02:59,350 --> 00:03:04,875 the pumping lemma for context free languages starts with a [inaudible] normal 35 00:03:04,875 --> 00:03:10,754 form grammar for L. Technically, it is for L minus epsilon, but the empty string will 36 00:03:10,754 --> 00:03:15,995 never meet the condition of being of length at least N. So its presence of 37 00:03:15,995 --> 00:03:21,379 absence doesn't matter. The [inaudible] grammar has M variables for sub M. So 38 00:03:21,379 --> 00:03:27,639 we're going to let M be two to the M. And now lets consider any string, z and l, of 39 00:03:27,639 --> 00:03:34,256 length at least n. That's two to the m again. We're going to prove first that any 40 00:03:34,256 --> 00:03:40,956 [inaudible] tree in the c and f grammar, for string z of length at least n equals 41 00:03:40,956 --> 00:03:47,821 two to the m, must have a path of length m plus two or more from the root to a leaf. 42 00:03:47,821 --> 00:03:54,477 Well, actually. Proof the contra-positive. That is, suppose there's a parse tree Z 43 00:03:54,477 --> 00:04:01,581 with no path longer than M plus one. Such a path has M nodes labeled by variables at 44 00:04:01,581 --> 00:04:08,263 the top, or the beginning, and one node labeled by a terminal at the end. That is, 45 00:04:08,263 --> 00:04:15,198 here is a typical path. That is, here is a typical path. It will have variables here, 46 00:04:15,198 --> 00:04:22,049 here, and here and then a terminal there. Okay. If we forget about the terminals at 47 00:04:22,049 --> 00:04:29,009 the leaves. Part strain at CNF grammar is a binary tree. Let's say at each level the 48 00:04:29,009 --> 00:04:35,541 number of notes is almost gonna double. Thus there is only one note at the top 49 00:04:35,541 --> 00:04:42,157 level, the root, there can be only two notes at the next level four at the third 50 00:04:42,157 --> 00:04:48,689 and eight at the fourth and so on. In general there will be at most two to the 51 00:04:48,689 --> 00:04:55,305 power M -one at the end level. But this tree has it most m levels with variables 52 00:04:55,305 --> 00:05:01,085 and then along each. Variables. The last variable has one trial with a terminal as 53 00:05:01,085 --> 00:05:06,401 its label. The largest number of leaves occur if all the paths on the tree have 54 00:05:06,401 --> 00:05:11,314 exactly N variables. If some paths terminate before level N, there will be 55 00:05:11,314 --> 00:05:16,833 fewer leaves. Thus there are at most two to the N minus one nodes that are labeled 56 00:05:16,833 --> 00:05:22,284 by variables and have a single trial with a terminal label. And thus there are at 57 00:05:22,284 --> 00:05:27,264 most two to the N minus one leaves. Finally, we therefore can conclude that 58 00:05:27,264 --> 00:05:32,905 the length of the yield is at most. Most of the two to the M minus one power M 59 00:05:32,905 --> 00:05:39,274 minus one. [inaudible] Power of m minus one is n over two. So z is of length n 60 00:05:39,274 --> 00:05:44,574 that cannot be the yield of any three. That has passed limited to length m plus 61 00:05:44,574 --> 00:05:49,940 one or less. Therefore, we conclude that somewhere on the [inaudible] is the path 62 00:05:49,940 --> 00:05:55,092 of this m plus two. Now we're ready to prove the pumping level. We just proved 63 00:05:55,092 --> 00:06:00,515 that Z's parsed tree has a path of length at least M+2. Only the last [inaudible] on 64 00:06:00,515 --> 00:06:05,740 any path can be labeled by a terminal. So there are at least M+1 [inaudible] with 65 00:06:05,740 --> 00:06:12,063 variables along this path. Let's focus on one of the longest paths on the. Surely 66 00:06:12,063 --> 00:06:18,755 there are at least ten plus one variables along this longest path. Remember that M 67 00:06:18,755 --> 00:06:23,767 is the number of variables of the grammar. So, along this path, there are two low 68 00:06:23,767 --> 00:06:28,843 nodes labeled by the same variable, call it A. In fact, to make sure we pump short 69 00:06:28,843 --> 00:06:34,045 pieces, let's look only at the bottom most M+1 variables along this path, which could 70 00:06:34,045 --> 00:06:39,749 be much longer than M+1. And we know that two of them must be the same. On the next 71 00:06:39,749 --> 00:06:47,075 slide we'll see what the pause stream must look like. Here's the picture of the 72 00:06:47,075 --> 00:06:51,979 [inaudible] tree for z. We've showed the path we've focused on and the lowest 73 00:06:51,979 --> 00:06:57,138 repeating variables along that path. The purple tree is rooted at the lower a. And 74 00:06:57,138 --> 00:07:02,912 the yellow tree with the purple tree within it is rooted up at upper a. Let W 75 00:07:02,912 --> 00:07:10,533 be the yield of the purple tree. V and X the portions of the yield with the yellow 76 00:07:10,533 --> 00:07:15,161 tree that precede and follow W respectively. And let U and V be the 77 00:07:15,161 --> 00:07:21,841 portions of Z that precede V and follow X respectively. Let's look at the yellow. 78 00:07:21,841 --> 00:07:27,336 Since the path shown is as long as any other, and that path has at most N+1 79 00:07:27,336 --> 00:07:33,056 variables, we know by the lemma one that we just proved that the yield of the 80 00:07:33,056 --> 00:07:39,078 yellow plus purple is no longer then two to the power M or N. That is, the length 81 00:07:39,078 --> 00:07:44,415 of VWX is no more than N. But v and x both can't be empty. Why? That's a useful 82 00:07:44,415 --> 00:07:49,666 property of [inaudible] normal form of grammars. Since the two A's shown in the 83 00:07:49,666 --> 00:07:55,182 tree are different nodes, the upper A must have a child to the left or right of the 84 00:07:55,182 --> 00:08:00,631 path shown. That's a consequence of the fact that we eliminated unit production so 85 00:08:00,631 --> 00:08:05,797 all bodies that have variables have at least two of them. Moreover, once we have 86 00:08:05,797 --> 00:08:10,804 a variable not on the path, there are no epsilon productions so we must generate 87 00:08:10,804 --> 00:08:15,687 from this variable at least one terminal. That is all we need to conclude that 88 00:08:15,687 --> 00:08:21,008 either V or X or both have length at least one. Now we can take advantage of the fact 89 00:08:21,008 --> 00:08:26,203 that we have two As along the path. We can get rid of V and X by pumping zero times. 90 00:08:26,203 --> 00:08:30,898 That is, we know the purple tree can substitute for the yellow because both 91 00:08:30,898 --> 00:08:35,345 trees have the same variable A at the root. If the original on the left 92 00:08:35,345 --> 00:08:40,425 satisfies the conditions of a parse tree, that is, every interior node is the head 93 00:08:40,425 --> 00:08:45,379 of a production whose body is the labels of his children, then the same will be 94 00:08:45,379 --> 00:08:50,395 true of the smaller parse tree on the right. We conclude that UWY is also in the 95 00:08:50,395 --> 00:08:56,749 language. Well, we could pump twice, that is, replace the purple tree by the yellow 96 00:08:56,749 --> 00:09:03,600 which has the purple within it and you get a par [inaudible] whose yield is uvvwxxy. 97 00:09:04,140 --> 00:09:09,177 And in the previous tree representing pumping twice, we could again replace the 98 00:09:09,177 --> 00:09:13,959 purple tree by the yellow, and get a bigger tree, whose yield is U, three V's, 99 00:09:13,959 --> 00:09:18,805 W, three X'es, and then Y. In the same manner, we could get parse trees for all 100 00:09:18,805 --> 00:09:23,715 strings [inaudible] form. U, V to the I, W, X, the IVY, for any integer I equal to 101 00:09:23,715 --> 00:09:28,561 or greater than zero. These strings are therefore all in the language L. That 102 00:09:28,561 --> 00:09:34,150 proves the pumping [inaudible] for context free languages. Let's look at an example 103 00:09:34,150 --> 00:09:39,220 of how the pumping [inaudible] can be used to show a language not to be context free. 104 00:09:40,060 --> 00:09:46,007 This language which involves matching the counts of two blocks of zeros is context 105 00:09:46,007 --> 00:09:52,230 free. A grammar of PDA4 is easy to construct. The ideas are very much like 106 00:09:52,230 --> 00:09:59,035 what we saw for the language O to the N, one to the N. But give the language three 107 00:09:59,035 --> 00:10:03,992 blocks of zeros, all of which must be the same length and we're suddenly outside 108 00:10:03,992 --> 00:10:09,742 what context free grammars can do. We can prove that using the problem for 109 00:10:09,742 --> 00:10:17,478 context-free languages. So we'll pick this language L. And the adversary now gets to 110 00:10:17,478 --> 00:10:24,130 pick N. We don't know N, but we do know it's fixed for the rest of the game. We 111 00:10:24,130 --> 00:10:30,133 get to pick Z, so let's pick zero to the N, one, zero to the N, one, zero to the N. 112 00:10:30,133 --> 00:10:36,520 That is, each block of zeroes is of length equal to whatever N the adversary picked. 113 00:10:36,900 --> 00:10:45,626 Now the adversary gets to break I, G up into Z equals U, V, W, X, Y. But he must 114 00:10:45,626 --> 00:10:51,594 chose these substrings, such that DWX together are no longer than N and D and X 115 00:10:51,594 --> 00:10:57,411 can not both be picked to be the empty string. There two cases depending upon 116 00:10:57,411 --> 00:11:03,380 whether the advisory picks D and X to have zeros or not. The first case, suppose 117 00:11:03,380 --> 00:11:09,726 there are no zeros among D and X and since they can not both be empty. There must me 118 00:11:09,726 --> 00:11:15,770 at least one, one among them. But then if we pump zero times to get the string UWY. 119 00:11:15,770 --> 00:11:21,510 We know that there is at most one, one in this string. The pumping number claims it 120 00:11:21,510 --> 00:11:27,040 as in the language l. But it can't be, because all strings in l have exactly two 121 00:11:27,040 --> 00:11:33,179 1s. In the second and last case, v and x have exactly one zero among them. D W X 122 00:11:33,179 --> 00:11:40,092 has length at most N. So these three sub strings cannot extend from the first block 123 00:11:40,092 --> 00:11:48,357 of 0s to the last because N plus two positions separate those blocks. So again 124 00:11:48,357 --> 00:11:54,666 consider U, W, I which if L is context free it must be an L. Removing V and X 125 00:11:54,666 --> 00:12:01,228 must leave at lease one of the three blocks of N 0's intact, so it still has N 126 00:12:01,228 --> 00:12:08,211 0's but V and X have at least one zero so when U, W, Y at least one of the blocks of 127 00:12:08,211 --> 00:12:14,857 0's is fewer than N 0's. We conclude that in this case too U, W, Y cannot be an L 128 00:12:14,857 --> 00:12:17,802 and those L cannot be context free.