1
00:00:00,000 --> 00:00:04,599
Today we are going to discuss algorithms
for answering questions about regular

2
00:00:04,599 --> 00:00:08,791
languages or really about the
representations for the languages such as

3
00:00:08,791 --> 00:00:13,448
[inaudible]. The questions we can resolve
for [inaudible] include many we cannot

4
00:00:13,448 --> 00:00:18,339
resolve for programs in general. Examples
include whether a given string is accepted

5
00:00:18,339 --> 00:00:22,530
by a given [inaudible], that's the
membership problem. Or whether a given

6
00:00:22,530 --> 00:00:26,897
[inaudible] accepts any string at all, the
emptiness problem. As part of our

7
00:00:26,897 --> 00:00:31,147
discussion we are going to prove an
important theorem called the pumping

8
00:00:31,147 --> 00:00:35,685
[inaudible] which lets us show certain
languages not to be regular. Automotive

9
00:00:35,685 --> 00:00:40,393
theory talks about many different classes
of languages including context feeling,

10
00:00:40,393 --> 00:00:44,984
redigit cursive and recursive, inumeral
languages. Regards to meet each of these

11
00:00:44,984 --> 00:00:49,634
classes but for the moment we know only
one class, the regular languages. When we

12
00:00:49,634 --> 00:00:54,109
investigate a class of languages there are
two important issues. The first is

13
00:00:54,109 --> 00:00:58,758
decision properties. We shall see that
there are many questions we might like to

14
00:00:58,758 --> 00:01:02,885
ask about a language or rather its
representation, such as whether it is

15
00:01:02,885 --> 00:01:07,418
empty, finite or infinite. Its good to
know there are algorithms to answer such

16
00:01:07,418 --> 00:01:12,612
questions ... At least for the regular
languages. Unfortunately as we meet larger

17
00:01:12,612 --> 00:01:17,817
classes of languages, we find that in
general the larger the class of languages

18
00:01:17,817 --> 00:01:22,496
the less likely there is to be an
algorithms to answer questions about

19
00:01:22,496 --> 00:01:27,503
languages in that class. The second
important issue is closure of properties

20
00:01:27,503 --> 00:01:32,577
of the class. These involve applying
operations such as union to languages in

21
00:01:32,577 --> 00:01:37,519
the class. We are going to defer the
discussion of closure of properties to

22
00:01:37,519 --> 00:01:42,703
another lecture although I'll give you an
example on the next slide. [inaudible]

23
00:01:42,703 --> 00:01:47,393
properties are statements that when we
apply certain operations to languages in

24
00:01:47,393 --> 00:01:51,906
the class, the result is also in the
class. For example we say that a class of

25
00:01:51,906 --> 00:01:56,655
languages is closed under union, if given
two languages in the class, the union of

26
00:01:56,655 --> 00:02:01,227
those languages is also in the class. So
if i have two regular languages i can

27
00:02:01,227 --> 00:02:06,093
represent them by regular expressions and
connect those expressions by a plus, with

28
00:02:06,093 --> 00:02:10,313
the appropriate parentheses, to get a
regular expression for their union.

29
00:02:10,313 --> 00:02:14,651
Similar constructions work for their
concatenation and closure. Now let us

30
00:02:14,651 --> 00:02:20,994
address the main topic. Of this lecture,
decision properties of regular languages.

31
00:02:20,994 --> 00:02:28,092
We've used both formal and informal ways
of describing languages. The formal ways

32
00:02:28,092 --> 00:02:34,051
include [inaudible] and regular
expressions, each of which defines a

33
00:02:34,051 --> 00:02:40,273
language by a precise mathematical
definition. But we've also described

34
00:02:40,273 --> 00:02:47,371
languages informally by pro statements and
set formers such as this. Or even more

35
00:02:47,371 --> 00:02:52,939
informal statements like this. Okay.
However you can't answer questions about a

36
00:02:52,939 --> 00:02:57,226
language unless you have a formal
description. For example, an instance

37
00:02:57,226 --> 00:03:02,003
where we'll talk shortly about testing
whether a regular language is infinite,

38
00:03:02,003 --> 00:03:06,841
given one of its formal representations.
It looks like the last of the informal

39
00:03:06,841 --> 00:03:13,741
descriptions describes an infinite set,
but does the word ?some' Mean any or some

40
00:03:13,741 --> 00:03:19,246
one particular number like ten for example
that I had in mind. Thus we are only going

41
00:03:19,246 --> 00:03:24,621
to use formal descriptions of languages
when we talk about algorithms for deciding

42
00:03:24,621 --> 00:03:29,775
things about those languages. Thus a
decision property for a class of languages

43
00:03:29,775 --> 00:03:33,773
is an algorithm that takes a formal
representative of the language. The

44
00:03:33,773 --> 00:03:38,559
algorithm answers some particular question
about the language such as whether or not

45
00:03:38,559 --> 00:03:44,369
the language described by the
representation is empty. Here are a few

46
00:03:44,369 --> 00:03:50,400
examples of why we might be interested in
decision properties of regular languages.

47
00:03:51,740 --> 00:03:58,004
Both involve protocols represented by
[inaudible] If we ask whether the language

48
00:03:58,004 --> 00:04:02,254
as such in the [inaudible] is finite we
are, in effect, asking whether the

49
00:04:02,254 --> 00:04:06,970
protocol it represents is guaranteed to
terminate. Or if we make the final states

50
00:04:06,970 --> 00:04:11,336
of the [inaudible] be error states, then
asking if its language is empty, is

51
00:04:11,336 --> 00:04:16,168
tantamount to asking whether the protocol
can fail. And remember we couldn't answer

52
00:04:16,168 --> 00:04:21,000
either of these questions about programs
in general, so we couldn't get the answers

53
00:04:21,000 --> 00:04:25,600
to these questions about protocols by
looking at the code that implements them.

54
00:04:26,060 --> 00:04:30,753
Another use for decision properties of
regular languages involves minimizing

55
00:04:30,753 --> 00:04:35,263
their representations. For instance, the
"F" "A"'s are a good representations for

56
00:04:35,263 --> 00:04:39,957
certain kinds of digital circuits. Those
that have memory. We usually want the

57
00:04:39,957 --> 00:04:44,955
smallest circuit to accomplish a task and
a good first step is to find a DFA that

58
00:04:44,955 --> 00:04:49,831
does what we want and has the smallest
number of states of any DFA for the same

59
00:04:49,831 --> 00:04:54,665
language. It turns out that we can
determine whether two DFA's are

60
00:04:54,665 --> 00:04:59,357
equivalent. That is whether they define
the same language. That lets us find the

61
00:04:59,357 --> 00:05:03,812
minimum state DFA equivalent to any given
DFA. Again, we do none of this for

62
00:05:03,812 --> 00:05:08,564
programs in general, you cant tell whether
two programs do the same thing, and we

63
00:05:08,564 --> 00:05:13,138
cant find the smallest equivalent for a
given program even though we know in

64
00:05:13,138 --> 00:05:18,691
principle that one must exist. The
membership question for regular languages

65
00:05:18,691 --> 00:05:24,191
answered by an algorithm that takes a DFA
and a string and tells whether or not the

66
00:05:24,191 --> 00:05:29,166
string is accepted by the DFA. The
algorithm is the obvious one. Simulate the

67
00:05:29,166 --> 00:05:37,100
DFA on the input. Here is an example, it's
something we've was seen several times

68
00:05:37,100 --> 00:05:42,880
before each time I show it to you I use a
different style of presentation, but the

69
00:05:42,880 --> 00:05:48,095
idea is always the same. Here, the DFA for
strings without consecutive ones,

70
00:05:48,095 --> 00:05:53,452
something which I know we have seen
before, and the input string is zero one

71
00:05:53,452 --> 00:05:58,245
zero one, one. That obviously has
consecutive ones so it shouldn't be

72
00:05:58,245 --> 00:06:04,854
accepted. Well, let's see what happens
when we simulate it. We read a zero we

73
00:06:04,854 --> 00:06:12,254
stay in A. When we read a one we go to B.
We read a zero we go back to A. Read a one

74
00:06:12,254 --> 00:06:18,696
we go to B, read another one we go to C.
When we simulate this DFA on the input, we

75
00:06:18,696 --> 00:06:26,148
see that in D the string gets [inaudible]
to state C, and it is not accepted. Now

76
00:06:26,148 --> 00:06:31,001
you might wonder what if the regular
language were represented by an NFA or a

77
00:06:31,001 --> 00:06:35,778
regular expression for example Then you
first need to convert the representation

78
00:06:35,778 --> 00:06:40,508
to a DFA and then simulate the DFA. It is
possible to convert from many of the full

79
00:06:40,508 --> 00:06:44,610
representations we know about to any of
the others using the circles of

80
00:06:44,610 --> 00:06:49,225
conversions. So doing might exponentiate
the size of the description but there is

81
00:06:49,225 --> 00:06:53,783
still an algorithm to do the conversion,
and that's all we need to show there is,

82
00:06:53,783 --> 00:06:58,456
for example, and algorithm to tell, given
a regular expression and a string, Whether

83
00:06:58,456 --> 00:07:02,965
the string is in the language o the
regular expression. Generally, proofs of

84
00:07:02,965 --> 00:07:08,536
closure or decision properties require
either a DFA or regular expression by the

85
00:07:08,536 --> 00:07:14,037
way. The emptiness problem is given a
representation for a regular language does

86
00:07:14,037 --> 00:07:20,698
its language contain any string at all.
Okay, we are going to assume the line is

87
00:07:20,698 --> 00:07:25,718
represented by a DFA. Obviously if by some
other representation, then converted to a

88
00:07:25,718 --> 00:07:30,853
DFA. Finding reachable states requires a
breadth first a depth first search from

89
00:07:30,853 --> 00:07:35,346
the start state. I'm not going to assume
you are familiar with these search

90
00:07:35,346 --> 00:07:40,259
techniques, but it is fairly easy to think
of some way of searching a graph from a

91
00:07:40,259 --> 00:07:44,932
single node by following all arcs out of
the node and marking those nodes you

92
00:07:44,932 --> 00:07:49,905
visited. Then follow arcs out of the nodes
you visited and mark any other nodes you

93
00:07:49,905 --> 00:07:55,150
visit. Keep doing that until you can not
mark any other nodes. The mark [inaudible]

94
00:07:55,150 --> 00:07:59,778
of those that are reached from the start
state on some input. If at least one final

95
00:07:59,778 --> 00:08:04,072
state is marked, then the DFA accepts at
least one input. If no final state is

96
00:08:04,072 --> 00:08:08,700
marked then it is impossible for the DFA
to accept anything and its language empty.

97
00:08:09,540 --> 00:08:24,785
Here is an example. Here is your start
state. We might mark, mark it, mark these

98
00:08:24,785 --> 00:08:37,929
guys. Maybe mark these guys as we go. But
if all your final states are out here, and

99
00:08:37,929 --> 00:08:45,810
never get marked. Then obviously no matter
how complex this automaton is, you cant

100
00:08:45,810 --> 00:08:50,349
reach your final state, and you cant
accept anything. Now if you have a

101
00:08:50,349 --> 00:08:55,601
automaton with three states it is pretty
easy to tell what is reachable and what

102
00:08:55,601 --> 00:09:00,530
isn't. But if you have a million states
represented by some table, then it is

103
00:09:00,530 --> 00:09:05,717
hardly easy to tell. But fortunately we
have a straight forward search algorithm

104
00:09:05,717 --> 00:09:11,071
regardless of how large the automaton or
graph is. Now let us take up a more

105
00:09:11,071 --> 00:09:15,664
difficult problem, but one that we can
still solve. We'd like to know whether or

106
00:09:15,664 --> 00:09:20,547
not the language defined by DFA is finite
or infinite. The first fact we're going to

107
00:09:20,547 --> 00:09:25,256
prove is that if the language of the DFA
contains any string of length n or more,

108
00:09:25,256 --> 00:09:29,733
where n is the number of states of the
DFA, then the DFA contains an infinite

109
00:09:29,733 --> 00:09:35,780
number of strings. Surely if the DFA
doesn't accept any string of money equal

110
00:09:35,780 --> 00:09:40,446
to or greater than N then it accepts only
a finite number of strings. However it

111
00:09:40,446 --> 00:09:45,111
doesn't seem feasible to test membership
for all input strings of length and or

112
00:09:45,111 --> 00:09:49,894
more since there are an infinite number of
them. Can we ever finish? If not then we

113
00:09:49,894 --> 00:09:54,754
really don't have an algorithm for testing
infiniteness. But as we shall see it is

114
00:09:54,754 --> 00:09:59,968
possible to limit the length of strings we
have to test to twice the number of state

115
00:09:59,968 --> 00:10:04,630
so we have a really large and ugly but
finite task, and we really do have an

116
00:10:04,630 --> 00:10:09,414
algorithm. So let's try to prove the point
we need, that if the DFA accepts any

117
00:10:09,414 --> 00:10:14,505
string whose length is at least the number
of states N, that it accepts an infinite

118
00:10:14,505 --> 00:10:20,369
number of strings First observe that a
string of a length N or more has at least

119
00:10:20,369 --> 00:10:26,187
N plus one states along its path. For
example a string a string of length two

120
00:10:26,187 --> 00:10:34,140
has three states on its path, here is a
typical picture we might have. String AB.

121
00:10:35,520 --> 00:10:44,100
Looks like that. Notice no matter how
many, well, the number of [inaudible]

122
00:10:44,100 --> 00:10:51,620
symbols in the string is the number of
arcs and there's always one more node than

123
00:10:51,620 --> 00:11:00,093
there are arcs in the path. That's why
there will be, n plus one states for a

124
00:11:00,093 --> 00:11:07,548
string of length n. Now if there are only
N different states, and there are N plus

125
00:11:07,548 --> 00:11:13,725
one states along the path, then two states
along the path must be the same. That's

126
00:11:13,725 --> 00:11:19,903
called the pigeonhole principle, you might
remember. Here is a picture of the path

127
00:11:19,903 --> 00:11:25,775
for string W, which we are breaking up as
X, Y, and Z. X is the prefix of W that

128
00:11:25,775 --> 00:11:31,572
gets the DFA to the fist state that
repeats on the path, which we call state

129
00:11:31,572 --> 00:11:38,470
Q, that's right here of course. Then Y is
a part of W that gets the [inaudible] back

130
00:11:38,470 --> 00:11:44,966
to Q for the first time. That is the end
of Y is the second occurrence of Q. Notice

131
00:11:44,966 --> 00:11:52,314
that therefore Y cannot be the empty
strain although X and Q might be Finally,

132
00:11:52,314 --> 00:11:58,625
Z is the rest of W and we know it gets DFA
to a final state because W is accepted by

133
00:11:58,625 --> 00:12:04,016
the DFA. Notice that the path label Z may
have states that also appear earlier but

134
00:12:04,016 --> 00:12:08,825
it doesn't matter, the important thing is
that we identify the first repeating state

135
00:12:08,825 --> 00:12:16,422
"Q". The claim that X, followed by I
repetitions of Y, followed by Z, is also

136
00:12:16,422 --> 00:12:23,707
accepted by the DFA for any integer I. To
see why, X takes us to state Q, so we

137
00:12:23,707 --> 00:12:33,257
could for example go like this and then we
could skip Y altogether. Just follow Z and

138
00:12:33,257 --> 00:12:40,467
we get to a final state. That tells us
that XZ is in the language. Or, we could

139
00:12:40,467 --> 00:12:48,686
follow X to state Q, we could go around
loops many, many times. And then finally

140
00:12:48,686 --> 00:12:57,006
go off following Z, and again get to the
final state. Or, after I use of input Y

141
00:12:57,006 --> 00:13:05,121
for any I, however large, we follow input
Z and we accept. This proves that X1 or

142
00:13:05,121 --> 00:13:11,360
the IZ is accepted for any I. Remember
that Y cannot be empty so all these

143
00:13:11,360 --> 00:13:16,580
accepted strings are different. Thus the
DSA accept an infinite number of strings,

144
00:13:16,580 --> 00:13:21,735
one for each I. Remember, we still do not
have an algorithm because we can't test

145
00:13:21,735 --> 00:13:26,697
the infinite number of strings that
lengths equal or greater to N, however we

146
00:13:26,697 --> 00:13:32,233
don't have to Because it is sufficient to
test strings of length between N and 2N

147
00:13:32,233 --> 00:13:37,601
minus one and there are a finite number of
such strings. When we prove this statement

148
00:13:37,601 --> 00:13:42,273
we show at least have an algorithm
although it is a rather time-consuming

149
00:13:42,273 --> 00:13:47,578
one. Now we picked Y to be the first cycle
on the path. So the length of XY cannot be

150
00:13:47,578 --> 00:13:52,692
greater than N. That is, some state within
the first N plus one states on the path

151
00:13:52,692 --> 00:13:58,992
surely repeats. We also know that if
length of "Y" is at least one, so "Y" lies

152
00:13:58,992 --> 00:14:05,901
between one and N then if W is the
shortest accepted string of length at

153
00:14:05,901 --> 00:14:11,764
least n, then we claim that W can not be
as long as 2n. But suppose it were. Now

154
00:14:11,764 --> 00:14:16,789
XZ, as we know, is another accepted
string, and the length of XZ is the length

155
00:14:16,789 --> 00:14:22,278
of W minus the length of Y, and the length
of Y is at most N. So the length of XZ is

156
00:14:22,278 --> 00:14:27,700
at least N. That means that XZ is a string
shorter than W, yet at least N in length

157
00:14:27,700 --> 00:14:33,056
and it is also accepted. But we assume
that there was no string accepted that was

158
00:14:33,056 --> 00:14:40,416
shorter than W and also of length of at
least N. As a result Given any really long

159
00:14:40,416 --> 00:14:46,283
string W that's accepted we can keep
taking out pieces of length between one

160
00:14:46,283 --> 00:14:52,074
and N, that's the Y in each of these
diagrams, and we just keep throwing them

161
00:14:52,074 --> 00:15:00,613
away and eventually what's left of W will
be between N and 2N minus one. So the

162
00:15:00,613 --> 00:15:06,318
algorithm to decide whether a regular
language is infinite is to construct a DFA

163
00:15:06,318 --> 00:15:11,813
for it. And let the DFA have N states.
Test all the strings of length between N

164
00:15:11,813 --> 00:15:17,166
and 2N minus one, and say infinite if any
of them are accepted. Otherwise say

165
00:15:17,166 --> 00:15:23,234
finite. This is a terrible algorithm. If
there are K input symbols and N states,

166
00:15:23,234 --> 00:15:28,412
then the number of strings we have to
simulate is about K to the power 2N. That

167
00:15:28,412 --> 00:15:33,787
is a lot of work and there is a much more
efficient algorithm. One that takes time

168
00:15:33,787 --> 00:15:39,228
proportional to the number of transitions,
that is K times N, if implemented right. I

169
00:15:39,228 --> 00:15:44,669
wanted to give you the argument about the
length of strings because it's important

170
00:15:44,669 --> 00:15:50,175
when we take up the pumping [inaudible], a
technique for showing languages not to be

171
00:15:50,175 --> 00:15:57,085
regular. We already discussed searching
forward from a node in a graph to find all

172
00:15:57,085 --> 00:16:02,541
the nodes you can reach. So we can
eliminate the states that are not

173
00:16:02,541 --> 00:16:08,714
reachable from the start state. We then
want to eliminate states that don't reach

174
00:16:08,714 --> 00:16:13,755
a final state. This algorithm is the same
except start by marking the final states

175
00:16:13,755 --> 00:16:20,298
and follow the arcs backwards. Now there
is an elegant algorithm for finding cycles

176
00:16:20,298 --> 00:16:25,308
using depth first search that takes time
proportional to the number of edges or

177
00:16:25,308 --> 00:16:30,255
transitions. I am going to trust that
you'll meet this algorithm in a course on

178
00:16:30,255 --> 00:16:36,283
algorithms if a data structures, if you
haven't already done so. However, here is

179
00:16:36,283 --> 00:16:40,980
a simple way to test for a cycle in a
graph, it takes time proportional to the

180
00:16:40,980 --> 00:16:45,375
number of nodes or states, times the
number of arcs or transitions. We are

181
00:16:45,375 --> 00:16:50,313
going to do the same thing for each node
N. Starting at N, search forward until you

182
00:16:50,313 --> 00:16:55,860
either can reach no more nodes. Or you
discover that you can reach N. That is,

183
00:16:55,860 --> 00:17:01,905
here's node N, we explore forward it, and
if we're lucky at some point we reach a

184
00:17:01,905 --> 00:17:08,252
node that has an arc back to N. If you can
reach N, then you have a cycle and you can

185
00:17:08,252 --> 00:17:13,768
conclude the language of the DFA is
infinite. If not, try the same process

186
00:17:13,768 --> 00:17:19,812
from another node. If you exhaust all the
nodes as starting points and you still

187
00:17:19,812 --> 00:17:25,631
haven't found the cycle, then there are
none and you conclude the language is

188
00:17:25,631 --> 00:17:30,241
finite. This is a good time to introduce
the pumping lemma for regular languages

189
00:17:30,241 --> 00:17:34,386
because we have essentially proved it
during our analysis of the [inaudible] of

190
00:17:34,386 --> 00:17:43,060
this problem Here is the statement of the
pumping lemma. For every regular language

191
00:17:43,060 --> 00:17:50,044
L there is an integer N. Which happens to
be the number of states which sum DFA for

192
00:17:50,044 --> 00:17:56,078
L, such that if every string w and l,
whose length is at least n, we can break w

193
00:17:56,078 --> 00:18:04,555
into w=xyz where Y is the first label of
sub-string of W that goes from a state to

194
00:18:04,555 --> 00:18:10,008
the, the same state, as we saw in the
previous slides, such that three things

195
00:18:10,008 --> 00:18:15,970
are true. First the prefix of XY is short,
it is of length at most N. We are sure of

196
00:18:15,970 --> 00:18:22,576
that by making y the label of the first
cycle we encounter. Second, Y is not the

197
00:18:22,576 --> 00:18:28,151
empty string. We are sure of this because
Y connects two different occurrences of

198
00:18:28,151 --> 00:18:35,438
the same state along the path of W. And
lastly, XY to the IZ is in L for all

199
00:18:35,438 --> 00:18:41,692
integers I. The statement is particularly
complex because it is of the form for all

200
00:18:41,692 --> 00:18:46,643
there exists, for all there exists. But
here's how we use it. Think of a game

201
00:18:46,643 --> 00:18:52,562
played between you and an adversary. You
pick the language L that you want to show

202
00:18:52,562 --> 00:18:58,265
is not regular and suppose the adversary
claims it is regular Then the adversary

203
00:18:58,265 --> 00:19:03,762
has to provide that their exist parts
while you play the for all parts. You have

204
00:19:03,762 --> 00:19:09,397
already picked L. Now the adversary has to
pick M. He can pick a number as large as

205
00:19:09,397 --> 00:19:14,345
he wants , but once picked, it is
finalized and the game proceeds. Now you

206
00:19:14,345 --> 00:19:20,117
get to pick the string W, subject only to
the constraint that it is at least as long

207
00:19:20,117 --> 00:19:25,481
as N, the number the adversary picked.
Next the advisory has to break your W up

208
00:19:25,481 --> 00:19:30,985
into X Y Z subject to the constraints of
the length of X Y as at most N and the

209
00:19:30,985 --> 00:19:36,627
length of Y as at least one. You win the
game by picking an I so if your X want to

210
00:19:36,627 --> 00:19:42,132
be I Z is not an L. However, in a proof we
don't know what moves an advisory will

211
00:19:42,132 --> 00:19:47,633
make, but to win we want to cover all
possible moves that is we know the picked

212
00:19:47,633 --> 00:19:53,259
N, but we don't know N's actual value. So
we must pick W in terms of N. Similarly,

213
00:19:53,259 --> 00:19:58,815
we know W equals X, Y, Z, but we don't
know exactly where Y is, except that it is

214
00:19:58,815 --> 00:20:04,370
not empty and it is among the first N
positions of W. Thus, our argument that X,

215
00:20:04,370 --> 00:20:09,844
Y to the I, Z is not an [inaudible] for
any of these possible Y's. Now, let's see

216
00:20:09,844 --> 00:20:14,669
an example. Let us pick this language as
L. It is the set of strings consisting of

217
00:20:14,669 --> 00:20:19,736
some number of zeros followed by the same
number of ones and we have claimed before

218
00:20:19,736 --> 00:20:26,853
that it is an example of a non regular
language. Now we're going to prove it. Now

219
00:20:26,853 --> 00:20:32,460
the adversary picks N, we don't know what
N is, but we know it has some fixed value.

220
00:20:33,120 --> 00:20:41,229
Now we get to choose W in terms of N. And
we pick, W equals zero to the N one to the

221
00:20:41,229 --> 00:20:47,128
N, that is N zeros followed by N ones. But
then the adversary gets to break W into X,

222
00:20:47,128 --> 00:20:53,822
Y, Z and you don't know exactly how it is
broken. But we know enough about X, Y, Z

223
00:20:53,822 --> 00:21:00,495
to show that there is some string, in
particular the case I=2, that the pumping

224
00:21:00,495 --> 00:21:07,696
lemma says has to be in the language L.
But obviously it isn't because it has more

225
00:21:07,696 --> 00:21:16,255
0's than 1's. That is, We know that Y,
being part of the first N positions of W,

226
00:21:16,255 --> 00:21:23,811
Can have only zeros. So two Ys have more
zeros than one Y and the number one which

227
00:21:23,811 --> 00:21:30,150
are all contained within a Z doesn't
change. We next take up the question of

228
00:21:30,150 --> 00:21:36,323
testing whether two regular languages are
the same. We've supposedly given

229
00:21:36,323 --> 00:21:44,220
representation that the two languages, L
and M Whatever representation we are given

230
00:21:44,220 --> 00:21:49,881
we convert to DFAs, then we have to
combine those DFAs into a single DFA, that

231
00:21:49,881 --> 00:21:57,680
in a sense, runs both DFAs in parallel, we
call it the product DFA. Suppose these two

232
00:21:57,680 --> 00:22:05,748
DFA's have states Q and R in the product
DFA the states are pairs, one state from Q

233
00:22:05,748 --> 00:22:12,403
the other from R The start state of the
[inaudible] DFA is the pair consisting of

234
00:22:12,403 --> 00:22:19,195
the start state from each DFA. For the
transitions to the product DFA suppose we

235
00:22:19,195 --> 00:22:25,766
have a state that is the pair QR And
suppose A is the input symbol from which

236
00:22:25,766 --> 00:22:31,467
we want to figure out the transition. We
look at the transition function for the

237
00:22:31,467 --> 00:22:38,040
first DFA, Say Delta L, and we see where Q
goes on input A. So here's Q and on input

238
00:22:38,040 --> 00:22:46,275
A it goes to some state like that. Then we
look at the transition function for the

239
00:22:46,275 --> 00:22:54,410
second DFA. Say, delta M, and we see where
R goes on input A. So here's R on input A

240
00:22:54,410 --> 00:23:02,085
it goes somewhere here. Okay. Then in the
product DFA, the transition from the state

241
00:23:02,085 --> 00:23:10,557
Qr ... Then in the product DFA the
transition from the state QR on input A is

242
00:23:10,557 --> 00:23:20,722
the state pair that is the first component
is delta L of Qa, and whose second

243
00:23:20,722 --> 00:23:28,316
component is delta M of Ra. That is we
simulate the two transitions in parallel.

244
00:23:28,316 --> 00:23:36,197
Here is a little example. Here are the two
given DFAs. We will call this the orange

245
00:23:36,197 --> 00:23:46,407
DFA and that the purple DFA. And here is
the product DFA. For example let's figure

246
00:23:46,407 --> 00:23:54,947
out the transition from AC on zero, so
here's state AC and I look in the orange

247
00:23:54,947 --> 00:24:03,254
[inaudible] a on a zero goes to a itself
And in the purple automaton, C on a zero

248
00:24:03,254 --> 00:24:10,483
goes to D. So the Combination AC goes to
AD, and you see that transition here. For

249
00:24:10,483 --> 00:24:18,772
another example where does AD go on input
one? Well, the orange automaton says A

250
00:24:18,772 --> 00:24:27,529
goes to B on one. The purple automaton
says D goes to C on one. So, on one AD

251
00:24:27,529 --> 00:24:35,292
goes to BC. That's this transition there.
The algorithm for testing whether two

252
00:24:35,292 --> 00:24:40,060
DFA's are equivalent, that is whether they
accept the same language, begins by

253
00:24:40,060 --> 00:24:44,828
constructing the product DFA. Make the
final states of the product DFA be all

254
00:24:44,828 --> 00:24:49,967
those pairs such that one is a final state
and the other isn't. If string W reaches

255
00:24:49,967 --> 00:24:55,106
one of these final states in the product,
then W is accepted by one of the original

256
00:24:55,106 --> 00:24:59,812
DFA's and not the other. Thus the two
languages are not the same. Only if the

257
00:24:59,812 --> 00:25:04,827
product DFA with this selection of final
states has an empty language of the two

258
00:25:04,827 --> 00:25:13,421
DFA's equivalent. Here's an example AC has
made a final state because in the original

259
00:25:13,421 --> 00:25:20,647
[inaudible] c is final and a is not.
Likewise BD is final because B is final

260
00:25:20,647 --> 00:25:26,835
but D is not. We now see that the two
original DFA's are not equivalent. It

261
00:25:26,835 --> 00:25:31,608
happens that the final state B, D is not
reachable from the start state so there

262
00:25:31,608 --> 00:25:35,793
are no strings that the orange
[inaudible], [inaudible], [inaudible] but

263
00:25:35,793 --> 00:25:41,086
the purple one does not accept. However,
AC is also a final state. And it is

264
00:25:41,086 --> 00:25:46,960
obviously reachable from the final states
simply because it is the final state. That

265
00:25:46,960 --> 00:25:52,764
is the empty string distinguishes between
the orange and purple automata. The empty

266
00:25:52,764 --> 00:25:59,148
string is accepted by the orange automaton
but not the purple one. A related question

267
00:25:59,148 --> 00:26:04,728
to ask about regular languages is whether
one is contained in the other. The test

268
00:26:04,728 --> 00:26:09,412
is, in a sense, one half of the
equivalence test we just saw. Start by

269
00:26:09,412 --> 00:26:14,234
building the product automaton. But we
have to define the final states

270
00:26:14,234 --> 00:26:19,744
differently. How will you do that? That is
L is not contained in M if, and only if,

271
00:26:19,744 --> 00:26:26,058
there is some string W that is in L but
not in M. Such a string would get the DFA

272
00:26:26,058 --> 00:26:31,338
for L to a final state, but would not get
the DFA for M to a final state. So the

273
00:26:31,338 --> 00:26:36,885
question of containment is the same as the
question of whether there is any string,

274
00:26:36,885 --> 00:26:42,098
W, that gets the product automaton to a
state QR, where Q is final and R is not.

275
00:26:42,098 --> 00:26:47,177
Here B is the only final state of the
first automaton and D is the only non

276
00:26:47,177 --> 00:26:52,709
final state of the second automaton, so
only BD is final. Okay. As we observed

277
00:26:52,709 --> 00:26:58,357
before, BD is not reachable from the start
state. It has arcs out but no arcs in.

278
00:26:58,357 --> 00:27:03,862
Thus the language of the product automaton
is empty. And we conclude that the

279
00:27:03,862 --> 00:27:09,224
language of the orange automaton is a
subset of the language of the purple

280
00:27:09,224 --> 00:27:15,229
automaton. Next we're going to attack the
problem of given a DFA find the equivalent

281
00:27:15,229 --> 00:27:20,778
DFA with the fewest states. There is an
obvious dumb algorithm, just consider all

282
00:27:20,778 --> 00:27:26,581
the DFA's with the same input alphabet but
a smaller number of states, there's a huge

283
00:27:26,581 --> 00:27:31,814
but finite number of such states, sorry,
got to do that over. Next, we're going to

284
00:27:31,814 --> 00:27:36,693
attack the problem of, given a DFA, find
the equivalent DFA with the fewest states.

285
00:27:36,693 --> 00:27:41,572
There's an obvious dumb algorithm. Just
consider all the DFAs with the same input

286
00:27:41,572 --> 00:27:46,451
alphabet, but a smaller number of states.
There's a huge but finite number of such

287
00:27:46,451 --> 00:27:51,451
[inaudible] so in principal we can solve
this problem. This time we're not going to

288
00:27:51,451 --> 00:27:56,210
dwell on the bad algorithm, but talk you
through the good algorithm immediately.

289
00:27:56,210 --> 00:28:00,898
The key idea is to build a table of pairs
of states and figure out which pairs are

290
00:28:00,898 --> 00:28:05,587
distinguishable in the sense that there is
some input string that leads one of the

291
00:28:05,587 --> 00:28:09,994
pair to a final state and the other to a
non-final state. Otherwise, states are

292
00:28:09,994 --> 00:28:16,276
indistinguishable and they can be merged
into a single state. Here is the

293
00:28:16,276 --> 00:28:22,708
[inaudible] automaton we saw way back in
the beginning of the course. Now, let's

294
00:28:22,708 --> 00:28:31,233
look at the states forty thirty here and
add in ... Input S takes them both to an

295
00:28:31,233 --> 00:28:39,307
accepting state that is [inaudible] and
input O takes them both to deuce, here and

296
00:28:39,307 --> 00:28:47,080
here. Thus, no further inputs could ever
distinguish 4030 from add in. Similarly

297
00:28:47,080 --> 00:28:55,108
thirty forty and [inaudible] are
indistinguishable Now we can deduce that

298
00:28:55,108 --> 00:29:06,234
thirty all , that's here, and deuce. Are
also indistinguishable, on input S They go

299
00:29:06,234 --> 00:29:14,975
to forty thirty and add in, respectively,
that is, thirty all goes to forty thirty

300
00:29:14,975 --> 00:29:22,044
on S. And deuce goes to add in on S. But
we know that those two states are

301
00:29:22,044 --> 00:29:29,763
indistinguishable so we'll never be able
to distinguish 30 all from deuce by a

302
00:29:29,763 --> 00:29:39,720
sequence beginning with S. And further, on
input O well, 30 all goes to 30-40. And

303
00:29:39,720 --> 00:29:45,184
deuce goes to add out. And we said we cap
these thing with these two states. So

304
00:29:45,184 --> 00:29:50,859
there is no string [inaudible] that can
distinguish 30 all from deuce. We are now

305
00:29:50,859 --> 00:29:56,323
going to talk about how we find the
distinguishable states. The basis is pairs

306
00:29:56,323 --> 00:30:01,858
that are distinguishable by the empty
string. These are the pairs that have one

307
00:30:01,858 --> 00:30:08,873
final and one non-final state. For the
inductive step we can mark a pair QR if

308
00:30:08,873 --> 00:30:15,401
those stay [inaudible] put A to
distinguish [inaudible]. If delta of QA

309
00:30:15,401 --> 00:30:22,849
and delta of RA are marked, then they're
distinguishable by some string W. That is,

310
00:30:22,849 --> 00:30:30,481
here we have Q, here's Q, and it goes on W
to let's say a non final state, and R goes

311
00:30:30,481 --> 00:30:38,681
on W to some final state. For the
inductive step we can mark a pair Q, R if

312
00:30:38,681 --> 00:30:45,417
these states go at some input A to
distinguishable states, that is to say

313
00:30:45,417 --> 00:30:53,058
here is Q And on A goes to some state, I
don't know whet it is, but its certainly

314
00:30:53,058 --> 00:30:59,037
delta Q and A, and here is R and it goes
on A to again some other state i don't

315
00:30:59,037 --> 00:31:05,092
know what it is but its a delta of R and
A. And then these two states on input W

316
00:31:05,092 --> 00:31:11,374
will say that this one goes to a non-final
state and that goes to a final state on

317
00:31:11,374 --> 00:31:18,861
input W. Then, a claim AW distinguishes Q
from R, because obviously Q goes on AW to

318
00:31:18,861 --> 00:31:25,854
a non-final state and R goes on the same
AW to a final state. After no more marks

319
00:31:25,854 --> 00:31:32,242
are possible, the unmarked pairs are
equivalent and can be merged into one

320
00:31:32,242 --> 00:31:39,862
state. This point may be obvious, but we
are going to need it in what follows. If

321
00:31:39,862 --> 00:31:47,009
there is no string W that distinguishes
state P and Q, and there is no string that

322
00:31:47,009 --> 00:31:54,069
distinguishes Q from R. Then how can we
distinguish P from R? That would mean that

323
00:31:54,069 --> 00:32:00,780
some string W leads one of P and R, say R
to a final state, so here is P and W,

324
00:32:00,780 --> 00:32:07,557
let's say it leads it to a final state. If
there is no string W that distinguishes

325
00:32:07,557 --> 00:32:12,962
States P and Q, then there is no string
that distinguishes Q from R, then how

326
00:32:12,962 --> 00:32:18,864
could some string W distinguish P from R.
That would mean that there is some string

327
00:32:18,864 --> 00:32:28,380
W, needs one of P and R To a final state
and the other to a non-final state. Let's

328
00:32:28,380 --> 00:32:35,564
say R leads to a final state, so here's W
leading P to a non-final state. And here's

329
00:32:35,564 --> 00:32:43,161
R and on the same W gets to a final state.
But then W also distinguishes Q from

330
00:32:43,161 --> 00:32:54,650
either P or R. So here's Q. And W leads it
to some state. Okay, so let's say that W

331
00:32:54,650 --> 00:33:03,890
leads Q to a final state. Then W
distinguishes Q from P, because here, P on

332
00:33:03,890 --> 00:33:11,555
W goes to a non-final state, Q on W goes
to a final state. If Q goes to a non-final

333
00:33:11,555 --> 00:33:18,438
state, then it doesn't distinguish, P from
Q but it does distinguish R from Q because

334
00:33:18,438 --> 00:33:25,962
Q would then go to a non-final state while
R goes to a final state. Incidentally,

335
00:33:25,962 --> 00:33:31,851
distinguishable is not transitive. It is
quite possible that W distinguishes P from

336
00:33:31,851 --> 00:33:37,102
Q and also Q from R, but does not
distinguish P from R. For example W could

337
00:33:37,102 --> 00:33:43,310
lead both P and R to final state and Q to
a non-final state. We are now going to use

338
00:33:43,310 --> 00:33:47,837
the table of in distinguishability to
merge states that are indistinguishable.

339
00:33:47,837 --> 00:33:52,364
That gives us the minimum state DFA.
Although we must be careful to remove at

340
00:33:52,364 --> 00:33:58,703
some stage, all the states that are not
reachable from the start state. Suppose we

341
00:33:58,703 --> 00:34:05,740
have set of indistinguishable states, say
Q1 through QK. We're going to replace them

342
00:34:05,740 --> 00:34:11,008
all by a single state that behaves as they
all do, called the representative Q. It

343
00:34:11,008 --> 00:34:16,405
can be one of the QI's or some new name we
create for this purpose. On any symbol A,

344
00:34:16,405 --> 00:34:20,892
all the indistinguishable states, the
QI's, go to states that are also

345
00:34:20,892 --> 00:34:25,965
indistinguishable from one another. For if
not, then we can use the distinction

346
00:34:25,965 --> 00:34:31,232
between say the states delta Q1 of A and
delta Q2 of A to distinguish between Q1

347
00:34:31,232 --> 00:34:36,304
and Q2. But we already know that Q1 and Q2
are indistinguishable so that can't

348
00:34:36,304 --> 00:34:42,646
happen. Thus, make the transition for
state Q on input A be the representative

349
00:34:42,646 --> 00:34:49,037
for the indistinguishable states delta Q1A
and so on. Let's work with the DFA that we

350
00:34:49,037 --> 00:34:54,616
constructed from the NFA that represented
the moves on a chess board. We are going

351
00:34:54,616 --> 00:35:00,183
to make it easier to work with by renaming
the states to be single letters. For

352
00:35:00,183 --> 00:35:05,750
example A is the set containing only one,
and G is the set containing one, three,

353
00:35:05,750 --> 00:35:13,173
Five, seven, and nine. Here is a little
trick for arranging pairs in a triangle so

354
00:35:13,173 --> 00:35:17,715
that each pair appears exactly once.
Notice that the rows are the states in

355
00:35:17,715 --> 00:35:22,780
order, except for the last state G, which
doesn't appear there. Then the columns are

356
00:35:22,780 --> 00:35:28,155
labeled by the states in backwards order
except for the first state A We begin the

357
00:35:28,155 --> 00:35:33,362
table of indistinguishabilities by marking
each pair that consists of a final state

358
00:35:33,362 --> 00:35:38,259
and a non-final state. Here, the final
states are F and G. So pairs that have one

359
00:35:38,259 --> 00:35:43,094
of these and one of the other states A
through E are marked. Let's look at the

360
00:35:43,094 --> 00:35:47,805
transitions on input R. Notice that the
column for R has only states B and D.

361
00:35:47,805 --> 00:35:52,454
Since we've not yet distinguished B from D
there's no way input R can help

362
00:35:52,454 --> 00:35:57,923
distinguish other pairs of states at this
point. However we have more luck with

363
00:35:57,923 --> 00:36:03,795
input B. Some states go to final states on
input B, namely C, D, E, and G. And

364
00:36:03,795 --> 00:36:09,666
others, A, B, and F, go to non-final
states, thus we can distinguish any of C,

365
00:36:09,666 --> 00:36:16,101
D, E, or G from any of A, B, or F. Some of
these pairs are already distinguished, but

366
00:36:16,101 --> 00:36:24,478
we get seven new pairs marked in red here.
At the next step we discovered two more

367
00:36:24,478 --> 00:36:30,289
distinguishable pairs, CD and CE. For
example C and D lead to F and G

368
00:36:30,289 --> 00:36:37,179
respectively, on input B. So we know that
whatever string W distinguishes F from G,

369
00:36:37,179 --> 00:36:43,717
D followed by W will distinguish P from D
now we can lock the pair A, B. These

370
00:36:43,717 --> 00:36:49,450
states transition on input R to B and D
respectively. And we already know that we

371
00:36:49,450 --> 00:36:56,164
can distinguish B from D. Unfortunately, D
and E can never be marked because on both

372
00:36:56,164 --> 00:37:04,237
inputs they go to the same state. Okay, we
are now done. We have distinguished every

373
00:37:04,237 --> 00:37:09,429
pair of states except for D and E.
Moreover, we can never distinguish these

374
00:37:09,429 --> 00:37:19,001
states since they both go to D on input R
and to G on input B. Thus, D and E form an

375
00:37:19,001 --> 00:37:25,026
indistinguishable group of states and we
can replace them by representative. We

376
00:37:25,026 --> 00:37:30,822
choose the new state A which is the
representative and all transitions from

377
00:37:30,822 --> 00:37:36,999
other states to D or E are replaced by
transitions to H. The rows for D and E are

378
00:37:36,999 --> 00:37:42,796
replaced by a row for H. As D and E
transition to D on input R, H transitions

379
00:37:42,796 --> 00:37:52,000
to itself on R. The transition on input B
for H is to G, which is the same as it was

380
00:37:52,000 --> 00:37:57,685
for both D and E. As we mentioned,
collapsing indistinguishable states to a

381
00:37:57,685 --> 00:38:02,156
single state goes a long way to finding
the minimum state equivalent to a given

382
00:38:02,156 --> 00:38:07,246
DFA. But there's one other issue that in
distinguishability doesn't address. The

383
00:38:07,246 --> 00:38:11,907
possible existence of unreachable states
that are cluttering up the transition

384
00:38:11,907 --> 00:38:16,628
diagram or table. But it is easy to find
such states and we can either eliminate

385
00:38:16,628 --> 00:38:20,522
them from the original DFA or we can
eliminate them after merging

386
00:38:20,522 --> 00:38:27,790
indistinguishable states. It doesn't
matter. Now we've done our best to combine

387
00:38:27,790 --> 00:38:33,977
states that we know how to combine. But it
is in principle possible that there is

388
00:38:33,977 --> 00:38:39,811
some other smaller DFA that we can't get
by combining states of our DFA. And

389
00:38:39,811 --> 00:38:49,771
fortunately that can't happen as we shall
now see. Here's the proof that there is

390
00:38:49,771 --> 00:38:54,773
nothing smaller than the DFA we get by
merging states and eliminating unreachable

391
00:38:54,773 --> 00:39:03,940
states. Suppose As are DSA and D is a
hypothetical equivalent with fewer states

392
00:39:06,060 --> 00:39:11,878
Imagine we combine the states of A and B
to form a larger DFA. It doesn't matter

393
00:39:11,878 --> 00:39:17,478
what the start state of the combined
[inaudible] is but the final states are

394
00:39:17,478 --> 00:39:22,642
those of A and B. We need to use
distinguishable in its contra-positive

395
00:39:22,642 --> 00:39:28,388
form, that is if W distinguishes delta of
Q, A from delta of P, A then surely A, W

396
00:39:28,388 --> 00:39:33,988
distinguishes Q from P. So if Q and P are
indistinguishable, then so are their

397
00:39:33,988 --> 00:39:40,810
successors on any input A. Here is an
informal illustration of the proof

398
00:39:40,810 --> 00:39:45,928
technique. We start off with the fact that
the start dates of automata A and B are

399
00:39:45,928 --> 00:39:52,672
surely indistinguishable because the
automata accept the same language Now

400
00:39:52,672 --> 00:40:00,118
suppose the start states go to some
states, P and Q on input A. P and Q must

401
00:40:00,118 --> 00:40:04,596
be indisguishable because if they were
distinguishable then we could distinguish

402
00:40:04,596 --> 00:40:12,680
the start dates and we know we cannot do
that Now suppose, Q and P go on input B to

403
00:40:13,017 --> 00:40:21,080
other states R and S. Then R and S must be
indistinguishable for the same reason.

404
00:40:24,680 --> 00:40:29,880
Formally we shall prove that every state Q
of A is in [inaudible] of some state of

405
00:40:29,880 --> 00:40:34,767
being. The proof is in the induction on
the length of the shortest string that

406
00:40:34,767 --> 00:40:39,717
gets you to Q from a start state. Notice
that because we eliminated unreachable

407
00:40:39,717 --> 00:40:47,685
states we know there is such a shorter
string For the basis, the state of A that

408
00:40:47,685 --> 00:40:52,263
is reachable from the start state by a
string of length zero, which of course is

409
00:40:52,263 --> 00:40:58,343
the start state itself. We know that this
state of A is indistinguishable from the

410
00:40:58,343 --> 00:41:05,465
start state of B because the languages are
the same. For the induction, assume the

411
00:41:05,465 --> 00:41:11,530
inductive hypothesis for strings shorter
then W and suppose W is a shorter string

412
00:41:11,530 --> 00:41:17,448
getting A to state Q. Let W equal X, A
that is A is the last symbol of W and X is

413
00:41:17,448 --> 00:41:25,785
all the rest of W. We can apply the
inductive hypothesis to X because it is

414
00:41:25,785 --> 00:41:32,003
shorter than W. We know X gets A to some
state R that is indistinguishable from

415
00:41:32,003 --> 00:41:40,291
some state P of B. But then A takes state
R on input A to state Q and we know B

416
00:41:40,291 --> 00:41:46,929
takes state P on input A to some state,
say S. Then Q must be indistinguishable

417
00:41:46,929 --> 00:41:53,850
from S using the argument that we saw two
slides ago. Okay. Now we use the

418
00:41:53,850 --> 00:41:57,935
transitivity of indistinguishable to argue
that no two states of A are

419
00:41:57,935 --> 00:42:02,193
indistinguishable from the same state of
B. For if they were they would be

420
00:42:02,193 --> 00:42:07,145
indistinguishable from each other. But we
A, cannot have indistinguishable states

421
00:42:07,145 --> 00:42:13,027
because we merged them all constructing A
Thus, B has at least as many states of A,

422
00:42:13,027 --> 00:42:19,243
even though we started off assuming that
there was no relationship between the

423
00:42:19,243 --> 00:42:25,066
automata A and B except that they each
accepted the same language. So, that

424
00:42:25,066 --> 00:42:31,676
concludes the, entire argument that says
that by throwing away unreachable states

425
00:42:31,676 --> 00:42:38,286
and then merging indistinguishable states
you get an automaton that is a small, that

426
00:42:38,286 --> 00:42:43,480
is has as few states as any other
automaton for the same language.