So, the algorithm that we're going to look at for solving the substring search problem is called the Knuth-Morris-Pratt algorithm. It's got an interesting history that we'll get to at the end. But I just want to start by saying that. This is one of the coolest algorithms that we'll cover in this course. And it's not an algorithm that anyone would come up with without a lot of lotta hard work. But understanding this algorithm really gives somebody an appreciation for what's possible with careful algorithmic thinking even for such a simple problem as this. It's a quite ingenious method. So, here's the intuition behind the algorithm. So, say, we're looking for the pattern. B followed by a bunch of a's in the text. So we're going along and so ahm well, we're trying to mismatch right away so we'll move over to first position. And we'll go BAAAA, and then we find a mismatch at the end in this case. So now, what the, brute force method is going to do is back up to move over one position if b doesn't match the a, back up again if b doesn't match the a, and keep going matching the first b against all these a's before getting to the next character in the text. So, we've matched five characters and we mismatched on the sixth. But the key point is that when we got that mismatch all we have is A's and B's. We already matched BAAAA, four A's and we got a mismatch so it's AB.. So, at that point we know what the previous characters in the text are, and we know that if we move over one position, we're going to be trying to match B against A. So we should be able to take advantage of that knowledge, that's the intuition on the method of the Knuth-Morris-Pratt. We don't need to back up the text pointer in this case. When we get the mismatch, we can just keep going. We don't even have to match the first b we know it's there So, We can just keep going in the text pointer and start in the second pattern pointer. The Knuth-Morris-Pratt algorithm is a very clever method that manages to always avoid backup no matter what the pattern is. So, let's take a look at what's involved. It's based on the idea of building a deterministic finite state machine for string searching. Now, the deterministic finite state machine, if you're not familiar or don't remember, it's a very simple thing. You've got a finite number of states. It's always in one of the states. There's a start state and a, and a halt state. And from every state, there's exactly one transition for each possible character in the alphabet. So we're, if we're in a state and we're reading a particular character we move to the state that is indicated by that character. So, if we're in state two and we're reading an A, we move to state three, that's what this line in the table says. If we're in state two and we read A, that happens to be the pattern character is a, and if we see and A, we move to state three. If we see AB or AC,c we go back to state zero. That's what this machine says. It says, the tabular representation and the graphical representation. If we get to stage six then we just say, except we found the pattern and you could see how that is, the pattern and then if we match the pattern character, we move right through this machine till we get to the halt state. So, let's first take a look at exactly how the machine works to make sure that everybody follows that. And then we'll look at how to rebuild this machine. So that's our machine and that's our text up at the top. So, we start at state zero we're reading in a. What are we supposed to do if we're in state zero and we're reading an A? This enter the table and says, we're supposed to go to state one. So, we go to state one. Every, at every step, we consume a text character. A. And in the graphical representation, it says, stay in state one. Or in this representation, it says, if you're in state one and you see an a, stay in state one. So that's what we'll do. We'll read the a and stay in state one. Now, we're in state one and so you can say that in, in state one is reading a's and in a way, looking for something that's not A, No matter how many a's you have, you'll read them right here in state one. So now, we're in state one and we're reading a B and it says, go to state two in that case. So, that's where we'll go, read the B. Now, we're in state two and reading an A, So we go to state three and read the A. Now, we're in state three and see a C. In state three, when we see a C we're supposed to go back to state zero. That's a mismatch our pattern sees not until the end. So now, we have to start the search all over again. Although in the final state machine does not know that, it's just plotting along according to the state transitions they're supposed to do. So, we read the C, C, now we're back in state zero. So now, we're in state zero looking at an a. And we move to one, and read the A. And then we see another A so we stay in one and read the A. And now we see a B, and we go to state two, and read the B. And then we go to state three and read the A, state three and read the A. State four and read the A. State five and read the C. And now, we're in state six. And that means we found the pattern. So, that's the simulation of what the deterministic finite state be. that is corresponding to this pattern would do. And as we'll see the code for implementing this simulation is quite simple. That's a demo of DFA simulation. So let's just take a quick look at what the interpretation of what this DFA is. So, what is it, what does it mean after we just read the ith character of the text? Well the number of the state is the number of characters in the pattern that we've matched up to the current point. So, that is it. At state three, we've matched three characters in the pattern. And what's happened is, that we've matched a prefix of a pattern, the first three characters of the pattern. It's, and actually, it's the longest prefix of the pattern that's also a suffix of the i, first i plus one characters from the text, Text from zero to i. That's the interpretation of what it is. So and if the DFA is in state three after the first six characters of the text it's got ABA is the longest prefix of a pattern. That's also a suffix of first six characters of the text. So, another prefix of the pattern that looks like it might work is ABABA but that's not a suffix of the text string ending at character six. And third, interpretation is useful to understanding what's going on. So given that we've constructed this DFA. It's just the state transition table for the DFA is simply a two-dimensional array. That's indexed by the pattern character. So, if, if you in a, a certain, if you're reading a certain pattern character sorry, if you're reading a certain text character then for each we reset the pattern pointer to be the entry given in the table that corresponds to the current one. So let's go back to the table to be sure about that. So when we're in a particular state like two and we see an a we read, so this would be j we refer to that column. And whatever text character we happen to see, that's how we update j. So that's what this code does. And then, if we ever to get to the transition state, the final state, we just return i minus m. Now, this is just like that alternate implementation of the brute force method that we gave. But the key idea is that, Notice that i never gets decremented. All we're doing is incrementing i and changing j. Either always just referring to the table. When we have a match, the table automatically increments j. We have a mismatch, it figures out the right state to go to. So, that's a very simple and economical and fast implementation of substring search. So the running time is clearly, all it does it look up an entry in the table for every text character. So, that's linear time. The key is, is. How do we build that table that drives the algorithm? And that's the tricky algorithm. So a warning. This is definitely the trickiest algorithm we've looked at so far. And the idea, the importance, again, of no backup is that since the text pointer never decrements, you could use the input stream and just replace text [UNKNOWN] just read the next character in the next input stream. And this algorithm is going to work fine, no matter how big the input stream is. It will just go right, right through. Its not no memory of what the text is. But its got some memory of what the pattern is that's built into the DFA. So, this is the key that, you have a pattern, You can spend some time building this DFA table and doing pre-processing. But then, when you get the text, just, just index into this table for every text character and you're doing this substring search. So, you can build a machine that does this. No backup. Okay. So let's take a look at what it means to construct this DFA. So it depends on the pattern instead of with one character one state for each character in the pattern plus an extra at substate. And let's look at what it means to build the same. So first thing now we do to one character one stage for each character in the pattern. So the first thing we do is deal with the match transitions. So, that's when the you've, you're in state j, That means you've already matched j characters in a pattern. Then if, if the next character matches. So that the character that you have is the, the character that's supposed to match. You're going to so it's, add that character j, that's just a j plus first character, Then you know that you've matched j plus one characters. So, all that means is we can put in the match transitions. So, we have ABABAC, and these guys go ABABAC if you're in state zero and you see an A you go to state one, if you're state one and you see a B, you go to state two and so forth. That's how we get the match transitions to get us all the way through the pattern to the accept state. So that's the, the easy part. And then the hard part is the mismatch transitions. What are you supposed to do if you come against a text character that does not match the current text character? So, for example, if you're in state zero, the pattern starts with an A. If you didn't see if you don't see an A as the first character in the text, If you see a B or a C, then obviously you want to stay in state zero. So, you can think of state zero as, scanning through the text looking for an A. It's going to stay in, in state zero as long as it sees a B or C as soon as it sees an A, it'll go to state one. That completes state zero, you know what to do if you have a, an A, you know what to do if you have a B or a C. What about state one? So, we know that if you're in state one, you saw an A in the text, and you see a B in the text, what are you supposed to do? Well there's two different cases. If you, If you see a B going to state two if you see a C and that's not going to match the first character A in the pattern. So, you go back to state zero. But if you see an A, it's just as if you saw an A, a in state zero, So, you might as well stay in state one. So, that's how we fill in the DFA for state one, If you see a B going state two, you matched. If you see an A, well that matches the first character. So, stay in state one. If you see a C, clearly you have to go back to state zero. Okay, what about the next one? So if we're in state two and we see an A we know that we go on to state three. And what about the mismatch pages? Well, in that case, it's a B or C and again if you're sitting out a B or a c, C you'd better go back to state zero to keep looking for an a. This is state three, and well, now it gets to be a little more complicated. C, you state three, see a B, you succeeded, if you see a C, you go back to state zero. It's not so bad if you see an A, it's just like before. That's going to be like the first one. So seems like we're going along pretty fine. And again if we're in state four, we seen A, you go to five. If you see a B or a C, then you better go back and look for an A again. This one, is the one that's a little more complicated. If you're in state five and you see a B you, you go back to state four. And that's it's a little more work to figure out why that's the case. And then that last case is kind of the essence of the algorithm. So we'll look at a systematic way to be able to figure out what you do on a mismatch in each case. In this case, you only needed that, that for that one state. Otherwise, it was elementary reasoning. So that's a full DFA for KnuthMorrisPratt. A demo of at least thinking about how it's going to be constructed. Okay, let's look a little more carefully and systematically at the construction process for the KnuthMorrisPratt DFA. So the, the start is clear. We're going to go through the pattern. And for systematically fill in the match transitions. If we're in state zero and we see an A, we want to go state one. If we're in state one and we see a B, we're going to, we want to go to state two. State two, so we look up the pattern character and then, whatever that one is, we want to go to the next state. So, we can fill in at least that much automatically. Now the real key is the mismatch transition. So, here is the idea of the mismatched transition. So if you're in stage J and you get a mismatch, the next character in the text does not match the jth character in the pattern. So as pointed out at the beginning as motivation for KnuthMorrisPratt, You know a lot about the text at that point. And you know that the last j, j minus one characters of the text are if you lop off the first character of the pattern, it's from one to j minus one. So, in this case and then, it's followed by the text character that, that you're looking at. So one thing that you could do, so what you want to do, so you know that. And what you could do is simulate the DFA that you have built on that part of the pattern and then take the transition for the character that you just find. So, let's, let's look at this. So let's run this machine, if we'd seen it on the text of, of BABA,. What we want to do, We want to put the machine in the state as if we have backed up, but we don't want to backup. So, if we see a B, we stay in zero, if we see an we, we go to one. Then we see a B we go two. And if we see an A, we go to three. So now we're in state three in if we had a mismatch then the, for the fifth character, what we do on a, on a mismatch here we have to look what happens if we get an A or we get a B. If we'd run it on BABA and we get an A, then we should go back to one. So, what that says is, if we had a mismatch and we saw an A in five we would need to be in state one. Because if we had, had run the thing on the characters that we know, BABAA, we would wind up in state one. And similarly if we were if we got the mismatch on a B, A if we did BABA,, we would be in state three. And, we're in state three and we saw B we'd go to four. So again to summarize, if four instate five and we see a C, we know that's a match, we go to six. If we see an A, we know that the previous five characters in the text were BABAA. So, we can just simulate the machine. Babaa, we're in state one. And ifwe get a mismatch that's a B, then that's DFAB five, Then we know that the previous five characters in the text were BABAB. And that would put us in state four. So that's the simulation of BABA puts us in state three. If we get an A, we go to one. So that means that from five, we go back to one. And if we get a B, we go to four. So, that's how, one way to calculate the mismatched transitions. And as we've noted in the simulation, this is the only non-trivial one for this example. Now, there's a little problem with that, Is that it seems to require j steps to do the simulation. In order to figure out these mismatched transitions, I had to all the way through the pattern shifted over one to figure out this state three. So that seems to be a bit of a problem, but actually it's no problem at all, because we can run this simulation one character at a time as we're building the machine. All we need to do is keep track of the that we would be at if we had run the DFA on the pattern starting at position one. Once, once we get going it's pretty easy but just let's start well, let's just illustrate it by saying, okay, We maintain this state x, Which is where we would be if we if we had run the machine and the pattern had shifted over one. Now when we come to do our mismatches to figure out where the mismatch transitions from five are all we do is look at if we get, if we were to get an A, would be as if we were to state X and got an A. So that's one and if we were to get a B it's as if we were at state X and got a B. And that's state four. So, what we need to do is to compute the mismatch transitions is keep track of state x. And that is where would the thing be if we had run it starting with the, at the pattern one position shifted over. And we want to update that. So, when we're moving to the next state, if we had a match on C, state x gets updated to where it would have gone if it got a C. Because for the next character, when we move j over for the match on C we'll want to have x updated. So, that's the key is keeping track of the state where the machine would be if we had backed up or if we had run it on the pattern shifted over one. So let's take a look at a demo that does the full construction for the KMP DFA. So here, here goes. Again, one state for every character plus an accept. Match transitions are easy. We build those. And we're going to start at position zero. And the mismatched transitions are easy. So now, when we move over x is when we're in position one of the pattern, x is where we would be if we started out without that, that character, Which would be the empty string. So we start out with just filling in zeros. For state zero and we can do that without any further every reason and now we've got x initialized. so now, what we need to do is fill in the mismatch transitions for state one. So, what are they? They're what would happen if we found those characters? And, and we're in state x, If we found an A, we would go to state one. If we found a C, we'd go to state zero. And maybe you noticed, that's just taking the entries corresponding to the entries we need from x's column and putting them in j's column. That's all it is. And then, we need to update x, which is where it would be if we matched a B. So, well stay and x will stay in state zero. So now let's look at state two. So we need to fill in what would happen if we got a B and what would happen if got a C. Well, it's what would happen if we were in state x and we got a B or a C, So what we'll do is move those two zeroes over to column two. And then don't forget we have to update x and x goes where the machine goes if we saw an A, that transitions from two to the three. So, we just move x to state one. So now, we have x in state one, and we're doing position three and now you can see how really simple the algorithm is once it gets going. So now the mismatched transitions are A and C, and that's what we have to fill for column but those mismatched transitions were already computed that's where x would go if we, that's where it would go if we happen to be in state one. So, we move the one and zero from column one over to column three. And again, we update x and when we see a B x goes to state two. And, and again, you can check what, what is x suppose to be. It's, it's suppose to be where you would be if you started the machine on the pattern with the first letter cut off. So, it's, it's suppose to be where would you wind up if you got BAB, BAB and that's a check. Alright, so now it's straightforward, straightforward we have to fill in B and C. We go to x's column and copy over B and C from x's column. Then we update x. That's if you see an A, you go to state three. Now we're ready for state five. We've got x all computed and we need to do A and B. And we get those from X. If it's an A it's a one. And if it's a B it's a four. It's just moved them over. And then when we do the C and get the accept there's no mismatch. That's the we do update x to get ready. But when we get to state six we're done. And again, it's just as a doublecheck, x is where you'd be if you saw BABAC.bac That's the construction of the Knuth-Morris-Pratt DFA efficient algorithm. Really ingenious and as it turns out, simple to implement. Implementation for the DFA construction for Knuth-Morris-Pratt requires remarkably little code so it just goes through with what we did in the demo. So we've got x and for every entry of the DFA and x's column that corresponds to everyone except for the match we just copy x's column to j's column. In fact, we just copy'em all, and then overwrite one corresponding to the match case, to j plus one. So there. And the entry in the column that corresponds to the pattern character, that's where do we go if we match, that's j plus one, all the rest of them are copied from x. So, one forwarded the copy from x then set the match case and the only other thing to do is update x. And how do you update x? You take the [unknown] machine to where it would go if it were at state x and I got the current x pattern, pattern character. So that's the, DFA construction amazingly a line of code. So the only, so flaw in this codec, is that it does take time proportional to R times M where are R is the radix and, and M is the length of the pattern and, and as we'll see this, ways to, get around this. But for relatively small alphabets this is no problem because the search is so efficient this way and the construction as well. But if you're doing this for Unicode for a long pattern maybe that's too much memory to devote to the DFA representation. So the bottom line is and this follows immediately from examining the code is that the substrings can be substring search as linear time. Your only access at most, each character in the pattern, each character in the texts wants quite remarkable. Each pattern character you have accessed once when your making the DN, DFA. And each text character is accessed once when you're simulating the DFA, and that's in the worst case. Now the space again, is proportional to RM because we have all those mismatches. It is possible to develop a version of Knuth-Morris-Pratt that constructs the automaton in time and space proportional to M. It's actually a non-deterministic automaton cuz it's either a match or all mismatches and mismatch might involve multiple hops. So if you are interested you can read about that version of KMP. But it's sufficiently more complicated that you should be, be prepared to study it carefully to really understand it. This algorithm is really interesting because of its history. It was actually independently discovered by two theoreticians and a hacker. So there's Knuth who was one of the fathers of computer science. Who read a paper that was a very esot by Steve Cook, a very esoteric theoretical result that he realized implied that it should be possible to solve the substring search problem in linear time. The theorem really didn't give any way to solve it but it indicated that it should be possible to solve it in linear time. So, Knuth worked on it and figured out a way. And then Pratt who was a student of Knuth's at Stanford at the time figured out a way to take care of this mismatch, independent of the alphabet side. And in the meantime, across the Bay at Berkeley Jim Morris was busy writing a text editor. In those days people were using typewriters. And other people were realizing that computers would be really good at editing text. And so you know, Many people would work on text editing and formatting systems. Ahm it was kind of a badge of honor in those times. Morris actually worked for the computer center at Berkeley. And wanted to have a really bolt-proof text editor that everyone could use. And one of the things he wanted to do was avoid backup. Cuz backup was just really inconvenient, Involved a lot of code. And just something that he didn't want to have. And he basically came up with the same algorithm and the community was kind of small at that time. And theory meets practice. And that's where this paper came from 1977. Morris, then went on to get his PHD, and to work at Xerox Parc. And unfortunately later on another systems programmer came and took a look at Morris's code and couldn't understand it, and put the brute force algorithm back in. But that part of the story is maybe a less successful example of theory meeting practice. That's Knuth-Morris-Pratt algorithm, one of the most ingenious algorithms that we'll see.