1 00:00:03,320 --> 00:00:11,039 As usual, it's very instructive to take a look at the simplest brute force algorithm 2 00:00:11,039 --> 00:00:15,412 for the problem. We'll look at that in a little detail, 3 00:00:15,412 --> 00:00:21,567 because it illustrates, really what, fundamental issues involved with getting 4 00:00:21,567 --> 00:00:27,478 efficient algorithms for this. And it's also the basis for more efficient 5 00:00:27,478 --> 00:00:31,609 algorithms. So the brute force method, we could give 6 00:00:31,609 --> 00:00:37,440 in a beginning programming class. You have your text, you have an index I 7 00:00:37,440 --> 00:00:43,552 that index as in to text, and you have an index j that indexes into the pattern. 8 00:00:43,552 --> 00:00:50,947 And, start out with both i and j at zero, and you compare text to pattern, and you 9 00:00:50,947 --> 00:00:56,916 keep going until you find a mismatch. If you find a mismatch before the end of 10 00:00:56,916 --> 00:01:01,752 the pattern, then what you do is move the pattern over one position, 11 00:01:01,752 --> 00:01:05,795 That corresponds to simply incremental the text pointer. 12 00:01:05,795 --> 00:01:10,919 And then, here we have a mismatch right away, so we move the pattern over one 13 00:01:10,920 --> 00:01:16,045 position, increment the text pointer. And then starting with I at two, we 14 00:01:16,045 --> 00:01:21,604 compare the second text character with j of zero, the first pattern character. 15 00:01:21,604 --> 00:01:25,863 Increment j to one, increment i to three, we have a mismatch. 16 00:01:25,863 --> 00:01:32,646 Mismatch happens at j and +J, j. the jth character the pattern versus the iJ + 17 00:01:32,646 --> 00:01:37,859 character of the text. Now we have a mismatch, move over. 18 00:01:37,859 --> 00:01:45,568 That's increment i to three, compare the first character of the pattern, j = zero 19 00:01:45,568 --> 00:01:50,663 would be the i plus j pattern of the text. That's a mismatch. 20 00:01:50,906 --> 00:01:55,192 Move over one. So we go first one is a match. 21 00:01:55,192 --> 00:02:00,773 The second one j of i is four compare against five, that's a mismatch. 22 00:02:00,773 --> 00:02:04,332 Move the pattern over one, that's a mismatch. 23 00:02:04,574 --> 00:02:10,613 Finally get to i = position six, and we go for j equals zero, one, two, three, all 24 00:02:10,613 --> 00:02:14,050 matches. When j gets to four, which is the number 25 00:02:14,050 --> 00:02:18,700 of characters in the pattern that's when we know we've found a match. 26 00:02:18,905 --> 00:02:23,570 So, in this table, the entry's engraved, just there for a reference. 27 00:02:23,775 --> 00:02:29,057 The black ones are where when we found matches, and the red ones are where we 28 00:02:29,057 --> 00:02:33,379 found mismatches. So, that's, we do the trace before the 29 00:02:33,379 --> 00:02:36,260 code, because the code is extremely simple. 30 00:02:36,495 --> 00:02:42,622 So this is, to implement brute force substring search, to look for a pattern, a 31 00:02:42,622 --> 00:02:49,298 given pattern, in a given text, or for job as index of, this woud be, on the index of 32 00:02:49,298 --> 00:02:55,749 method in, string, takes the argument pattern. So we get the pattern length, and 33 00:02:55,749 --> 00:03:00,889 we get the text length. And we're going to potentially look at 34 00:03:01,146 --> 00:03:07,915 every, straight, the pattern could start at any position in the text from zero to N 35 00:03:07,915 --> 00:03:12,758 - M. And for every value of i, we've create a 36 00:03:12,758 --> 00:03:19,869 new j and we look for the match between the pattern in position j and the text 37 00:03:19,869 --> 00:03:26,460 character of position + j. If we get all, all the way to j = m, then 38 00:03:26,720 --> 00:03:32,183 we found a match in i. If we get a mismatch, then we get a break 39 00:03:32,183 --> 00:03:37,040 before j = m, and we go and try the next value of i. 40 00:03:37,040 --> 00:03:43,085 And if we get all the way through there without returning, then we just return N, 41 00:03:43,085 --> 00:03:48,978 which is one past the index of the last text character, the length of the text. 42 00:03:48,978 --> 00:03:54,117 And that's an indication that the pattern was not found in the text. 43 00:03:54,117 --> 00:03:59,180 Very straightforward implementation of the pattern match algorithm. 44 00:03:59,640 --> 00:04:06,215 Now, the key point is that, This algorithm is fine in, in many 45 00:04:06,215 --> 00:04:10,703 contexts, and actually it's the one that Java's index of uses. 46 00:04:10,913 --> 00:04:14,770 But the real problem is that it can be slow. Number one, 47 00:04:14,770 --> 00:04:19,252 Just the algorithmic problem, It can be slow if there's a lot of, 48 00:04:19,252 --> 00:04:23,243 repeat, repetitive characters in the text and the pattern. 49 00:04:23,243 --> 00:04:28,776 For example, suppose, that the, text is, all A's and a B, and imagine that there is 50 00:04:28,776 --> 00:04:34,378 a million A's and a B or whatever. And then the pattern also was a smaller 51 00:04:34,378 --> 00:04:40,039 copy of the same thing, all A's and a B. Then what happens in this case is that for 52 00:04:40,039 --> 00:04:45,421 every possible position matching the pattern against the text, we go through 53 00:04:45,421 --> 00:04:50,190 all the pattern characters and only find the mismatch on the last one. 54 00:04:50,190 --> 00:04:54,796 So, And then we find a mismatch, we have to go 55 00:04:54,796 --> 00:04:58,725 over one and try them all and find the mismatch in the last one. 56 00:04:58,725 --> 00:05:04,474 And we eventually do find a match, but for every text character we've looked at 57 00:05:04,474 --> 00:05:10,151 almost M - one pattern characters. So, this is a worst case that shows that 58 00:05:10,151 --> 00:05:15,902 the running time of the brute force algorithm can be proportional to M N, 59 00:05:15,902 --> 00:05:20,925 where M is the pattern length. And the pattern could be long, say that 60 00:05:20,925 --> 00:05:26,531 the pattern could be hundred characters and N can be huge, like a billion and 61 00:05:26,531 --> 00:05:30,534 that's going to be slow, Particularly by comparison to the 62 00:05:30,534 --> 00:05:36,504 alternatives that we're going to look at. Now a more important issue than just the 63 00:05:36,504 --> 00:05:39,780 worst case performance is the idea of backup. 64 00:05:39,780 --> 00:05:45,873 As I mentioned, for lots of applications, if we're going to put our machine in, on 65 00:05:45,873 --> 00:05:52,361 one of the wires of the Internet and watch the input go by or if we just take the 66 00:05:52,361 --> 00:05:57,109 abstract standard input model, you don't get to go, you don't get to back up when 67 00:05:57,109 --> 00:06:02,727 you find a mismatch but the brute force algorithm is always backing up. 68 00:06:02,727 --> 00:06:07,080 If we go through, matching our pattern against our text, 69 00:06:07,080 --> 00:06:11,645 When we find a mismatch, We say we want to move the potential 70 00:06:11,645 --> 00:06:16,210 pattern over one position, But that means backing up in the text. 71 00:06:16,445 --> 00:06:23,209 So, we would that's, ways to deal with that by, saving the most recent M text 72 00:06:23,209 --> 00:06:29,422 characters that we've seen. But it's definitely problematic for larger 73 00:06:29,422 --> 00:06:35,949 patterns and certainly inconvenient. so the brute force algorithm uses backup. 74 00:06:36,185 --> 00:06:40,117 And so you could maintain a buffer as I mentioned, 75 00:06:40,117 --> 00:06:46,409 But what we're going to look at is an ingenious way couple of ingenious ways to 76 00:06:46,645 --> 00:06:50,529 avoid having to do backup. To setup for that, 77 00:06:50,792 --> 00:06:56,222 We're going to look at, A slightly different implementation or 78 00:06:56,222 --> 00:07:00,251 alternate implementation of brute force. It, it, 79 00:07:00,251 --> 00:07:05,507 Compares the same characters as the previous, implementation. 80 00:07:05,770 --> 00:07:12,282 But, it, does things in a slightly different order, without, without a second 81 00:07:12,282 --> 00:07:17,226 for loop. And it does the explicit backup, so let's look at that. 82 00:07:17,226 --> 00:07:23,030 So, we have our i and j pointers, and, and we initialize them both at zero. 83 00:07:23,033 --> 00:07:29,467 And so it's a for loop where, I gets incremented on every iteration through the 84 00:07:29,467 --> 00:07:36,550 loop. And so what we do is, as long as we see a 85 00:07:36,550 --> 00:07:43,732 match we also increment J, so that while the pattern is matching we're implementing 86 00:07:43,732 --> 00:07:48,751 both I and J. When we see a mismatch what we do is just 87 00:07:48,751 --> 00:07:53,062 subtract. The current value of j from i the pointer. 88 00:07:53,062 --> 00:07:57,222 That's the next, character that we have to look at. 89 00:07:57,457 --> 00:08:03,737 So, when we find this mismatch, we wanna, subtract the current value of J. 90 00:08:03,973 --> 00:08:07,505 Then increment I at the end of the four loop. 91 00:08:07,505 --> 00:08:14,569 And that puts us right on the, next, text character that we wanna look at and then 92 00:08:14,569 --> 00:08:19,472 we reset j to zero. So this, does the same, Character 93 00:08:19,472 --> 00:08:23,777 comparisons. But it explicitly shows that we're backing 94 00:08:23,777 --> 00:08:28,231 up in the text. And then, if we ever get to the end of the 95 00:08:28,231 --> 00:08:33,947 pattern then we return I minus M. Which is the position of the first 96 00:08:33,947 --> 00:08:39,663 character in the text that matches the pattern which we found there was M 97 00:08:39,663 --> 00:08:42,336 matches. So it's an alternate implem, 98 00:08:42,336 --> 00:08:49,037 implementation that will come back to. So, the ideas that there are a couple of 99 00:08:49,037 --> 00:08:55,477 out rhythm challenges even though this brute force method, is simple it is not 100 00:08:55,477 --> 00:09:00,327 always good enough. And so the first one is just, from a 101 00:09:00,327 --> 00:09:04,541 purely algorthmic stand point. This is a challenge. 102 00:09:04,780 --> 00:09:08,176 Do we need N M? Or M is the length of the pattern? 103 00:09:08,176 --> 00:09:11,982 Or can we do it in time independent of the length of the pattern? 104 00:09:12,158 --> 00:09:16,550 What we want is a linear time guarantee. And that was a fundamental problem 105 00:09:16,550 --> 00:09:19,420 algorithmic problem that people worried about. 106 00:09:19,420 --> 00:09:25,173 And for a, for a, for a bit and we're going to look at the, how people approach 107 00:09:25,173 --> 00:09:30,392 this fundamental algorithmic problem. And then there's the practical challenge 108 00:09:30,392 --> 00:09:36,881 of we don't, we might not have the room or the time to save the text and actually the 109 00:09:36,881 --> 00:09:41,631 judge might not be happy about us saving text away in some computer. 110 00:09:41,832 --> 00:09:44,910 So we want to avoid backup in the text stream. 111 00:09:45,110 --> 00:09:48,322 All we're supposed to know about is our pattern. 112 00:09:48,322 --> 00:09:53,340 And we're supposed to light the light if we find our pattern and that's it. 113 00:09:53,340 --> 00:10:00,354 So what we're gonna see next is a way to deal with both of those challenges at the 114 00:10:00,354 --> 00:10:01,200 same time.