1 00:00:00,000 --> 00:00:05,846 . And that's still not the end of the story. 2 00:00:05,846 --> 00:00:11,131 We're gonna look at one more really interesting al-, algorithm that has, lots 3 00:00:11,131 --> 00:00:14,420 of important applications, called a Rabin Karp Algorithm. 4 00:00:14,596 --> 00:00:18,590 Invented by two Turing award winners. Michael Rabin and Dick Karp. 5 00:00:18,793 --> 00:00:24,028 I can remember hearing about this algorithm, from my friend, Dick Lipton, 6 00:00:24,028 --> 00:00:29,127 who explained it to me over the phone, and, I, he explained it to me in about, 7 00:00:29,127 --> 00:00:33,751 fifteen seconds, and I realized I had to have this, in the book. 8 00:00:33,954 --> 00:00:38,850 And, so now, here we are, presenting it. Th, that was, in the 70's. 9 00:00:38,850 --> 00:00:43,755 So the basic idea for the Rabin-Karp algorithm is, has to do with hashing. 10 00:00:43,941 --> 00:00:47,853 In it's a particular kind of hashing, called modular hashing. 11 00:00:47,853 --> 00:00:51,516 It's just a particular way of computing a hash function. 12 00:00:51,516 --> 00:00:56,235 It's easiest to think about in terms of numbers, although it, it works in all 13 00:00:56,235 --> 00:01:00,271 kinds of situations. Because remember everything in a computer 14 00:01:00,271 --> 00:01:05,363 is encoded as a byte, which can be treated as bytes which could be treated as binary 15 00:01:05,363 --> 00:01:09,253 numbers. And so what we are going to do is in this 16 00:01:09,253 --> 00:01:13,400 case We saw a pattern characters are decimal digits. 17 00:01:13,400 --> 00:01:18,980 And so, we'll treat a sequence of pattern characters as the decimal number. 18 00:01:19,360 --> 00:01:23,872 And modular hashing is, just take a big prime. 19 00:01:23,872 --> 00:01:30,692 And compute the remainder when you divide your number by that prime. 20 00:01:30,692 --> 00:01:38,715 So in this case, 613 is the remainder that you get when you divide 26,535 by 997. 21 00:01:38,715 --> 00:01:45,033 So you can check that. So that's what we're going to use as the 22 00:01:45,033 --> 00:01:47,340 hash function. And that. 23 00:01:47,340 --> 00:01:53,251 This type of hashing is widely used. You have a prime number, we talked about 24 00:01:53,251 --> 00:01:58,308 it when we talked about hashing. It satisfies, it seems to satisfy 25 00:01:58,308 --> 00:02:03,831 something like the uniform hash assumption under various circumstances. 26 00:02:03,831 --> 00:02:10,132 So that's our pattern, a five character pattern, and we're going to keep the small 27 00:02:10,132 --> 00:02:16,821 hash values 613 and this is going to generalize to longer patterns, and we'll 28 00:02:16,821 --> 00:02:22,540 talk about that in a minute. So now, suppose we have this text. 29 00:02:22,778 --> 00:02:27,383 And our pattern happens to occur here in the text. 30 00:02:27,383 --> 00:02:34,768 And what the method is built on is the idea of you take the first five characters 31 00:02:34,768 --> 00:02:41,200 in the text and compute its hash value. In this case, 31,415, mod down 97, is 508. 32 00:02:41,382 --> 00:02:44,364 So that's different so that's not the pattern. 33 00:02:44,547 --> 00:02:49,537 Maybe take the next five characters, that's 201. That's diffent It's not the 34 00:02:49,537 --> 00:02:51,120 pattern. Take the next one. 35 00:02:51,355 --> 00:02:58,098 That's 715, different it's not the pattern 15,926 by 97 is 971, it's not the pattern. 36 00:02:58,098 --> 00:03:03,587 Eventually, when you have the text characters that are the same as the 37 00:03:03,587 --> 00:03:08,920 pattern characters you're gonna get the same result, it's a match. 38 00:03:08,920 --> 00:03:15,114 If the pattern hat, hash equals the text sub-string hash you, you have the 39 00:03:15,114 --> 00:03:19,787 potential for a match. And that's what the algorithm is based on. 40 00:03:19,787 --> 00:03:25,892 Now it seems like, we're doing a lot of calculation, with, making numbers out of 41 00:03:25,892 --> 00:03:29,893 these things. And, and keep doing modular arithmetic on 42 00:03:29,893 --> 00:03:33,612 it. But actually there's, a really, simple way 43 00:03:33,612 --> 00:03:36,910 to, severely limit the amount of calculation. 44 00:03:37,121 --> 00:03:40,490 And give a quick linear algorithm for, search. 45 00:03:40,490 --> 00:03:46,582 Sub-string search. So first thing is how to compute the hash 46 00:03:46,582 --> 00:03:51,030 function. So we take the, just convert the Math. 47 00:03:51,030 --> 00:03:56,639 So R's our Radix. So in this example, we're using ten so we 48 00:03:56,639 --> 00:04:02,537 have decimal numbers. And then the digits, say t's of i That's 49 00:04:02,537 --> 00:04:08,629 the text characters. So we have a number x of I, which is the, 50 00:04:10,040 --> 00:04:18,366 M characters starting at position I. And that's just in Math, tiR to the M-1 51 00:04:19,242 --> 00:04:31,850 So you know, in this case that's two10000+61000+5100+310+5 that's just 52 00:04:31,850 --> 00:04:36,748 Math for that. And our goal is so it's an N digit base 53 00:04:36,748 --> 00:04:42,764 based our integer modular Q And our and our goal is to do the math. 54 00:04:42,764 --> 00:04:50,412 That gives us the remainder that we would get when dividing that by Q well there's 55 00:04:50,670 --> 00:04:56,985 really easy method called Horner's Method that we can use to evaluate a degree in 56 00:04:56,985 --> 00:05:02,992 polynomials just with a multiplied M multiply and add. And we can do the 57 00:05:02,992 --> 00:05:10,308 modular computation all the way through at each step, to keep the numbers less than q 58 00:05:10,308 --> 00:05:16,581 and we still get the same result. And so the idea is, you multiply by R. 59 00:05:16,581 --> 00:05:24,650 You go from left to right through the digits and you just multiply by R and add 60 00:05:24,650 --> 00:05:28,886 the digit, and then do mod q at every time. 61 00:05:28,886 --> 00:05:37,661 So we start with two mod 97 is two. To six mod 987 is two10+6 mod 987 and 62 00:05:37,661 --> 00:05:41,839 that's 26. And then I take that value. 63 00:05:42,191 --> 00:05:47,707 Multiply by ten and add five that's 265 mod 997. 64 00:05:48,059 --> 00:05:54,750 In that case it's, it's 265. So 26510+3 is 2653. 65 00:05:54,750 --> 00:06:00,933 Our remainder is divided by 997, it's 659, so even though our number gets bigger than 66 00:06:00,933 --> 00:06:08,031 997, might take them out every time, we keep our running total less than 997. 67 00:06:08,031 --> 00:06:16,051 And then the last step is to take the 659. Basically we've thrown out a bunch of 68 00:06:16,051 --> 00:06:26,389 multiples of 997 that we don't care about. And 65910+5 mod 997 is exactly equal to 69 00:06:26,389 --> 00:06:30,350 26535 mod 997 and that's 613. That's our value. 70 00:06:31,123 --> 00:06:39,412 So that's A using Horner's method we got a, well known linear time method to do 71 00:06:39,412 --> 00:06:43,550 compute or hash function with this simple code. 72 00:06:43,764 --> 00:06:50,206 And this notice will work even for a huge key that we wouldn't compute a hundred 73 00:06:50,206 --> 00:06:55,574 dig, convert a hundred digit key in to some number to do the calculation. 74 00:06:55,789 --> 00:07:02,517 We do one digit at a time using Horner's method and then we have no limit because 75 00:07:02,517 --> 00:07:07,170 we're always keeping our numbers less than our prime queue. 76 00:07:07,170 --> 00:07:12,130 So that's a first step, so no matter how big the pattern is, 77 00:07:12,431 --> 00:07:17,630 We can efficiently compute a hash or, since that is the first step. 78 00:07:17,956 --> 00:07:28,827 So now the second step for the Rabin Karp algorithm is to realize that if we know xi 79 00:07:28,827 --> 00:07:38,502 mod q we can efficiently compute xi+1 mod q cause they have a lot of digits in 80 00:07:38,502 --> 00:07:44,372 common. And you can just do a little Math to get 81 00:07:44,372 --> 00:07:53,310 to xi+1 you take. Xi, we don't care about the first digit anymore, so you subtract 82 00:07:53,310 --> 00:07:57,380 it off. Multiply by r and then add the new digit. 83 00:07:57,380 --> 00:08:03,610 That's like one step of Horner's method. Now, then you have to take that 84 00:08:03,610 --> 00:08:08,261 computation and you can do mod q all the way through. 85 00:08:08,261 --> 00:08:16,158 All you have to do is pre-compute r to the n+1 mod q And so here's the computation 86 00:08:16,158 --> 00:08:22,476 for one example. If we're at this position 41592 and we 87 00:08:22,476 --> 00:08:32,479 know 41592 mod q we can compute 15926 mod q by subtracting off 40,000. 88 00:08:32,753 --> 00:08:41,596 The TiR-1 and that gives us just the four digits, multiply by the radix add the new 89 00:08:41,596 --> 00:08:51,351 trailing digit and that's the new value. And if we just keep that all mod q then we 90 00:08:51,351 --> 00:09:02,683 can with just a multiply and an add at each step we can keep a running total of 91 00:09:02,683 --> 00:09:07,874 the Modular hash value of the five digit thing. 92 00:09:07,874 --> 00:09:17,266 So, for example, this is the case that, that we just did 4152 on 997 is, is done 93 00:09:17,266 --> 00:09:25,021 by ex, exactly as we said we subtract and then add and then multiply by the radix 94 00:09:25,279 --> 00:09:29,415 mod 997. So, doing those calculations all the way 95 00:09:29,415 --> 00:09:33,810 through the search, we eventually get to a match. 96 00:09:34,125 --> 00:09:38,725 That's again remarkably small amount of code. 97 00:09:38,978 --> 00:09:45,873 We're going to keep a long random prime. Just keep it a little smaller than the 98 00:09:45,873 --> 00:09:54,331 biggest long value to avoid overflow. So we pre-compute r to the m - one mod q 99 00:09:54,331 --> 00:10:00,249 'cause that's the little calculation that we have to do. 100 00:10:00,249 --> 00:10:05,521 We compute the hash function. And for the pattern. 101 00:10:05,521 --> 00:10:13,914 And then, with those pre-computations, the search is extremely straight forward. 102 00:10:14,236 --> 00:10:22,521 So we take our current hash value. And this is just, add a q make sure it's 103 00:10:22,521 --> 00:10:26,881 positive. And subtract off rm times the first 104 00:10:26,881 --> 00:10:31,750 character in then add in the next character mod q. 105 00:10:32,011 --> 00:10:37,229 And that gives us the text hash for the current position. 106 00:10:37,229 --> 00:10:42,968 And then we compare to see if that's equal to the pattern hash. 107 00:10:43,229 --> 00:10:50,359 Now there's this is an introduction to the idea of randomized algorithms. 108 00:10:50,620 --> 00:10:57,490 There's two ways to proceed from here. One way called Monte Carlo version. 109 00:10:57,738 --> 00:10:57,955 Where we guaratee that the algorithm is gonna be quick but with low probability, 110 00:10:57,955 --> 00:11:06,694 it might get the answer wrong. In that version we don't ever bother to check 111 00:11:06,694 --> 00:11:06,694 whether the go through and check all digits to see if there's actually a match. 112 00:11:14,580 --> 00:11:25,723 We take queue large enough so that we're confident the probability of the, to, two 113 00:11:25,726 --> 00:11:30,736 digit numbers or two m digit numbers having the same hash value. 114 00:11:30,736 --> 00:11:36,320 And so low that we're not gonna worry about it, that's called the Monte Carlo 115 00:11:36,320 --> 00:11:40,113 version. The Las Vegas version is guarateed to get 116 00:11:40,113 --> 00:11:44,551 the right answer. In that one we would go and check to make 117 00:11:44,551 --> 00:11:48,774 sure that the M characters match if we have a hash match. 118 00:11:48,989 --> 00:11:55,646 And then if it could be that with such at a low probability could be that there's a 119 00:11:55,646 --> 00:11:59,440 hash match but not a substring match. Then we're just. 120 00:11:59,440 --> 00:12:03,766 Move on. And from a theoretical point of view 121 00:12:04,054 --> 00:12:11,650 there's some very extremely low possibility that one could be slow. 122 00:12:11,650 --> 00:12:15,494 But lets look over at what the analysis says. 123 00:12:15,750 --> 00:12:22,499 So the theory says that if you take a sufficiently large random prime say mn^ 124 00:12:22,499 --> 00:12:28,650 three So a long value maybe you can get that and remember n is huge. 125 00:12:28,650 --> 00:12:36,594 Then the probability of a false collision is about 1/n. so you know in a billion 126 00:12:36,594 --> 00:12:43,258 things you might get, you might get 1/n, you might get a false collision. 127 00:12:43,514 --> 00:12:49,400 So in practice we choose actually q just to be, there's no reason not to choose as 128 00:12:49,400 --> 00:12:54,796 large as we possibly can, not related to m and n.m And then the probably collision is 129 00:12:54,796 --> 00:13:00,653 going to be about 1/q. So we're going to take it to be like the biggest law, and 130 00:13:00,653 --> 00:13:04,404 that means that probably collision is extremely small. 131 00:13:04,601 --> 00:13:10,326 And then you can take your chances. You can do a Monte Carlo version where you 132 00:13:10,326 --> 00:13:13,814 just say I got a match because I got a hash match. 133 00:13:13,814 --> 00:13:16,710 And be confident in the laws of probability. 134 00:13:16,944 --> 00:13:21,548 And not worry about the client getting the wrong answer. 135 00:13:21,782 --> 00:13:27,634 Or you can have the Las Vegas version where you go, go ahead and return the 136 00:13:27,634 --> 00:13:33,208 correct answer. And. And be confident that your client's 137 00:13:33,208 --> 00:13:40,288 not gonna run into a slow case cuz the probability is so, so tiny, 1/q, that you 138 00:13:40,288 --> 00:13:45,370 don't have to worry about it. That's the Rabin-Karp algorithm. 139 00:13:45,554 --> 00:13:48,690 Now, why look at this algorithm? It's linear time. 140 00:13:48,874 --> 00:13:51,702 We have other algorithms that are linear time. 141 00:13:51,702 --> 00:13:56,744 One of the key reasons to, be interested in Rabin-Karp is that it's easy to extend 142 00:13:56,744 --> 00:14:01,724 it to more complicated situations. So, say you wanna look for one of, several 143 00:14:01,724 --> 00:14:05,413 different patterns. Well, you just compute the hatches for 144 00:14:05,413 --> 00:14:10,946 those patterns, and then look for any one them Use, a symbol table, to look for'em. 145 00:14:11,131 --> 00:14:16,172 So, that's a much more, general capability than, we can provide with the other 146 00:14:16,172 --> 00:14:18,509 methods. It also can be extended to do 147 00:14:18,509 --> 00:14:21,996 2-dimensional search and other things like that. 148 00:14:22,215 --> 00:14:27,979 For straight suffering search, it's gonna be a little slower because, there's, 149 00:14:28,198 --> 00:14:33,232 interloop it's kind of long, the arithmatic operation, are gonna be a 150 00:14:33,232 --> 00:14:39,725 little slow, if you wanna do the Las Vegas version you have to back up the text and 151 00:14:39,725 --> 00:14:46,219 you have this, Monte Carlo Las, Las Vegas thing, and you should think about, writing 152 00:14:46,219 --> 00:14:52,421 code to extend it to look for any one of p possible patterns, thats, an interesting, 153 00:14:52,640 --> 00:14:58,170 Algorithmic puzzle as I mentioned is not so difficult to solve. 154 00:14:58,170 --> 00:15:04,521 So here's our summary. We started with a brute force algorithm, 155 00:15:04,521 --> 00:15:10,667 and although typically you don't have this worst case thing. 156 00:15:10,667 --> 00:15:18,863 It works fairly well for typical cases. And then we've got the Knuth-Morris-Prath 157 00:15:18,863 --> 00:15:24,600 Method that can guarantee linear time and has no backup. 158 00:15:24,831 --> 00:15:29,541 And maybe uses extra space unless you use Pratt's version. 159 00:15:29,541 --> 00:15:37,263 The Aboriamor who can get the running time down to n/m which is quite an amazing jump 160 00:15:37,494 --> 00:15:43,826 and quite useful and in a Rabin Karp that's very flexible and extends through 161 00:15:43,826 --> 00:15:48,845 all these other situations. This is a nice mac, microcosm of 162 00:15:49,077 --> 00:15:56,180 algorithmic technology where, really interesting, and unique and path breaking 163 00:15:56,180 --> 00:16:01,792 algorithmic ideas give us. A good algorithm, for even such a simple 164 00:16:01,792 --> 00:16:05,527 problem. That's an introduction to pattern