.
And that's still not the end of the story.
We're gonna look at one more really
interesting al-, algorithm that has, lots
of important applications, called a Rabin
Karp Algorithm.
Invented by two Turing award winners.
Michael Rabin and Dick Karp.
I can remember hearing about this
algorithm, from my friend, Dick Lipton,
who explained it to me over the phone,
and, I, he explained it to me in about,
fifteen seconds, and I realized I had to
have this, in the book.
And, so now, here we are, presenting it.
Th, that was, in the 70's.
So the basic idea for the Rabin-Karp
algorithm is, has to do with hashing.
In it's a particular kind of hashing,
called modular hashing.
It's just a particular way of computing a
hash function.
It's easiest to think about in terms of
numbers, although it, it works in all
kinds of situations.
Because remember everything in a computer
is encoded as a byte, which can be treated
as bytes which could be treated as binary
numbers.
And so what we are going to do is in this
case We saw a pattern characters are
decimal digits.
And so, we'll treat a sequence of pattern
characters as the decimal number.
And modular hashing is, just take a big
prime.
And compute the remainder when you divide
your number by that prime.
So in this case, 613 is the remainder that
you get when you divide 26,535 by 997.
So you can check that.
So that's what we're going to use as the
hash function.
And that.
This type of hashing is widely used.
You have a prime number, we talked about
it when we talked about hashing.
It satisfies, it seems to satisfy
something like the uniform hash assumption
under various circumstances.
So that's our pattern, a five character
pattern, and we're going to keep the small
hash values 613 and this is going to
generalize to longer patterns, and we'll
talk about that in a minute.
So now, suppose we have this text.
And our pattern happens to occur here in
the text.
And what the method is built on is the
idea of you take the first five characters
in the text and compute its hash value.
In this case, 31,415, mod down 97, is 508.
So that's different so that's not the
pattern.
Maybe take the next five characters,
that's 201. That's diffent It's not the
pattern.
Take the next one.
That's 715, different it's not the pattern
15,926 by 97 is 971, it's not the pattern.
Eventually, when you have the text
characters that are the same as the
pattern characters you're gonna get the
same result, it's a match.
If the pattern hat, hash equals the text
sub-string hash you, you have the
potential for a match.
And that's what the algorithm is based on.
Now it seems like, we're doing a lot of
calculation, with, making numbers out of
these things.
And, and keep doing modular arithmetic on
it.
But actually there's, a really, simple way
to, severely limit the amount of
calculation.
And give a quick linear algorithm for,
search.
Sub-string search.
So first thing is how to compute the hash
function.
So we take the, just convert the Math.
So R's our Radix.
So in this example, we're using ten so we
have decimal numbers.
And then the digits, say t's of i That's
the text characters.
So we have a number x of I, which is the,
M characters starting at position I.
And that's just in Math, tiR to the M-1
So you know, in this case that's
two10000+61000+5100+310+5 that's just
Math for that.
And our goal is so it's an N digit base
based our integer modular Q And our and
our goal is to do the math.
That gives us the remainder that we would
get when dividing that by Q well there's
really easy method called Horner's Method
that we can use to evaluate a degree in
polynomials just with a multiplied M
multiply and add. And we can do the
modular computation all the way through at
each step, to keep the numbers less than q
and we still get the same result.
And so the idea is, you multiply by R.
You go from left to right through the
digits and you just multiply by R and add
the digit, and then do mod q at every
time.
So we start with two mod 97 is two.
To six mod 987 is two10+6 mod 987 and
that's 26.
And then I take that value.
Multiply by ten and add five that's 265
mod 997.
In that case it's, it's 265.
So 26510+3 is 2653.
Our remainder is divided by 997, it's 659,
so even though our number gets bigger than
997, might take them out every time, we
keep our running total less than 997.
And then the last step is to take the 659.
Basically we've thrown out a bunch of
multiples of 997 that we don't care about.
And 65910+5 mod 997 is exactly equal to
26535 mod 997 and that's 613.
That's our value.
So that's A using Horner's method we got
a, well known linear time method to do
compute or hash function with this simple
code.
And this notice will work even for a huge
key that we wouldn't compute a hundred
dig, convert a hundred digit key in to
some number to do the calculation.
We do one digit at a time using Horner's
method and then we have no limit because
we're always keeping our numbers less than
our prime queue.
So that's a first step, so no matter how
big the pattern is,
We can efficiently compute a hash or,
since that is the first step.
So now the second step for the Rabin Karp
algorithm is to realize that if we know xi
mod q we can efficiently compute xi+1 mod
q cause they have a lot of digits in
common.
And you can just do a little Math to get
to xi+1 you take. Xi, we don't care about
the first digit anymore, so you subtract
it off.
Multiply by r and then add the new digit.
That's like one step of Horner's method.
Now, then you have to take that
computation and you can do mod q all the
way through.
All you have to do is pre-compute r to the
n+1 mod q And so here's the computation
for one example.
If we're at this position 41592 and we
know 41592 mod q we can compute 15926 mod
q by subtracting off 40,000.
The TiR-1 and that gives us just the four
digits, multiply by the radix add the new
trailing digit and that's the new value.
And if we just keep that all mod q then we
can with just a multiply and an add at
each step we can keep a running total of
the Modular hash value of the five digit
thing.
So, for example, this is the case that,
that we just did 4152 on 997 is, is done
by ex, exactly as we said we subtract and
then add and then multiply by the radix
mod 997.
So, doing those calculations all the way
through the search, we eventually get to a
match.
That's again remarkably small amount of
code.
We're going to keep a long random prime.
Just keep it a little smaller than the
biggest long value to avoid overflow.
So we pre-compute r to the m - one mod q
'cause that's the little calculation that
we have to do.
We compute the hash function.
And for the pattern.
And then, with those pre-computations, the
search is extremely straight forward.
So we take our current hash value.
And this is just, add a q make sure it's
positive.
And subtract off rm times the first
character in then add in the next
character mod q.
And that gives us the text hash for the
current position.
And then we compare to see if that's equal
to the pattern hash.
Now there's this is an introduction to the
idea of randomized algorithms.
There's two ways to proceed from here.
One way called Monte Carlo version.
Where we guaratee that the algorithm is
gonna be quick but with low probability,
it might get the answer wrong. In that
version we don't ever bother to check
whether the go through and check all
digits to see if there's actually a match.
We take queue large enough so that we're
confident the probability of the, to, two
digit numbers or two m digit numbers
having the same hash value.
And so low that we're not gonna worry
about it, that's called the Monte Carlo
version.
The Las Vegas version is guarateed to get
the right answer.
In that one we would go and check to make
sure that the M characters match if we
have a hash match.
And then if it could be that with such at
a low probability could be that there's a
hash match but not a substring match.
Then we're just.
Move on.
And from a theoretical point of view
there's some very extremely low
possibility that one could be slow.
But lets look over at what the analysis
says.
So the theory says that if you take a
sufficiently large random prime say mn^
three So a long value maybe you can get
that and remember n is huge.
Then the probability of a false collision
is about 1/n. so you know in a billion
things you might get, you might get 1/n,
you might get a false collision.
So in practice we choose actually q just
to be, there's no reason not to choose as
large as we possibly can, not related to m
and n.m And then the probably collision is
going to be about 1/q. So we're going to
take it to be like the biggest law, and
that means that probably collision is
extremely small.
And then you can take your chances.
You can do a Monte Carlo version where you
just say I got a match because I got a
hash match.
And be confident in the laws of
probability.
And not worry about the client getting the
wrong answer.
Or you can have the Las Vegas version
where you go, go ahead and return the
correct answer.
And. And be confident that your client's
not gonna run into a slow case cuz the
probability is so, so tiny, 1/q, that you
don't have to worry about it.
That's the Rabin-Karp algorithm.
Now, why look at this algorithm?
It's linear time.
We have other algorithms that are linear
time.
One of the key reasons to, be interested
in Rabin-Karp is that it's easy to extend
it to more complicated situations.
So, say you wanna look for one of, several
different patterns.
Well, you just compute the hatches for
those patterns, and then look for any one
them Use, a symbol table, to look for'em.
So, that's a much more, general capability
than, we can provide with the other
methods.
It also can be extended to do
2-dimensional search and other things like
that.
For straight suffering search, it's gonna
be a little slower because, there's,
interloop it's kind of long, the
arithmatic operation, are gonna be a
little slow, if you wanna do the Las Vegas
version you have to back up the text and
you have this, Monte Carlo Las, Las Vegas
thing, and you should think about, writing
code to extend it to look for any one of p
possible patterns, thats, an interesting,
Algorithmic puzzle as I mentioned is not
so difficult to solve.
So here's our summary.
We started with a brute force algorithm,
and although typically you don't have this
worst case thing.
It works fairly well for typical cases.
And then we've got the Knuth-Morris-Prath
Method that can guarantee linear time and
has no backup.
And maybe uses extra space unless you use
Pratt's version.
The Aboriamor who can get the running time
down to n/m which is quite an amazing jump
and quite useful and in a Rabin Karp
that's very flexible and extends through
all these other situations.
This is a nice mac, microcosm of
algorithmic technology where, really
interesting, and unique and path breaking
algorithmic ideas give us.
A good algorithm, for even such a simple
problem.
That's an introduction to pattern