So now that we understand why we can't
have a single hash function which always
does well at every single data set, that
is every hash function is subject to a
pathological data sets. We'll discuss the
randomize solution of how we can have a
family of hash functions and if you make a
real time decision about which hash
function to use, you're guaranteed to do
well on average, no matter what the data
is. So let me remind you of the three
prong plan that I have for this part of
the material. So in this video, we'll be
covering the first two. So part one, which
we'll accomplish in the next slide, will
be to propose a mathematical definition of
a good random hash function. So formerly,
we're going to define a universal family
of hash functions. Now, what makes this
definition useful, well, two things. And
so, part two, we'll show that there are
examples of simple and easy to compute
hash functions that meet this definition,
that are universal in the sense described
on the next slide. So that's important.
And in the third part, which we'll do in
the next video, will be the mathematical
analysis of the performance of hashing,
specifically with chaining when you use
universal hashing. And we'll show that if
you pick a random function from a
universal family, then, the expected
performance of all of the operations are
constant. Assuming, of course, of the
number of buckets is comparable to the
number of objects in the hash table which
we saw earlier is a necessary condition
for good performance. So let's go ahead
and get started and let's say what we mean
by a good random hash function. So for
this definition, we'll assume that the
universe is fixed. So maybe it's IP
addresses, maybe it's our friends names.
Maybe it's configurations of a chessboard,
whatever. But there's some fixed universe
u and we'll also assume we've decided on
the number of buckets n. And we call the
set H universal if and only if it meets
the following condition. In English, the
condition says that for each pair of
distinct elements, the probab ility that
they collide should be no larger than with
the gold standard of perfectly uniform
random hashing. So for all distinct keys
from the universe, call them x and y, what
we want is that the probability if we
choose a random hash function, h, from the
set script h, the probability that x and y
collide. And again, just to be clear, what
that means is that, x and y hash to
exactly the same bucket under this hash
function h, this should be no more than
1/n, and don't forget n is the number of
buckets. Again, to interpret this, you
know, 1/n, where does this come from? So,
we said earlier that an impractical but in
some sense, gold standard hash function
would be to just independently for each
key, assign it bucket uniformly and random
with different keys being assigned
independently. Remember the reason this is
not a practical hash function is because,
you'd have to remember where everybody
went. And then that would basically
require maintaining a list which would
devolve to the list solution, so you don't
want that. You want hash functions where
you have to store almost nothing and we
can evaluate them in constant time. But,
if we throw out those requirements of
small space and small time then, random
function should spread stuff out pretty
evenly, right? I mean that's what they are
doing. They're throwing darts completely
at random at these n buckets. So what
would be the collision probability of two
given keys say, of Alice and of Bob if you
are doing everything independently and
uniformly at random. Well, you know, first
Alice shows up and it goes to some totally
random bucket, say, bucket number
seventeen. Now, Bob shows up. So, what's
the probability that it collides with
Alice? Well, we have these n buckets that
Bob could go to, each is equally likely,
and there's a collision between Alice and
Bob if and only if Bob goes to bucket
seventeen. Since, each bucket's equally
likely, that's only a one in n
probability. So really what this condition
is saying, is that, for each pair of
elements, the collision probability should
be as small, as good as with the holy
grail of perfectly random hashing. So this
is a pretty subtle definition, perhaps the
most subtle one that we see in this entire
course. So, to help you get some facility
with this, and to force you to think about
it a little deeply, the next quiz which is
probably harder than a typical in class
quiz, asks you to compare this definition
of universal hashing with another
definition and ask you to figure out to
what extent they're the same definition.
So the correct answer to this quiz
question is the third one that there are
hash function families h, that satisfy the
condition on this slide that are not
universal. On the other hand, there are
hash function families H, which satisfy
this property and are universal. So, I'm
going to give you an example of each. I'd
encourage you to think carefully about why
this is an example and a non-example
offline. So an easy example to show that
sometimes the answer is yes, you have
universal hash function families h, which
also satisfy this property of the slide,
would be to just take H to be the set of
all functions from mapping the universe to
the number of buckets. So that's an awful
lot of functions, that's a huge set, but
it's a set nonetheless. And, by symmetry
of having all of the functions, it both
satisfies the property of this slide. It
is indeed true that exactly a one on one
refraction of all functions, map arbitrary
key k to arbitrary bucket i. And by the
same reasoning, by the same symmetry
properties, this is universal. So really,
if you think about it choosing a function
at random function from H, is now just
choosing a completely random function. So
it's exactly what we've been calling
perfect, random hashing. And as we
discussed in the last slide, that would
indeed have a collision probability of
exactly 1/n for each pair of distinct
keys. So, this shows sometimes you can
have both this property and be universal.
An example where you have the property in
this slide, but you're not universa l,
would be to take h to be a quite small
family, a family of exactly n functions,
each of which is a constant function. So
it's going to be one function which always
maps everything to bucket zero, that's a
totally stupid hash function. There's
going to be another hash function which
always maps everything to bucket number
one, that's a different but also totally
stupid hash function, and so on. And then
the end function will be the constant
function that always maps everything to
bucket n-1. And if you think about it,
this very silly set H does indeed satisfy
this very reasonable looking property on
this slide. Fix any key, fix any bucket,
you know say bucket number 31 what's the
probability that you pick a hash function
that maps this key to bucket number 31?
Well, independent of what the key is, it's
going to be the probability that you pick
the constant hash function whose output is
always 31. Since there's n different
constant functions, there's a one in n
probability. So, that's an example showing
that in some sense, this is not as useful
a property as the property of universal
hashing. So this is really not what you
wanted. This is not strong enough.
Universal hashing, that's what you want
for strong guarantees. So now that we've
spent some time trying to assimilate
probably the subtlest definition we've
seen so far in this class, let me let you
in on a little secret about the role of
definitions in mathematics. So on the one
hand, I think mathematical definitions
often get short shrift, especially in, you
know, the popular discussion of
mathematical research. That said, you
know, it's easy to come up with one reason
why that's true, which is that any schmo
can come up and write down an mathematical
definition. Nobody's stopping you. So,
what you really need to do is you need to
prove that a mathematical definition is
useful. So how do you indicate usefulness
of a definition? Well you gotta do two
things. First of all, you have to show
that the definition is satisfied by
objects of interest. For us right now,
objects of interest, are hash functions,
we might imagine implementing. So they
should be easy to store, easy to evaluate.
So there better be such hash functions
meaning, that complicated universal hash
function definition. The second thing is,
is something good better happen if you
meet the definition. And in the context of
hashing, what good thing do we want to
have happen? We want to have good
performance. So those are the two things
that I owe you in these lectures. First of
all, a construction of practical hash
functions that meet that definition,
that's what we'll start on right now.
Second of all, why meeting that definition
is a sufficient condition for good hash
table performance. That will be the next
video. So in this example, I'm going to
focus on IP addresses although the hash
function construction is general, as I
hope will be reasonably clear. And as many
of you know, an IP address is 32 bit
integer consisting of four different eight
bit parts. So let's just go ahead and
think of an IP address as a four two fold,
the way you often see it. And since each
of the four parts is eight bits, it's
going to be a number between zero and 255.
And the hash function that we're going to
construct, it's really not going to be so
different than the quick and dirty
functions as we talked about in the last
video although in this case we'll be able
to prove that the hash function family is
in fact, universal. And we're again going
to use the same compression function.
We're going to take the modulas with
respect to a prime number of buckets. The
only difference is we're going to multiply
these xi's by a random set of
coefficients. We're going to take a, a
random linear combination of x1, x2, x3
and x4. So I'm going to be a little more
precise. So we're going to choose a number
of buckets, n, and as we say over and
over, the number of buckets should be
chosen so it's in the same ball park of
the number of objects you are storing. So
you know, let's say that n should be
roughly double the number of objects that
you are storing as initial rule of thumb.
So, for example, maybe we only want to
maintain something in the ball park of 500
IP addresses and we can choose n to be a
prime like 997. So here's the
construction. Remember, we want to produce
not just one hash function, but the
definition is about a universal family of
hash functions. So we need a whole set of
hash functions that we're ultimately going
to chose one member from, at random. So,
how do we construct a whole bunch of hash
functions in a simple way? Here's how we
do it. So you define one hash function,
which I'm going to note by h sub a, a here
is a four tuple. The components of which
I'm going to call a1, a2, a3 and a4. And,
all of the components of a are integers
between zero and n-1. So they're exactly
in correspondence with the indices of the
buckets. So if we have 997 buckets, then
each of these ai's is an integer between
zero and 996. So it's clear that this
defines, you know, a whole bunch of
functions. So in fact, for each of the
four coefficients, that's four independent
choices, you have n options. Okay so each
of the integers between zero and n-1 for
each of the four coefficients. So that's
fine, not giving a name to end of the four
different functions, but what is any given
function? How do you actually evaluate one
of these functions? Just remember what a
hash function is supposed to do. Remember
you know, how it type checks it takes as
input something from the universe in this
case an IP address, and outputs a bucket
number. And the way we evaluate the hash
function h sub a, and remember a here is a
4-tuple. And remember IP address is also a
4-tuple, okay, so each component of the IP
address is between zero and 255. Each
component of a is between zero and n-1, so
for example, between zero and 996. And
what we do is just take the dot products
or the inner products of the vector a and
the vector x, and then we take the modulus
with respect the number of buckets. So
that is we take a1  x1 + a2  x2 + a3 
x3 +a4  x4. Now of course, remember the
x's lie be tween zero and 255, the ai's
lie between the zero and n-1, so say zero
and 996, you know, so you do these form of
multiplications now make it a pretty big
number, you might well over shoot the
number of buckets n. So to get back in the
range of what the buckets are actually
indexed that in the end we take the
module, modulus the number of buckets. So
in the end we do output, a number between
zero and n-1 as desired. So that's a set
of a whole bunch of hash functions, n to
the fourth hash functions. And each one
meets the criteria of being a good hash
function from an implementation
perspective, right? So remember, we don't
want to have to store much to evaluate a
function. And for a given hash function in
this family, all we gotta remember are the
coefficients, a1, a2, a3 and a4. So you
just gotta remember these four numbers.
And then to evaluate a hash function on an
IP address, we clearly do a constant
amount of work. We just do these four
multiplications, the three additions, and
then taking the modulus by the number of
buckets n. So it's constant time to
evaluate, constant space to store. And
what's cool is, using just these very
simple hash functions which are constant
time to evaluate and constant space to
store, this is already enough to meet the
definition of a universal family of hash
functions. So this fulfills the first
promise that I owed you, after subjecting
you to that definition of universal
hashing. Remember the first promise was,
there are simple, there are useful
examples that meet the definition, and
then of course, I still owe you why.
Meaning this definition is useful, why
does it leave the good performance. But I
want to conclude, this video of actually
proving this theorem to you, arguing that
this is, in fact, a universal family of
hash functions. Right. So this should be a
mostly complete proof and certainly will
have all of the conceptual ingredients of
why the proof works There will be one spot
where I'm a little hand-wavy because we
need a little number theory, and I don't
want to have a big detour into number
theory. And if you think about it, you
shouldn't be surprised that basic number
theory plays at least some role. Like as I
said, we should choose the number of
buckets to be prime. So that means at some
point in the proof, you should expect us
to use the assumption that n is prime. And
pretty much always you're going to use
that assumption will involve at least
elementary number theory, okay? But I'll
be clear about where I'm being hand-wavy.
So what do we have to prove? Let's just
quickly review a definition of a universal
hash function. So we have our set h that
we, that we know exactly what it is. What
does it mean that it's universal? It means
for each pair of distinct keys, so in our
context it's for each pair of IP
addresses, the probability that a random
hash function from our family script h
causes a collision, maps these two IP
addresses to the same bucket should be no
worse than with perfectly random hashing.
So no worse than 1/n where n is the number
of buckets, say like 997. So, the
definition we need to meet is a condition
for every pair of distinct keys. So let's
just start by fixing two distinct keys. So
I'm going to assume for this proof that
these two IP addresses differ in their
fourth component. That is that I'm going
to assume that x4 is different than y4. So
I hope that it's intuitively clear that,
you know, it shouldn't matter, you know,
which, which set of 8-bits I'm looking at.
So they're different IP addresses. They
differ somewhere. If I really wanted, I
could have four cases that were totally
identical depending on whether they differ
in the first eight bits, the next 8-bits,
the next 8-bits, or the last 8-bits. I'm
going to show you one case, because the
other three are the same. So let's just
think of the last 8-bits as being
different. And now, remember what the
definition asked us to prove. It asked us
to prove that the probability that these
two IP addresses are going to collide is
at most, 1/n. So we need an upper bound on
the collision probability w ith respect to
a random hash function from our set of n
to the fourth hash functions. So I want to
be clear on the quantifiers. We're
thinking about two fixed IP addresses. So
for example, the IP address for the New
York Times website and the IP address for
the CNN website. We're asking for these
two fixed IP addresses, what fraction of
our hash functions cause them to collide,
right? We'll have some hash functions
which map the New York Times and CNN IP
addresses to the same bucket, and we'll
have other hash functions which do not map
those two IP addresses to the same bucket.
And we're trying to say, that the
overwhelming majority, sends them to
different buckets, only a 1/n fraction at
most, sends them to the same bucket. So
we're asking about the probability for the
choice of a random hash function from our
set h that the function maps the two IP
addresses to the same place. So the next
step is just algebra. I'm just going to
take this equation which indicates when
the two IP addresses collide over a hash
function. I'm going to expand the
definition of a hash function, remember
it's just this inner product modulo the
number of buckets n, and I am going to
rewrite this condition in a more
convenient way. Alright, so after the
algebra, and the dust has settled. We're
left with this equation being equivalent
to the two IP addresses colliding. So
again, we're interested in the fraction of
choices of a1, a2, a3, and a4, such that
this condition holds, right? Sometimes
it'll hold for some choices of the ai's,
sometimes it won't hold for other choices
and we're going to show that it almost
never holds. Okay, so it fails for all but
a 1/n fraction of the choices of the ai's.
So next we're going to do something a
little sneaky. This trick is sometimes
called the Principle of Deferred
Decisions. And the idea is when you have a
bunch of random coin flips, it's sometimes
convenient to flip some but not all of
them. So sometimes fixing parts of the
randomness clarifies the role that the
remaining randomness is going to play .
That's what's going to happen here. So
let's go ahead and flip the coins, which
tell us the random choice of a1, a2, and
a3. So again remember, in the definition
of a universal hash function, you analyze
collision probability under a random
choice of a hash function. What does it
mean to choose a random hash function for
us? It means a random choice of a1, and
a2, and a3, and a4. So we're making four
random choices. And what I'm saying is,
let's condition on the outcomes of the
first three. Suppose we knew, that a1
turns up 173, a2 shows up 122 and a3 shows
up 723, but we don't know what a4 is. A4
is still equally likely to be any of zero,
one, two all the way up to n-1. So
remember that what we want to prove is
that at most 1/n fraction of the choices
of a1, a2, a3, and a4, cause this
underlined equation to be true, cause a
collision. So what we're going to show is
that for each fixed choice of a1, a2, and
a3, at most a 1/n fraction of the choices
of a4 cause this equation to hold. And if
we can show that for every single choice
of a1, a2, and a3, no matter how those
random coin flips come out, almost a 1/n
fraction of the remaining outcomes satisfy
the equation, then we're done. That means
that at most of 1/n fraction of the
overall outcomes can cause the equation to
be true. So if you haven't seen the
principle of, for these decisions before,
you might want to think about this a
little bit offline, but it's easily
justified by just say two lines of
algebra. Okay, so we're done with the
setup and we're ready for the meat of the
argument. So we have done is, we've
identified an equation which is now in
green, which occurs if and only if we have
a collision between the two IP addresses.
And the question we need to ask is, for a
fixed choices of a1, a2 and a3, how
frequently will the choice of a4 cause
this equation to be satisfied? Cause a
collision? Now, here is why we did this
trick of the Principle of Deferred
Decisions. By fixing a1, a2, and a3, the
right hand side of this equation is now
just some fixed number b etween zero and
n-1. So maybe this is 773, right? The xi's
were fixed upfront, the yi's were fixed
upfront. We fixed a1, a2, a3 at the
beginning, at the end of the last slide,
and those were the only ones involved in
the right hand side. So this is 773 and
over on the left hand side, x4 is fixed,
y4 is fixed but a4 is still random. This
is an integer equally likely to be any
value between zero and n-1. Now here's the
key claim, which is that the left-hand
side of this green equation is equally
likely to be any number between zero and
n-1. And I'll tell you the reasons why
this key claim is true. Although this is
the point where we need a little bit of
number theory, so I'll be kind of
hand-wavy about it. So there's three
things we have going for us, the first is
that x4 and y4 are different. Remember our
assumption at the beginning of the proof
was that, you know, the IP addresses
differ somewhere so why not just assume
that they differ in the last 8-bits of the
proof. Again this is not important if you
really wanted to be pedantic you could
have three other cases depending on the
other possible bits in which the IP
addresses might differ. But anyway, so,
because x4 and y4 are different, what that
means is that x4 - y4 is not zero. And in
fact, now that I write this, it's jogging
my memory of something that I should have
told you earlier, and forgot, which is
that the number of buckets n should be at
least as large as the maximum coeffcient
value. So for example, we definitely want
the number of buckets n in this equation
to be bigger than x4, and bigger than y4.
And the reason is, otherwise you could
have x4 and y4 being different from each
other, but they still, the difference
still winds up being zero mod-n. So for
example, suppose n was four, and x4 was
six and y4 was ten. Then x4-10 would be
-four and that's actually zero modulo
four. So that's getting now what you want.
You want to make sure that if x4 and y4
are different, then they're difference is
non-zero modulo n. And the way you ensure
that is that you just make sure n is
bigger than each. So you should choose a
number of buckets bigger than the maximum
value of the coefficient. So in our IP
address example, remember that the
coefficient don't get bigger than 255. And
I was suggesting a number of buckets equal
the same 997. Now, in general, this is
never a big deal in practice, if you only
wanted to use say, 100 buckets, you didn't
want to use 1000, you wanted 100, well,
then you could just use smaller
coefficients, right, you could just break
up the IP address, instead of into 8-bit
chunks, you could break it into 6-bit
chunks, or 4-bit chunks, and that would
keep the coefficient size smaller than the
number of buckets, okay? So you could
choose the buckets first, and then you
choose how many bits to chunk your data
into, and that's how you make sure this is
satisfied. So back to the three things we
have going for us in trying to prove this
key claim. So x4 and y4 are different, so
their difference is non-zero modulo n. So
second of all, n is prime, that was part
of the definition, part of the
construction. And then third, a4, this
final coefficient is equally likely to
take on any value between zero and n-1.
So, just as a plausibility argument, let
me give you a proof by example. Again, I
don't want to detour into elementary
number theory, although it's beautiful
stuff, so you know, I encourage those of
you who are interested to go learn some
and figure out exactly how you prove it.
You really only need the barest elementary
number theory to give a formal proof of
this. But just to show you that is true in
some simple examples, so let's think about
a very small prime. Let's just say there's
seven buckets and let's suppose that the
difference between x4 and y4 is two. Okay,
so having chosen the parameters of set n =
seven, I've set the difference equal to
two. What I want to do is I want to step
through the seven possible choices of a4,
and look at what we get in this blue
circle quantity, on the left hand side of
the green equation. So, we want to say the
left hand's equally likely to be any of
the seven numbers between zero and six, so
that means that as we try our seven
different choices for a4, we better get
the seven different possible numbers as
output. So, for example, if we set a4 =
zero, then the blue circled quantity is
certainly itself zero. If we set it equal
to one, then it's one  two, so we hit
two. For two, we get two  two which is
four. For three, we get three  two which
is six. Now, when we set a4 = four, we get
four  two which is eight, modulo seven is
one. Five  two - seven is three. Six
two - seven is five. So as we cycle
through a4, zero through six, we get the
value zero, two, four, six, one, three,
five. So indeed we cycle through the seven
possible outcomes one by one. So if a4 is
chosen uniformly and random, then indeed
this blue circled quantity will also be
uniformly random. So just to give another
quick example, we could also keep n =
seven, and think about the difference of
x4 and y4. Again, we have no idea what it
is, other than that its non-zero. So, you
know, maybe instead of three, maybe, maybe
instead of two, it's three. So now again,
let's stop through the seven choices of
a4, and see what we get. So now we're
going to get zero, then three, then six,
then two, then five, and then one, and
then four. So again, stepping through the
seven choices of a4, we get all of the
seven different possibilities of this left
hand side. And it's not an accident that
these choices are parameters. As long as n
is prime, x4 and y4 are different, and y
ranges over all possibilities, so will the
value on the left-hand side. So by
choosing a four uniformly random, indeed,
the left-hand side is equally likely to be
any of its possible values, zero, one, two
up to n-1. And so, what does that mean?
Well, basically it means that we're done
with our proof cuz remember, the
right-hand side that circled in pink is
fixed. We fixed a1, a2, and the a3. The
x's and y's have been fixed all along so
this is just some number, like 773. And
so, we know that there's exactly one
choice of a4 that will cause the left-hand
side to also be equal to 773. Now a4 has n
different possible values and it's equally
likely to take one on becaus e only a
one-end chance that we're going to get the
unlucky choice of a4 that causes the
left-hand side to be equal to 773 and of
course, there's nothing special about 773.
Doesn't matter how the right-hand side
comes out. We have only one-hand chance of
being unlucky and having a collision and
that is exactly the condition we are
trying to prove and that establishes the
universality of this function each of n^4,
very simple, very easy to evaluate hash