In the last video, we discussed the
performance of hash tables that are
implemented using chaining, using one link
list per bucket. In fact, we proved
mathematically that if you use a hash
function chosen uniformly at random from a
universal family, and if you keep the
buckets, number of buckets, comparable to
the size of your data set, then in fact,
you're guaranteed constant time expected
performance But, recall that chaining is
not the only implementation of hash
tables. There's a second paradigm which is
also very important called open
addressing. This is where you're only
allowed to store one object in each slot,
and you keep searching for an empty slot
When you need to insert a new object into
your hash table. Now it turns out it's
much harder to mathematically analyze hash
tables implemented using open addressing
But, I do want to say a few words about it
to give you the gist of what kind of
performance you might hope to expect from
those sorts of hash tables. So recall how
open addressing works. We're only
permitted to store one object in each slot
So this is unlike the case with chaining
where we can have an arbitrarily long list
in a given bucket of the hash table. With
at most one object per slot, obviously
open addressing only makes sense when the
load factor alpha is less than one. When
the number of objects you're storing in
your table is less than the number of
slots available Because of this
requirement we have at most one object per
slot we need to demand more of our hash
function. Our hash function might ask us
to put a given object, say with some IP
address into say bucket number seventeen
but bucket number seventeen might already
be full, might already be populated. In
that case, we go back to our hash function
and ask it where to look for an empty slot
next. So maybe it tells us to next look in
bucket 41. If 41 is full it tells us to
look in bucket number seven and so on Two
specific strategies for producing a probe
sequence that we mentioned earlier were
double hashing and linear probing. D ouble
hashing is where you use two different
hash functions, h1 and h2. H1 tells you
which slot in which to search first and
then every time you find a full slot you
add an increment which is specified by the
second hash functions h2. Linear probing
is even simpler you just have one hash
function that tells you where to search
first and then you just add one to the
slot until you find an empty slot As I
mentioned at the beginning, it is quite
nontrivial to Mathematically analyze the
performance of hash tables, using these
various open addressing strategies. It's
not impossible. There is some quite
beautiful and quite informative
theoretical work. That does tell us how
hash tables perform But that's well
outside the scope, of this course. So
instead what I wanna do is I want to give
you a quick and dirty calculation. That
suggests, at least in an idealized world.
What kind of performance we should expect
from a hash table with open addressing If
it's well implemented As a function of the
load factor, alpha. Precisely, I'm going
to introduce a heuristic assumption. It's
certainly not true but we'll do it just
for a quick and dirty calculation, that
we're using a hash function in which each
of the n-factorial possible probe
sequences is equally likely. Now, no hash
function you're ever going to use is
actually going to satisfy this assumption,
and if you think about it for a little
bit, you realize that if you use double
hashing or linear programming, you're
certainly not going to be satisfying that
assumption. So this will still give us a
kind of best case scenario against to
which you can compare the performance of
your own hash table implementations. So if
you [inaudible] hash table, and you're
seeing performance as good, as what's
suggested by this idealized [inaudible]
analysis, then you're home free. You know
your hash table is performing great. So
what is the line in the sand that gets
drawn, under this heuristic assumption?
What is this idealized, idealized hash
function performance? As a function of the
lo ad alpha Well here it is. What I'm
gonna argue next is that under this
heuristic assumption, the expected amount
of time to insert a new object into the
hash table, is going to be essentially one
over one minus alpha, where alpha is the
load. Remember the load is the number of
objects in the hash table divided by the
number of available slots. So if the hash
table is half full, then alpha's going to
be.5. If it's 75 percent full then alpha's
going to be three-fourths. So what this
means is that, in this idealized scenario,
if you keep the load pretty under control.
So, say if the load is 50%, then the
insertion time is gonna be great, right?
If alpha's.5 And 1/ (1-alpha) =two, so you
expect just two probes before the
successful insert of the new object And of
course, if you're thinking about lookup,
that's going to be at least as good as
insert, so if you're lucky a lookup might
terminate early if you find what you are
looking for. In the worst case you go all
the way until an empty slot in an
unsuccessful search, and that's gonna be
the same as insertion. So if alpha is
small bounded away from one, you're
getting constant time performance. On the
other hand, as the hash table gets full,
as alpha gets close to one, this operation
time is blowing up; it's such a going to
infinity as alpha gets close to one. So if
you need to have a nice. 90 percent full
hash table with open addressing. You're
gonna start seeing, ten probes. So, you
really wanna keep hash tables with open
addressing. You wanna keep the load under
control Certainly no more than probably.7.
Maybe even less than that To refresh your
memory, with chaining, hash tables are
perfectly well-defined even with loads
factors bigger than one. What we derived
is that under universal hashing, under a
weaker assumption, we had an operation
time of one plus alpha, for a load of
alpha. So with chaining, you just gotta
keep alpha, you know, at most, some
reasonably small constant with open
address, and you really got to keep it
well bounded a way below one. So next
let's understand why this observation is
true. Why under the assumption that every
probe sequence is equally likely do we
expect a one over one minus alpha running
time for hash tables with open addressing?
So, the reason is pretty simple. And we
can derive it by analogy with a simple
coin flipping experiment. So, to motivate
the experiment, think just about the very
first probe that we do. Okay, so we get
some new objects, some new IP address that
we want to insert into our hash table.
Let's say our hash table's currently 70
percent full. Say there's 100 slots, 70
are already taken by objects. Well, when
we look at this first probe, by assumption
it's equally likely to be any one of the
100 slots. 70 of which are full, 30 which
are empty So, with probability of one
minus alpha, or in the case, 30%, our
first Probe will, luckily, find an empty
slot and we'll be done. We'll just insert
the new object into that slot If we get
unlucky with a probability, 70%. We find a
slot that's already occupied and then we
have to try again. So we try a new slot,
drawn at random And we again check is it
full, or is it not full? And again, with
30 percent probability, essentially it's
going to be empty and we can stop And if
it's already full. Then we try, yet again.
So Doing random probes, looking for an
empty slot, is tantamount to flipping a
coin with the probability of heads
1-alpha, or, in this example, 30 Percent
And the number of probes you need until
you successfully insert is just the number
of times you flip this last coin until you
see a heads. In fact this biased coin
flipping experiment slightly overestimates
the expected time for insertions and the
heuristics assumptions and that's because
in the insertion time whenever we're never
going to try the same slot twice. We're
going to try all end buckets in some order
with each of the impact [inaudible]
ordering equally likely So back to our
example, where we have a hash table with
100 slots, 70 of which are full. The first
probe indeed, we have a 30 in 100 chance
of ge tting an empty slot. If that one
fails then we're not going to try the same
slot again. So there is only 99 residual
possibilities. Again, 30 of which are
empty. The one we checked last time was
full. So we actually have a 30 over 99
percent chance of getting an empty slot on
the second try. Like 30 over 98 on the
third try, if the second one fails, and so
on But, a valid upper bond is just to
assume a 30 percent success probability
with every single probe, and that's
precisely, what this coin flipping
experiment will get us. So the next quiz
will ask you to actually compute the
expected value of capital N, the number of
coin flips, needed to get heads when you
have a probability of heads of one minus
alpha. As a hint, we actually analyzed
this exact same coin flipping experiment
when alpha equals a half, back when we
discussed the expected running time of
randomized linear time selection. Alright,
so the correct answer is the first one.
One over 1-alpha So to see why, let's
return to our derivation, where we reduced
analyzing the expected insertion time to
this random variable. The expected number
of coin flips until we see a heads. So,
I'm gonna solve this exactly the same way
that we did it back when we analyzed a
randomized, selection algorithm. And it's
quite a sneaky way, but very effective.
What we're going to do is we're going to
express the expected value of capital N,
in terms of itself, and then solve. So how
do we do that? Well on the left hand side
let's write the expected number of coin
flips, the expected value of capital N,
and then let's just notice that there's
two different cases, either the first coin
flip is a heads or it's not. So in any
case you're certainly going to have one
coin flip so let's separate that out and
count it separately. With probability
alpha, the first coin flip is gonna be
tails and then you start all over again
And because it's a memory less process,
the expected number of further coin flips
one requires, given that the first coin
flip was tails, is just the same as the
expected number of coin flips in the first
place. So now it's a simple matter to
solve this one linear equation for the
expected value of N, and we find that it
is indeed one over one minus alpha, as
claimed. Summarizing, under our idealized
heuristic assumption, that every single
probe sequence is equally likely, the
expected insertion time is upper bounded
by the expected number of coin flips,
which by this argument is, at most, one
over one minus alpha. So, as long as your
load, alpha, is well bounded below one,
you're good. At least in this idealized
analysis, you're hash table will, will
work extremely quickly. Now I hope you're
regarding this idealized analysis with a
bit of skepticism. Right, from a false
hypothesis you can literally derive
anything you want. And we started with
this assumption which is not satisfied, by
hash functions you're actually going to
use in practice. This heuristic
assumption, that all probe sequences are
equally likely. So, should you expect this
one over one minus alpha bound to hold in
practice or not? Well, that depends to
some extent. It depends on what open
addressing strategy you're using. It
depends on, how good a hash function
you're using. It depends on whether the
data is pathological or not. So, just to
give course rules of thumb If you're Using
double hashing and you have
non-pathological data, I would go ahead
and look for this 1/1-alpha bound in
practice. So implement your hash table,
check its performance as a function of the
load factor alpha and shoot for the
1/1-alpha curve. That's really what you'd
like to see. With linear probing, on the
other hand, you should not expect to see
this performance guarantee of 1/1-alpha
even in a totally idealized scenario.
Remember, linear probing is the strategy
where your initial probe, the hash
function, tells you where to look first,
and then you just skim linearly through
the hash table until you find what you're
looking for, an empty slot, the. That
you're looking up or whatever So a linear
probing, even in a best case scenario,
it's going to be subject to clumping.
You're going to have contiguous Groups of
slots which are all full, and that's
because of the linear probing strategy.
Now I encourage you to do some experiments
with implementations to see this for
yourself. So because of clumping with
linear probing, even in the idealized
scenario, you're not going to see one over
one minus alpha. However, you're going to
see something worse, but still in
idealized situations. Quite reasonable so
that's the last thing I want to tell you
about In this video. Now needless to say,
with linear probing the heuristic
assumption is badly false. The heuristic
assumption is pretty much always false to
no matter what hashing strategy you're
using, but with linear programming it's
quote on quote really false. So to see
that, the heuristic assumption, say that
all in factorial probe sequences are
equally likely. So your next probe is
going to be uniform or random amongst
everything you haven't probed so far but
when you're probing, it's totally the
opposite. Right once you know the first
slot that you're looking into say bucket
seventeen, slot a7 is gonna be the first
slot, you know the rest of the sequence
because it's a linear [inaudible] cancel
the table. So it's kind of [inaudible] the
opposite from each successive probe being
independent from the previous ones except
not exploring things twice. So to state a
conjectured or idealized performance
guarantee for hash tables with linear
probing, we're going to place, replace the
blatant false heuristic assumption by a
still false, but more heuristic reasonable
assumption. So what do we know? We know
that the initial probe with linear probing
determines the rest of the sequence. So
let's assume that these initial probes are
uniform at random, and independent for
different keys. Of course, once you have
the initial probe, you know everything
else, but let's assume independence and
uniformity amongst The initial probes.
Now, this is a strong assumption. This is
way stronger than assuming you ha ve a
universal family of hash functions. This
assumption is not satisfied Practice, but
Performance guarantees we can derive under
this assumption are typically satisfied in
practice by well implemented hash tables
that use linear probing. So, the
assumption is still useful for deriving
the correct, idealized performance of this
type of hash table. So what is that
performance? Well this is an utterly
classic result from exactly 50 years ago
From 1962 And this is a result by my
colleague, the living legend, Don Canuth,
author of Art of Computer Programming. At
what can proved is, was that is that under
this weaker [inaudible] assumptions,
suitable for linear probing. The expected
time to insert an object into a hash table
with a load factor alpha, when you're
using linear probing is worse than one
over one minus alpha, but. It is still a
function of the load alpha only and not a
function of the number of objects in the
hash table. That is with linear
programming you will not get as good a
performance guarantee, but it is still the
case that if you keep the load factor
bounded away from one. If you make sure
the hash table doesn't get too full you
will enjoy constant time operations on
average so for example if with linear
probing your hash table is 50 percent full
then you are going to get an expected
insertion time of roughly four probes.
Note however this quantity does approach
does blow up pretty rapidly as the hash
table grows full. If it is 90 percent full
this is already going to be something like
a hundred probes on average. So you really
don't wanna let hash tables get too full
when you are using linear probing. You
might well wonder if it's ever worth,
implementing linear probing, given that it
has the worst performance curve, one over
one minus alpha squared. Then the
performance curve you'd hope from
something like double hashing, one over
one, minus alpha. And it's a tricky cost
benefit analysis between linear probing
and more complicated but better performing
strategies. That really depends on the ap
plication. There are reasons that you do
want to use linear probing sometimes, it
is actually quite Common in practice For
example, it's often interacts very well
with memory hierarchies So again, as with
all of this hash and discussion. You know
the costs and benefits Are, are very
subtle trade-offs between the different
approaches. If you have mission critical
code that's using a hash table and you
really want to optimize it. Try a bunch of
prototypes, and just test. Figure out
which one is the best, for your particular
type of application. Let me conclude the
video with a quote from Canuck himself
where he talks about the rapture of
proving this one of our one man is half a
square theorem and how it was life
changing. He says I first formulated the
following derivation, meaning, the proof
of that last theorem in 1962. Ever since
that day, the analysis of algorithms has,
in fact, been one of the major themes in
my life.