So, before we embark in the analysis, what
are we hoping to understand? Well, it
seems intuitively clear is that there is
going to be some trade off between the two
resources of the bloom filter. One
resource is space consumption, the other
resource is essentially correctness so the
more space we use, the larger number of
bits, we'd hope that we'd make fewer and
fewer errors. And then as we compress the
table more and more, we use bits more and
more for different objects then presumably
the error rate is going to increase. So,
the goal of the analysis that we're about
to do is to understand this trade off
precisely at qualitative level. Once we
understand the trade off occur between
these two resources, then we can ask is
there is a sweet spot which gives us a
useful data structure? Quite small space
and quite manageable error probability. So
the way we're going to proceed with the
analysis, we'll be familiar to those of
you who watched the open addressing video
about hash tables so to make the
mathematical analysis tractable, I'm going
to make a heuristic assumption The strong
assumption which is not really satisfied
by hash functions you would use in the
practice. We're going to use that
assumption to derive a performance
guarantee for bloom filters but as all as
any implementation you should check that
your implementation actually is getting
performance comparable to what the
idealizing analysis suggest. That said, if
you use a good hash function and if you
have a non-pathological data, the hopes
and this is going out many empirical
studies is that you will see performance
comparable to what this heuristic analysis
will suggest. So, what is the heuristic
assumption? Well, it's going to be again
familiar from my hashing discussions.
We're just going to assume that all the
hashing is totally random. So, for each
choice of a hash function hi and for each
possible object ax, the slots, the
position of the array which the hash
functions gives for that object is
uniformly random and first of all and it's
independen t from all other outputs of all
hash functions on all objects. So the set
up then is we have n bits. We have a data
set, S which we have inserted into our
bloom filter. Now our eventual goal is to
understand the error rate or the false
positive probability. That is the chance
that an object which we haven't inserted
into the bloom filter looks as if it has
been inserted into the bloom filter but as
a preliminary step, I want to ask about
the population of 1s after we've inserted
this data set S into the bloom filter. So,
specifically let's focus on a particular
position of the array and by symmetry it
doesn't matter which one. And let's ask
what is the probability that a given bit,
a given position on this array has been
set to one after we've inserted this
entire data set S? Alright, so this, this
is a somewhat difficult quiz question
actually. The correct answer is the second
answer. It's one - quantity one - 1/n
raised to the number of hash functions k
the number of objects cardinality of S,
that's the probability let's say the first
bit of the bloom filter has been set to
one after the data set S has been
inserted. So the, maybe the easiest way to
see this is to first focus on the first
answer. So, the first answer is going to
be the probability I claim that the first
bit is zero after the entire data set has
been inserted. Then of course it's
probably it's a one, is just the one - its
quantity which is equal to the second
answer. So we just seem to understand why
the first choice is probably the first bit
= zero. Well, it's initially zero,
remember stuff is only set from zero to
one. So we really need to analyze the
probability that this first bit survives
all of these darts that are getting thrown
to the bloom filter over the course of
this entire data set being inserted. So
there, the cardinality of these objects
each get inserted on an insertion k darts
uniformly at random and independent from
each other or effectively thrown at the
array at the bloom filter. Any position of
the dart hits, gets set to one. Maybe it
was one already but if it was zero, it
gets set to one. If it's one then it stays
one. So, how is this first pick going to
stay zero? We'll have to be missed by all
of the darts. A given dart, a given bit
flick is uniformly likely to be any of the
n bits so the probability of the ones that
being this bit is only 1/n but, if it even
it's fortunately somebody else? Well,
that's one - 1/n so you have a chance of
surviving a single dart with probably one
- 1/n There is the number of hash
functions k the number of objects
cardinality that's a dart being thrown.
Right k per object that gets inserted so
the overall probability of eluding all of
the darts is one - one or n raised to the
number of hash functions k the number of
insertions cardinality of S. Again, the
probability that is one which is the one -
that quantity which is the second option
in the quiz. So, let's go ahead and resume
our analysis using the answer to that
quiz. So, what do we discover, discover
the probability that a given bit is one,
is one - quantity one - 1/n or n is the
number of position raised to the number of
hash functions k the number of insertions
cardinality of S. So, that's the kind of
messy quantity so let's recall a simple
estimation facts that we used once
earlier. You saw this when we analyzed
cardinals construction algorithm and the
benefit of multiple repetitions or
cardinals contraction algorithm. And the
trick here is to estimate a quantity
that's on the form of one + x or one - x
by either the x or the - x as the case
maybe. So you take the function one + x
which goes through the points -ten and 01.
And of course it's a straight line and
then you also look at the function e to
the x. Well, those two functions are going
to kiss at the point 0,1 and everywhere
else e to the x is going to be above one +
x. So for any real value of x we can
always upper bound the quantity one + x by
either the - x. So let's apply this fact
to this quantity here, one - 1/n raise to
the k cardinality of S. We're going to
take x to the - 1/n so that gives us an
upper bound on this probability of one - e
to th e - k the number of insertions over
n, okay? So that's taking x to the - 1/n.
Let's simplify and finalize a little bit
further by introducing some notation. So,
I'm going to let b denote the number of
bits that were using per object. So this
is the quantity I was telling you to think
about as eighth previously. This is the
ratio n, the total number of bits divided
by the cardinality of S. So, this green
expression becomes one - e^k where b is
the number of bits per object. And now
we're already seeing this type of trade
off that we're expecting. Remember we're
expecting that as we use more and more
space, then the error rate we think should
go down so if you can press the table a
lot or use bits for lots of different
objects that's when you start going to see
a lot of false positives so in this light
blue expression if you take the number of
bits per objects with the number space,
the amount of space, little b if you take
that going very large expanding to
infinity, this exponent to zero. So either
the -zero is one. So overall, this
probability of a given bit being one is
turning to zero. So, that is, the more
bits you have, the bigger space you have.
The, well, the smaller of the fraction of
1s. The bigger the faction of 0s. That
should translate to a smaller false
positive probability unless we will make
precise on the next and final slot. So
let's, let's rewrite the upshot form the
last slide but probability that a given
bit is equal to one is that at above by
one - e to the - k over b where k is the
number of hash functions and b is the
number of bits we're using per object. Now
this is not the quantity that you care
about. The quantity we care about is a
false positive probability where something
looks like it's in the bloom filter even
though it's never been inserted so it's
focused on some object like some IP
address which is never ever been inserted
into this bloom filter. So for a given
object x which is not in the data set,
that this has not been inserted into the
bloom filter or what has to happen for us
to have a success ful look up for false
positive for this object? Well each one of
its k bits has to be set to one. So, we
already computed the probability that a
given bit is set to one. So, what has to
happen for all k of the bits that
indicates x's membership in the bloom
filter all k of them has to be set to one.
So we just take the quantity we computed
on the previous slide and we raise that to
the kth power. Indicating that it has to
happen k different times. So believe it or
not we now have exactly what we wanted.
What we set out to do which is derive a
qualitative understanding of the intuitive
trade off between the one hand space used
and on the other hand on the error
probability. The false positive of
probability. So, we're going to call this
green circle quantity and name it. We'll
call it epsilon for the error rate and
again all errors are false positives. And
again as b goes to infinity, as we use
more and more space, this exponent goes to
zero so one - e to that quantity is going
to zero as well. And of course, once we
power it through the kth power, it gets
even closer to zero. So if the bigger b
gets the small of this error rate epsilon
gets. So now let's get to the punch line.
So remember the question is, is this data
structure actually useful? Can we actually
set all of the parameters in a way that we
could both really usefully small space but
a tolerable error epsilon? And, of course
we wouldn't be giving this video if the
answer wasn't yes. Now one thing I've been
alluding all along is how do we set k? How
do we choose the number of hash functions?
I told you at the very beginning We think
of k as a small constant like 2345. And
now that we have this really nice
qualitative version of how the error rate
in the space trade off with each other. We
can answer how to set k. Namely set k
optimally so what do I mean? Well, fix the
number of bits that you're using per
object. Eight, sixteen, 24, whatever. For
fixed b, you can just choose the k that
minimize the screen quantity. That
minimizes the error rate epsilon. So, how
do you minimize t his quantity? Well, you
do it just like you learn in calculus and
I'll leave this as an exercise for you to
do in the privacy of your own home. But
for fixed b, the way to get this green
quantity epsilon as small as possible is
to set the number of hash functions k to
be roughly the natural log of two. That's
a number of < one notice that's like
.693 b. So, in other words the number of
hash functions for the optimal
implementation of the bloom filter is
scaling linearly than the number of bits
that you're using per object. It's about
.693 the bits per object. Of course this
is generally not going to be an integer so
you just pick k either this number rounded
up or this number rounded down. But,
continuing the heuristic analysis, now
that we know how to set k optimally to
minimize the error for a given amount of
space we can plug that value of k back in
and see well, how does the space and the
error rate trade off against each other
and we get a very nice answer.
Specifically, we get that the error rate
epsilon is just under an optimal trades to
the number of hash functions k decreases
exponentially in the number of bits that
you use per object. So, it's roughly one
half raised to the natural log of two or
.693 roughly the number of bits per object
b. But, again the key qualitative point
here is notice that epsilon is going down
really quickly as you scale b. If you
double the number of bits that you're
allocating per object, you're squaring the
error rate and for small error rates,
squaring it makes it much, much, much
smaller. And of course this is just one
equation in two variables. If you prefer,
you can solve this equation to express b,
the space requirement as a function of an
error requirement. So if you know that the
tolerance for false positives in your
application is one percent you can just
solve this for b and figure out how many
bits per object you need to allocate. And
so rewriting what you get is that the
number of bits per object that you need is
roughly 1.44 the log base two of one over
epsilon. So, as expected as epsilon gets
smaller and smaller, you want fewer and
fewer errors, the space requirements will
increase. So, the final question is, is it
a useful data structure? Can you set all
the parameters so that you get you know,
really interesting space error trade off
and the answer is totally. So, let me give
you an example. Let's go back to having
eight bits of storage per object so that
corresponds to b = eight. Then, what this
pick formula indicates is we should use
five or six hash functions and already you
have an error probability of something
like two percent which for a lot of the
motivating applications we talked about is
already good enough. And again, if you
double the number of bits to say sixteen
per object, then this error probability
would be really small. Pushing you know
one in 5,000 or something like that. So,
to conclude at least in this idealized
analysis which again, you should check
against at any real world implementation
although empirically, it is definitely
achievable with well implemented bloom
filter in nonpathological data to get this
kind of performance even with really a
ridiculously minuscule amount of space per
object much less generally than storing
the object itself, you can get fast
inserts, fast look ups, you do have to
have false positives but with a very
controllable amount of error rates and
that what's make bloom filters a win in a
number of applications.