1 00:00:02,058 --> 00:00:06,113 So, before we embark in the analysis, what are we hoping to understand? Well, it 2 00:00:06,113 --> 00:00:10,201 seems intuitively clear is that there is going to be some trade off between the two 3 00:00:10,201 --> 00:00:13,512 resources of the bloom filter. One resource is space consumption, the other 4 00:00:13,512 --> 00:00:18,383 resource is essentially correctness so the more space we use, the larger number of 5 00:00:18,383 --> 00:00:21,675 bits, we'd hope that we'd make fewer and fewer errors. And then as we compress the 6 00:00:21,675 --> 00:00:26,751 table more and more, we use bits more and more for different objects then presumably 7 00:00:26,751 --> 00:00:31,838 the error rate is going to increase. So, the goal of the analysis that we're about 8 00:00:31,838 --> 00:00:37,034 to do is to understand this trade off precisely at qualitative level. Once we 9 00:00:37,034 --> 00:00:40,430 understand the trade off occur between these two resources, then we can ask is 10 00:00:40,430 --> 00:00:44,484 there is a sweet spot which gives us a useful data structure? Quite small space 11 00:00:44,484 --> 00:00:49,637 and quite manageable error probability. So the way we're going to proceed with the 12 00:00:49,637 --> 00:00:53,045 analysis, we'll be familiar to those of you who watched the open addressing video 13 00:00:53,045 --> 00:00:57,388 about hash tables so to make the mathematical analysis tractable, I'm going 14 00:00:57,388 --> 00:01:00,554 to make a heuristic assumption The strong assumption which is not really satisfied 15 00:01:00,554 --> 00:01:04,013 by hash functions you would use in the practice. We're going to use that 16 00:01:04,013 --> 00:01:08,012 assumption to derive a performance guarantee for bloom filters but as all as 17 00:01:08,012 --> 00:01:11,061 any implementation you should check that your implementation actually is getting 18 00:01:11,061 --> 00:01:16,402 performance comparable to what the idealizing analysis suggest. That said, if 19 00:01:16,402 --> 00:01:20,276 you use a good hash function and if you have a non-pathological data, the hopes 20 00:01:20,276 --> 00:01:23,876 and this is going out many empirical studies is that you will see performance 21 00:01:23,876 --> 00:01:28,681 comparable to what this heuristic analysis will suggest. So, what is the heuristic 22 00:01:28,681 --> 00:01:31,548 assumption? Well, it's going to be again familiar from my hashing discussions. 23 00:01:31,548 --> 00:01:34,855 We're just going to assume that all the hashing is totally random. So, for each 24 00:01:34,855 --> 00:01:39,694 choice of a hash function hi and for each possible object ax, the slots, the 25 00:01:39,694 --> 00:01:44,021 position of the array which the hash functions gives for that object is 26 00:01:44,021 --> 00:01:49,013 uniformly random and first of all and it's independen t from all other outputs of all 27 00:01:49,013 --> 00:01:55,753 hash functions on all objects. So the set up then is we have n bits. We have a data 28 00:01:55,753 --> 00:02:03,117 set, S which we have inserted into our bloom filter. Now our eventual goal is to 29 00:02:03,117 --> 00:02:06,041 understand the error rate or the false positive probability. That is the chance 30 00:02:06,041 --> 00:02:10,008 that an object which we haven't inserted into the bloom filter looks as if it has 31 00:02:10,008 --> 00:02:14,053 been inserted into the bloom filter but as a preliminary step, I want to ask about 32 00:02:14,053 --> 00:02:19,683 the population of 1s after we've inserted this data set S into the bloom filter. So, 33 00:02:19,683 --> 00:02:24,662 specifically let's focus on a particular position of the array and by symmetry it 34 00:02:24,662 --> 00:02:28,690 doesn't matter which one. And let's ask what is the probability that a given bit, 35 00:02:28,690 --> 00:02:33,621 a given position on this array has been set to one after we've inserted this 36 00:02:33,621 --> 00:02:42,054 entire data set S? Alright, so this, this is a somewhat difficult quiz question 37 00:02:42,054 --> 00:02:48,062 actually. The correct answer is the second answer. It's one - quantity one - 1/n 38 00:02:48,062 --> 00:02:53,040 raised to the number of hash functions k the number of objects cardinality of S, 39 00:02:53,040 --> 00:02:56,047 that's the probability let's say the first bit of the bloom filter has been set to 40 00:02:56,047 --> 00:03:00,056 one after the data set S has been inserted. So the, maybe the easiest way to 41 00:03:00,056 --> 00:03:04,059 see this is to first focus on the first answer. So, the first answer is going to 42 00:03:04,059 --> 00:03:09,094 be the probability I claim that the first bit is zero after the entire data set has 43 00:03:09,094 --> 00:03:13,061 been inserted. Then of course it's probably it's a one, is just the one - its 44 00:03:13,061 --> 00:03:18,020 quantity which is equal to the second answer. So we just seem to understand why 45 00:03:18,020 --> 00:03:21,178 the first choice is probably the first bit = zero. Well, it's initially zero, 46 00:03:21,179 --> 00:03:25,026 remember stuff is only set from zero to one. So we really need to analyze the 47 00:03:25,026 --> 00:03:29,063 probability that this first bit survives all of these darts that are getting thrown 48 00:03:29,063 --> 00:03:34,069 to the bloom filter over the course of this entire data set being inserted. So 49 00:03:34,069 --> 00:03:39,001 there, the cardinality of these objects each get inserted on an insertion k darts 50 00:03:39,001 --> 00:03:43,028 uniformly at random and independent from each other or effectively thrown at the 51 00:03:43,028 --> 00:03:47,013 array at the bloom filter. Any position of the dart hits, gets set to one. Maybe it 52 00:03:47,013 --> 00:03:51,092 was one already but if it was zero, it gets set to one. If it's one then it stays 53 00:03:51,092 --> 00:03:55,008 one. So, how is this first pick going to stay zero? We'll have to be missed by all 54 00:03:55,008 --> 00:04:01,021 of the darts. A given dart, a given bit flick is uniformly likely to be any of the 55 00:04:01,021 --> 00:04:04,718 n bits so the probability of the ones that being this bit is only 1/n but, if it even 56 00:04:04,718 --> 00:04:08,760 it's fortunately somebody else? Well, that's one - 1/n so you have a chance of 57 00:04:08,760 --> 00:04:13,524 surviving a single dart with probably one - 1/n There is the number of hash 58 00:04:13,524 --> 00:04:17,211 functions k the number of objects cardinality that's a dart being thrown. 59 00:04:17,211 --> 00:04:21,181 Right k per object that gets inserted so the overall probability of eluding all of 60 00:04:21,181 --> 00:04:25,232 the darts is one - one or n raised to the number of hash functions k the number of 61 00:04:25,232 --> 00:04:29,985 insertions cardinality of S. Again, the probability that is one which is the one - 62 00:04:30,057 --> 00:04:36,244 that quantity which is the second option in the quiz. So, let's go ahead and resume 63 00:04:36,244 --> 00:04:40,843 our analysis using the answer to that quiz. So, what do we discover, discover 64 00:04:40,843 --> 00:04:46,646 the probability that a given bit is one, is one - quantity one - 1/n or n is the 65 00:04:46,646 --> 00:04:52,155 number of position raised to the number of hash functions k the number of insertions 66 00:04:52,155 --> 00:04:56,983 cardinality of S. So, that's the kind of messy quantity so let's recall a simple 67 00:04:56,983 --> 00:05:01,186 estimation facts that we used once earlier. You saw this when we analyzed 68 00:05:01,186 --> 00:05:05,673 cardinals construction algorithm and the benefit of multiple repetitions or 69 00:05:05,673 --> 00:05:10,589 cardinals contraction algorithm. And the trick here is to estimate a quantity 70 00:05:10,589 --> 00:05:16,228 that's on the form of one + x or one - x by either the x or the - x as the case 71 00:05:16,228 --> 00:05:23,808 maybe. So you take the function one + x which goes through the points -ten and 01. 72 00:05:23,808 --> 00:05:29,727 And of course it's a straight line and then you also look at the function e to 73 00:05:29,727 --> 00:05:35,613 the x. Well, those two functions are going to kiss at the point 0,1 and everywhere 74 00:05:35,613 --> 00:05:42,053 else e to the x is going to be above one + x. So for any real value of x we can 75 00:05:42,053 --> 00:05:47,903 always upper bound the quantity one + x by either the - x. So let's apply this fact 76 00:05:47,903 --> 00:05:54,141 to this quantity here, one - 1/n raise to the k cardinality of S. We're going to 77 00:05:54,141 --> 00:06:02,701 take x to the - 1/n so that gives us an upper bound on this probability of one - e 78 00:06:02,701 --> 00:06:12,019 to th e - k the number of insertions over n, okay? So that's taking x to the - 1/n. 79 00:06:12,019 --> 00:06:16,227 Let's simplify and finalize a little bit further by introducing some notation. So, 80 00:06:16,227 --> 00:06:21,439 I'm going to let b denote the number of bits that were using per object. So this 81 00:06:21,439 --> 00:06:25,579 is the quantity I was telling you to think about as eighth previously. This is the 82 00:06:25,579 --> 00:06:33,152 ratio n, the total number of bits divided by the cardinality of S. So, this green 83 00:06:33,152 --> 00:06:41,632 expression becomes one - e^k where b is the number of bits per object. And now 84 00:06:41,632 --> 00:06:46,347 we're already seeing this type of trade off that we're expecting. Remember we're 85 00:06:46,347 --> 00:06:49,810 expecting that as we use more and more space, then the error rate we think should 86 00:06:49,810 --> 00:06:52,859 go down so if you can press the table a lot or use bits for lots of different 87 00:06:52,859 --> 00:06:57,565 objects that's when you start going to see a lot of false positives so in this light 88 00:06:57,565 --> 00:07:01,707 blue expression if you take the number of bits per objects with the number space, 89 00:07:01,707 --> 00:07:05,065 the amount of space, little b if you take that going very large expanding to 90 00:07:05,065 --> 00:07:08,638 infinity, this exponent to zero. So either the -zero is one. So overall, this 91 00:07:08,638 --> 00:07:13,342 probability of a given bit being one is turning to zero. So, that is, the more 92 00:07:13,342 --> 00:07:18,423 bits you have, the bigger space you have. The, well, the smaller of the fraction of 93 00:07:18,423 --> 00:07:22,290 1s. The bigger the faction of 0s. That should translate to a smaller false 94 00:07:22,290 --> 00:07:27,313 positive probability unless we will make precise on the next and final slot. So 95 00:07:27,313 --> 00:07:31,855 let's, let's rewrite the upshot form the last slide but probability that a given 96 00:07:31,855 --> 00:07:35,226 bit is equal to one is that at above by one - e to the - k over b where k is the 97 00:07:35,226 --> 00:07:39,131 number of hash functions and b is the number of bits we're using per object. Now 98 00:07:39,131 --> 00:07:43,391 this is not the quantity that you care about. The quantity we care about is a 99 00:07:43,391 --> 00:07:47,061 false positive probability where something looks like it's in the bloom filter even 100 00:07:47,061 --> 00:07:50,366 though it's never been inserted so it's focused on some object like some IP 101 00:07:50,366 --> 00:07:56,028 address which is never ever been inserted into this bloom filter. So for a given 102 00:07:56,028 --> 00:08:00,533 object x which is not in the data set, that this has not been inserted into the 103 00:08:00,533 --> 00:08:04,479 bloom filter or what has to happen for us to have a success ful look up for false 104 00:08:04,479 --> 00:08:09,437 positive for this object? Well each one of its k bits has to be set to one. So, we 105 00:08:09,437 --> 00:08:14,667 already computed the probability that a given bit is set to one. So, what has to 106 00:08:14,667 --> 00:08:19,478 happen for all k of the bits that indicates x's membership in the bloom 107 00:08:19,478 --> 00:08:23,779 filter all k of them has to be set to one. So we just take the quantity we computed 108 00:08:23,779 --> 00:08:29,144 on the previous slide and we raise that to the kth power. Indicating that it has to 109 00:08:29,144 --> 00:08:33,941 happen k different times. So believe it or not we now have exactly what we wanted. 110 00:08:33,941 --> 00:08:38,516 What we set out to do which is derive a qualitative understanding of the intuitive 111 00:08:38,516 --> 00:08:42,319 trade off between the one hand space used and on the other hand on the error 112 00:08:42,319 --> 00:08:45,979 probability. The false positive of probability. So, we're going to call this 113 00:08:45,979 --> 00:08:50,444 green circle quantity and name it. We'll call it epsilon for the error rate and 114 00:08:50,444 --> 00:08:54,232 again all errors are false positives. And again as b goes to infinity, as we use 115 00:08:54,232 --> 00:08:59,175 more and more space, this exponent goes to zero so one - e to that quantity is going 116 00:08:59,175 --> 00:09:03,153 to zero as well. And of course, once we power it through the kth power, it gets 117 00:09:03,153 --> 00:09:07,427 even closer to zero. So if the bigger b gets the small of this error rate epsilon 118 00:09:07,427 --> 00:09:12,199 gets. So now let's get to the punch line. So remember the question is, is this data 119 00:09:12,199 --> 00:09:16,545 structure actually useful? Can we actually set all of the parameters in a way that we 120 00:09:16,545 --> 00:09:22,135 could both really usefully small space but a tolerable error epsilon? And, of course 121 00:09:22,135 --> 00:09:27,863 we wouldn't be giving this video if the answer wasn't yes. Now one thing I've been 122 00:09:27,863 --> 00:09:31,831 alluding all along is how do we set k? How do we choose the number of hash functions? 123 00:09:31,831 --> 00:09:36,449 I told you at the very beginning We think of k as a small constant like 2345. And 124 00:09:36,449 --> 00:09:40,365 now that we have this really nice qualitative version of how the error rate 125 00:09:40,365 --> 00:09:45,565 in the space trade off with each other. We can answer how to set k. Namely set k 126 00:09:45,565 --> 00:09:51,440 optimally so what do I mean? Well, fix the number of bits that you're using per 127 00:09:51,440 --> 00:09:56,033 object. Eight, sixteen, 24, whatever. For fixed b, you can just choose the k that 128 00:09:56,033 --> 00:09:59,837 minimize the screen quantity. That minimizes the error rate epsilon. So, how 129 00:09:59,837 --> 00:10:03,433 do you minimize t his quantity? Well, you do it just like you learn in calculus and 130 00:10:03,433 --> 00:10:07,236 I'll leave this as an exercise for you to do in the privacy of your own home. But 131 00:10:07,236 --> 00:10:11,227 for fixed b, the way to get this green quantity epsilon as small as possible is 132 00:10:11,227 --> 00:10:16,966 to set the number of hash functions k to be roughly the natural log of two. That's 133 00:10:16,966 --> 00:10:23,648 a number of < one notice that's like .693 b. So, in other words the number of 134 00:10:23,648 --> 00:10:27,139 hash functions for the optimal implementation of the bloom filter is 135 00:10:27,139 --> 00:10:33,396 scaling linearly than the number of bits that you're using per object. It's about 136 00:10:33,396 --> 00:10:38,690 .693 the bits per object. Of course this is generally not going to be an integer so 137 00:10:38,690 --> 00:10:45,219 you just pick k either this number rounded up or this number rounded down. But, 138 00:10:45,219 --> 00:10:49,451 continuing the heuristic analysis, now that we know how to set k optimally to 139 00:10:49,451 --> 00:10:53,884 minimize the error for a given amount of space we can plug that value of k back in 140 00:10:53,884 --> 00:10:57,937 and see well, how does the space and the error rate trade off against each other 141 00:10:57,937 --> 00:11:01,315 and we get a very nice answer. Specifically, we get that the error rate 142 00:11:01,315 --> 00:11:05,986 epsilon is just under an optimal trades to the number of hash functions k decreases 143 00:11:05,986 --> 00:11:12,650 exponentially in the number of bits that you use per object. So, it's roughly one 144 00:11:12,650 --> 00:11:20,553 half raised to the natural log of two or .693 roughly the number of bits per object 145 00:11:20,553 --> 00:11:26,815 b. But, again the key qualitative point here is notice that epsilon is going down 146 00:11:26,815 --> 00:11:31,429 really quickly as you scale b. If you double the number of bits that you're 147 00:11:31,429 --> 00:11:35,620 allocating per object, you're squaring the error rate and for small error rates, 148 00:11:35,620 --> 00:11:39,355 squaring it makes it much, much, much smaller. And of course this is just one 149 00:11:39,355 --> 00:11:42,834 equation in two variables. If you prefer, you can solve this equation to express b, 150 00:11:42,834 --> 00:11:48,825 the space requirement as a function of an error requirement. So if you know that the 151 00:11:48,825 --> 00:11:52,553 tolerance for false positives in your application is one percent you can just 152 00:11:52,553 --> 00:11:56,160 solve this for b and figure out how many bits per object you need to allocate. And 153 00:11:56,160 --> 00:12:01,282 so rewriting what you get is that the number of bits per object that you need is 154 00:12:01,282 --> 00:12:07,252 roughly 1.44 the log base two of one over epsilon. So, as expected as epsilon gets 155 00:12:07,252 --> 00:12:12,196 smaller and smaller, you want fewer and fewer errors, the space requirements will 156 00:12:12,196 --> 00:12:17,243 increase. So, the final question is, is it a useful data structure? Can you set all 157 00:12:17,243 --> 00:12:22,241 the parameters so that you get you know, really interesting space error trade off 158 00:12:22,241 --> 00:12:27,817 and the answer is totally. So, let me give you an example. Let's go back to having 159 00:12:27,817 --> 00:12:33,198 eight bits of storage per object so that corresponds to b = eight. Then, what this 160 00:12:33,198 --> 00:12:38,712 pick formula indicates is we should use five or six hash functions and already you 161 00:12:38,712 --> 00:12:41,640 have an error probability of something like two percent which for a lot of the 162 00:12:41,640 --> 00:12:45,740 motivating applications we talked about is already good enough. And again, if you 163 00:12:45,740 --> 00:12:49,744 double the number of bits to say sixteen per object, then this error probability 164 00:12:49,744 --> 00:12:54,279 would be really small. Pushing you know one in 5,000 or something like that. So, 165 00:12:54,279 --> 00:12:59,659 to conclude at least in this idealized analysis which again, you should check 166 00:12:59,659 --> 00:13:03,418 against at any real world implementation although empirically, it is definitely 167 00:13:03,418 --> 00:13:06,778 achievable with well implemented bloom filter in nonpathological data to get this 168 00:13:06,778 --> 00:13:11,661 kind of performance even with really a ridiculously minuscule amount of space per 169 00:13:11,661 --> 00:13:16,313 object much less generally than storing the object itself, you can get fast 170 00:13:16,313 --> 00:13:20,095 inserts, fast look ups, you do have to have false positives but with a very 171 00:13:20,095 --> 00:13:31,892 controllable amount of error rates and that what's make bloom filters a win in a 172 00:13:31,892 --> 00:14:51,068 number of applications.