So, in this video, we're going to discuss Bloom filters which is a data structure developed appropriately enough by Burton Bloom back in 1970. Bloom filters are variant on hash tables, you'll recognize a lot of the ideas from our hash table discussion. The win that you get in Bloom filters is that they are more space efficient than run of the mill hash tables and they're going to handle, they do allow for errors, there is a non zero false positive probability when you do look ups but that's still a win for some applications. So, it's a very cool idea, very cool data structure. You do see it used quite a bit in practice so let's start talking about it. So, we'll go through the usual topics that we do whenever we discuss a new data structure. So first, I want to tell you what operations they support and what kind of performance you're going to expect from those operations, in other words, what is the API corresponding to the data structure. Secondly, I'm going to talk a little bit about what it's good for. So, what are some potential application and then we'll take a peek under the hood. I'll tell you some of the implementation details with an emphasis on explaining why you get the kinds of performance trade offs that you do with Bloom filters. So, to first order, the raison d'ĂȘtre of Bloom filters is exactly the same as a hash table. It supports super fast inserts, super fast look ups. You can put stuff in there and you can remember what you put in earlier. Now, of course, what you should be wondering is what we already know what data structure that supports super fast in certain look ups, a hash table. Why am I bothering to tell you about yet another data structure with exactly those same operations? So, let me tell you about the pros and cons of Bloom filters relative to run off the mill hash tables as we've already discussed. The big win is that Bloom filters are more space efficient than hash tables. No matter whether they are implemented with chaining or with open addressing, you can store much less space per objects. In fact, as we'll see, less space than that of an object itself using a Bloom filter. As far as the cons, well, first of all, this is really for applications where you just want to remember what kind of values you see. You are not trying to store pointers to the objects themselves and just trying to remember values. So, the first drawback of the Bloom filter is that because we want to be so space efficient, we don't even want to remember the object itself just whether or not we've seen it before. We're not going to be able to store the objects or even pointers to the objects in a Bloom filter. We're just going to remember what we've seen and what we haven't. So, some of you might know the terminology hash set for this kind of variant of a hash table as opposed to a full blown hash table or hash map. The second con is at least in the vanilla implementation of Bloom filters that I'm going to describe here, deletions are not allowed. You can only insert, you can't delete. The situation with deletions is very much similar to hash tables implemented with open addressing. It's not that you can't have a Bloom filter that accommodates deletion, you can, there are very instances of it but that requires significantly more work and we're not going to discuss it here. So, the first order at least for vanilla Bloom filters, you want to think of them as suitable for applications or deletions or not a first order of operation. Now, the third con and this is a drawback that we have not see previously using any data structures is Bloom filters can actually make mistakes. Now, what kind of mistake could this kind of data structure possibly make when all you're really doing is looking something up. Well, one of mistake would be a false negative and that means you have inserted something previously then you look it up and the hash table or the Bloom filter says, it's not there. So, Bloom filters will not have false negatives of this form. You've insert something, you look it up later, it's definitely going to confirm that you inserted it in the past. But Bloom filters will have false positives, that means that despite the fact you have never inserted say, a given IP address into the, into the Bloom filter, if you look it up later, it will say that you have. So, there will sometimes be in some sense phantom objects in Bloom filters, objects which it thinks have been inserted even though they haven't been. So, given that, I am now showing you two data structures with essentially the same functionality, hash tables and Bloom filters, at least, if we ignore the deletion issue. You might want to wonder which one is more appropriate, which one is more useful. And because there is these trade offs between the two, the answer as you expect is, it depends on the application, right? So, if it's an application where space is really at a premium, you might want to turn to Bloom filters especially if a small chance of a false positive is not deal breaker. If you have some kind of application where false positives are absolutely out of the question, of course, you should not use a Bloom filter and you want to think about a hash table. So, what are some situations where people actually do use Bloom filters where you either really care about space and/or you don't really care about this false positive probability. For one of the earliest applications of Bloom filters, this is not long time ago, this is something like 40 years ago, was the spell checkers. So, how would you implement a spell checker using a Bloom filter? Well, first you have this insert phase where you basically just go through the entire dictionary word-by-word and you insert every valid word into the Bloom filter. Then, afterwards, when you're presented with a new document that somebody has written, you're going to go through the document word-by-word for each word, you say, is this in the Bloom filter? That is, is this one of the legitimate word from the dictionary which is previously inserted? If the Bloom filters says yes, this word is in the dictionary as in we've stored and seen that before, then you treat is as a correctly spelled word and if it's not in the Bloom filters, then you treat it as incorrectly spelled word. Now, the false positive probability means this isn't a perfect spell checker. I mean sometimes, you're going to look up a misspelled word and the Bloom filter won't catch it and it willl actually say yes, with small probability, we'll say, this is a legitimate word. So, you know, it's not ideal but, you know, the, the English language is pretty big and space was definitely at a premium, 40 plus years ago. So, it was a win for that application at that time, to use Bloom filters to implement a spell checker. Another application which, you know, remains relevant today is to keep track of a list of forbidden passwords. Now, why would you have forbidden passwords? Well, maybe, you want to keep track of password which are too weak or too easy to guess or too common. You may, yourself, have used the piece of software or website at some point where it asked you for a password and if you typed in something which is too simple or too easy, rejected it and asked you to type in another one. So, one way to implement a list of forbidden passwords is just with the Bloom filter and the idea is similar to the spell checker. You first, insert into the Bloom filter all of the passwords that you don't want anybody to use for whatever reason. Then, when a client comes and tries to type in a new password, you look it up in the Bloom filter and if you get a positive look up, then you tell the user, no, that's no good, you can't use that password, choose another one. And this is an application where you really don't care about the errors, you really don't care about the fact that there's a false positive rate. Let's assume that the error rate is something like one percent or 0.1%. So, what would that means in context, that would just mean once in a while, one in a hundred clients or one in a thousand clients actually types in a perfectly strong password that gets rejected by the Bloom filter and they have to type in a second one. Okay, but big deal and if space is at the, the premium, this is definitely a win to use this super lightweight data structure to keep track of these blocked passwords. These days certainly one the killer applications of Bloom filters is in software deployed on network routers. So, the machinery out in the Internet which is responsible for transmitting packets from one place to another. So, what are the reasons why Bloom filters have found fertile application in network routers? Well, first of all, you do have a budget on space, typically on network routers. There's a lot of things that you got to do and you don't want to waste that much of it on some random data structure to do one's specific task. So, you do have a budget on space and also, you need super, super fast data structures, right? Since these packets are coming in at this torrential rate which you can't even imagine and you want to process these packets in real time, sending them off to the next top. Bloom filters are the work force behind a lot of different tasks that is done in the network router. You can imagine wanting to keep track of blocked IP addresses, you can imagine keeping track of the contents of some cache so you don't do spurious look ups. You can imagine maintaining statistics to check for denial of service attacks and so on and so forth. So, summarizing as a expert programmer, what is it that you should remember about Bloom filters, what purpose does this tool serve in your tool box? Well, as far as the operation supported which is the same as a hash table, the point is to have super fast inserts, super fast look ups. But Bloom filters are more lightweight version of a hash table. So, they are more space efficient but they do have this drawback of having a small error probability. So, those are the key features you should remember when deciding whether or not you are working on an application that could make good use of this data structure. So, having discussed one of th e operations and what these data structures are good for, let's take it to the next level, let's peer under the hood and see how they are actually implemented. Cuz this is really a quite simple, quite cool idea. So, like hash tables, Bloom filters have essentially two ingredients. First of all, there's an array and second of all, there's a hash function or in fact, several hash functions. So, we're going to have a random access array except, instead of having n buckets or n slots as we've been calling them, each entry in this array is just going to be a single bit. Each entry in this array can only take on two values, zero or one. And the way they think about the space occupied by Bloom filters is in terms of the number of bits per object that has been inserted into the Bloom filter. So, if you have inserted the data set capital S, then the total number of bits is n, the number of objects that have been inserted is cardinality of s. So, n / |s| is the number of bits in this data structure that you are using per entry in the data set. Now, you can tune a Bloom filter so this ratio is any number of different quantities but for now, I encourage you to think of this ratio as being eight, that is for each object stored in the Bloom Filter, you are using only eight bits of memory. That will help you appreciate just how amazing this data structures are, right, cuz maybe our data set is something like IP addresses which is 32 bits so what I'm saying here, if this is eight, I'm saying we are not, definitely not actually storing the IP address. So, we have this 32-bit object we are inserting and we are only using eight bits of memory. This is how we are going to remember whether its there or whether its not. And again, certainly, eight bits per object is way less than keeping a pointer to some associated memory somewhere. So, this is a really impressive minimal use of space to keep track of what we've seen and what we haven't. And secondly, we need mappings of given an object to say, given the IP address, what are the relevant bits for seeing if we've seen this IP address before or not? So, in a Bloom filter, its important to have not one hash function, but several hash functions. So, k is going to denote the number of hash functions in the Bloom filter which you think of k is some small constant somewhere, you know, three, four, five, or something like that. So, obviously it's a little bit more complicated to use multiple hash functions as supposed to just one hash function. But it's really not that big of deal. So, we'll call from our discussion of say, universal hashing, we have identified the entire families of hash functions which will work well on average. So, instead of choosing just using one hash function at random from universal family, you gave me k independent random choices from universal family. In fact, in practice, it seems to typically be enough to just use two different hash functions and then generate k different linear combinations of those two hash functions. But for the purposes of this video, let's just assume that we've done enough work to come up with k, different good hash functions and that's what we're going to be using in our Bloom filter. So, the code for both insert and delete is very elegant. So, let's start by insertion. So, suppose we have some new IP address and we want to stick into these Bloom filter, what we do? Well, we'll just evaluate each of our k hash functions on this new object. Each of those tells us an index into our array of bits and we'll just set those k bits equal to one. And when we do this insert, we don't even bother to look at what the previous values of these bits were.. So, zero or one, we don't care. We'll just blithely go in and set this k bits equal to one, whatever they were before. So, what about looking up? How are we going to implement that? Well, all you have to do is check for the footprint that was inevitably left by a prior insertion. So, if we're looking up an IP address and we know was inserted sometime in the past, what happened when we evaluated the k hash functions, we went t o appropriate positions in the array and we set all of those bits to one. So now, I'll just check that, that indeed happened, that is when we get a new IP address, we're looking it up. We evaluate the hash functions, all k of them. We look at the corresponding k positions and we verified that indeed those k bits have been set to one. So, what I hope is clear fairly quickly from inspecting this very elegant code is that we will not ever have false negatives, yet, we might have false positives. So, let's discuss those one other time. So, remember, a false negative would mean that the Bloom filter says, something isn't there when in fact, it is, that is we insert something and we'll look it up later and the Bloom filter rejects us. Well, that's not going to happen. Cuz when we insert something, we set the relevant k bits to one. Notice when a bit is one, it remains one forevermore. That bits are never reset back to zero. So, if anything was ever inserted in the subs when we look it up, definitely we well confirm that all those bits are one. So, we're never going to be rejected by something we inserted before. On the other hand, it is totally possible that we will have a false positive. It's totally possible that there will be a phantom object and we'll do a look up and the Bloom filter will turn yes when we never inserted that object. Suppose for example, the k = three. So, we're using three different hash functions. Consider some IP address, fixed IP address, maybe the three hash functions tell us the relevant bits are seventeen, 23, and 36. Maybe we never inserted this IP address, but we have inserted IP address number two and in its insertion, the seventeenth bit got set to one. We inserted some other IP address, IP address number three and the twenty-third bit got set to one. And then we inserted IP address number four and the 36th bit got set to one. So, three different IP addresses were responsible for setting these three different bits but whatever, its not like we are remembering that. And that once, once we look up this IP address we really care about, what do we do, we just inspect bit seventeen, its one. Inspect the 23, its one. We inspect the 36, its also one. For all we know, this thing really was inserted and the Bloom filter is going to say, yes, it's in the table. So, that's how we have false positives. All of the bits that are indicating whether or not a given object are in, are in the Bloom filter were previously set by insertions from other objects. So, there are two points that I hope are clear at this stage of the discussion. First of all, that this Bloom filter, the idea does suggests a possibility of a super space efficient variant of a hash table, right. So, we've been talking about setting the number of bits to be say roughly eight times the number of objects that you're storing so you're only using eight bits per object and for most objects, that's going to be radically smaller than just the simple array, storing the objects themselves. Again, if their IP addresses we're only have 25 percent much space as we actually stored those IP address in just an array with no extra bells and whistles. The second point is that we're inevitably going to have errors in a Bloom filter, we will have false positives or we look something up and it says, its there when in fact, its not. So, those two points I hope are clear. What's actually not clear is the bottom line. Is this actually a useful idea? For this to be useful, it'd better be the case that the error probability can be quite small even while the space per object is quite small. If we can't get those two things small simultaneously, this is a bad idea and we should always just use a hash table instead. So, to evaluate the quality of this idea, we're going to have to do little bit of mathematical analysis. That's what I'm going to show you in the next couple of slides.