1 00:00:00,438 --> 00:00:06,276 Let's begin by building up some intuition about what we would want from a hash 2 00:00:06,276 --> 00:00:10,488 function, now that we know how hash functions are usually implemented. So 3 00:00:10,488 --> 00:00:13,789 let's start with a hash function which is implemented by chaining. So what's going 4 00:00:13,789 --> 00:00:18,370 to be the running time of our lookup, insert, and delete operations in a hash 5 00:00:18,370 --> 00:00:23,224 table with chaining? Well, the happy operation in a hash table with chaining is 6 00:00:23,224 --> 00:00:27,195 insertion. Insertion, we can just say without any qualifications, is constant 7 00:00:27,195 --> 00:00:31,778 time. This requires the sort of obvious optimization that when you do insert, you 8 00:00:31,778 --> 00:00:36,072 insert something at the front of the list in its bucket. Like, there's no reason to 9 00:00:36,072 --> 00:00:40,698 insert at the end. That would be silly. So the plot thickens when we think about the 10 00:00:40,698 --> 00:00:44,077 other two operations, deletion and the lookup. So let's just think about lookup. 11 00:00:44,077 --> 00:00:47,945 Deletion's basically the same. So how do we implement lookup? Well, remember when 12 00:00:47,945 --> 00:00:52,765 we get some key x, we invoke the hash function. We call h(x). That tells us 13 00:00:52,765 --> 00:00:57,620 what bucket to look in. So if it tells us 17, we know that, you know, x may 14 00:00:57,620 --> 00:01:00,287 or may not be in the hash table. But at this point we know that if it's in the 15 00:01:00,287 --> 00:01:03,659 hash table, it's got to be in the linked list that's in the seventeenth bucket. So 16 00:01:03,659 --> 00:01:08,043 now we descend into this bucket. We find ourselves a linked list. And now, we have 17 00:01:08,043 --> 00:01:11,681 to resort to just an exhaustive search through this list in the seventeenth 18 00:01:11,681 --> 00:01:16,308 bucket, to see whether or not X is there. So we know how long it takes to search a 19 00:01:16,308 --> 00:01:21,338 regular list for some element. It's just linear and the list length. And now we're 20 00:01:21,338 --> 00:01:26,679 starting to see why the hash function might matter. Right, so suppose we insert 21 00:01:26,679 --> 00:01:32,352 100 objects into a hash table with 100 buckets. If we have a super lucky hash 22 00:01:32,352 --> 00:01:37,330 function, then perhaps each bucket will get its own object. There'll be one object 23 00:01:37,330 --> 00:01:42,111 in each of the lists, in each of the buckets, so. Theta of the list length 24 00:01:42,111 --> 00:01:45,926 is just theta of one. We're doing great. Okay? So, a constant, constant link to lists, 25 00:01:45,926 --> 00:01:50,919 means constant time insert delete. A really stupid hash function would map 26 00:01:50,919 --> 00:01:55,828 every single object to bucket number zero. Okay, so then if you insert 100 27 00:01:55,828 --> 00:02:00,182 objects, they're all in bucket number zero. The other 99 buckets are empty. And 28 00:02:00,182 --> 00:02:04,154 so every time you do insert or delete, it's just resorting to the naive linked list 29 00:02:04,154 --> 00:02:07,116 solution. And the running time is going to be linear and the number of 30 00:02:07,116 --> 00:02:12,149 objects in the hash table. So the largest list length could vary anywhere from m/n, 31 00:02:12,149 --> 00:02:16,822 where m is the number of objects, this is when you have equal linked lists, 32 00:02:16,822 --> 00:02:22,208 to if you use this ridiculous constant hash function, m, all the objects in the 33 00:02:22,208 --> 00:02:26,631 same list. And so the main point I'm trying to make here, is that you know, 34 00:02:26,631 --> 00:02:30,066 first of all, at least with chaining, where the running time is governed by the 35 00:02:30,066 --> 00:02:34,986 list length, the running time depends crucially on the hash function. How well 36 00:02:34,986 --> 00:02:39,981 the hash function distributes the data across the different buckets. And 37 00:02:39,981 --> 00:02:43,485 something analogous is true for hash tables that use open addressing. Alright 38 00:02:43,485 --> 00:02:46,515 so here there aren't any lists. So you don't, there's no linked lists to keep 39 00:02:46,515 --> 00:02:49,228 track of. So the running time is going to be governed by the length of the probe 40 00:02:49,228 --> 00:02:52,165 sequence. So the question is how many times do you have to look at different 41 00:02:52,165 --> 00:02:55,410 buckets in the hash table before you either find the thing you're looking for, 42 00:02:55,410 --> 00:02:58,638 or if you're doing insertion, before you find an empty bucket in which to insert 43 00:02:58,638 --> 00:03:02,179 this new object. So the performance is governed by the length of the probe 44 00:03:02,179 --> 00:03:06,827 sequence. And again, the probe sequence is going to depend on the hash function. For 45 00:03:06,827 --> 00:03:10,809 really good hash functions in some sense, stuff that spreads data out evenly, you 46 00:03:10,809 --> 00:03:14,987 expect probe sequences to be not too bad. At least intuitive, And for say the 47 00:03:14,987 --> 00:03:19,087 constant function you are going to expect these probe sequences to grow linearly 48 00:03:19,087 --> 00:03:22,582 with the numbers of object you insert into the table. So again this point remains 49 00:03:22,582 --> 00:03:27,195 true, the performance of a hash table in either implementation really depends on 50 00:03:27,195 --> 00:03:32,915 what hash function you use. So, having built up this intuition, we can now say 51 00:03:32,915 --> 00:03:37,324 what it is what we want from a hash function. So first we want it to lead to 52 00:03:37,324 --> 00:03:42,831 good performance. I'm using the chaining implementations as a guiding example. We 53 00:03:42,831 --> 00:03:47,000 see that if we have a size of a hash table, n, that's comparable to the number 54 00:03:47,000 --> 00:03:51,245 of objects, m, it would be really cool if all of the lists had a length that was 55 00:03:51,245 --> 00:03:55,641 basically constant; therefore we had our constant time operations. So equal length 56 00:03:55,641 --> 00:03:59,837 lists is way better than unequal length lists in a hash table with chaining. So 57 00:03:59,837 --> 00:04:04,382 we, we want the hash function to do is to spread the data out as equally as possible 58 00:04:04,382 --> 00:04:07,848 amongst the different buckets. And something similar is true with open 59 00:04:07,848 --> 00:04:11,335 addressing; in some sense you want hash functions to spread the data uniformly 60 00:04:11,335 --> 00:04:16,438 across the possible places you might probe, as much as possible. And in 61 00:04:16,438 --> 00:04:21,564 hashing, usually the gold standard for spreading data out is the performance of a 62 00:04:21,564 --> 00:04:26,131 completely random function. So you can imagine for each object that shows up you 63 00:04:26,131 --> 00:04:30,060 flip some coins. With each of the n buckets equally likely, you put this 64 00:04:30,060 --> 00:04:33,495 object in one of the n buckets. And you flip independent coins for every different 65 00:04:33,495 --> 00:04:36,620 object. So this, you would expect, you know, because you're just throwing darts 66 00:04:36,620 --> 00:04:39,991 at the buckets independently, you'd expect this to spread the data out quite well. 67 00:04:39,991 --> 00:04:47,355 But, of course, it's not good enough just to spread data out. It's also important 68 00:04:47,355 --> 00:04:49,857 that we don't have to work too hard to remember what our hash function is and to 69 00:04:49,857 --> 00:04:54,785 evaluate it. Remember, every time we do any of these operations, an insert or a 70 00:04:54,785 --> 00:04:59,915 delete or a lookup, we're going to be applying our hash function to some key x. 71 00:04:59,915 --> 00:05:04,252 So every operation includes a hash function evaluation. So if we want our 72 00:05:04,252 --> 00:05:08,372 operations to run a constant time, evaluating the hash function also better 73 00:05:08,372 --> 00:05:13,394 run in constant time. And this second property is why we can't actually 74 00:05:13,394 --> 00:05:17,750 implement completely random hashing. So there's no way we can actually adjust when 75 00:05:17,750 --> 00:05:21,678 say we wanted to insert Alice's phone number, flip a new batch of random coins. 76 00:05:21,678 --> 00:05:25,010 Suppose we did. Suppose we flipped some random coins and it tells us to put 77 00:05:25,010 --> 00:05:29,148 Alice's phone number into the 39th bucket, while. Later on, we might do a 78 00:05:29,148 --> 00:05:32,430 lookup for Alice's phone number, and we better remember the fact that we're 79 00:05:32,430 --> 00:05:35,688 supposed to look in the 39th bucket for Alice's phone number. But what does that 80 00:05:35,688 --> 00:05:38,998 mean? That means we have to explicitly remember what choice we made. We have to 81 00:05:38,998 --> 00:05:44,107 write down. You get a list of effects that Alice is in bucket number 39. In every 82 00:05:44,107 --> 00:05:47,921 single insertion, if they're all from in the point of coin flips, you have to 83 00:05:47,921 --> 00:05:50,672 remember all of the different random choices independently. And this really 84 00:05:50,672 --> 00:05:55,172 just devolves back to the naive list base solution that we discussed before. So, 85 00:05:55,172 --> 00:05:59,050 evaluating the hash function is gonna take us linear time and that defeats the 86 00:05:59,050 --> 00:06:03,422 purpose of a hash table. So we again we want the best of both worlds. We want a 87 00:06:03,422 --> 00:06:08,042 hash function which we can store in ideally constant space, evaluate in 88 00:06:08,042 --> 00:06:13,525 constant time, but it should spread the data out just as well as if we had this 89 00:06:13,525 --> 00:06:18,115 gold standard of completely random hashing. So I want to touch briefly on the 90 00:06:18,115 --> 00:06:21,691 topic of how you might design hash functions. And in particular good hash 91 00:06:21,691 --> 00:06:26,600 functions that have the two properties we identified on the previous slide. But I 92 00:06:26,600 --> 00:06:29,694 have to warn you, if you ask ten different, you know, serious hardcore 93 00:06:29,694 --> 00:06:32,443 programmers, you know, about their approach to designing hash functions, 94 00:06:32,443 --> 00:06:36,252 you're likely to get ten somewhat different answers. So the design of hash 95 00:06:36,252 --> 00:06:42,199 functions is a tricky topic, and, it's as much art as science at this point. Despite 96 00:06:42,199 --> 00:06:44,888 the fact that there's a ton of science, there's actually a very beautiful theory, 97 00:06:44,888 --> 00:06:48,521 about what makes good hash functions. We'll touch on a little bit of that in a, in a 98 00:06:48,521 --> 00:06:54,709 different optional video. And if you only remember one thing of, you know, from this 99 00:06:54,709 --> 00:06:58,872 video or from these next couple slides, the thing to remember is the following. 100 00:06:58,872 --> 00:07:05,712 Remember that it's really easy to inadvertently design bad hash functions 101 00:07:05,712 --> 00:07:11,086 and bad hash functions lead to poor hash table performance. Much poorer than you 102 00:07:11,086 --> 00:07:14,582 would expect given the other discussion we've had in this video. So if you have to 103 00:07:14,582 --> 00:07:18,734 design your own hash function, do your homework, get some examples, learn what 104 00:07:18,734 --> 00:07:22,940 other experts are doing and use your best judgment. If you do just something without 105 00:07:22,940 --> 00:07:26,163 thinking about it, it's quite possible to lead to quite poor performance, much 106 00:07:26,163 --> 00:07:30,483 poorer than you were expecting. So to drive home this point, suppose that you're 107 00:07:30,483 --> 00:07:35,268 thinking about keys being phone numbers. So let's say, you know, I'm gonna just be 108 00:07:35,268 --> 00:07:39,766 very kinda United States centric here. I'm just gonna focus on the, the ten digit 109 00:07:39,766 --> 00:07:43,799 phone numbers inside the US. So the universe size here is ten to the ten, 110 00:07:43,799 --> 00:07:47,467 which is quite big. That's probably not something you really want to implement 111 00:07:47,467 --> 00:07:50,697 explicitly and let's consider an application where, you know, you're only 112 00:07:50,697 --> 00:07:54,735 say, keeping track of at most, you know, 100 or 1,000 phone numbers or something 113 00:07:54,735 --> 00:07:58,412 like that. So we need to choose a number of buckets, let's say we choose 1,000 114 00:07:58,412 --> 00:08:02,616 buckets. Let's say we're expecting no more than 500 phone numbers or so. So we double 115 00:08:02,616 --> 00:08:07,219 that, we get a number of buckets to be equal 1,000. And now I got to come up with 116 00:08:07,219 --> 00:08:10,528 a hash function. And remember, you know, a hash functions by definition. All it does 117 00:08:10,528 --> 00:08:13,983 is map anything in the universe to a bucket number. So that means it has to 118 00:08:13,983 --> 00:08:18,763 take as input a ten digit phone number and spit as output some number between zero 119 00:08:18,763 --> 00:08:22,795 and 999. And, beyond that we have flexibility of how to define this mapping. 120 00:08:22,795 --> 00:08:26,448 Now, when you are dealing with things that have all these digits it's very tempting 121 00:08:26,448 --> 00:08:31,822 to just project on to a subset of the digits. And, if you want a really terrible 122 00:08:31,822 --> 00:08:37,261 hash function, just use the most significant digits of a phone number to 123 00:08:37,261 --> 00:08:41,839 define a mapping from phone numbers to buckets. Alright, so I hope it's clear why 124 00:08:41,839 --> 00:08:46,343 this is a terrible choice of a hash function. Alright, so maybe you're a 125 00:08:46,343 --> 00:08:51,177 company based in the San Francisco Bay area. The area code for San Francisco is 126 00:08:51,177 --> 00:08:57,254 415. So if you're storing phone numbers from customers in your area. You know 127 00:08:57,254 --> 00:09:02,717 maybe twenty, 30, 40 percent of them are gonna have area codes 415. All of those 128 00:09:02,717 --> 00:09:08,040 are going to hash to exactly the same bucket, bucket number 415 in this hash 129 00:09:08,040 --> 00:09:12,662 table. So you're gonna get an overwhelming percentage of the data mapping just to 130 00:09:12,662 --> 00:09:17,644 this one bucket. Meanwhile you know not all 1000 possibilities of, of these three 131 00:09:17,644 --> 00:09:22,163 digits are even legitimate area codes. Not all three digit numbers are area codes in 132 00:09:22,163 --> 00:09:25,805 the United States. So there'll be buckets of your hash table which are totally 133 00:09:25,805 --> 00:09:31,937 guaranteed to be empty. So you waste a ton of space in your hash table, you have a 134 00:09:31,937 --> 00:09:35,550 huge list in the bucket corresponding to 415, you have a huge list in the 135 00:09:35,550 --> 00:09:39,704 bucket corresponding to 650, the area code at Stanford. You're gonna have a very slow 136 00:09:39,704 --> 00:09:43,668 look up time for everything that hashes to those two buckets and there's gonna be a 137 00:09:43,668 --> 00:09:49,156 lot of stuff that hashes to those two buckets, So terrible idea. So a better but 138 00:09:49,156 --> 00:09:54,066 still mediocre hash function would be to do the same trick but using the last three 139 00:09:54,066 --> 00:09:58,450 digits instead of the first three digits. This is better than our terrible hash 140 00:09:58,450 --> 00:10:03,056 function because there aren't ridiculous clumps of phone numbers that have exactly 141 00:10:03,056 --> 00:10:07,587 the same last three digits. But still, this is sort of assuming you're using this 142 00:10:07,587 --> 00:10:11,640 hash function as tantamount to thinking that the last three digits of phone 143 00:10:11,640 --> 00:10:16,190 numbers are uniformly distributed among all of the 1,000 possibilities. And really 144 00:10:16,190 --> 00:10:19,128 there's no evidence if that's true. Okay? And so there's going to be patterns and 145 00:10:19,128 --> 00:10:23,506 phone numbers that are maybe a little subtle to see with the naked eye, but 146 00:10:23,506 --> 00:10:26,667 which will be exposed if you try to use a mediocre hash function like this. So let's 147 00:10:26,667 --> 00:10:30,727 look at another example. Perhaps you are keeping track of objects just based by 148 00:10:30,727 --> 00:10:35,175 where they are laid out in memory. So in other words the key for an object is just 149 00:10:35,175 --> 00:10:40,194 gonna be its memory location and if these things are in bytes, then you are 150 00:10:40,194 --> 00:10:43,289 guaranteed that every memory location will be a multiple of four. So for a second 151 00:10:43,289 --> 00:10:49,492 example let's think about a universe where the possible keys are the possible memory 152 00:10:49,492 --> 00:10:53,754 locations, So here you're just associating objects with where they're laid in memory, 153 00:10:53,754 --> 00:10:57,984 and a hash function is responsible for taking in as input some memory location of 154 00:10:57,984 --> 00:11:02,730 some object and spitting out some bucket number. Now generally, because of, you 155 00:11:02,730 --> 00:11:06,957 know the structure of bytes and so on, our memory locations are going to be multiples 156 00:11:06,957 --> 00:11:13,482 of some power of two. In particular, memory locations are going to be even, And 157 00:11:13,482 --> 00:11:17,715 so a bad choice of a hash function. Would be to take, remember, the hash function 158 00:11:17,715 --> 00:11:20,743 takes the input of the memory location, which is, you know, some possibly really 159 00:11:20,743 --> 00:11:24,373 big number, and we wanna compress it, we want to output a bucket number. Now let's 160 00:11:24,373 --> 00:11:29,050 think of a hash table where we choose N equals 10^3, or 1000 buckets. 161 00:11:29,050 --> 00:11:32,481 So then the question is, you know, how is this hash function going to take this big 162 00:11:32,481 --> 00:11:37,121 number, which is the memory location, and squeeze it down to a small number. Which 163 00:11:37,121 --> 00:11:40,656 is one of the buckets and so let's just use the same idea as in the mediocre hash 164 00:11:40,656 --> 00:11:44,333 function, which is we're gonna look at the least significant bits so we can express 165 00:11:44,333 --> 00:11:49,519 that using the mod operator. So let's just think about we pick the hash function h(x) 166 00:11:49,519 --> 00:11:57,652 where h is the memory location to be x mod 1000 There again, you know, the 167 00:11:57,652 --> 00:12:01,040 meaning of 1,000 is that's the number of buckets we've chosen to put in our hash 168 00:12:01,040 --> 00:12:04,470 table because, you know, we're gonna remember roughly at most 500 different 169 00:12:04,470 --> 00:12:08,197 objects. So don't forget what the mod operation means, this means you just, 170 00:12:08,197 --> 00:12:13,142 essentially subtract multiples of 1,000 until you get down to a number less than 171 00:12:13,142 --> 00:12:18,636 1,000. So in this case, it means if you write out x base ten, then you just take 172 00:12:18,636 --> 00:12:22,891 the last three digits. So in that sense, this is the same hash function as our 173 00:12:22,891 --> 00:12:27,075 mediocre hash function when we were talking about phone numbers. So we 174 00:12:27,075 --> 00:12:29,940 discussed how the keys here are all going to be memory locations; in particular 175 00:12:29,940 --> 00:12:34,344 they'll be even numbers. And here we're taking their modulus with respect to an 176 00:12:34,344 --> 00:12:39,658 even number. And what does that mean? That means every single output of this hash 177 00:12:39,658 --> 00:12:43,010 function will itself be an even number. Right, you take an even number, you 178 00:12:43,010 --> 00:12:46,075 subtract a bunch of multiples of a 1000, you're still going to have an even number. 179 00:12:46,075 --> 00:12:53,490 So this hash function is incapable of outputting an odd number. So what does 180 00:12:53,490 --> 00:12:58,499 that mean? That means at least half of the locations in the hash table will be 181 00:12:58,499 --> 00:13:02,603 completely empty, guaranteed, no matter what the keys you're hashing is. And 182 00:13:02,603 --> 00:13:05,851 that's ridiculous. It's ridiculous to have this hash table 50 percent of which is 183 00:13:05,851 --> 00:13:10,210 guaranteed to be empty. And again, what I want you to remember, hopefully long after 184 00:13:10,210 --> 00:13:14,547 this class is completed is not so much these specific examples, but more the 185 00:13:14,547 --> 00:13:18,757 general point that I'm making. Which is, it's really easy to design bad hash 186 00:13:18,757 --> 00:13:22,839 functions. And bad hash functions lead to hash table performance much poorer than 187 00:13:22,839 --> 00:13:27,719 what you're probably counting on. Now that we're equipped with examples of bad hash 188 00:13:27,719 --> 00:13:31,478 functions. It's natural to ask about, you know, what are some good hash functions? 189 00:13:31,478 --> 00:13:35,464 Well it's actually quite tricky to answer that question. What are the good hash 190 00:13:35,464 --> 00:13:38,848 functions, and I'm not really going to answer that on this slide. I don't promise 191 00:13:38,848 --> 00:13:42,367 about hash functions that I'm going to tell you about right now, are good in a 192 00:13:42,367 --> 00:13:45,657 very strong sense of the word. I will say these are not obliviously bad hash 193 00:13:45,657 --> 00:13:50,076 functions, they're let's say, somewhat better hash functions. And in particular 194 00:13:50,076 --> 00:13:53,874 if you just need a hash function, and you need a quick and dirty one, you don't want 195 00:13:53,874 --> 00:13:57,078 to spend too much time on it. The method that I'm going to talk about on this slide 196 00:13:57,078 --> 00:14:01,333 is a common way of doing it. On the other hand, if you're designing a hash function 197 00:14:01,333 --> 00:14:05,570 for some really mission-critical code, you should learn more than what I'm gonna tell 198 00:14:05,570 --> 00:14:08,370 you about on this slide. So you, you should do more research about what are the 199 00:14:08,370 --> 00:14:11,930 best hash functions, what's the state of the art, if you have a super important 200 00:14:11,930 --> 00:14:15,716 hash function. But if you just need a quick one what's, what we say on this 201 00:14:15,716 --> 00:14:20,737 slide will do in many, in most situations. So the design of a hash function can be 202 00:14:20,737 --> 00:14:25,840 thought of as two separate parts. So remember by definition a hash function 203 00:14:25,840 --> 00:14:29,724 takes as input something from the universe. An IP address, a name, whatever 204 00:14:29,724 --> 00:14:34,610 and spits out a bucket number. But, it can be useful to factor that into two separate 205 00:14:34,610 --> 00:14:39,645 parts. So first you take an object which is not intrinsically numeric. So, 206 00:14:39,645 --> 00:14:44,055 something like s string or something more abstract. And you somehow turn an object 207 00:14:44,055 --> 00:14:51,860 into a number, possibly a very big number. And then you take a possibly big number and you map it to a much smaller number, 208 00:14:51,860 --> 00:14:56,124 namely the index of a bucket. So in some cases I've seen these two steps given the 209 00:14:56,124 --> 00:15:01,673 names like the first step is formulating the hash code For an object, and then the 210 00:15:01,673 --> 00:15:06,826 second step is applying a compression function. In some cases, you can skip the 211 00:15:06,826 --> 00:15:09,923 first step. So, for example, if your keys are social security numbers, they're 212 00:15:09,923 --> 00:15:14,152 already integers. If they're phone number, they're already integers. Of course, there 213 00:15:14,152 --> 00:15:17,763 are applications where the objects are not numeric. You know, for example, maybe 214 00:15:17,763 --> 00:15:20,973 they're strings, maybe you're remembering names. And so then, the production of this 215 00:15:20,973 --> 00:15:24,285 hash code basically boils down to writing a subroutine that takes, as input, a 216 00:15:24,285 --> 00:15:29,386 string, and outputs some possibly very big number. There are standard methods for 217 00:15:29,386 --> 00:15:33,210 doing that, it's easy to find resources to, to give you example code for 218 00:15:33,210 --> 00:15:37,326 converting strings to integers you know, I'll just say one or two sentences about 219 00:15:37,326 --> 00:15:41,307 it. So you know each character in a string it is easy to regard as a number in 220 00:15:41,307 --> 00:15:45,551 various ways. Either you know just say it is ASCII, well ASCII code then you just 221 00:15:45,551 --> 00:15:48,494 have to aggregate all of the different numbers, one number per character into 222 00:15:48,494 --> 00:15:52,078 some overall number and so one thing you can do is you can iterate over the 223 00:15:52,078 --> 00:15:56,334 characters one at a time. You can keep a running sum. And with each character, you 224 00:15:56,334 --> 00:16:00,447 can multiply the running sum by some constant, and then add the new letter to 225 00:16:00,447 --> 00:16:04,707 it, and then, if you need to, take a module list to prevent overflow. And the 226 00:16:04,707 --> 00:16:08,144 point of me giving you this one to two sentence of the subroutine is just to give 227 00:16:08,144 --> 00:16:11,563 you a flavor of what they're like, and to make sure th at you're just not scared of 228 00:16:11,563 --> 00:16:14,906 it at all. Okay? So it's very simple programs you can write for doing things 229 00:16:14,906 --> 00:16:18,510 like converting from strings to integers. But again, you know, I do encourage you to 230 00:16:18,510 --> 00:16:21,665 look it up on the Web or in a programming textbook, to actually look at those 231 00:16:21,665 --> 00:16:24,914 examples. Okay? But there are standard methods for doing it. So that leaves the 232 00:16:24,914 --> 00:16:26,691 quest, question of how to design this compression function. So you take as input 233 00:16:26,691 --> 00:16:29,709 this huge integer. Maybe your keys are already numeric, like Social Security 234 00:16:29,709 --> 00:16:33,376 numbers or IP addresses. Or maybe you've already some subroutine to convert a 235 00:16:33,376 --> 00:16:37,004 string, like your friend's name, into. Some big number, but the point is you have 236 00:16:37,004 --> 00:16:40,246 a number in the millions or the billions, and you need to somehow take that and 237 00:16:40,246 --> 00:16:43,517 output one of these buckets. And again think of there being maybe a thousand or 238 00:16:43,517 --> 00:16:47,772 so buckets. So the easiest way to do that is something we already saw in a previous 239 00:16:47,772 --> 00:16:54,106 slide, which is just to take the modulus, With respect to the number of buckets. Now 240 00:16:54,106 --> 00:16:57,121 certainly one positive thing you can say about this compression function is its 241 00:16:57,121 --> 00:17:01,050 super simple, Both in the sense that it's simple to code and in the sense that it's 242 00:17:01,050 --> 00:17:04,117 simple to evaluate. Now remember that was our second goal for a hash function. It 243 00:17:04,117 --> 00:17:08,545 should be simple to store, it is actually nothing to store. And it should be quick 244 00:17:08,545 --> 00:17:13,151 to evaluate. And this certainly fits the bill. Now the problem is, remember the 245 00:17:13,151 --> 00:17:16,995 first. Property of a hash function that we wanted is that it should spread the data 246 00:17:16,995 --> 00:17:19,833 out equally. And what we saw in the previous slide is that at least if you 247 00:17:19,833 --> 00:17:24,598 choose the number of buckets N poorly, then you can fail to have the first 248 00:17:24,598 --> 00:17:28,680 property. And in that respect you can fail to be a good hash function. So if for 249 00:17:28,680 --> 00:17:32,331 example if N is even and all of your objects are even, then it's a disaster, 250 00:17:32,331 --> 00:17:37,765 all of the odd buckets go completely empty. And honestly, you know, this is a 251 00:17:37,765 --> 00:17:40,821 pretty simplistic method. Like I said, this is a quick and dirty hash function. 252 00:17:40,821 --> 00:17:44,999 So, no matter how you choose the number of buckets N, it's not gonna be a perfect 253 00:17:44,999 --> 00:17:48,744 hash function in any sense of the word. That said, there are some rules of thumb 254 00:17:48,744 --> 00:17:52,991 for how to pick the number of buckets, how to pick this modulus, so that you don't 255 00:17:52,991 --> 00:17:56,583 get the problems that we saw on the previous slide. So, I'll conclude this 256 00:17:56,583 --> 00:17:59,928 video just with some standard rules of thumb. You know, if you just need a quick 257 00:17:59,928 --> 00:18:03,434 and dirty hash function, you're gonna use the, the modulus compression function, how 258 00:18:03,434 --> 00:18:08,767 do you choose N? Well, the first thing is we definitely don't want to have the 259 00:18:08,767 --> 00:18:11,395 problem we had in the last slide, where we're guaranteed to have these empty 260 00:18:11,395 --> 00:18:14,918 buckets no matter what the data is. So what went wrong in the previous slide? 261 00:18:14,918 --> 00:18:21,992 Well. The problem is that all of the data elements were divisible by two. And the 262 00:18:21,992 --> 00:18:26,838 hash function modulus, the number of buckets, was also divisible by two. So 263 00:18:26,838 --> 00:18:32,338 because they shared a common factor, namely two, that guaranteed that all of 264 00:18:32,338 --> 00:18:36,915 the odd buckets remained empty. And this is a problem, more generally, if the data 265 00:18:36,915 --> 00:18:39,858 shares any common factors with N, the number of buckets. So, in other words, if 266 00:18:39,858 --> 00:18:43,058 all of your data elements are multiples of three, and the number of buckets is also a 267 00:18:43,058 --> 00:18:47,224 multiple of three, you got a big problem. Then everything's gonna hash into bucket 268 00:18:47,224 --> 00:18:50,656 numbers which are multiples of three, too, that's if your hash table will go 269 00:18:50,656 --> 00:18:56,337 unfilled. So the upshot is, you really want the number of buckets to not share 270 00:18:56,337 --> 00:19:02,287 any factors With the data that you're hashing. So, how do you reduce the number 271 00:19:02,287 --> 00:19:05,287 of common factors? Well, you just make sure the number of buckets has very few 272 00:19:05,287 --> 00:19:10,692 factors, which means you should choose N to be a prime number, 'kay? A number that 273 00:19:10,692 --> 00:19:15,794 has no nontrivial factors, And let me remind you, the number of buckets should 274 00:19:15,794 --> 00:19:19,247 also be comparable to the size of the set that you're planning on storing. Again, at 275 00:19:19,247 --> 00:19:23,456 no point did we need "N" to be, you know, very closely connected to the number of 276 00:19:23,456 --> 00:19:26,782 elements that you're storing just within, say some small constant factor. And you 277 00:19:26,782 --> 00:19:30,337 can always find a prime within a small constant factor of a target number of 278 00:19:30,337 --> 00:19:34,974 elements to store. If the number of buckets in your hash table isn't too big, 279 00:19:35,024 --> 00:19:39,007 if it's just in say the thousands or maybe the tens of thousands, then, you know, you 280 00:19:39,007 --> 00:19:42,273 can just look up a list of all known primes up to some point, and you can just 281 00:19:42,273 --> 00:19:45,465 sort of pick out a prime which is about the magnitude that you're looking for. If 282 00:19:45,465 --> 00:19:49,480 you're gonna use a really huge number of buckets in the millions or more, then 283 00:19:49,480 --> 00:19:53,537 there are algorithms you can use for primarily testing which will help you find 284 00:19:53,537 --> 00:19:58,524 a prime in about the range that you're looking for. >> So that's the first order 285 00:19:58,524 --> 00:20:02,354 rule of thumb you should remember if you're using the modulus compression 286 00:20:02,354 --> 00:20:05,762 function, which is set the number of buckets equal to a prime. So you're 287 00:20:05,762 --> 00:20:10,486 guaranteed to not have non-trivial common factors of the modulus shared by all of 288 00:20:10,486 --> 00:20:14,269 your data. So there's also a couple of second order optimizations, which people 289 00:20:14,269 --> 00:20:19,514 often mention. And you also don't want the prime; you want the prime to be not too 290 00:20:19,514 --> 00:20:23,163 close to patterns in your data. So what does that mean, Patterns in your data? 291 00:20:23,163 --> 00:20:27,760 Well, in the phone number example we saw that patterns emerged in the data when we 292 00:20:27,760 --> 00:20:32,305 expressed it base ten. So for example, there is crazy amounts of Lumping in the 293 00:20:32,305 --> 00:20:35,557 first three digits when we expressed a phone number-based ten, Because that 294 00:20:35,557 --> 00:20:40,109 corresponded with the area code. And then, with. Memory locations when we express, 295 00:20:40,109 --> 00:20:45,062 express it base two, there are crazy correlations in the low orbits. And these 296 00:20:45,062 --> 00:20:48,762 are the two most common examples. Either there's some digit, to the base ten 297 00:20:48,762 --> 00:20:52,955 representation or digits in the base two representation where you have, you know, 298 00:20:52,955 --> 00:20:58,133 patterns that is non-uniformity. So that. Suggests that the prime, that, N that you 299 00:20:58,133 --> 00:21:02,364 choose, you know, all else being equal, shouldn't be too close to a power of two, 300 00:21:02,364 --> 00:21:06,187 and shouldn't be too close to a power of ten. The thinking being that, that will 301 00:21:06,187 --> 00:21:11,566 spread more evenly data sets that do have these patterns in terms of base two 302 00:21:11,566 --> 00:21:15,667 representation, or base ten representations. So in closing, this is a 303 00:21:15,667 --> 00:21:20,029 recipe I recommend for coding of hash functions if what you're looking to do is 304 00:21:20,029 --> 00:21:24,917 sort of minimize program ming, programmer time, subject to not coming up with a hash 305 00:21:24,917 --> 00:21:29,333 function, which is completely broken. But I want to reiterate, this is not the state 306 00:21:29,333 --> 00:21:34,080 of the art in hash function design. There are hash functions which are in some sense 307 00:21:34,080 --> 00:21:37,647 better than the ones that expand on this slide. If you're responsible for some 308 00:21:37,647 --> 00:21:41,573 really mission critical code that involves a hash function, you should really study 309 00:21:41,573 --> 00:21:44,582 more deeply than we've been able to do here. We'll touch on some issues in, of 310 00:21:44,582 --> 00:21:47,984 the different optional video, but really you should do additional homework. You 311 00:21:47,984 --> 00:21:50,903 should find out about the state-of-the-art about hash function design. You should 312 00:21:50,903 --> 00:21:54,367 also look into implementations of open addressing in those probing strategies. 313 00:21:54,367 --> 00:21:58,646 And above all you really should consider cold, coding up multiple prototypes and 314 00:21:58,646 --> 00:22:03,206 seeing which one works the best. There's no silver bullet, there's no panacea in 315 00:22:03,206 --> 00:22:07,063 the design of hash tables. I've given you some high-level guidance about the 316 00:22:07,063 --> 00:22:10,404 different approaches. But ultimately it's gonna be up to you to find the optimal 317 00:22:10,404 --> 00:22:13,134 implementation for your own application.