1 00:00:00,000 --> 00:00:04,182 So in this video we'll take a peek under the hood of hash functions. And I'll 2 00:00:04,182 --> 00:00:09,451 discuss some of the high level principles by which their implemented. So let's 3 00:00:09,451 --> 00:00:13,338 briefly review the raison d'etre of a hash table. So the purpose in life for a hash 4 00:00:13,338 --> 00:00:17,038 table is to support super-fast lookups. So maybe you're keeping track of the 5 00:00:17,038 --> 00:00:20,925 transactions that happened on your website yesterday. Maybe you're keeping track of 6 00:00:20,925 --> 00:00:24,671 your employees; maybe you're keeping track of IP addresses in an Internet router. 7 00:00:24,671 --> 00:00:28,184 Maybe you're keeping track of chess configurations in a, in a chess-playing 8 00:00:28,184 --> 00:00:31,837 program, Whatever, The point is, you want to be able to insert stuff into a hash 9 00:00:31,837 --> 00:00:35,302 table, and later remember whether something's there or whether something's 10 00:00:35,302 --> 00:00:38,815 not there. So the implementations we'll discuss will generally also support 11 00:00:38,815 --> 00:00:42,608 deletions. But that's pretty much it. It's a very restricted set of operations. But 12 00:00:42,608 --> 00:00:46,627 the hash, It was going to execute them at very, very well. So, basically in constant 13 00:00:46,627 --> 00:00:50,574 time, again subject to some fine print, which we'll discuss a little bit in this 14 00:00:50,574 --> 00:00:54,671 video but, then more deeply in a separate optional video. So, the two caveats are 15 00:00:54,671 --> 00:00:59,018 first of all the hash table had better be properly implemented. It's actually pretty 16 00:00:59,018 --> 00:01:03,015 easy to screw up a hash table to screw up hash functions. We'll talk a bit about 17 00:01:03,015 --> 00:01:06,462 that in a few minutes and, then, also, the data should, in some sense, be 18 00:01:06,462 --> 00:01:10,310 non-pathological, and that, we will discuss more deeply in a separate video. 19 00:01:10,310 --> 00:01:14,163 Alright, so let me give you an initial glimpse of some of the magic that's 20 00:01:14,163 --> 00:01:18,017 happening under the hood in hash functions. So, at first let me say exactly 21 00:01:18,017 --> 00:01:22,079 what the setup is. The first step is to identify all the things that you might 22 00:01:22,079 --> 00:01:26,141 want to be storing. So, in other words, the universe of your application, So this 23 00:01:26,141 --> 00:01:30,463 would be something like, all possible I-P addresses, of which there's 2^32 . 24 00:01:30,463 --> 00:01:34,981 All possible names you might encounter, perhaps with a maximum of, say, 25 00:01:34,981 --> 00:01:40,190 30 characters. All possible configurations of a chessboard, and so on, And one thing 26 00:01:40,190 --> 00:01:44,935 I hope you can appreciate from these examples is that, in many cases, this 27 00:01:44,935 --> 00:01:49,260 universe is really big. Sothe number of ] IP address is, quote unquote, only two 2^32. 28 00:01:49,260 --> 00:01:52,811 The number of all names, you're probably talking more like 26 raised to the 30. 29 00:01:52,811 --> 00:01:56,697 All chessboard configurations I don't even wanna think about. And what you 30 00:01:56,697 --> 00:02:00,535 wanna accomplish is, you wanna maintain an evolving subset of this universe U. So 31 00:02:00,535 --> 00:02:04,470 maybe you wanna keep track of all the IP addresses you've seen on your website in 32 00:02:04,470 --> 00:02:08,452 the last 24 hours. You wanna keep track of the phone numbers of all of your friends. 33 00:02:08,452 --> 00:02:12,386 You wanna keep track of the chessboard configurations that you've explored in the 34 00:02:12,386 --> 00:02:16,665 past three seconds, whatever. And again I hope what is clear from the applications 35 00:02:16,665 --> 00:02:20,929 we've been discussing, is that the set S is usually of a reasonable size. It's, 36 00:02:20,929 --> 00:02:25,194 it's something you could store in main memory. You know, it maybe it's tens of 37 00:02:25,194 --> 00:02:29,735 thousands of IP addresses. Maybe it's, you know, a few hundred names of your various 38 00:02:29,735 --> 00:02:33,446 friends. You know, maybe it's in the, you know, millions of chessboard 39 00:02:33,446 --> 00:02:37,760 configurations, but still way, way, way smaller than the size of the universe. So 40 00:02:37,760 --> 00:02:41,978 without data structures, you'd have to resort to other unsatisfactory solutions 41 00:02:41,978 --> 00:02:46,035 to maintaining this set. So the first thing you could try, as we discussed in 42 00:02:46,035 --> 00:02:50,040 the previous video, would be just have an array with one position for every 43 00:02:50,040 --> 00:02:54,068 imaginable thing you might want to store in your set. So this is the solution 44 00:02:54,068 --> 00:02:58,015 that's going to work well if all of your friends happen to have names that are 45 00:02:58,015 --> 00:03:01,611 integers between 1 and 10,000, but doesn't scale when the universe size 46 00:03:01,761 --> 00:03:05,658 becomes really big, as in most of these applications. So, the good news is, is of 47 00:03:05,658 --> 00:03:09,754 course, is an array of and it's of course fast random access so you can access any 48 00:03:09,754 --> 00:03:13,800 position in constant time. So, if you have an array base solution index by all the 49 00:03:13,800 --> 00:03:17,547 elements of the universe, you can do constant time, insert, delete and look up. 50 00:03:17,547 --> 00:03:20,893 The bad news is, is the space requirement is proportional to the 51 00:03:20,893 --> 00:03:24,590 universe. And again, forget about being unsatisfactory. That's just literary 52 00:03:24,590 --> 00:03:28,695 impossible. Infeasible in many applications in which you'd use hash 53 00:03:28,695 --> 00:03:32,539 tables. Now of course to get the memory proportional to the size of the set stuff 54 00:03:32,539 --> 00:03:35,982 that you're storing, an easy solution would just be to use a list. You know, say 55 00:03:35,982 --> 00:03:39,557 a doubly-linked list. Something like that. Now with a list-based solution the good 56 00:03:39,557 --> 00:03:42,956 news is, is your memory is certainly proportional to the size of the set that 57 00:03:42,956 --> 00:03:46,311 you're storing, and independent of the size of the universe from which these 58 00:03:46,311 --> 00:03:49,887 elements are drawn. The bad news is that to figure out whether something is, or is 59 00:03:49,887 --> 00:03:53,330 not, in a list you generally have to traverse through most of that list. And 60 00:03:53,330 --> 00:03:57,063 that's gonna take up time proportional to the length of the list. So, really the 61 00:03:57,063 --> 00:04:02,096 question we're faced in implementing cache table is, can we get the best Of both 62 00:04:02,096 --> 00:04:06,211 worlds, of these two naive solutions. And the one hand, we want to have the constant 63 00:04:06,211 --> 00:04:10,429 time operations enjoyed by the array based solution. But on the other hand, we wanna 64 00:04:10,429 --> 00:04:14,951 have the, linear space in the size of the set that we're storing; that we get in the 65 00:04:14,951 --> 00:04:19,168 list based solution. So to get the best of both worlds, we are going to use an array 66 00:04:19,168 --> 00:04:23,283 based solution. But the array will not be big. It'll not be with size proportional 67 00:04:23,283 --> 00:04:27,297 to the universe. The array will only have size, you know, roughly the same as the 68 00:04:27,297 --> 00:04:31,367 set that we're storing, So somewhere in the ball park of the cardinality of S. So 69 00:04:31,367 --> 00:04:35,737 the first thing we do is we decide on how big we want our array to be. So that, that 70 00:04:35,737 --> 00:04:40,002 length is gonna be called n. We're gonna have an array of length n. And n is gonna 71 00:04:40,002 --> 00:04:44,319 be in the ballpark of the size of S. It's gonna depend on a few things. Exactly how 72 00:04:44,319 --> 00:04:48,766 n compares to S, but for now think of n as like double the size of S. We're gonna be 73 00:04:48,766 --> 00:04:53,557 calling each entry of the array a bucket, so there's n buckets, and then, the size 74 00:04:53,557 --> 00:04:57,633 of S is about 50 percent of the number of buckets, let's say. So one objection you 75 00:04:57,633 --> 00:05:01,720 might legitimately raise at this point is, you know I thought, I said the set was 76 00:05:01,720 --> 00:05:05,604 dynamic. The set S. Right? Stuff can be added, stuff can be deleted. So the size 77 00:05:05,604 --> 00:05:09,691 isn't always the same. It can fluctuate over time. So what does it mean to define 78 00:05:09,691 --> 00:05:13,472 an array which is the, roughly the same length as this changing set. So for 79 00:05:13,472 --> 00:05:17,714 simplicity, for the purposes of this video to focus on the key points I am going to 80 00:05:17,714 --> 00:05:21,955 assume that the set size S. While S itself can be changing, I'm going to assume that 81 00:05:21,955 --> 00:05:26,033 the size of S doesn't fluctuate too much. So there are additional bells and whistles 82 00:05:26,033 --> 00:05:29,032 you can add to a hash table implementation, and they're all quite 83 00:05:29,032 --> 00:05:32,446 natural. I think most of you could probably figure them out on your own, to 84 00:05:32,446 --> 00:05:35,813 deal with the fact that S might be changing sizes. So for example, you can 85 00:05:35,813 --> 00:05:39,596 just keep track of how many elements are in your hash table. And when it exceeds a 86 00:05:39,596 --> 00:05:43,287 big, a certain threshold, so when it's too big relative to the size of your array, 87 00:05:43,287 --> 00:05:47,208 you just double the array. And then you reinsert all of the elements into this new 88 00:05:47,208 --> 00:05:50,945 doubled array. Similarly, if you want to, if the set shrinks, you can have tricks 89 00:05:50,945 --> 00:05:54,682 for shrinking the array dynamically as well. So I'm not gonna discuss these bells 90 00:05:54,682 --> 00:05:58,556 and whistles for resizing your hash table dynamically. They are, of course Important 91 00:05:58,556 --> 00:06:02,458 for a real implementation, and they are part of the implementations in the 92 00:06:02,458 --> 00:06:06,358 standard programming libraries. But I view those as sort of a, a second order point 93 00:06:06,358 --> 00:06:10,102 in the implementation of a hash table. And I wanna focus on the first order points, 94 00:06:10,239 --> 00:06:13,571 in this video. So, summarizing, think of the set S. There are insertions and 95 00:06:13,571 --> 00:06:17,315 deletions we have to accommodate. But, you know, S is gonna be roughly the same size. 96 00:06:17,315 --> 00:06:20,967 And the number of buckets will be, you know, within a constant factor of the size 97 00:06:20,967 --> 00:06:25,978 of the set. All right so there we have our array with totally reasonable space, space 98 00:06:25,978 --> 00:06:31,103 proportional to the size of the set that we are storing. And now what we want is we 99 00:06:31,103 --> 00:06:35,920 want is some way of translating between the things that we care about, say our 100 00:06:35,920 --> 00:06:41,108 friends names or whatever the elements in the universe are to the positions in this 101 00:06:41,108 --> 00:06:46,172 array. So the object responsible for that translation from keys drawn from this universe to 102 00:06:46,172 --> 00:06:52,060 positions in this array is called a hash function. So formally, a hash function 103 00:06:52,060 --> 00:06:56,844 Takes as input a key. So this is gonna be an IP address or the name of somebody or a 104 00:06:56,844 --> 00:07:01,628 chessboard configuration or whatever. And it's going to spit out an position in this 105 00:07:01,628 --> 00:07:06,013 array. So I'm gonna label the array entries from 0 to n-1 for this lecture. 106 00:07:06,013 --> 00:07:10,057 Obviously at the moment this is super underspecified. There's a zillion 107 00:07:10,057 --> 00:07:14,556 functions you could choose. Which one you use, we'll talk about that, but for now 108 00:07:14,556 --> 00:07:19,226 there's just gonna be some hash function mapping from elements of the universe to 109 00:07:19,226 --> 00:07:23,977 buckets, to positions in this array. Now, as far as the semantics of this hash 110 00:07:23,977 --> 00:07:29,223 function, what the hash function is doing, it's telling us in which position we 111 00:07:29,223 --> 00:07:34,132 should store a given key from the universe. So, if we have some new friend 112 00:07:34,132 --> 00:07:38,737 named Alice. And we run Alice, we key Alice through the hash function and it 113 00:07:38,737 --> 00:07:43,226 gives us a 17. It says we should store Alice's phone number in position 114 00:07:43,226 --> 00:07:47,886 17 of the array. If we have some crazy chessboard configuration, we feed it 115 00:07:47,886 --> 00:07:52,777 into a hash function and it spits out 172, it says we should remember this chessboard 116 00:07:52,777 --> 00:07:57,788 configuration in the 172nd bucket of this array. So again, given x, which is some 117 00:07:57,788 --> 00:08:03,662 key from this universe, we invoke a hash function to get a position in this array, 118 00:08:03,662 --> 00:08:09,537 to get a bucket. And then that is where we try to store this x and any associated 119 00:08:09,537 --> 00:08:13,366 data with it. So that's the high leveled idea of how you implement a hash 120 00:08:13,366 --> 00:08:16,986 table, but we're quite far from done, And in particular there is a serious issue, 121 00:08:16,986 --> 00:08:20,560 that we're going to have to deal with, that's fundamental to implementing hash 122 00:08:20,560 --> 00:08:23,997 tables, and that's the notion of a collision. So probably many of you may 123 00:08:23,997 --> 00:08:27,709 have already noticed that this problem might occur. Which is well what happens if 124 00:08:27,709 --> 00:08:31,558 we're storing our friend's phone numbers, and you know Alice shows up and we ask our 125 00:08:31,558 --> 00:08:35,178 hash function where to store Alice's phone number, and it says oh bucket number 126 00:08:35,178 --> 00:08:38,889 17, And then our friend Bob shows up, and we ask our hash function where to 127 00:08:38,889 --> 00:08:42,509 store Bob's phone number, and what if the hash function also says bucket number 128 00:08:42,509 --> 00:08:45,865 17 for Bob? What do we put in bucket at 17? Do we put Alice 129 00:08:45,865 --> 00:08:49,590 there, do we put Bob there, do we put them both there? How do we deal with these 130 00:08:49,590 --> 00:08:53,459 so-called collisions? So, the next quiz is meant to give, to get you thinking about 131 00:08:53,459 --> 00:08:57,900 collisions, and in some sense, how truly unavoidable they really are. [sound], 132 00:08:57,900 --> 00:09:02,587 [sound] All right. So the correct answer to this question is the first answer, 133 00:09:02,587 --> 00:09:06,605 believe it or not. All you need is 23 people in a room before you're equally 134 00:09:06,605 --> 00:09:10,834 likely to have two people with the same birthday as not. So if you're looking to, 135 00:09:10,834 --> 00:09:14,851 to skim a little money off of your non-mathematical friends, this is one way 136 00:09:14,851 --> 00:09:19,345 you can do it. Go to cocktail parties with about 40 people and place bets with people 137 00:09:19,345 --> 00:09:23,710 that there are two people in the room with the same birthday. So if you have 367 138 00:09:23,710 --> 00:09:27,910 people, well there's only 366 distinct birthdays, I'm counting February 29th 139 00:09:27,910 --> 00:09:32,770 here as one of them. So by the pigeonhole principle, certainly the 140 00:09:32,770 --> 00:09:38,910 probability is 100%. By the time you get to 367. Now, by the time you're at 57. 141 00:09:38,910 --> 00:09:45,418 You're already at 99%. So you already have overwhelming probability to have a 142 00:09:45,418 --> 00:09:51,112 duplicate birthday with 57 people. So of course, with 184 you're gonna be almost at 143 00:09:51,112 --> 00:09:56,389 100%, 99.99. Who knows? Some large number of 9's, And at 23, you're at 50%. So many 144 00:09:56,389 --> 00:10:02,083 people find this quite counter-intuitive that you only need 23 people to get a 145 00:10:02,083 --> 00:10:08,193 duplicate birthday on average. And so this is a, this is a quite famous example and 146 00:10:08,193 --> 00:10:12,699 it sometimes goes by the birthday paradox. Calling it a paradox is sort of a 147 00:10:12,699 --> 00:10:15,691 misnomer. A paradox, you know, often suggests some kind of logical 148 00:10:15,691 --> 00:10:18,822 inconsistency. There's no logical inconsistency here. It's just that 149 00:10:18,822 --> 00:10:22,562 people's brains are not really wired to have this intuition, for whatever reason. 150 00:10:22,562 --> 00:10:26,208 So, but it's really just math. You can work out the math, and, and, and you can 151 00:10:26,208 --> 00:10:29,808 just solve it. So, more generally, the principle behind the birthday paradox is 152 00:10:29,808 --> 00:10:33,126 the following. So suppose you have a calendar, perhaps on some different 153 00:10:33,126 --> 00:10:36,913 planet, which has K days. Where each, everybody's equally likely to have each of 154 00:10:36,913 --> 00:10:40,746 the K days as their birthday. Then it's about the square root of k people that you 155 00:10:40,746 --> 00:10:44,392 need in a room before you're equally likely to have a duplicate, or not have a 156 00:10:44,392 --> 00:10:48,006 duplicate. Okay, and the reason that you get the square root effect is because if 157 00:10:48,006 --> 00:10:52,572 you think about it. There's a quadratic number of pairs of people in the room, so 158 00:10:52,572 --> 00:10:56,997 that's a quadratic, and the number of people Opportunities to have a duplicate. 159 00:10:56,997 --> 00:11:01,110 Right? So, each pair of people could be a duplicate, there's a quadratic number of 160 00:11:01,110 --> 00:11:05,326 pairs. And so, that's why, once the number of pairs starts reaching about the number 161 00:11:05,480 --> 00:11:09,644 of different days, you're, you're about, you're likely to see a duplicate around 162 00:11:09,644 --> 00:11:13,542 that point. So you might be wondering why I'm telling you about the birthday paradox 163 00:11:13,542 --> 00:11:17,361 in the middle of a lecture about hashing, but really it's quite relevant. So imagine 164 00:11:17,361 --> 00:11:21,324 for example you defined a hash function in the following way. Now to be clear, this 165 00:11:21,324 --> 00:11:25,239 is not a practical hash function, but just for the purposes of discussion, imagine 166 00:11:25,239 --> 00:11:29,202 you have a hash function which randomly assigned every single key to a uniform 167 00:11:29,202 --> 00:11:33,068 bucket. 'Kay, so for each, each of the 1/n buckets equally likely. Then what 168 00:11:33,068 --> 00:11:36,983 the birthday paradox says is, even for a very small dataset, you are already gonna 169 00:11:36,983 --> 00:11:40,439 have a pair of things colliding. All right, So if you have an n buckets, so 170 00:11:40,439 --> 00:11:44,355 maybe your n is like, 10,000, all you need is roughly 100 elements in your data set, 171 00:11:44,355 --> 00:11:47,793 and despite the fact that the table is only going to be one percent full, you're 172 00:11:47,793 --> 00:11:51,278 already going to see a collision, okay? So 99 percent of them are empty, but you're 173 00:11:51,278 --> 00:11:55,146 going to have one bucket that has two, so that's sort of annoying. So the birthday 174 00:11:55,146 --> 00:11:58,918 paradox says, you start getting collisions with the hash function, even with the 175 00:11:58,918 --> 00:12:02,499 really tiny data sets. So in this sense, if you're going to have hash tables, 176 00:12:02,499 --> 00:12:06,175 you've got to deal with collisions. There's going to be a fair number of them, 177 00:12:06,175 --> 00:12:10,375 and you need some method for resolving them So, collisions are a fact of life 178 00:12:10,375 --> 00:12:15,292 when you're talking about hashing. Where again, by collision, what I mean is two 179 00:12:15,292 --> 00:12:20,832 different keys. So two different elements x and y from the universe that hash to the 180 00:12:20,832 --> 00:12:25,977 same bucket, Who have the same hash value, So in general we can think of a hash 181 00:12:25,977 --> 00:12:31,386 function as doing a compression of sorts. So we have a huge universe U and we have 182 00:12:31,386 --> 00:12:36,272 this very modest size array A with the only n buckets. Where n, we're thinking 183 00:12:36,272 --> 00:12:40,318 of as being much, much, much smaller than U. So, of course, this hash function has 184 00:12:40,318 --> 00:12:44,519 to map various elements of U to the same bucket. So what are we gonna do about it? 185 00:12:44,519 --> 00:12:48,254 How are we going to resolve these collisions? Well, there's two different 186 00:12:48,254 --> 00:12:52,351 solutions which are both quite prevalent in practice. So solution number one is 187 00:12:52,351 --> 00:12:56,604 called chaining, or sometimes you'll also see it called separate chaining. And this 188 00:12:56,604 --> 00:13:00,702 is a very natural solution; it's also the one that's relatively easy to analyze 189 00:13:00,702 --> 00:13:05,139 mathematically. What you do is just for elements that hash to the same bucket, you 190 00:13:05,139 --> 00:13:09,740 just revert to the list-based solution that we talked about in a previous slide. 191 00:13:09,740 --> 00:13:13,954 So, each of the n buckets will not necessarily contain just merely 0 or 1 element 192 00:13:13,954 --> 00:13:18,555 , it will contain a list within a principle unbounded number of elements. 193 00:13:18,555 --> 00:13:22,991 Okay, so when we use chaining, it's done quite straight-forward to figure out how 194 00:13:22,991 --> 00:13:26,927 to implement all of the hash table operations, namely, insert, delete and 195 00:13:26,927 --> 00:13:31,307 look-up, you just hash something to the appropriate bucket and then you just do 196 00:13:31,307 --> 00:13:36,075 insert, delete or look-up, as appropriate, in the list that's in that bucket. So just 197 00:13:36,075 --> 00:13:42,400 to make clear that everything is type checking, so here h(x), this is the bucket 198 00:13:42,400 --> 00:13:48,950 for x. That's what's specified by the hash function. And then, in the h(x) position 199 00:13:48,950 --> 00:13:54,631 of this array A, in the h (x), the bucket is where we find the linked list that is 200 00:13:54,631 --> 00:13:59,627 going to contain x. So just to give a cartoon example, if you had, say, four 201 00:13:59,627 --> 00:14:05,188 buckets, Maybe, you know, the first bucket has exactly one record. Corresponding to 202 00:14:05,188 --> 00:14:11,285 Alice, maybe the second bucket just has a null pointer. No one's been inserted in the 203 00:14:11,285 --> 00:14:17,015 second bucket. And then the third bucket we have, let's say, both Bob as well as 204 00:14:17,015 --> 00:14:22,960 Daniel. And then maybe in the fourth bucket we have Carol. Okay, so because we 205 00:14:22,960 --> 00:14:26,964 have a collision between Bob and Daniel, both map to the third bucket, and we 206 00:14:26,964 --> 00:14:31,469 resolve that just by having a linked list, with Bob and Daniel in some order. So the 207 00:14:31,469 --> 00:14:36,539 second solution which is trickier to talk about mathematically but still quite 208 00:14:36,539 --> 00:14:41,221 important practically is called open addressing. And the principal in open 209 00:14:41,221 --> 00:14:45,546 addressing is you're not going to use any space for pointers. You're not gonna 210 00:14:45,546 --> 00:14:49,936 have lists. So you're only gonna have one object per bucket of the array. So another 211 00:14:49,936 --> 00:14:53,812 question is what happens if, you know, you try and insert Daniel and you go, you 212 00:14:53,812 --> 00:14:57,639 invoke the hash function on Daniel and it takes you to a bucket that already 213 00:14:57,639 --> 00:15:01,664 contains Bob? That means there's no room for Daniel. So what you're going to do is 214 00:15:01,664 --> 00:15:05,739 you're gonna probe the hash table in some other position. So a hash function is, is 215 00:15:05,739 --> 00:15:09,864 now gonna be replaced by a hash sequence, where you try, the hash function tells you 216 00:15:09,864 --> 00:15:13,939 the first bucket to try to insert Daniel; failing that, a second bucket in which to 217 00:15:13,939 --> 00:15:17,765 try to insert Daniel; failing that, a third bucket to try to insert Daniel; and 218 00:15:17,765 --> 00:15:21,691 so on. And you just keep trying till you find an open position somewhere in the 219 00:15:23,822 --> 00:15:25,953 array. So there's various strategies for trying to figure out the probe 220 00:15:25,953 --> 00:15:30,698 sequence. One strategy is if you fail and save bucket 17, which is where the 221 00:15:30,698 --> 00:15:35,109 hash function tells you to go first. You just try bucket 18, then 19, 222 00:15:35,109 --> 00:15:39,351 then, 20, then 21 and so on, until you find your first open slot. So that's 223 00:15:39,351 --> 00:15:43,626 called linear probing. And another approach is double hashing. So this is a 224 00:15:43,626 --> 00:15:47,525 solution where you actually have two hash functions, hash function 1 and hash 225 00:15:47,525 --> 00:15:51,325 function 2. And the idea is, suppose you're trying to insert, say, Daniel, into 226 00:15:51,325 --> 00:15:55,372 a hash table with open addressing, and you evaluate both of the hash functions. And 227 00:15:55,372 --> 00:15:59,270 the first one comes up 17, and the fir-, the second one comes up 23. So, as 228 00:15:59,270 --> 00:16:03,120 usual, the has-, first hash function will specify where you look first. So if it 229 00:16:03,120 --> 00:16:07,216 evaluates on Daniel to 17, you look in the seventeenth position of the array, 230 00:16:07,216 --> 00:16:11,066 And if, if it's empty, that's where you insert Daniel. Now, if it's full, what you 231 00:16:11,066 --> 00:16:16,513 do is you use the second Hash value to be an additive shift, so. Unlike linear 232 00:16:16,513 --> 00:16:20,488 probing where after seventeen, you look at eighteen. With double hashing, if the 233 00:16:20,488 --> 00:16:24,617 second hash function gives you 23, that's gonna be your offset. So after seventeen, 234 00:16:24,617 --> 00:16:28,643 you look at bucket 40. If 40 is already full, you look at bucket 63. If bucket 63 235 00:16:28,643 --> 00:16:32,670 is already full then you look at bucket 86. So you keep adding increments of 23 236 00:16:32,670 --> 00:16:36,849 until you finally find a bucket where, that's empty and that's where you insert 237 00:16:36,849 --> 00:16:40,518 Daniel. Now of course, if you try to insert some other name, if you try to 238 00:16:40,518 --> 00:16:44,494 insert Elizabeth, you're gonna get two totally different numbers in general. So 239 00:16:44,494 --> 00:16:48,469 maybe you'll get 42 and 27, and so here the probed sequence will be 42, failing 240 00:16:48,469 --> 00:16:54,781 that 69, failing that 96 failing that 123 and so on, So a question you should have 241 00:16:54,781 --> 00:16:58,215 at this point, is, you know. I've told you two solutions to resolving collisions in a 242 00:16:58,215 --> 00:17:01,359 hash table. And you're probably asking, well, which ones should you use if you 243 00:17:01,359 --> 00:17:04,545 have to implement your own hash table? And, you know, as usual, if I present you 244 00:17:04,545 --> 00:17:07,978 with two different solutions for the same problem. You can probably rest assured 245 00:17:07,978 --> 00:17:11,081 that neither one dominates the other, right? Otherwise I wouldn't waste your 246 00:17:11,081 --> 00:17:14,267 time by presenting both of them to you. So, sometimes chaining's gonna perform 247 00:17:14,267 --> 00:17:17,453 better, and sometimes, open addressing's gonna perform better. And of course, it 248 00:17:17,453 --> 00:17:20,994 also depends on what kind of metric that you care about. So there are a couple of 249 00:17:20,994 --> 00:17:24,963 rules of thumb that I can tell you. So first of all if space is at a real premium 250 00:17:24,963 --> 00:17:29,127 you might want to consider open addressing instead of chaining, and that's cause with 251 00:17:29,127 --> 00:17:33,095 chaining you do have this excess, not huge but you do have this little bit of 252 00:17:33,095 --> 00:17:37,210 space overhead and dealing with all these pointers in this link list. So if you want 253 00:17:37,210 --> 00:17:41,081 to avoid that, you might want to think about open addressing. The second rule of 254 00:17:41,081 --> 00:17:44,706 thumb is deletion is trickier with open addressing than with chaining, but 255 00:17:44,706 --> 00:17:48,919 deletion is clearly not difficult at all, either to code or understand when you use 256 00:17:48,919 --> 00:17:53,035 chaining cause it just reduces chaining to a linked list which of course you all know 257 00:17:53,035 --> 00:17:56,887 how to do. Open Addressing is, it's not impossible to implement deletion but it's 258 00:17:56,887 --> 00:18:00,340 much trickier. So if deletion's a, a crucial operation for you, that might 259 00:18:00,340 --> 00:18:04,082 steer you towards thinking about chaining. But ultimately, if it's really kinda 260 00:18:04,082 --> 00:18:07,871 mission critical code, probably the best thing to do is implement both kinds of 261 00:18:07,871 --> 00:18:11,708 solutions and just see which one works better. It's a little hard to predict how 262 00:18:11,708 --> 00:18:15,065 they're gonna interact with memory hierarchies and that kind of thing. 263 00:18:15,065 --> 00:18:18,746 They're both useful in their own contexts. Alright so we've covered the two most 264 00:18:18,746 --> 00:18:22,386 prevalent ways of handling collisions. And we argued that collisions are inevitable 265 00:18:22,386 --> 00:18:25,894 no matter you design you hash function. You're stuck with collisions and you can 266 00:18:25,894 --> 00:18:29,490 do chaining or linked lists per bucket, or you can do addressing, where you actually 267 00:18:29,490 --> 00:18:32,911 have a probe sequence in order of which you look at buckets until you find an 268 00:18:32,911 --> 00:18:37,694 empty one. And the elephant in the room at this point is, you know, what is this hash 269 00:18:37,694 --> 00:18:42,514 function? I have told you nothing about hash functions. All I told you is there is 270 00:18:42,514 --> 00:18:47,394 some mapping from the set of the universe, so IP addresses, or names, or whatever to 271 00:18:47,394 --> 00:18:51,441 a bucket number. Well what kind of function should you use? Excellent 272 00:18:51,441 --> 00:18:55,845 question, Tons of research on that question, And to this day as much art as 273 00:18:55,845 --> 00:18:58,285 science. But let's start talking about it.