1 00:00:00,000 --> 00:00:03,028 So now that we understand why we can't have a single hash function which always 2 00:00:03,028 --> 00:00:08,042 does well at every single data set, that is every hash function is subject to a 3 00:00:08,042 --> 00:00:11,074 pathological data sets. We'll discuss the randomize solution of how we can have a 4 00:00:11,074 --> 00:00:16,019 family of hash functions and if you make a real time decision about which hash 5 00:00:16,019 --> 00:00:20,055 function to use, you're guaranteed to do well on average, no matter what the data 6 00:00:20,055 --> 00:00:25,427 is. So let me remind you of the three prong plan that I have for this part of 7 00:00:25,427 --> 00:00:29,539 the material. So in this video, we'll be covering the first two. So part one, which 8 00:00:29,539 --> 00:00:33,936 we'll accomplish in the next slide, will be to propose a mathematical definition of 9 00:00:33,936 --> 00:00:38,898 a good random hash function. So formerly, we're going to define a universal family 10 00:00:38,898 --> 00:00:43,168 of hash functions. Now, what makes this definition useful, well, two things. And 11 00:00:43,168 --> 00:00:48,617 so, part two, we'll show that there are examples of simple and easy to compute 12 00:00:48,617 --> 00:00:52,906 hash functions that meet this definition, that are universal in the sense described 13 00:00:52,906 --> 00:00:56,999 on the next slide. So that's important. And in the third part, which we'll do in 14 00:00:56,999 --> 00:01:01,271 the next video, will be the mathematical analysis of the performance of hashing, 15 00:01:01,271 --> 00:01:05,549 specifically with chaining when you use universal hashing. And we'll show that if 16 00:01:05,549 --> 00:01:08,957 you pick a random function from a universal family, then, the expected 17 00:01:08,957 --> 00:01:12,764 performance of all of the operations are constant. Assuming, of course, of the 18 00:01:12,764 --> 00:01:17,172 number of buckets is comparable to the number of objects in the hash table which 19 00:01:17,172 --> 00:01:21,939 we saw earlier is a necessary condition for good performance. So let's go ahead 20 00:01:21,939 --> 00:01:26,157 and get started and let's say what we mean by a good random hash function. So for 21 00:01:26,157 --> 00:01:30,965 this definition, we'll assume that the universe is fixed. So maybe it's IP 22 00:01:30,965 --> 00:01:35,707 addresses, maybe it's our friends names. Maybe it's configurations of a chessboard, 23 00:01:35,707 --> 00:01:40,143 whatever. But there's some fixed universe u and we'll also assume we've decided on 24 00:01:40,143 --> 00:01:47,322 the number of buckets n. And we call the set H universal if and only if it meets 25 00:01:47,322 --> 00:01:52,977 the following condition. In English, the condition says that for each pair of 26 00:01:52,977 --> 00:01:58,163 distinct elements, the probab ility that they collide should be no larger than with 27 00:01:58,163 --> 00:02:03,520 the gold standard of perfectly uniform random hashing. So for all distinct keys 28 00:02:03,520 --> 00:02:09,854 from the universe, call them x and y, what we want is that the probability if we 29 00:02:09,854 --> 00:02:16,382 choose a random hash function, h, from the set script h, the probability that x and y 30 00:02:16,382 --> 00:02:22,176 collide. And again, just to be clear, what that means is that, x and y hash to 31 00:02:22,176 --> 00:02:30,961 exactly the same bucket under this hash function h, this should be no more than 32 00:02:30,961 --> 00:02:38,334 1/n, and don't forget n is the number of buckets. Again, to interpret this, you 33 00:02:38,334 --> 00:02:42,835 know, 1/n, where does this come from? So, we said earlier that an impractical but in 34 00:02:42,835 --> 00:02:47,820 some sense, gold standard hash function would be to just independently for each 35 00:02:47,820 --> 00:02:53,007 key, assign it bucket uniformly and random with different keys being assigned 36 00:02:53,007 --> 00:02:56,055 independently. Remember the reason this is not a practical hash function is because, 37 00:02:56,055 --> 00:02:59,006 you'd have to remember where everybody went. And then that would basically 38 00:02:59,006 --> 00:03:02,033 require maintaining a list which would devolve to the list solution, so you don't 39 00:03:02,033 --> 00:03:05,070 want that. You want hash functions where you have to store almost nothing and we 40 00:03:05,070 --> 00:03:10,044 can evaluate them in constant time. But, if we throw out those requirements of 41 00:03:10,044 --> 00:03:13,093 small space and small time then, random function should spread stuff out pretty 42 00:03:13,093 --> 00:03:17,054 evenly, right? I mean that's what they are doing. They're throwing darts completely 43 00:03:17,054 --> 00:03:21,003 at random at these n buckets. So what would be the collision probability of two 44 00:03:21,003 --> 00:03:24,034 given keys say, of Alice and of Bob if you are doing everything independently and 45 00:03:24,034 --> 00:03:28,084 uniformly at random. Well, you know, first Alice shows up and it goes to some totally 46 00:03:28,084 --> 00:03:32,062 random bucket, say, bucket number seventeen. Now, Bob shows up. So, what's 47 00:03:32,062 --> 00:03:37,218 the probability that it collides with Alice? Well, we have these n buckets that 48 00:03:37,218 --> 00:03:40,765 Bob could go to, each is equally likely, and there's a collision between Alice and 49 00:03:40,765 --> 00:03:45,551 Bob if and only if Bob goes to bucket seventeen. Since, each bucket's equally 50 00:03:45,551 --> 00:03:49,522 likely, that's only a one in n probability. So really what this condition 51 00:03:49,522 --> 00:03:53,696 is saying, is that, for each pair of elements, the collision probability should 52 00:03:53,696 --> 00:03:59,899 be as small, as good as with the holy grail of perfectly random hashing. So this 53 00:03:59,899 --> 00:04:04,704 is a pretty subtle definition, perhaps the most subtle one that we see in this entire 54 00:04:04,704 --> 00:04:09,443 course. So, to help you get some facility with this, and to force you to think about 55 00:04:09,443 --> 00:04:13,684 it a little deeply, the next quiz which is probably harder than a typical in class 56 00:04:13,684 --> 00:04:17,535 quiz, asks you to compare this definition of universal hashing with another 57 00:04:17,535 --> 00:04:21,719 definition and ask you to figure out to what extent they're the same definition. 58 00:04:21,719 --> 00:04:26,676 So the correct answer to this quiz question is the third one that there are 59 00:04:26,676 --> 00:04:32,211 hash function families h, that satisfy the condition on this slide that are not 60 00:04:32,211 --> 00:04:37,138 universal. On the other hand, there are hash function families H, which satisfy 61 00:04:37,138 --> 00:04:42,708 this property and are universal. So, I'm going to give you an example of each. I'd 62 00:04:42,708 --> 00:04:47,618 encourage you to think carefully about why this is an example and a non-example 63 00:04:47,798 --> 00:04:52,547 offline. So an easy example to show that sometimes the answer is yes, you have 64 00:04:52,547 --> 00:04:57,659 universal hash function families h, which also satisfy this property of the slide, 65 00:04:57,659 --> 00:05:02,832 would be to just take H to be the set of all functions from mapping the universe to 66 00:05:02,832 --> 00:05:08,030 the number of buckets. So that's an awful lot of functions, that's a huge set, but 67 00:05:08,030 --> 00:05:12,448 it's a set nonetheless. And, by symmetry of having all of the functions, it both 68 00:05:12,448 --> 00:05:16,504 satisfies the property of this slide. It is indeed true that exactly a one on one 69 00:05:16,504 --> 00:05:21,604 refraction of all functions, map arbitrary key k to arbitrary bucket i. And by the 70 00:05:21,604 --> 00:05:26,715 same reasoning, by the same symmetry properties, this is universal. So really, 71 00:05:26,715 --> 00:05:31,414 if you think about it choosing a function at random function from H, is now just 72 00:05:31,414 --> 00:05:34,707 choosing a completely random function. So it's exactly what we've been calling 73 00:05:34,707 --> 00:05:38,302 perfect, random hashing. And as we discussed in the last slide, that would 74 00:05:38,302 --> 00:05:44,039 indeed have a collision probability of exactly 1/n for each pair of distinct 75 00:05:44,039 --> 00:05:47,639 keys. So, this shows sometimes you can have both this property and be universal. 76 00:05:47,639 --> 00:05:52,049 An example where you have the property in this slide, but you're not universa l, 77 00:05:52,049 --> 00:05:56,937 would be to take h to be a quite small family, a family of exactly n functions, 78 00:05:56,937 --> 00:06:01,269 each of which is a constant function. So it's going to be one function which always 79 00:06:01,269 --> 00:06:05,020 maps everything to bucket zero, that's a totally stupid hash function. There's 80 00:06:05,020 --> 00:06:09,034 going to be another hash function which always maps everything to bucket number 81 00:06:09,034 --> 00:06:13,504 one, that's a different but also totally stupid hash function, and so on. And then 82 00:06:13,504 --> 00:06:18,031 the end function will be the constant function that always maps everything to 83 00:06:18,031 --> 00:06:23,488 bucket n-1. And if you think about it, this very silly set H does indeed satisfy 84 00:06:23,488 --> 00:06:28,750 this very reasonable looking property on this slide. Fix any key, fix any bucket, 85 00:06:28,750 --> 00:06:33,133 you know say bucket number 31 what's the probability that you pick a hash function 86 00:06:33,133 --> 00:06:37,193 that maps this key to bucket number 31? Well, independent of what the key is, it's 87 00:06:37,193 --> 00:06:40,669 going to be the probability that you pick the constant hash function whose output is 88 00:06:40,669 --> 00:06:43,544 always 31. Since there's n different constant functions, there's a one in n 89 00:06:43,544 --> 00:06:48,192 probability. So, that's an example showing that in some sense, this is not as useful 90 00:06:48,192 --> 00:06:52,107 a property as the property of universal hashing. So this is really not what you 91 00:06:52,107 --> 00:06:55,759 wanted. This is not strong enough. Universal hashing, that's what you want 92 00:06:55,759 --> 00:06:59,862 for strong guarantees. So now that we've spent some time trying to assimilate 93 00:06:59,862 --> 00:07:03,413 probably the subtlest definition we've seen so far in this class, let me let you 94 00:07:03,413 --> 00:07:07,423 in on a little secret about the role of definitions in mathematics. So on the one 95 00:07:07,423 --> 00:07:12,415 hand, I think mathematical definitions often get short shrift, especially in, you 96 00:07:12,415 --> 00:07:16,206 know, the popular discussion of mathematical research. That said, you 97 00:07:16,206 --> 00:07:20,572 know, it's easy to come up with one reason why that's true, which is that any schmo 98 00:07:20,572 --> 00:07:24,800 can come up and write down an mathematical definition. Nobody's stopping you. So, 99 00:07:24,800 --> 00:07:29,626 what you really need to do is you need to prove that a mathematical definition is 100 00:07:29,626 --> 00:07:33,715 useful. So how do you indicate usefulness of a definition? Well you gotta do two 101 00:07:33,715 --> 00:07:38,301 things. First of all, you have to show that the definition is satisfied by 102 00:07:38,301 --> 00:07:42,532 objects of interest. For us right now, objects of interest, are hash functions, 103 00:07:42,532 --> 00:07:46,594 we might imagine implementing. So they should be easy to store, easy to evaluate. 104 00:07:46,594 --> 00:07:50,596 So there better be such hash functions meaning, that complicated universal hash 105 00:07:50,596 --> 00:07:55,678 function definition. The second thing is, is something good better happen if you 106 00:07:55,678 --> 00:07:59,072 meet the definition. And in the context of hashing, what good thing do we want to 107 00:07:59,072 --> 00:08:02,092 have happen? We want to have good performance. So those are the two things 108 00:08:02,092 --> 00:08:06,035 that I owe you in these lectures. First of all, a construction of practical hash 109 00:08:06,035 --> 00:08:09,056 functions that meet that definition, that's what we'll start on right now. 110 00:08:09,056 --> 00:08:13,078 Second of all, why meeting that definition is a sufficient condition for good hash 111 00:08:13,078 --> 00:08:18,096 table performance. That will be the next video. So in this example, I'm going to 112 00:08:18,096 --> 00:08:24,023 focus on IP addresses although the hash function construction is general, as I 113 00:08:24,023 --> 00:08:30,025 hope will be reasonably clear. And as many of you know, an IP address is 32 bit 114 00:08:30,025 --> 00:08:35,009 integer consisting of four different eight bit parts. So let's just go ahead and 115 00:08:35,009 --> 00:08:39,081 think of an IP address as a four two fold, the way you often see it. And since each 116 00:08:39,081 --> 00:08:45,386 of the four parts is eight bits, it's going to be a number between zero and 255. 117 00:08:45,386 --> 00:08:49,072 And the hash function that we're going to construct, it's really not going to be so 118 00:08:49,072 --> 00:08:51,855 different than the quick and dirty functions as we talked about in the last 119 00:08:51,855 --> 00:08:56,843 video although in this case we'll be able to prove that the hash function family is 120 00:08:56,843 --> 00:08:59,819 in fact, universal. And we're again going to use the same compression function. 121 00:08:59,819 --> 00:09:02,879 We're going to take the modulas with respect to a prime number of buckets. The 122 00:09:02,879 --> 00:09:08,088 only difference is we're going to multiply these xi's by a random set of 123 00:09:08,088 --> 00:09:11,851 coefficients. We're going to take a, a random linear combination of x1, x2, x3 124 00:09:11,851 --> 00:09:16,002 and x4. So I'm going to be a little more precise. So we're going to choose a number 125 00:09:16,002 --> 00:09:19,589 of buckets, n, and as we say over and over, the number of buckets should be 126 00:09:19,589 --> 00:09:24,482 chosen so it's in the same ball park of the number of objects you are storing. So 127 00:09:24,482 --> 00:09:27,499 you know, let's say that n should be roughly double the number of objects that 128 00:09:27,499 --> 00:09:32,449 you are storing as initial rule of thumb. So, for example, maybe we only want to 129 00:09:32,449 --> 00:09:37,049 maintain something in the ball park of 500 IP addresses and we can choose n to be a 130 00:09:37,049 --> 00:09:43,003 prime like 997. So here's the construction. Remember, we want to produce 131 00:09:43,003 --> 00:09:47,091 not just one hash function, but the definition is about a universal family of 132 00:09:47,091 --> 00:09:52,074 hash functions. So we need a whole set of hash functions that we're ultimately going 133 00:09:52,074 --> 00:09:58,004 to chose one member from, at random. So, how do we construct a whole bunch of hash 134 00:09:58,004 --> 00:10:04,064 functions in a simple way? Here's how we do it. So you define one hash function, 135 00:10:04,064 --> 00:10:11,081 which I'm going to note by h sub a, a here is a four tuple. The components of which 136 00:10:11,081 --> 00:10:19,089 I'm going to call a1, a2, a3 and a4. And, all of the components of a are integers 137 00:10:19,089 --> 00:10:26,089 between zero and n-1. So they're exactly in correspondence with the indices of the 138 00:10:26,089 --> 00:10:32,009 buckets. So if we have 997 buckets, then each of these ai's is an integer between 139 00:10:32,009 --> 00:10:37,014 zero and 996. So it's clear that this defines, you know, a whole bunch of 140 00:10:37,014 --> 00:10:42,062 functions. So in fact, for each of the four coefficients, that's four independent 141 00:10:42,062 --> 00:10:47,278 choices, you have n options. Okay so each of the integers between zero and n-1 for 142 00:10:47,278 --> 00:10:52,062 each of the four coefficients. So that's fine, not giving a name to end of the four 143 00:10:52,062 --> 00:10:56,084 different functions, but what is any given function? How do you actually evaluate one 144 00:10:56,084 --> 00:11:00,048 of these functions? Just remember what a hash function is supposed to do. Remember 145 00:11:00,048 --> 00:11:04,035 you know, how it type checks it takes as input something from the universe in this 146 00:11:04,035 --> 00:11:09,013 case an IP address, and outputs a bucket number. And the way we evaluate the hash 147 00:11:09,013 --> 00:11:14,558 function h sub a, and remember a here is a 4-tuple. And remember IP address is also a 148 00:11:14,558 --> 00:11:19,258 4-tuple, okay, so each component of the IP address is between zero and 255. Each 149 00:11:19,258 --> 00:11:25,095 component of a is between zero and n-1, so for example, between zero and 996. And 150 00:11:25,095 --> 00:11:29,469 what we do is just take the dot products or the inner products of the vector a and 151 00:11:29,469 --> 00:11:34,369 the vector x, and then we take the modulus with respect the number of buckets. So 152 00:11:34,369 --> 00:11:45,773 that is we take a1 x1 + a2 x2 + a3 x3 +a4 x4. Now of course, remember the 153 00:11:45,773 --> 00:11:49,982 x's lie be tween zero and 255, the ai's lie between the zero and n-1, so say zero 154 00:11:49,982 --> 00:11:54,743 and 996, you know, so you do these form of multiplications now make it a pretty big 155 00:11:54,743 --> 00:11:59,734 number, you might well over shoot the number of buckets n. So to get back in the 156 00:11:59,734 --> 00:12:04,163 range of what the buckets are actually indexed that in the end we take the 157 00:12:04,163 --> 00:12:07,628 module, modulus the number of buckets. So in the end we do output, a number between 158 00:12:07,628 --> 00:12:13,294 zero and n-1 as desired. So that's a set of a whole bunch of hash functions, n to 159 00:12:13,294 --> 00:12:17,310 the fourth hash functions. And each one meets the criteria of being a good hash 160 00:12:17,310 --> 00:12:21,053 function from an implementation perspective, right? So remember, we don't 161 00:12:21,053 --> 00:12:25,097 want to have to store much to evaluate a function. And for a given hash function in 162 00:12:25,097 --> 00:12:30,013 this family, all we gotta remember are the coefficients, a1, a2, a3 and a4. So you 163 00:12:30,013 --> 00:12:34,051 just gotta remember these four numbers. And then to evaluate a hash function on an 164 00:12:34,051 --> 00:12:38,041 IP address, we clearly do a constant amount of work. We just do these four 165 00:12:38,041 --> 00:12:42,065 multiplications, the three additions, and then taking the modulus by the number of 166 00:12:42,065 --> 00:12:47,008 buckets n. So it's constant time to evaluate, constant space to store. And 167 00:12:47,008 --> 00:12:51,084 what's cool is, using just these very simple hash functions which are constant 168 00:12:51,084 --> 00:12:56,291 time to evaluate and constant space to store, this is already enough to meet the 169 00:12:56,291 --> 00:13:01,020 definition of a universal family of hash functions. So this fulfills the first 170 00:13:01,020 --> 00:13:05,005 promise that I owed you, after subjecting you to that definition of universal 171 00:13:05,005 --> 00:13:08,071 hashing. Remember the first promise was, there are simple, there are useful 172 00:13:08,071 --> 00:13:12,037 examples that meet the definition, and then of course, I still owe you why. 173 00:13:12,037 --> 00:13:16,032 Meaning this definition is useful, why does it leave the good performance. But I 174 00:13:16,032 --> 00:13:20,038 want to conclude, this video of actually proving this theorem to you, arguing that 175 00:13:20,038 --> 00:13:24,117 this is, in fact, a universal family of hash functions. Right. So this should be a 176 00:13:24,117 --> 00:13:28,010 mostly complete proof and certainly will have all of the conceptual ingredients of 177 00:13:28,010 --> 00:13:31,482 why the proof works There will be one spot where I'm a little hand-wavy because we 178 00:13:31,482 --> 00:13:34,438 need a little number theory, and I don't want to have a big detour into number 179 00:13:34,438 --> 00:13:38,363 theory. And if you think about it, you shouldn't be surprised that basic number 180 00:13:38,363 --> 00:13:42,153 theory plays at least some role. Like as I said, we should choose the number of 181 00:13:42,153 --> 00:13:46,212 buckets to be prime. So that means at some point in the proof, you should expect us 182 00:13:46,212 --> 00:13:50,046 to use the assumption that n is prime. And pretty much always you're going to use 183 00:13:50,046 --> 00:13:53,099 that assumption will involve at least elementary number theory, okay? But I'll 184 00:13:53,099 --> 00:13:57,693 be clear about where I'm being hand-wavy. So what do we have to prove? Let's just 185 00:13:57,693 --> 00:14:02,420 quickly review a definition of a universal hash function. So we have our set h that 186 00:14:02,420 --> 00:14:05,450 we, that we know exactly what it is. What does it mean that it's universal? It means 187 00:14:05,450 --> 00:14:10,090 for each pair of distinct keys, so in our context it's for each pair of IP 188 00:14:10,090 --> 00:14:15,083 addresses, the probability that a random hash function from our family script h 189 00:14:15,083 --> 00:14:19,077 causes a collision, maps these two IP addresses to the same bucket should be no 190 00:14:19,077 --> 00:14:24,811 worse than with perfectly random hashing. So no worse than 1/n where n is the number 191 00:14:24,811 --> 00:14:29,500 of buckets, say like 997. So, the definition we need to meet is a condition 192 00:14:29,500 --> 00:14:33,692 for every pair of distinct keys. So let's just start by fixing two distinct keys. So 193 00:14:33,692 --> 00:14:39,461 I'm going to assume for this proof that these two IP addresses differ in their 194 00:14:39,461 --> 00:14:43,580 fourth component. That is that I'm going to assume that x4 is different than y4. So 195 00:14:43,580 --> 00:14:47,409 I hope that it's intuitively clear that, you know, it shouldn't matter, you know, 196 00:14:47,409 --> 00:14:50,834 which, which set of 8-bits I'm looking at. So they're different IP addresses. They 197 00:14:50,834 --> 00:14:54,724 differ somewhere. If I really wanted, I could have four cases that were totally 198 00:14:54,724 --> 00:14:57,975 identical depending on whether they differ in the first eight bits, the next 8-bits, 199 00:14:57,975 --> 00:15:01,543 the next 8-bits, or the last 8-bits. I'm going to show you one case, because the 200 00:15:01,543 --> 00:15:04,871 other three are the same. So let's just think of the last 8-bits as being 201 00:15:04,871 --> 00:15:08,123 different. And now, remember what the definition asked us to prove. It asked us 202 00:15:08,123 --> 00:15:12,784 to prove that the probability that these two IP addresses are going to collide is 203 00:15:12,784 --> 00:15:17,891 at most, 1/n. So we need an upper bound on the collision probability w ith respect to 204 00:15:17,891 --> 00:15:22,336 a random hash function from our set of n to the fourth hash functions. So I want to 205 00:15:22,336 --> 00:15:27,023 be clear on the quantifiers. We're thinking about two fixed IP addresses. So 206 00:15:27,023 --> 00:15:32,008 for example, the IP address for the New York Times website and the IP address for 207 00:15:32,008 --> 00:15:37,216 the CNN website. We're asking for these two fixed IP addresses, what fraction of 208 00:15:37,216 --> 00:15:42,610 our hash functions cause them to collide, right? We'll have some hash functions 209 00:15:42,610 --> 00:15:47,188 which map the New York Times and CNN IP addresses to the same bucket, and we'll 210 00:15:47,188 --> 00:15:51,590 have other hash functions which do not map those two IP addresses to the same bucket. 211 00:15:51,590 --> 00:15:55,135 And we're trying to say, that the overwhelming majority, sends them to 212 00:15:55,135 --> 00:16:00,025 different buckets, only a 1/n fraction at most, sends them to the same bucket. So 213 00:16:00,025 --> 00:16:06,049 we're asking about the probability for the choice of a random hash function from our 214 00:16:06,049 --> 00:16:11,047 set h that the function maps the two IP addresses to the same place. So the next 215 00:16:11,047 --> 00:16:16,032 step is just algebra. I'm just going to take this equation which indicates when 216 00:16:16,032 --> 00:16:20,086 the two IP addresses collide over a hash function. I'm going to expand the 217 00:16:20,086 --> 00:16:24,150 definition of a hash function, remember it's just this inner product modulo the 218 00:16:24,150 --> 00:16:28,919 number of buckets n, and I am going to rewrite this condition in a more 219 00:16:28,919 --> 00:16:33,266 convenient way. Alright, so after the algebra, and the dust has settled. We're 220 00:16:33,266 --> 00:16:38,249 left with this equation being equivalent to the two IP addresses colliding. So 221 00:16:38,249 --> 00:16:42,817 again, we're interested in the fraction of choices of a1, a2, a3, and a4, such that 222 00:16:42,817 --> 00:16:47,329 this condition holds, right? Sometimes it'll hold for some choices of the ai's, 223 00:16:47,329 --> 00:16:51,460 sometimes it won't hold for other choices and we're going to show that it almost 224 00:16:51,460 --> 00:16:56,814 never holds. Okay, so it fails for all but a 1/n fraction of the choices of the ai's. 225 00:16:56,814 --> 00:16:59,676 So next we're going to do something a little sneaky. This trick is sometimes 226 00:16:59,676 --> 00:17:03,273 called the Principle of Deferred Decisions. And the idea is when you have a 227 00:17:03,273 --> 00:17:08,504 bunch of random coin flips, it's sometimes convenient to flip some but not all of 228 00:17:08,504 --> 00:17:12,552 them. So sometimes fixing parts of the randomness clarifies the role that the 229 00:17:12,552 --> 00:17:16,347 remaining randomness is going to play . That's what's going to happen here. So 230 00:17:16,347 --> 00:17:21,266 let's go ahead and flip the coins, which tell us the random choice of a1, a2, and 231 00:17:21,266 --> 00:17:25,560 a3. So again remember, in the definition of a universal hash function, you analyze 232 00:17:25,560 --> 00:17:30,090 collision probability under a random choice of a hash function. What does it 233 00:17:30,090 --> 00:17:35,172 mean to choose a random hash function for us? It means a random choice of a1, and 234 00:17:35,172 --> 00:17:39,104 a2, and a3, and a4. So we're making four random choices. And what I'm saying is, 235 00:17:39,104 --> 00:17:44,893 let's condition on the outcomes of the first three. Suppose we knew, that a1 236 00:17:44,893 --> 00:17:52,422 turns up 173, a2 shows up 122 and a3 shows up 723, but we don't know what a4 is. A4 237 00:17:52,422 --> 00:17:56,908 is still equally likely to be any of zero, one, two all the way up to n-1. So 238 00:17:56,908 --> 00:18:02,552 remember that what we want to prove is that at most 1/n fraction of the choices 239 00:18:02,552 --> 00:18:07,001 of a1, a2, a3, and a4, cause this underlined equation to be true, cause a 240 00:18:07,001 --> 00:18:12,003 collision. So what we're going to show is that for each fixed choice of a1, a2, and 241 00:18:12,003 --> 00:18:18,062 a3, at most a 1/n fraction of the choices of a4 cause this equation to hold. And if 242 00:18:18,062 --> 00:18:23,543 we can show that for every single choice of a1, a2, and a3, no matter how those 243 00:18:23,543 --> 00:18:28,378 random coin flips come out, almost a 1/n fraction of the remaining outcomes satisfy 244 00:18:28,378 --> 00:18:32,364 the equation, then we're done. That means that at most of 1/n fraction of the 245 00:18:32,364 --> 00:18:36,403 overall outcomes can cause the equation to be true. So if you haven't seen the 246 00:18:36,403 --> 00:18:39,718 principle of, for these decisions before, you might want to think about this a 247 00:18:39,718 --> 00:18:43,022 little bit offline, but it's easily justified by just say two lines of 248 00:18:43,022 --> 00:18:47,261 algebra. Okay, so we're done with the setup and we're ready for the meat of the 249 00:18:47,261 --> 00:18:51,934 argument. So we have done is, we've identified an equation which is now in 250 00:18:51,934 --> 00:18:57,637 green, which occurs if and only if we have a collision between the two IP addresses. 251 00:18:57,637 --> 00:19:02,938 And the question we need to ask is, for a fixed choices of a1, a2 and a3, how 252 00:19:02,938 --> 00:19:08,033 frequently will the choice of a4 cause this equation to be satisfied? Cause a 253 00:19:08,033 --> 00:19:12,973 collision? Now, here is why we did this trick of the Principle of Deferred 254 00:19:12,973 --> 00:19:19,196 Decisions. By fixing a1, a2, and a3, the right hand side of this equation is now 255 00:19:19,196 --> 00:19:25,619 just some fixed number b etween zero and n-1. So maybe this is 773, right? The xi's 256 00:19:25,619 --> 00:19:30,891 were fixed upfront, the yi's were fixed upfront. We fixed a1, a2, a3 at the 257 00:19:30,891 --> 00:19:35,613 beginning, at the end of the last slide, and those were the only ones involved in 258 00:19:35,613 --> 00:19:41,512 the right hand side. So this is 773 and over on the left hand side, x4 is fixed, 259 00:19:41,512 --> 00:19:48,239 y4 is fixed but a4 is still random. This is an integer equally likely to be any 260 00:19:48,239 --> 00:19:53,653 value between zero and n-1. Now here's the key claim, which is that the left-hand 261 00:19:53,653 --> 00:20:00,028 side of this green equation is equally likely to be any number between zero and 262 00:20:00,028 --> 00:20:05,554 n-1. And I'll tell you the reasons why this key claim is true. Although this is 263 00:20:05,554 --> 00:20:08,525 the point where we need a little bit of number theory, so I'll be kind of 264 00:20:08,525 --> 00:20:12,061 hand-wavy about it. So there's three things we have going for us, the first is 265 00:20:12,061 --> 00:20:16,044 that x4 and y4 are different. Remember our assumption at the beginning of the proof 266 00:20:16,044 --> 00:20:19,373 was that, you know, the IP addresses differ somewhere so why not just assume 267 00:20:19,373 --> 00:20:23,115 that they differ in the last 8-bits of the proof. Again this is not important if you 268 00:20:23,115 --> 00:20:26,648 really wanted to be pedantic you could have three other cases depending on the 269 00:20:26,648 --> 00:20:30,418 other possible bits in which the IP addresses might differ. But anyway, so, 270 00:20:30,418 --> 00:20:36,409 because x4 and y4 are different, what that means is that x4 - y4 is not zero. And in 271 00:20:36,409 --> 00:20:41,662 fact, now that I write this, it's jogging my memory of something that I should have 272 00:20:41,662 --> 00:20:45,988 told you earlier, and forgot, which is that the number of buckets n should be at 273 00:20:45,988 --> 00:20:50,777 least as large as the maximum coeffcient value. So for example, we definitely want 274 00:20:50,777 --> 00:20:53,574 the number of buckets n in this equation to be bigger than x4, and bigger than y4. 275 00:20:53,574 --> 00:20:57,752 And the reason is, otherwise you could have x4 and y4 being different from each 276 00:20:57,752 --> 00:21:03,597 other, but they still, the difference still winds up being zero mod-n. So for 277 00:21:03,597 --> 00:21:08,319 example, suppose n was four, and x4 was six and y4 was ten. Then x4-10 would be 278 00:21:08,319 --> 00:21:11,563 -four and that's actually zero modulo four. So that's getting now what you want. 279 00:21:11,563 --> 00:21:15,614 You want to make sure that if x4 and y4 are different, then they're difference is 280 00:21:15,614 --> 00:21:19,721 non-zero modulo n. And the way you ensure that is that you just make sure n is 281 00:21:19,721 --> 00:21:23,669 bigger than each. So you should choose a number of buckets bigger than the maximum 282 00:21:23,669 --> 00:21:26,577 value of the coefficient. So in our IP address example, remember that the 283 00:21:26,577 --> 00:21:31,802 coefficient don't get bigger than 255. And I was suggesting a number of buckets equal 284 00:21:31,802 --> 00:21:35,681 the same 997. Now, in general, this is never a big deal in practice, if you only 285 00:21:35,681 --> 00:21:39,752 wanted to use say, 100 buckets, you didn't want to use 1000, you wanted 100, well, 286 00:21:39,752 --> 00:21:42,660 then you could just use smaller coefficients, right, you could just break 287 00:21:42,660 --> 00:21:45,694 up the IP address, instead of into 8-bit chunks, you could break it into 6-bit 288 00:21:45,694 --> 00:21:49,721 chunks, or 4-bit chunks, and that would keep the coefficient size smaller than the 289 00:21:49,721 --> 00:21:54,004 number of buckets, okay? So you could choose the buckets first, and then you 290 00:21:54,004 --> 00:21:57,080 choose how many bits to chunk your data into, and that's how you make sure this is 291 00:21:57,080 --> 00:22:00,093 satisfied. So back to the three things we have going for us in trying to prove this 292 00:22:00,093 --> 00:22:06,025 key claim. So x4 and y4 are different, so their difference is non-zero modulo n. So 293 00:22:06,025 --> 00:22:10,243 second of all, n is prime, that was part of the definition, part of the 294 00:22:10,243 --> 00:22:15,458 construction. And then third, a4, this final coefficient is equally likely to 295 00:22:15,458 --> 00:22:19,908 take on any value between zero and n-1. So, just as a plausibility argument, let 296 00:22:19,908 --> 00:22:24,053 me give you a proof by example. Again, I don't want to detour into elementary 297 00:22:24,053 --> 00:22:28,045 number theory, although it's beautiful stuff, so you know, I encourage those of 298 00:22:28,045 --> 00:22:32,052 you who are interested to go learn some and figure out exactly how you prove it. 299 00:22:32,052 --> 00:22:36,045 You really only need the barest elementary number theory to give a formal proof of 300 00:22:36,045 --> 00:22:39,739 this. But just to show you that is true in some simple examples, so let's think about 301 00:22:39,739 --> 00:22:44,486 a very small prime. Let's just say there's seven buckets and let's suppose that the 302 00:22:44,486 --> 00:22:48,667 difference between x4 and y4 is two. Okay, so having chosen the parameters of set n = 303 00:22:48,667 --> 00:22:52,984 seven, I've set the difference equal to two. What I want to do is I want to step 304 00:22:52,984 --> 00:22:57,319 through the seven possible choices of a4, and look at what we get in this blue 305 00:22:57,319 --> 00:23:01,910 circle quantity, on the left hand side of the green equation. So, we want to say the 306 00:23:01,910 --> 00:23:05,371 left hand's equally likely to be any of the seven numbers between zero and six, so 307 00:23:05,371 --> 00:23:10,454 that means that as we try our seven different choices for a4, we better get 308 00:23:10,454 --> 00:23:15,359 the seven different possible numbers as output. So, for example, if we set a4 = 309 00:23:15,359 --> 00:23:20,421 zero, then the blue circled quantity is certainly itself zero. If we set it equal 310 00:23:20,421 --> 00:23:24,602 to one, then it's one two, so we hit two. For two, we get two two which is 311 00:23:24,602 --> 00:23:30,529 four. For three, we get three two which is six. Now, when we set a4 = four, we get 312 00:23:30,529 --> 00:23:30,731 four two which is eight, modulo seven is one. Five two - seven is three. Six 313 00:23:30,731 --> 00:23:31,438 two - seven is five. So as we cycle through a4, zero through six, we get the 314 00:23:31,438 --> 00:23:31,438 value zero, two, four, six, one, three, five. So indeed we cycle through the seven 315 00:23:31,438 --> 00:23:31,640 possible outcomes one by one. So if a4 is chosen uniformly and random, then indeed 316 00:23:31,640 --> 00:23:31,640 this blue circled quantity will also be uniformly random. So just to give another 317 00:23:31,640 --> 00:00:00,000 quick example, we could also keep n = seven, and think about the difference of 318 00:00:00,000 --> 00:00:00,000 x4 and y4. Again, we have no idea what it is, other than that its non-zero. So, you 319 00:00:00,000 --> 00:00:00,000 know, maybe instead of three, maybe, maybe instead of two, it's three. So now again, 320 00:00:00,000 --> 00:00:00,000 let's stop through the seven choices of a4, and see what we get. So now we're 321 00:00:00,000 --> 00:00:00,000 going to get zero, then three, then six, then two, then five, and then one, and 322 00:00:00,000 --> 00:00:00,000 then four. So again, stepping through the seven choices of a4, we get all of the 323 00:00:00,000 --> 00:00:00,000 seven different possibilities of this left hand side. And it's not an accident that 324 00:00:00,000 --> 00:00:00,000 these choices are parameters. As long as n is prime, x4 and y4 are different, and y 325 00:00:00,000 --> 00:00:00,000 ranges over all possibilities, so will the value on the left-hand side. So by 326 00:00:00,000 --> 00:00:00,000 choosing a four uniformly random, indeed, the left-hand side is equally likely to be 327 00:00:00,000 --> 00:00:00,000 any of its possible values, zero, one, two up to n-1. And so, what does that mean? 328 00:00:00,000 --> 00:00:00,000 Well, basically it means that we're done with our proof cuz remember, the 329 00:00:00,000 --> 00:00:00,000 right-hand side that circled in pink is fixed. We fixed a1, a2, and the a3. The 330 00:00:00,000 --> 00:00:00,000 x's and y's have been fixed all along so this is just some number, like 773. And 331 00:00:00,000 --> 00:00:00,000 so, we know that there's exactly one choice of a4 that will cause the left-hand 332 00:00:00,000 --> 00:00:00,000 side to also be equal to 773. Now a4 has n different possible values and it's equally 333 00:00:00,000 --> 00:00:00,000 likely to take one on becaus e only a one-end chance that we're going to get the 334 00:00:00,000 --> 00:00:00,000 unlucky choice of a4 that causes the left-hand side to be equal to 773 and of 335 00:00:00,000 --> 00:00:00,000 course, there's nothing special about 773. Doesn't matter how the right-hand side 336 00:00:00,000 --> 00:00:00,000 comes out. We have only one-hand chance of being unlucky and having a collision and 337 00:00:00,000 --> 00:00:00,000 that is exactly the condition we are trying to prove and that establishes the 338 00:00:00,000 --> 00:00:00,000 universality of this function each of n^4, very simple, very easy to evaluate hash