1
00:00:00,000 --> 00:00:03,093
So, in this video, we're going to start
reasoning about the performance of hash

2
00:00:03,093 --> 00:00:08,017
tables. In particular, we'll make precise
this idea that properly implemented they

3
00:00:08,017 --> 00:00:12,043
guarantee constant time lookup. So, let me
just briefly remind you of the road map

4
00:00:12,043 --> 00:00:16,048
that we're in the middle of. So, we
observed that every fixed hash function is

5
00:00:16,048 --> 00:00:20,080
subject to a pathological data set and so
exploring the solution of making a real

6
00:00:20,080 --> 00:00:24,081
time decision of what hash function to
use. And we've already gone over this

7
00:00:24,081 --> 00:00:28,097
really quite interesting definition of
universal hash functions and that's the

8
00:00:28,097 --> 00:00:33,034
proposed definition of a good random hash
function. More over, in the previous video

9
00:00:33,034 --> 00:00:37,045
I showed you that there are simple such
families of hash functions. They don't

10
00:00:37,045 --> 00:00:41,067
take much space to store, they don't take
much time to evaluate. And the plan for

11
00:00:41,067 --> 00:00:45,087
this video is to prove formally, that if
you draw a hash function uniformly at

12
00:00:45,087 --> 00:00:50,006
random from a universal family of hash
functions, like the one we saw in the

13
00:00:50,006 --> 00:00:54,025
previous video, then you're guaranteed
expected constant time for all of the

14
00:00:54,025 --> 00:00:58,029
supported operations. So, here's the
definition once more of a universal family

15
00:00:58,029 --> 00:01:01,052
of hash functions. We'll be using this
definition, of course, when we prove that

16
00:01:01,052 --> 00:01:05,431
these hash functions give good
performance. So, remember, we're talking

17
00:01:05,431 --> 00:01:09,065
now about a set of hash functions. These
are all of the conceivable real time

18
00:01:09,065 --> 00:01:12,466
decisions you might make about which hash
function to use. So, the universe is

19
00:01:12,466 --> 00:01:16,544
fixed, this is something like IP
addresses, the number of buckets is fixed.

20
00:01:16,544 --> 00:01:20,432
You know that's going to be something like
10,000, say, and what it means for a

21
00:01:20,432 --> 00:01:24,643
family to be universal is that the
probability that any given pair of items

22
00:01:24,643 --> 00:01:30,008
collides is as good, as small as with the
gold standard of completely perfect

23
00:01:30,008 --> 00:01:35,333
uniform random hashing. That is for each
pair xy of distinct elements of the

24
00:01:35,333 --> 00:01:39,573
universe, so for example, for each
distinct pair of IP addresses, the

25
00:01:39,573 --> 00:01:44,715
probability over the choice of the random
hash function little h from the family

26
00:01:44,715 --> 00:01:49,290
script h is no more than one over n, where
n is the number of buckets. So, if you

27
00:01:49,290 --> 00:01:53,832
have 10,000 buckets, the probability that
any given pair of IP addresses winds up

28
00:01:53,832 --> 00:01:58,551
getting mapped to the same bucket is
almost one in 10,000. Let me now spell out

29
00:01:58,551 --> 00:02:03,789
the precise guarantee we're going to prove
if you use a randomly chosen hash function

30
00:02:03,789 --> 00:02:08,068
from a universal family. So, for this
video, we're going to only talk about hash

31
00:02:08,068 --> 00:02:12,574
tables implemented with chaining, with one
length list per bucket. We'll be able to

32
00:02:12,574 --> 00:02:17,710
give a completely precise mathematical
analysis with this collision resolution

33
00:02:17,710 --> 00:02:21,940
scheme. We're going to analyze the
performance of this hash table in

34
00:02:21,940 --> 00:02:26,862
expectation over the choice of a hash
function little h drawn uniformly from a

35
00:02:26,862 --> 00:02:30,842
universal family script h. So, for
example, for the family that we

36
00:02:30,842 --> 00:02:35,101
constructed in the previous video, this
just amounts to choosing each of the four

37
00:02:35,101 --> 00:02:39,210
coefficients uniformly at random. That's
how you select a universal, that's how you

38
00:02:39,210 --> 00:02:42,977
select a hash function uniformly at
random. So, this theorem and also the

39
00:02:42,977 --> 00:02:47,018
definition of universal hash functions
dates back to a 1979 research paper by

40
00:02:47,018 --> 00:02:50,898
Carter and Wegman. The idea of hashing
dates back quite a bit before that,

41
00:02:50,898 --> 00:02:54,931
certainly to the 50s. So, this just kind
of shows us sometimes ideas have to be

42
00:02:54,931 --> 00:02:59,771
percolating for awhile before you find the
right way to explain what's going on. So,

43
00:02:59,771 --> 00:03:04,130
Carter and Wegman provided this very clean
and modular way of thinking about

44
00:03:04,130 --> 00:03:08,045
performance of hashing through this
universal hashing definition. And the

45
00:03:08,045 --> 00:03:12,035
guarantee is exactly the one that I
promised way back when we talked about

46
00:03:12,035 --> 00:03:16,144
what operations are supported by hash
tables and what kind of performance should

47
00:03:16,144 --> 00:03:20,533
you expect, you should expect constant
time performance. As always, with hashing,

48
00:03:20,533 --> 00:03:25,941
there is some fine print so let me just be
precise about what the caveats of this

49
00:03:25,941 --> 00:03:30,269
guarantee are. So, first of all,
necessarily this guarantee is an

50
00:03:30,269 --> 00:03:35,548
expectation. So, it's on average over the
choice of the hash function, little h. But

51
00:03:35,548 --> 00:03:40,397
I want to reiterate that this guarantee
does hold for an arbitrary data set. So,

52
00:03:40,397 --> 00:03:44,693
this guarantee is quite reminiscent of the
one we had for the rand omized quick sort

53
00:03:44,693 --> 00:03:47,961
algorithm. In Quicksort, we made no
assumptions about the data. It was a

54
00:03:47,961 --> 00:03:52,267
completely arbitrary input array and the
guarantee said, on average over the real

55
00:03:52,267 --> 00:03:56,012
time randomized decisions of the
algorithm, no matter what the input is,

56
00:03:56,012 --> 00:04:00,087
the expected running time was in log in.
Here we're saying again, no assumptions

57
00:04:00,087 --> 00:04:03,764
about the data. It doesn't matter what
you're storing in the hash table and

58
00:04:03,764 --> 00:04:07,521
expectation over the real time random
decision of what hash function you use,

59
00:04:07,521 --> 00:04:11,319
you should expect constant time
performance, no matter what that data set

60
00:04:11,319 --> 00:04:15,083
is. So, the second caveat is something
we've talked about before. Remember, the

61
00:04:15,083 --> 00:04:19,343
key to having good hash table performance,
not only do you need a good hash function

62
00:04:19,343 --> 00:04:23,616
which is what this universality key is
providing us but you also need to control

63
00:04:23,616 --> 00:04:28,405
the load of the hash table. So, of course,
to get constant time performance, as we've

64
00:04:28,405 --> 00:04:32,608
discussed, a necessary condition is that
you have enough buckets to hold more or

65
00:04:32,608 --> 00:04:36,701
less the stuff that you're storing. That
is the load, alpha, the number of objects

66
00:04:36,701 --> 00:04:40,462
in the table divided by the number of
buckets should be constant for this

67
00:04:40,462 --> 00:04:44,561
theorem to hold. And finally, whenever you
do a hash table operation, you have to in

68
00:04:44,561 --> 00:04:48,793
particular invoke the hash function on
whatever key you're given. So, in order to

69
00:04:48,793 --> 00:04:53,508
have constant time performance, it better
be the case that it only takes constant

70
00:04:53,508 --> 00:04:57,372
time to evaluate your hash function and
that's, of course, something we also

71
00:04:57,372 --> 00:05:01,486
discussed in the previous video when we
emphasized the importance of having simple

72
00:05:01,486 --> 00:05:05,793
universal hash functions like those random
linear combinations we discussed in the

73
00:05:05,793 --> 00:05:09,202
previous video. In general, the
mathematical analysis of hash table

74
00:05:09,202 --> 00:05:12,996
performance is a quite deep field, and
there is some, quite mathematically

75
00:05:12,996 --> 00:05:17,757
interesting results that are well outside
the scope of this course. But what's cool,

76
00:05:17,757 --> 00:05:21,439
in this theorem I will be able to provide
you a full and rigorous proof. So, for

77
00:05:21,439 --> 00:05:25,527
hash tables with chaining and using
randomly chosen universal hash functions,

78
00:05:25,527 --> 00:05:29,317
I'm going to now prove that you do get
cons tant time performance. Right, so hash

79
00:05:29,317 --> 00:05:33,430
tables support various operations, Insert,
Delete and Lookup. But really if we can

80
00:05:33,430 --> 00:05:37,051
just bound a running time of an
unsuccessful lookup, that's going to be

81
00:05:37,051 --> 00:05:41,044
enough to bound the running time of all of
these operations. So, remember in hash

82
00:05:41,044 --> 00:05:45,022
table with chaining, you first hash the
appropriate bucket and then you do the

83
00:05:45,022 --> 00:05:48,728
appropriate Insert, Delete or Lookup in
the link list in that bucket. So, the

84
00:05:48,728 --> 00:05:52,405
worst case as far as traversing though
this length list is if you're looking for

85
00:05:52,405 --> 00:05:56,188
something but it's not there cuz you have
to look at every single element in the

86
00:05:56,188 --> 00:06:00,480
list and followup into the list before you
can conclude that the lookup has failed.

87
00:06:00,480 --> 00:06:03,623
Of course, insertion, as we discussed, is
always constant time, deletion and

88
00:06:03,623 --> 00:06:07,517
successful searches, well, you might get
lucky, and stop early before you hit the

89
00:06:07,517 --> 00:06:11,806
end of the list. So, all we're going to do
is bother to analyze unsuccessful lookups

90
00:06:11,806 --> 00:06:16,091
that will carry over to all of the other
operations. So, a little more precisely,

91
00:06:16,091 --> 00:06:21,077
let's let s be the data set. This is the
objects that we are storing in our hash

92
00:06:21,077 --> 00:06:25,725
table. And remember that to get constant
time lookup, it really needs to be the

93
00:06:25,725 --> 00:06:30,046
case that the load is constant. So, we are
assuming that the size of s is bigger over

94
00:06:30,046 --> 00:06:34,810
the number of buckets n. And let's suppose
we are searching for some other object

95
00:06:34,810 --> 00:06:39,033
which is not an s, call it x. And again, I
want to emphasize we are making no

96
00:06:39,033 --> 00:06:43,077
assumptions about what this data set s is
other than that the size is comparable to

97
00:06:43,077 --> 00:06:47,065
the number of buckets. So, conceptually
doing a lookup in a hash table with

98
00:06:47,065 --> 00:06:51,078
chaining is a very simple thing. You just
hash to the appropriate bucket and then

99
00:06:51,078 --> 00:06:54,938
you scan through the length list in that
bucket. So, conceptually, it's very easy

100
00:06:54,938 --> 00:07:00,203
to write down the what the running time of
this unsuccessful lookup is. It's got two

101
00:07:00,203 --> 00:07:04,207
parts. So, the first thing you do is you
evaluate the hash function to figure out

102
00:07:04,207 --> 00:07:08,039
the right bucket. And again, remember
we're assuming that we have a simple of a

103
00:07:08,039 --> 00:07:12,405
hash function and it takes constant time.
Now, of course, the magic of hash

104
00:07:12,405 --> 00:07:16,185
functions is that given that this hash
value, we can zip right to where the

105
00:07:16,185 --> 00:07:20,586
lenght list is to search for x using
random access into our array of buckets.

106
00:07:20,586 --> 00:07:25,522
So, we go straight to the appropriate
place in our array of buckets and we just

107
00:07:25,522 --> 00:07:30,365
search through the list ultimately failing
to find what we're looking for s.

108
00:07:30,365 --> 00:07:34,853
Traversing a link list, as you all know,
it takes time proportional to the length

109
00:07:34,853 --> 00:07:39,642
of the list. So, we find something that we
discussed informally in the past which is

110
00:07:39,642 --> 00:07:44,333
that's the running time of hash table
operations implemented with chaining is

111
00:07:44,333 --> 00:07:48,929
governed by the list lengths. So, that's
really the key quantity we have to

112
00:07:48,929 --> 00:07:54,176
understand. So, lets call this, lets give
this a name, Capital L, it's important

113
00:07:54,176 --> 00:07:59,666
enough to give a name. So, what I want you
to appreciate at this point is that this

114
00:07:59,666 --> 00:08:04,761
key quantity, Capital L, the list of the
length in x's bucket is a random variable.

115
00:08:04,761 --> 00:08:09,170
It is a function of which hash function
little h, we wind up selecting in a real

116
00:08:09,170 --> 00:08:14,036
time. So, for some choices of our hash
function, little h, this list length might

117
00:08:14,036 --> 00:08:18,666
be as small as zero but for other choices
of this hash function h, this list, list

118
00:08:18,666 --> 00:08:23,232
length could be bigger. So, this is
exactly analogous too in Quicksort where

119
00:08:23,232 --> 00:08:27,337
depending on the real time decision of
which random pivot element you use, your

120
00:08:27,337 --> 00:08:31,573
going to get a different number of
comparisons, a different running time. So,

121
00:08:31,573 --> 00:08:37,019
the list length and hence the time for the
unsuccessful storage, depends on the hash

122
00:08:37,019 --> 00:08:42,009
function little h, which we're choosing at
random. So, let's recall what it is we're

123
00:08:42,009 --> 00:08:46,090
trying to prove. We're trying to prove an
upper bound on the running time of an

124
00:08:46,090 --> 00:08:52,004
unsuccessful lookup on average, where the
on average is over the choice of the hash

125
00:08:52,004 --> 00:08:56,725
function little h. We've expressed that
this lookup time in terms of the average

126
00:08:56,725 --> 00:09:02,011
list length in x's bucket where again the
average is over the random choice of h.

127
00:09:02,011 --> 00:09:06,096
Summarizing, we've reduced what we care
about, expected time for lookup to

128
00:09:06,096 --> 00:09:12,006
understanding the expected value of this
random variable Capital L, the average

129
00:09:12,006 --> 00:09:17,036
list length in x's bucket. So, that's what
we've got to do, we've got to compute the

130
00:09:17,036 --> 00:09:21,348
expected value of this random variable,
Capital L. Now to do that, I want to jog

131
00:09:21,348 --> 00:09:25,583
your memory of a general technique for
analyzing expectations which you haven't

132
00:09:25,583 --> 00:09:29,055
seen in awhile. The last time we saw it, I
believe, was when we were doing the

133
00:09:29,055 --> 00:09:33,037
analysis of randomized Quicksort and
counting its comparisons. So, here's a

134
00:09:33,037 --> 00:09:37,282
general decomposition principle which
we're now going to use in exactly the same

135
00:09:37,282 --> 00:09:41,673
way as we did in Quicksort here to analyze
the performance of hashing with chaining.

136
00:09:41,673 --> 00:09:47,036
So, this is where you want to understand
the expect, expectation of random variable

137
00:09:47,225 --> 00:09:52,093
which is complicated but what you can
express as the sum of much simpler random

138
00:09:52,093 --> 00:09:56,321
variables. Ideally, 0,1 or indicator
random variables. So, the first step is to

139
00:09:56,321 --> 00:10:01,156
figure out the random variable, Capital Y,
it's what I'm calling it here that you

140
00:10:01,156 --> 00:10:05,875
really care about. Now, we finished the
last slide, completing step one. What we

141
00:10:05,875 --> 00:10:10,331
really care about is Capital L, the list
length in x's bucket. So, that governs the

142
00:10:10,331 --> 00:10:15,139
running time a bit unsuccessful Look up,
clearly that's what we really care about.

143
00:10:15,139 --> 00:10:19,590
Step two of the decomposition principle is
well, you know, the random variable you

144
00:10:19,590 --> 00:10:24,693
care about is complicated, hard to analyze
directly but decompose it as a sum of 0,1

145
00:10:24,693 --> 00:10:29,011
indicator random variable. So, that's what
we're going to do in the beginning of the

146
00:10:29,011 --> 00:10:33,266
next slide. Why is it useful to decompose
a complicated random variable into the sum

147
00:10:33,266 --> 00:10:37,153
of 0,1random variables? Well, then you're
in the wheelhouse of linear of

148
00:10:37,153 --> 00:10:41,837
expectations. You get that the expected
value of the random variable that you care

149
00:10:41,837 --> 00:10:46,397
about is just the sum of the expected
values of the different indicator random

150
00:10:46,397 --> 00:10:50,792
variables and those expectations are
generally much easier to understand. And

151
00:10:50,792 --> 00:10:56,195
that will again be the case here in this
theorem about the performance of hash

152
00:10:56,195 --> 00:11:00,670
tables with chaning. So, let's apply this
three-step-decomposition principle to

153
00:11:00,670 --> 00:11:04,627
complete the proof of the Carter-Wegman
theorem. So, for the record, let me just

154
00:11:04,627 --> 00:11:08,512
remind you about the random variable that
we actually really care about, that's

155
00:11:08,512 --> 00:11:13,349
Capital L. The reason that's a random
variable is that because it depends on the

156
00:11:13,349 --> 00:11:17,085
choice of the hash function, little h.
This number could vary between zero and

157
00:11:17,085 --> 00:11:22,009
something much, much bigger than zero,
depending on which each you choose. So,

158
00:11:22,009 --> 00:11:26,033
this is complicated, hard to analyze
directly, so let's try to express it as

159
00:11:26,033 --> 00:11:31,013
the sum of 0,1 random variables. So, here
are the0,1 random variables that are going

160
00:11:31,013 --> 00:11:36,625
to be the constituents of Capital L. We're
going to have one such variable for each

161
00:11:36,625 --> 00:11:43,209
object y in the data set. Now, remember
this is an unsuccessful search, x is not

162
00:11:43,209 --> 00:11:49,102
in the data set Capital S. So, if y is in
the data set, x and y are necessarily

163
00:11:49,102 --> 00:11:56,440
different. And we will define each random
variable z sub y, as follows. We'll define

164
00:11:56,440 --> 00:12:04,141
it as one if y collides with x under h and
zero otherwise. So, for a given zy, we

165
00:12:04,141 --> 00:12:08,822
have fixed objects x and y So, x is some
IP address, say, y is some distinct IP

166
00:12:08,822 --> 00:12:13,819
address, x is not in our hash table, y is
in our hash table. And now, depending on

167
00:12:13,819 --> 00:12:18,637
which hash function we wind up using,
these two distinct IP addresses may or may

168
00:12:18,637 --> 00:12:23,106
not get mapped to the same bucket of our
hash table. So, this indicates a random

169
00:12:23,106 --> 00:12:27,648
variable just indicating whether or not
they collide, whether or not we unluckily

170
00:12:27,648 --> 00:12:32,484
choose a hash function little h that sends
these distinct IP addresses x and y to

171
00:12:32,484 --> 00:12:37,417
exactly the same bucket. Okay, so, that's
zy, clearly by definition, it's a 0,1

172
00:12:37,417 --> 00:12:42,458
random variable. Now, here's what's cool
about these random variables is that

173
00:12:42,458 --> 00:12:48,800
Capital L, the list length that we care
about decomposes precisely into the sum of

174
00:12:48,800 --> 00:12:55,025
the zy's. That is, we can write capital L
as being equal to the sum over the objects

175
00:12:55,025 --> 00:13:01,261
in the hash table of zy. So, if you think
about it, this equation is always true, no

176
00:13:01,261 --> 00:13:07,327
matter what the hash function h is. That
is if we choose some hash functions that

177
00:13:07,327 --> 00:13:12,139
maps these IP address x to, say, bucket
number seventeen, Capital L is just

178
00:13:12,139 --> 00:13:17,273
counting how many other objects in the
hash table wind up getting mapped to

179
00:13:17,273 --> 00:13:21,814
bucket number seventeen. So, maybe ten
different ob jects got mapped to bucket

180
00:13:21,814 --> 00:13:26,723
number seventeen. Those are exactly the
ten different values of y that will have

181
00:13:26,723 --> 00:13:32,015
their zy equal to1, right? So, so l is
just counting the number of objects in the

182
00:13:32,015 --> 00:13:37,500
data set s that's collide with x. A given
zy is just counting whether or not a given

183
00:13:37,500 --> 00:13:41,885
object y in hash table is colliding with
x. So, summing over all the possible

184
00:13:41,885 --> 00:13:46,657
things that could collide with x, summing
over the zy's, we of course get the total

185
00:13:46,657 --> 00:13:51,126
number of things that collide with x which
is exactly equal to the number, the

186
00:13:51,126 --> 00:13:55,019
population of x's bucket in the hash
table. Alright, so we've got all of our

187
00:13:55,019 --> 00:13:59,105
ducks lined up in a row. Now, if we just
remember all of the things we have going

188
00:13:59,105 --> 00:14:03,043
for us, we can just follow our nose and
nail the proof of this theorem. So, what

189
00:14:03,043 --> 00:14:07,657
is it that we have going for us? Well, in
addition to this decomposition of the list

190
00:14:07,657 --> 00:14:12,018
length in to indicate random variables,
we've got linear expectation going for us,

191
00:14:12,018 --> 00:14:16,195
we've got the fact that our hash function
is drawn from a universal family going for

192
00:14:16,195 --> 00:14:20,078
us. And we've got the fact that we've
chosen the number of buckets and to be

193
00:14:20,078 --> 00:14:24,740
comparable to the data set size. So, we
want to use all of those assumptions and

194
00:14:24,740 --> 00:14:29,282
finish the proof that the expected
performance is constant. So, we're going

195
00:14:29,282 --> 00:14:33,619
to have a few inequalities and we're going
to begin with the thing that we really

196
00:14:33,619 --> 00:14:37,993
care about. We care about the average list
length in x's bucket. Remember, we saw in

197
00:14:37,993 --> 00:14:42,613
the previous slide, this is what governs
the expected performance of the lookup. If

198
00:14:42,613 --> 00:14:46,583
we can prove that the expected value of
capital L is constant, we're done, we've

199
00:14:46,583 --> 00:14:50,416
finished the theorem. So, the whole point
of this decomposition principle is to

200
00:14:50,416 --> 00:14:55,075
apply linear of expectation which says
that the expected value of a sum of random

201
00:14:55,075 --> 00:14:59,644
variables equals the sum of the expected
values. So, because L can be expressed as

202
00:14:59,644 --> 00:15:05,065
the sum of these zy's, we can reverse the
summation and the expectation and we can

203
00:15:05,065 --> 00:15:11,043
first sum over the y's, over the objects
in the hash table and then take the

204
00:15:11,043 --> 00:15:16,085
expected value of zy. Now, something which
came up in our Quicksort an alysis but

205
00:15:16,085 --> 00:15:21,259
which you might have forgotten is that 0,1
random variables have particularly simple

206
00:15:21,259 --> 00:15:25,831
expectations. So, the next quiz is just
going to jog your memory about why 0,1

207
00:15:25,831 --> 00:15:30,403
random variables are so nice in this
context. Okay, so the answer to this quiz

208
00:15:30,403 --> 00:15:35,674
is the third one, the expected value of zy
is simply the probability that x and y

209
00:15:35,674 --> 00:15:40,557
collide, that just follows from the
definition of the random variable zy and

210
00:15:40,557 --> 00:15:45,650
the definition of expectation, namely
recall how do we define zy. This is just

211
00:15:45,650 --> 00:15:51,198
this one, if this object y in the hash
table happens to collide with the object x

212
00:15:51,198 --> 00:15:56,314
that we are looking for under the hash
function x and zero otherwise, again, this

213
00:15:56,314 --> 00:16:00,471
will be, this will be one for some hash
functions and zero for other hash

214
00:16:00,471 --> 00:16:06,006
functions. And then we just have to
compute expectations. So, one way to

215
00:16:06,006 --> 00:16:10,678
compute the expected value of a 0,1 random
variable is, you just say, well, you know,

216
00:16:10,678 --> 00:16:15,390
there are the cases where the random
variable evaluates to zero and then

217
00:16:15,390 --> 00:16:20,068
there's the cases where the random
variable evaluates to one, and of course

218
00:16:20,068 --> 00:16:25,276
we can cancel the zero. So, this just
equals the probability that zy = one. And

219
00:16:25,276 --> 00:16:30,023
since zy being one is exactly the same
thing as x and y colliding, that's what

220
00:16:30,023 --> 00:16:34,083
gives us the answer. Okay, so it's the
probability that x collides with y. So,

221
00:16:34,083 --> 00:16:39,819
plenty of that into our derivation. Now,
we again have the sum of all the objects y

222
00:16:39,819 --> 00:16:44,785
in our hash table and the set of the
expected value of zy what's right that in

223
00:16:44,785 --> 00:16:50,076
the more interpretable form, the
probability that this particular object in

224
00:16:50,076 --> 00:16:55,045
the hash table y collides with the thing
we are looking for x. Now, we know

225
00:16:55,045 --> 00:17:00,020
something pretty cool about the
probability that a given pair of distinct

226
00:17:00,020 --> 00:17:06,032
elements like x and y collide with each
other. What is it that we know? Okay, so I

227
00:17:06,032 --> 00:17:11,001
hope you answered the second response to
this quiz. This is really in some sense

228
00:17:11,001 --> 00:17:15,054
the key point of the analysis. This is the
role, that being a universal family of

229
00:17:15,054 --> 00:17:19,078
hash functions plays in this performance
guarantee. What does it mean to be

230
00:17:19,078 --> 00:17:23,091
universal? It means for any pair of
objects distinct like x and y in that

231
00:17:23,091 --> 00:17:27,098
proof, if you make a random choice of a
hash function, the probability of

232
00:17:27,098 --> 00:17:32,034
collision is as good as with perfectly
random hashing, hashing. Namely at most,

233
00:17:32,034 --> 00:17:38,004
1/ n where n is the number of buckets. So,
now I can return to the derivation. What

234
00:17:38,004 --> 00:17:42,583
that quiz reminds you is that the
definition of a universal family of hash

235
00:17:42,583 --> 00:17:48,034
functions guarantees that this probability
for each y is at most 1/n, where n is the

236
00:17:48,034 --> 00:17:53,955
number of buckets in the hash table. So,
let's just rewrite that. So, we've upper

237
00:17:53,955 --> 00:18:00,018
bounded the expected list length by a sum
over the objects in the data set of 1/n.

238
00:18:00,018 --> 00:18:07,748
And this, of course, is just equal to the
number of objects in the data set, the

239
00:18:07,748 --> 00:18:13,123
[inaudible] of s divided by n. And what is
this? This is simply the load, this is the

240
00:18:13,123 --> 00:18:18,845
definition of the load alpha which we are
assuming is constant. Remember, that was

241
00:18:18,845 --> 00:18:23,492
the third caveat in the theorem. So,
that's why as long as you have a hash

242
00:18:23,492 --> 00:18:28,377
function which you can compute quickly in
constant time. And as long as you keep the

243
00:18:28,377 --> 00:18:32,461
load under control so the number of
buckets is commensurate with the size of

244
00:18:32,461 --> 00:18:36,435
the data set that you're storing. That's
why, universal hash functions in a hash

245
00:18:36,435 --> 00:18:40,053
table with chaining guarantee expected
constant time performance.