1
00:00:00,438 --> 00:00:06,276
Let's begin by building up some intuition
about what we would want from a hash

2
00:00:06,276 --> 00:00:10,488
function, now that we know how hash
functions are usually implemented. So

3
00:00:10,488 --> 00:00:13,789
let's start with a hash function which is
implemented by chaining. So what's going

4
00:00:13,789 --> 00:00:18,370
to be the running time of our lookup,
insert, and delete operations in a hash

5
00:00:18,370 --> 00:00:23,224
table with chaining? Well, the happy
operation in a hash table with chaining is

6
00:00:23,224 --> 00:00:27,195
insertion. Insertion, we can just say
without any qualifications, is constant

7
00:00:27,195 --> 00:00:31,778
time. This requires the sort of obvious
optimization that when you do insert, you

8
00:00:31,778 --> 00:00:36,072
insert something at the front of the list
in its bucket. Like, there's no reason to

9
00:00:36,072 --> 00:00:40,698
insert at the end. That would be silly. So
the plot thickens when we think about the

10
00:00:40,698 --> 00:00:44,077
other two operations, deletion and the
lookup. So let's just think about lookup.

11
00:00:44,077 --> 00:00:47,945
Deletion's basically the same. So how do
we implement lookup? Well, remember when

12
00:00:47,945 --> 00:00:52,765
we get some key x, we invoke the hash
function. We call h(x). That tells us

13
00:00:52,765 --> 00:00:57,620
what bucket to look in. So if it tells us
17, we know that, you know, x may

14
00:00:57,620 --> 00:01:00,287
or may not be in the hash table. But at
this point we know that if it's in the

15
00:01:00,287 --> 00:01:03,659
hash table, it's got to be in the linked
list that's in the seventeenth bucket. So

16
00:01:03,659 --> 00:01:08,043
now we descend into this bucket. We find
ourselves a linked list. And now, we have

17
00:01:08,043 --> 00:01:11,681
to resort to just an exhaustive search
through this list in the seventeenth

18
00:01:11,681 --> 00:01:16,308
bucket, to see whether or not X is there.
So we know how long it takes to search a

19
00:01:16,308 --> 00:01:21,338
regular list for some element. It's just
linear and the list length. And now we're

20
00:01:21,338 --> 00:01:26,679
starting to see why the hash function
might matter. Right, so suppose we insert

21
00:01:26,679 --> 00:01:32,352
100 objects into a hash table with 100
buckets. If we have a super lucky hash

22
00:01:32,352 --> 00:01:37,330
function, then perhaps each bucket will
get its own object. There'll be one object

23
00:01:37,330 --> 00:01:42,111
in each of the lists, in each of the
buckets, so. Theta of the list length

24
00:01:42,111 --> 00:01:45,926
is just theta of one. We're doing great. Okay?
So, a constant, constant link to lists,

25
00:01:45,926 --> 00:01:50,919
means constant time insert delete. A
really stupid hash function would map

26
00:01:50,919 --> 00:01:55,828
every single object to bucket number
zero. Okay, so then if you insert 100

27
00:01:55,828 --> 00:02:00,182
objects, they're all in bucket number
zero. The other 99 buckets are empty. And

28
00:02:00,182 --> 00:02:04,154
so every time you do insert or delete,
it's just resorting to the naive linked list

29
00:02:04,154 --> 00:02:07,116
solution. And the running time is
going to be linear and the number of

30
00:02:07,116 --> 00:02:12,149
objects in the hash table. So the largest
list length could vary anywhere from m/n,

31
00:02:12,149 --> 00:02:16,822
where m is the number of objects,
this is when you have equal linked lists,

32
00:02:16,822 --> 00:02:22,208
to if you use this ridiculous constant
hash function, m, all the objects in the

33
00:02:22,208 --> 00:02:26,631
same list. And so the main point I'm
trying to make here, is that you know,

34
00:02:26,631 --> 00:02:30,066
first of all, at least with chaining,
where the running time is governed by the

35
00:02:30,066 --> 00:02:34,986
list length, the running time depends
crucially on the hash function. How well

36
00:02:34,986 --> 00:02:39,981
the hash function distributes the data
across the different buckets. And

37
00:02:39,981 --> 00:02:43,485
something analogous is true for hash
tables that use open addressing. Alright

38
00:02:43,485 --> 00:02:46,515
so here there aren't any lists. So you
don't, there's no linked lists to keep

39
00:02:46,515 --> 00:02:49,228
track of. So the running time is going to
be governed by the length of the probe

40
00:02:49,228 --> 00:02:52,165
sequence. So the question is how many
times do you have to look at different

41
00:02:52,165 --> 00:02:55,410
buckets in the hash table before you
either find the thing you're looking for,

42
00:02:55,410 --> 00:02:58,638
or if you're doing insertion, before you
find an empty bucket in which to insert

43
00:02:58,638 --> 00:03:02,179
this new object. So the performance is
governed by the length of the probe

44
00:03:02,179 --> 00:03:06,827
sequence. And again, the probe sequence is
going to depend on the hash function. For

45
00:03:06,827 --> 00:03:10,809
really good hash functions in some sense,
stuff that spreads data out evenly, you

46
00:03:10,809 --> 00:03:14,987
expect probe sequences to be not too bad.
At least intuitive, And for say the

47
00:03:14,987 --> 00:03:19,087
constant function you are going to expect
these probe sequences to grow linearly

48
00:03:19,087 --> 00:03:22,582
with the numbers of object you insert into
the table. So again this point remains

49
00:03:22,582 --> 00:03:27,195
true, the performance of a hash table in
either implementation really depends on

50
00:03:27,195 --> 00:03:32,915
what hash function you use. So, having
built up this intuition, we can now say

51
00:03:32,915 --> 00:03:37,324
what it is what we want from a hash
function. So first we want it to lead to

52
00:03:37,324 --> 00:03:42,831
good performance. I'm using the chaining
implementations as a guiding example. We

53
00:03:42,831 --> 00:03:47,000
see that if we have a size of a hash
table, n, that's comparable to the number

54
00:03:47,000 --> 00:03:51,245
of objects, m, it would be really cool if
all of the lists had a length that was

55
00:03:51,245 --> 00:03:55,641
basically constant; therefore we had our
constant time operations. So equal length

56
00:03:55,641 --> 00:03:59,837
lists is way better than unequal length
lists in a hash table with chaining. So

57
00:03:59,837 --> 00:04:04,382
we, we want the hash function to do is to
spread the data out as equally as possible

58
00:04:04,382 --> 00:04:07,848
amongst the different buckets. And
something similar is true with open

59
00:04:07,848 --> 00:04:11,335
addressing; in some sense you want hash
functions to spread the data uniformly

60
00:04:11,335 --> 00:04:16,438
across the possible places you might
probe, as much as possible. And in

61
00:04:16,438 --> 00:04:21,564
hashing, usually the gold standard for
spreading data out is the performance of a

62
00:04:21,564 --> 00:04:26,131
completely random function. So you can
imagine for each object that shows up you

63
00:04:26,131 --> 00:04:30,060
flip some coins. With each of the n
buckets equally likely, you put this

64
00:04:30,060 --> 00:04:33,495
object in one of the n buckets. And you
flip independent coins for every different

65
00:04:33,495 --> 00:04:36,620
object. So this, you would expect, you
know, because you're just throwing darts

66
00:04:36,620 --> 00:04:39,991
at the buckets independently, you'd expect
this to spread the data out quite well.

67
00:04:39,991 --> 00:04:47,355
But, of course, it's not good enough just
to spread data out. It's also important

68
00:04:47,355 --> 00:04:49,857
that we don't have to work too hard to
remember what our hash function is and to

69
00:04:49,857 --> 00:04:54,785
evaluate it. Remember, every time we do
any of these operations, an insert or a

70
00:04:54,785 --> 00:04:59,915
delete or a lookup, we're going to be
applying our hash function to some key x.

71
00:04:59,915 --> 00:05:04,252
So every operation includes a hash
function evaluation. So if we want our

72
00:05:04,252 --> 00:05:08,372
operations to run a constant time,
evaluating the hash function also better

73
00:05:08,372 --> 00:05:13,394
run in constant time. And this second
property is why we can't actually

74
00:05:13,394 --> 00:05:17,750
implement completely random hashing. So
there's no way we can actually adjust when

75
00:05:17,750 --> 00:05:21,678
say we wanted to insert Alice's phone
number, flip a new batch of random coins.

76
00:05:21,678 --> 00:05:25,010
Suppose we did. Suppose we flipped some
random coins and it tells us to put

77
00:05:25,010 --> 00:05:29,148
Alice's phone number into the 39th
bucket, while. Later on, we might do a

78
00:05:29,148 --> 00:05:32,430
lookup for Alice's phone number, and we
better remember the fact that we're

79
00:05:32,430 --> 00:05:35,688
supposed to look in the 39th bucket for
Alice's phone number. But what does that

80
00:05:35,688 --> 00:05:38,998
mean? That means we have to explicitly
remember what choice we made. We have to

81
00:05:38,998 --> 00:05:44,107
write down. You get a list of effects that
Alice is in bucket number 39. In every

82
00:05:44,107 --> 00:05:47,921
single insertion, if they're all from in
the point of coin flips, you have to

83
00:05:47,921 --> 00:05:50,672
remember all of the different random
choices independently. And this really

84
00:05:50,672 --> 00:05:55,172
just devolves back to the naive list base
solution that we discussed before. So,

85
00:05:55,172 --> 00:05:59,050
evaluating the hash function is gonna take
us linear time and that defeats the

86
00:05:59,050 --> 00:06:03,422
purpose of a hash table. So we again we
want the best of both worlds. We want a

87
00:06:03,422 --> 00:06:08,042
hash function which we can store in
ideally constant space, evaluate in

88
00:06:08,042 --> 00:06:13,525
constant time, but it should spread the
data out just as well as if we had this

89
00:06:13,525 --> 00:06:18,115
gold standard of completely random
hashing. So I want to touch briefly on the

90
00:06:18,115 --> 00:06:21,691
topic of how you might design hash
functions. And in particular good hash

91
00:06:21,691 --> 00:06:26,600
functions that have the two properties we
identified on the previous slide. But I

92
00:06:26,600 --> 00:06:29,694
have to warn you, if you ask ten
different, you know, serious hardcore

93
00:06:29,694 --> 00:06:32,443
programmers, you know, about their
approach to designing hash functions,

94
00:06:32,443 --> 00:06:36,252
you're likely to get ten somewhat
different answers. So the design of hash

95
00:06:36,252 --> 00:06:42,199
functions is a tricky topic, and, it's as
much art as science at this point. Despite

96
00:06:42,199 --> 00:06:44,888
the fact that there's a ton of science,
there's actually a very beautiful theory,

97
00:06:44,888 --> 00:06:48,521
about what makes good hash functions. We'll
touch on a little bit of that in a, in a

98
00:06:48,521 --> 00:06:54,709
different optional video. And if you only
remember one thing of, you know, from this

99
00:06:54,709 --> 00:06:58,872
video or from these next couple slides,
the thing to remember is the following.

100
00:06:58,872 --> 00:07:05,712
Remember that it's really easy to
inadvertently design bad hash functions

101
00:07:05,712 --> 00:07:11,086
and bad hash functions lead to poor hash
table performance. Much poorer than you

102
00:07:11,086 --> 00:07:14,582
would expect given the other discussion
we've had in this video. So if you have to

103
00:07:14,582 --> 00:07:18,734
design your own hash function, do your
homework, get some examples, learn what

104
00:07:18,734 --> 00:07:22,940
other experts are doing and use your best
judgment. If you do just something without

105
00:07:22,940 --> 00:07:26,163
thinking about it, it's quite possible to
lead to quite poor performance, much

106
00:07:26,163 --> 00:07:30,483
poorer than you were expecting. So to
drive home this point, suppose that you're

107
00:07:30,483 --> 00:07:35,268
thinking about keys being phone numbers.
So let's say, you know, I'm gonna just be

108
00:07:35,268 --> 00:07:39,766
very kinda United States centric here. I'm
just gonna focus on the, the ten digit

109
00:07:39,766 --> 00:07:43,799
phone numbers inside the US. So the
universe size here is ten to the ten,

110
00:07:43,799 --> 00:07:47,467
which is quite big. That's probably not
something you really want to implement

111
00:07:47,467 --> 00:07:50,697
explicitly and let's consider an
application where, you know, you're only

112
00:07:50,697 --> 00:07:54,735
say, keeping track of at most, you know,
100 or 1,000 phone numbers or something

113
00:07:54,735 --> 00:07:58,412
like that. So we need to choose a number
of buckets, let's say we choose 1,000

114
00:07:58,412 --> 00:08:02,616
buckets. Let's say we're expecting no more
than 500 phone numbers or so. So we double

115
00:08:02,616 --> 00:08:07,219
that, we get a number of buckets to be
equal 1,000. And now I got to come up with

116
00:08:07,219 --> 00:08:10,528
a hash function. And remember, you know, a
hash functions by definition. All it does

117
00:08:10,528 --> 00:08:13,983
is map anything in the universe to a
bucket number. So that means it has to

118
00:08:13,983 --> 00:08:18,763
take as input a ten digit phone number and
spit as output some number between zero

119
00:08:18,763 --> 00:08:22,795
and 999. And, beyond that we have
flexibility of how to define this mapping.

120
00:08:22,795 --> 00:08:26,448
Now, when you are dealing with things that
have all these digits it's very tempting

121
00:08:26,448 --> 00:08:31,822
to just project on to a subset of the
digits. And, if you want a really terrible

122
00:08:31,822 --> 00:08:37,261
hash function, just use the most
significant digits of a phone number to

123
00:08:37,261 --> 00:08:41,839
define a mapping from phone numbers to
buckets. Alright, so I hope it's clear why

124
00:08:41,839 --> 00:08:46,343
this is a terrible choice of a hash
function. Alright, so maybe you're a

125
00:08:46,343 --> 00:08:51,177
company based in the San Francisco Bay
area. The area code for San Francisco is

126
00:08:51,177 --> 00:08:57,254
415. So if you're storing phone numbers
from customers in your area. You know

127
00:08:57,254 --> 00:09:02,717
maybe twenty, 30, 40 percent of them are
gonna have area codes 415. All of those

128
00:09:02,717 --> 00:09:08,040
are going to hash to exactly the same
bucket, bucket number 415 in this hash

129
00:09:08,040 --> 00:09:12,662
table. So you're gonna get an overwhelming
percentage of the data mapping just to

130
00:09:12,662 --> 00:09:17,644
this one bucket. Meanwhile you know not
all 1000 possibilities of, of these three

131
00:09:17,644 --> 00:09:22,163
digits are even legitimate area codes. Not
all three digit numbers are area codes in

132
00:09:22,163 --> 00:09:25,805
the United States. So there'll be buckets
of your hash table which are totally

133
00:09:25,805 --> 00:09:31,937
guaranteed to be empty. So you waste a ton
of space in your hash table, you have a

134
00:09:31,937 --> 00:09:35,550
huge list in the bucket corresponding to
415, you have a huge list in the

135
00:09:35,550 --> 00:09:39,704
bucket corresponding to 650, the area code
at Stanford. You're gonna have a very slow

136
00:09:39,704 --> 00:09:43,668
look up time for everything that hashes to
those two buckets and there's gonna be a

137
00:09:43,668 --> 00:09:49,156
lot of stuff that hashes to those two
buckets, So terrible idea. So a better but

138
00:09:49,156 --> 00:09:54,066
still mediocre hash function would be to
do the same trick but using the last three

139
00:09:54,066 --> 00:09:58,450
digits instead of the first three digits.
This is better than our terrible hash

140
00:09:58,450 --> 00:10:03,056
function because there aren't ridiculous
clumps of phone numbers that have exactly

141
00:10:03,056 --> 00:10:07,587
the same last three digits. But still,
this is sort of assuming you're using this

142
00:10:07,587 --> 00:10:11,640
hash function as tantamount to thinking
that the last three digits of phone

143
00:10:11,640 --> 00:10:16,190
numbers are uniformly distributed among
all of the 1,000 possibilities. And really

144
00:10:16,190 --> 00:10:19,128
there's no evidence if that's true. Okay?
And so there's going to be patterns and

145
00:10:19,128 --> 00:10:23,506
phone numbers that are maybe a little
subtle to see with the naked eye, but

146
00:10:23,506 --> 00:10:26,667
which will be exposed if you try to use a
mediocre hash function like this. So let's

147
00:10:26,667 --> 00:10:30,727
look at another example. Perhaps you are
keeping track of objects just based by

148
00:10:30,727 --> 00:10:35,175
where they are laid out in memory. So in
other words the key for an object is just

149
00:10:35,175 --> 00:10:40,194
gonna be its memory location and if these
things are in bytes, then you are

150
00:10:40,194 --> 00:10:43,289
guaranteed that every memory location will
be a multiple of four. So for a second

151
00:10:43,289 --> 00:10:49,492
example let's think about a universe where
the possible keys are the possible memory

152
00:10:49,492 --> 00:10:53,754
locations, So here you're just associating
objects with where they're laid in memory,

153
00:10:53,754 --> 00:10:57,984
and a hash function is responsible for
taking in as input some memory location of

154
00:10:57,984 --> 00:11:02,730
some object and spitting out some bucket
number. Now generally, because of, you

155
00:11:02,730 --> 00:11:06,957
know the structure of bytes and so on, our
memory locations are going to be multiples

156
00:11:06,957 --> 00:11:13,482
of some power of two. In particular,
memory locations are going to be even, And

157
00:11:13,482 --> 00:11:17,715
so a bad choice of a hash function. Would
be to take, remember, the hash function

158
00:11:17,715 --> 00:11:20,743
takes the input of the memory location,
which is, you know, some possibly really

159
00:11:20,743 --> 00:11:24,373
big number, and we wanna compress it, we
want to output a bucket number. Now let's

160
00:11:24,373 --> 00:11:29,050
think of a hash table where we choose N
equals 10^3, or 1000 buckets.

161
00:11:29,050 --> 00:11:32,481
So then the question is, you know, how is
this hash function going to take this big

162
00:11:32,481 --> 00:11:37,121
number, which is the memory location, and
squeeze it down to a small number. Which

163
00:11:37,121 --> 00:11:40,656
is one of the buckets and so let's just
use the same idea as in the mediocre hash

164
00:11:40,656 --> 00:11:44,333
function, which is we're gonna look at the
least significant bits so we can express

165
00:11:44,333 --> 00:11:49,519
that using the mod operator. So let's just
think about we pick the hash function h(x)

166
00:11:49,519 --> 00:11:57,652
where h is the memory location to be x mod 1000
There again, you know, the

167
00:11:57,652 --> 00:12:01,040
meaning of 1,000 is that's the number of
buckets we've chosen to put in our hash

168
00:12:01,040 --> 00:12:04,470
table because, you know, we're gonna
remember roughly at most 500 different

169
00:12:04,470 --> 00:12:08,197
objects. So don't forget what the mod
operation means, this means you just,

170
00:12:08,197 --> 00:12:13,142
essentially subtract multiples of 1,000
until you get down to a number less than

171
00:12:13,142 --> 00:12:18,636
1,000. So in this case, it means if you
write out x base ten, then you just take

172
00:12:18,636 --> 00:12:22,891
the last three digits. So in that sense,
this is the same hash function as our

173
00:12:22,891 --> 00:12:27,075
mediocre hash function when we were
talking about phone numbers. So we

174
00:12:27,075 --> 00:12:29,940
discussed how the keys here are all going
to be memory locations; in particular

175
00:12:29,940 --> 00:12:34,344
they'll be even numbers. And here we're
taking their modulus with respect to an

176
00:12:34,344 --> 00:12:39,658
even number. And what does that mean? That
means every single output of this hash

177
00:12:39,658 --> 00:12:43,010
function will itself be an even number.
Right, you take an even number, you

178
00:12:43,010 --> 00:12:46,075
subtract a bunch of multiples of a 1000,
you're still going to have an even number.

179
00:12:46,075 --> 00:12:53,490
So this hash function is incapable of
outputting an odd number. So what does

180
00:12:53,490 --> 00:12:58,499
that mean? That means at least half of the
locations in the hash table will be

181
00:12:58,499 --> 00:13:02,603
completely empty, guaranteed, no matter
what the keys you're hashing is. And

182
00:13:02,603 --> 00:13:05,851
that's ridiculous. It's ridiculous to have
this hash table 50 percent of which is

183
00:13:05,851 --> 00:13:10,210
guaranteed to be empty. And again, what I
want you to remember, hopefully long after

184
00:13:10,210 --> 00:13:14,547
this class is completed is not so much
these specific examples, but more the

185
00:13:14,547 --> 00:13:18,757
general point that I'm making. Which is,
it's really easy to design bad hash

186
00:13:18,757 --> 00:13:22,839
functions. And bad hash functions lead to
hash table performance much poorer than

187
00:13:22,839 --> 00:13:27,719
what you're probably counting on. Now that
we're equipped with examples of bad hash

188
00:13:27,719 --> 00:13:31,478
functions. It's natural to ask about, you
know, what are some good hash functions?

189
00:13:31,478 --> 00:13:35,464
Well it's actually quite tricky to answer
that question. What are the good hash

190
00:13:35,464 --> 00:13:38,848
functions, and I'm not really going to
answer that on this slide. I don't promise

191
00:13:38,848 --> 00:13:42,367
about hash functions that I'm going to
tell you about right now, are good in a

192
00:13:42,367 --> 00:13:45,657
very strong sense of the word. I will say
these are not obliviously bad hash

193
00:13:45,657 --> 00:13:50,076
functions, they're let's say, somewhat
better hash functions. And in particular

194
00:13:50,076 --> 00:13:53,874
if you just need a hash function, and you
need a quick and dirty one, you don't want

195
00:13:53,874 --> 00:13:57,078
to spend too much time on it. The method
that I'm going to talk about on this slide

196
00:13:57,078 --> 00:14:01,333
is a common way of doing it. On the other
hand, if you're designing a hash function

197
00:14:01,333 --> 00:14:05,570
for some really mission-critical code, you
should learn more than what I'm gonna tell

198
00:14:05,570 --> 00:14:08,370
you about on this slide. So you, you
should do more research about what are the

199
00:14:08,370 --> 00:14:11,930
best hash functions, what's the state of
the art, if you have a super important

200
00:14:11,930 --> 00:14:15,716
hash function. But if you just need a
quick one what's, what we say on this

201
00:14:15,716 --> 00:14:20,737
slide will do in many, in most situations.
So the design of a hash function can be

202
00:14:20,737 --> 00:14:25,840
thought of as two separate parts. So
remember by definition a hash function

203
00:14:25,840 --> 00:14:29,724
takes as input something from the
universe. An IP address, a name, whatever

204
00:14:29,724 --> 00:14:34,610
and spits out a bucket number. But, it can
be useful to factor that into two separate

205
00:14:34,610 --> 00:14:39,645
parts. So first you take an object which
is not intrinsically numeric. So,

206
00:14:39,645 --> 00:14:44,055
something like s string or something more
abstract. And you somehow turn an object

207
00:14:44,055 --> 00:14:51,860
into a number, possibly a very big number. And then you take a possibly big number and you map it to a much smaller number,

208
00:14:51,860 --> 00:14:56,124
namely the index of a bucket. So in some
cases I've seen these two steps given the

209
00:14:56,124 --> 00:15:01,673
names like the first step is formulating
the hash code For an object, and then the

210
00:15:01,673 --> 00:15:06,826
second step is applying a compression
function. In some cases, you can skip the

211
00:15:06,826 --> 00:15:09,923
first step. So, for example, if your keys
are social security numbers, they're

212
00:15:09,923 --> 00:15:14,152
already integers. If they're phone number,
they're already integers. Of course, there

213
00:15:14,152 --> 00:15:17,763
are applications where the objects are not
numeric. You know, for example, maybe

214
00:15:17,763 --> 00:15:20,973
they're strings, maybe you're remembering
names. And so then, the production of this

215
00:15:20,973 --> 00:15:24,285
hash code basically boils down to writing
a subroutine that takes, as input, a

216
00:15:24,285 --> 00:15:29,386
string, and outputs some possibly very big
number. There are standard methods for

217
00:15:29,386 --> 00:15:33,210
doing that, it's easy to find resources
to, to give you example code for

218
00:15:33,210 --> 00:15:37,326
converting strings to integers you know,
I'll just say one or two sentences about

219
00:15:37,326 --> 00:15:41,307
it. So you know each character in a string
it is easy to regard as a number in

220
00:15:41,307 --> 00:15:45,551
various ways. Either you know just say it
is ASCII, well ASCII code then you just

221
00:15:45,551 --> 00:15:48,494
have to aggregate all of the different
numbers, one number per character into

222
00:15:48,494 --> 00:15:52,078
some overall number and so one thing you
can do is you can iterate over the

223
00:15:52,078 --> 00:15:56,334
characters one at a time. You can keep a
running sum. And with each character, you

224
00:15:56,334 --> 00:16:00,447
can multiply the running sum by some
constant, and then add the new letter to

225
00:16:00,447 --> 00:16:04,707
it, and then, if you need to, take a
module list to prevent overflow. And the

226
00:16:04,707 --> 00:16:08,144
point of me giving you this one to two
sentence of the subroutine is just to give

227
00:16:08,144 --> 00:16:11,563
you a flavor of what they're like, and to
make sure th at you're just not scared of

228
00:16:11,563 --> 00:16:14,906
it at all. Okay? So it's very simple
programs you can write for doing things

229
00:16:14,906 --> 00:16:18,510
like converting from strings to integers.
But again, you know, I do encourage you to

230
00:16:18,510 --> 00:16:21,665
look it up on the Web or in a programming
textbook, to actually look at those

231
00:16:21,665 --> 00:16:24,914
examples. Okay? But there are standard
methods for doing it. So that leaves the

232
00:16:24,914 --> 00:16:26,691
quest, question of how to design this
compression function. So you take as input

233
00:16:26,691 --> 00:16:29,709
this huge integer. Maybe your keys are
already numeric, like Social Security

234
00:16:29,709 --> 00:16:33,376
numbers or IP addresses. Or maybe you've
already some subroutine to convert a

235
00:16:33,376 --> 00:16:37,004
string, like your friend's name, into.
Some big number, but the point is you have

236
00:16:37,004 --> 00:16:40,246
a number in the millions or the billions,
and you need to somehow take that and

237
00:16:40,246 --> 00:16:43,517
output one of these buckets. And again
think of there being maybe a thousand or

238
00:16:43,517 --> 00:16:47,772
so buckets. So the easiest way to do that
is something we already saw in a previous

239
00:16:47,772 --> 00:16:54,106
slide, which is just to take the modulus,
With respect to the number of buckets. Now

240
00:16:54,106 --> 00:16:57,121
certainly one positive thing you can say
about this compression function is its

241
00:16:57,121 --> 00:17:01,050
super simple, Both in the sense that it's
simple to code and in the sense that it's

242
00:17:01,050 --> 00:17:04,117
simple to evaluate. Now remember that was
our second goal for a hash function. It

243
00:17:04,117 --> 00:17:08,545
should be simple to store, it is actually
nothing to store. And it should be quick

244
00:17:08,545 --> 00:17:13,151
to evaluate. And this certainly fits the
bill. Now the problem is, remember the

245
00:17:13,151 --> 00:17:16,995
first. Property of a hash function that we
wanted is that it should spread the data

246
00:17:16,995 --> 00:17:19,833
out equally. And what we saw in the
previous slide is that at least if you

247
00:17:19,833 --> 00:17:24,598
choose the number of buckets N poorly,
then you can fail to have the first

248
00:17:24,598 --> 00:17:28,680
property. And in that respect you can fail
to be a good hash function. So if for

249
00:17:28,680 --> 00:17:32,331
example if N is even and all of your
objects are even, then it's a disaster,

250
00:17:32,331 --> 00:17:37,765
all of the odd buckets go completely
empty. And honestly, you know, this is a

251
00:17:37,765 --> 00:17:40,821
pretty simplistic method. Like I said,
this is a quick and dirty hash function.

252
00:17:40,821 --> 00:17:44,999
So, no matter how you choose the number of
buckets N, it's not gonna be a perfect

253
00:17:44,999 --> 00:17:48,744
hash function in any sense of the word.
That said, there are some rules of thumb

254
00:17:48,744 --> 00:17:52,991
for how to pick the number of buckets, how
to pick this modulus, so that you don't

255
00:17:52,991 --> 00:17:56,583
get the problems that we saw on the
previous slide. So, I'll conclude this

256
00:17:56,583 --> 00:17:59,928
video just with some standard rules of
thumb. You know, if you just need a quick

257
00:17:59,928 --> 00:18:03,434
and dirty hash function, you're gonna use
the, the modulus compression function, how

258
00:18:03,434 --> 00:18:08,767
do you choose N? Well, the first thing is
we definitely don't want to have the

259
00:18:08,767 --> 00:18:11,395
problem we had in the last slide, where
we're guaranteed to have these empty

260
00:18:11,395 --> 00:18:14,918
buckets no matter what the data is. So
what went wrong in the previous slide?

261
00:18:14,918 --> 00:18:21,992
Well. The problem is that all of the data
elements were divisible by two. And the

262
00:18:21,992 --> 00:18:26,838
hash function modulus, the number of
buckets, was also divisible by two. So

263
00:18:26,838 --> 00:18:32,338
because they shared a common factor,
namely two, that guaranteed that all of

264
00:18:32,338 --> 00:18:36,915
the odd buckets remained empty. And this
is a problem, more generally, if the data

265
00:18:36,915 --> 00:18:39,858
shares any common factors with N, the
number of buckets. So, in other words, if

266
00:18:39,858 --> 00:18:43,058
all of your data elements are multiples of
three, and the number of buckets is also a

267
00:18:43,058 --> 00:18:47,224
multiple of three, you got a big problem.
Then everything's gonna hash into bucket

268
00:18:47,224 --> 00:18:50,656
numbers which are multiples of three, too,
that's if your hash table will go

269
00:18:50,656 --> 00:18:56,337
unfilled. So the upshot is, you really
want the number of buckets to not share

270
00:18:56,337 --> 00:19:02,287
any factors With the data that you're
hashing. So, how do you reduce the number

271
00:19:02,287 --> 00:19:05,287
of common factors? Well, you just make
sure the number of buckets has very few

272
00:19:05,287 --> 00:19:10,692
factors, which means you should choose N
to be a prime number, 'kay? A number that

273
00:19:10,692 --> 00:19:15,794
has no nontrivial factors, And let me
remind you, the number of buckets should

274
00:19:15,794 --> 00:19:19,247
also be comparable to the size of the set
that you're planning on storing. Again, at

275
00:19:19,247 --> 00:19:23,456
no point did we need "N" to be, you know,
very closely connected to the number of

276
00:19:23,456 --> 00:19:26,782
elements that you're storing just within,
say some small constant factor. And you

277
00:19:26,782 --> 00:19:30,337
can always find a prime within a small
constant factor of a target number of

278
00:19:30,337 --> 00:19:34,974
elements to store. If the number of
buckets in your hash table isn't too big,

279
00:19:35,024 --> 00:19:39,007
if it's just in say the thousands or maybe
the tens of thousands, then, you know, you

280
00:19:39,007 --> 00:19:42,273
can just look up a list of all known
primes up to some point, and you can just

281
00:19:42,273 --> 00:19:45,465
sort of pick out a prime which is about
the magnitude that you're looking for. If

282
00:19:45,465 --> 00:19:49,480
you're gonna use a really huge number of
buckets in the millions or more, then

283
00:19:49,480 --> 00:19:53,537
there are algorithms you can use for
primarily testing which will help you find

284
00:19:53,537 --> 00:19:58,524
a prime in about the range that you're
looking for. >> So that's the first order

285
00:19:58,524 --> 00:20:02,354
rule of thumb you should remember if
you're using the modulus compression

286
00:20:02,354 --> 00:20:05,762
function, which is set the number of
buckets equal to a prime. So you're

287
00:20:05,762 --> 00:20:10,486
guaranteed to not have non-trivial common
factors of the modulus shared by all of

288
00:20:10,486 --> 00:20:14,269
your data. So there's also a couple of
second order optimizations, which people

289
00:20:14,269 --> 00:20:19,514
often mention. And you also don't want the
prime; you want the prime to be not too

290
00:20:19,514 --> 00:20:23,163
close to patterns in your data. So what
does that mean, Patterns in your data?

291
00:20:23,163 --> 00:20:27,760
Well, in the phone number example we saw
that patterns emerged in the data when we

292
00:20:27,760 --> 00:20:32,305
expressed it base ten. So for example,
there is crazy amounts of Lumping in the

293
00:20:32,305 --> 00:20:35,557
first three digits when we expressed a
phone number-based ten, Because that

294
00:20:35,557 --> 00:20:40,109
corresponded with the area code. And then,
with. Memory locations when we express,

295
00:20:40,109 --> 00:20:45,062
express it base two, there are crazy
correlations in the low orbits. And these

296
00:20:45,062 --> 00:20:48,762
are the two most common examples. Either
there's some digit, to the base ten

297
00:20:48,762 --> 00:20:52,955
representation or digits in the base two
representation where you have, you know,

298
00:20:52,955 --> 00:20:58,133
patterns that is non-uniformity. So that.
Suggests that the prime, that, N that you

299
00:20:58,133 --> 00:21:02,364
choose, you know, all else being equal,
shouldn't be too close to a power of two,

300
00:21:02,364 --> 00:21:06,187
and shouldn't be too close to a power of
ten. The thinking being that, that will

301
00:21:06,187 --> 00:21:11,566
spread more evenly data sets that do have
these patterns in terms of base two

302
00:21:11,566 --> 00:21:15,667
representation, or base ten
representations. So in closing, this is a

303
00:21:15,667 --> 00:21:20,029
recipe I recommend for coding of hash
functions if what you're looking to do is

304
00:21:20,029 --> 00:21:24,917
sort of minimize program ming, programmer
time, subject to not coming up with a hash

305
00:21:24,917 --> 00:21:29,333
function, which is completely broken. But
I want to reiterate, this is not the state

306
00:21:29,333 --> 00:21:34,080
of the art in hash function design. There
are hash functions which are in some sense

307
00:21:34,080 --> 00:21:37,647
better than the ones that expand on this
slide. If you're responsible for some

308
00:21:37,647 --> 00:21:41,573
really mission critical code that involves
a hash function, you should really study

309
00:21:41,573 --> 00:21:44,582
more deeply than we've been able to do
here. We'll touch on some issues in, of

310
00:21:44,582 --> 00:21:47,984
the different optional video, but really
you should do additional homework. You

311
00:21:47,984 --> 00:21:50,903
should find out about the state-of-the-art
about hash function design. You should

312
00:21:50,903 --> 00:21:54,367
also look into implementations of open
addressing in those probing strategies.

313
00:21:54,367 --> 00:21:58,646
And above all you really should consider
cold, coding up multiple prototypes and

314
00:21:58,646 --> 00:22:03,206
seeing which one works the best. There's
no silver bullet, there's no panacea in

315
00:22:03,206 --> 00:22:07,063
the design of hash tables. I've given you
some high-level guidance about the

316
00:22:07,063 --> 00:22:10,404
different approaches. But ultimately it's
gonna be up to you to find the optimal

317
00:22:10,404 --> 00:22:13,134
implementation for your own application.