1
00:00:00,000 --> 00:00:04,059
This sequence of videos are going to take
it to the next level with Hash tables and

2
00:00:04,059 --> 00:00:07,829
understand more deeply the conditions
under which they perform well, amazingly

3
00:00:07,829 --> 00:00:11,405
well in fact As you know, we've constant
time performance for all of their

4
00:00:11,405 --> 00:00:15,175
operations. The main point of this first
video is to explain in sense in which

5
00:00:15,175 --> 00:00:18,703
every hash function has its own
Kyptonite. A pathological data set for

6
00:00:18,703 --> 00:00:22,376
it which then motivates the need
to tread carefully with mathematics in

7
00:00:22,376 --> 00:00:27,242
the subsequent videos. So, a quick review
So remember that the whole purpose of a

8
00:00:27,242 --> 00:00:31,115
hash table is to enable extremely fast
look ups, ideally constant time lookups.

9
00:00:31,122 --> 00:00:35,103
Now, of course, to have anything to look
up, you have to also allow insertions. So

10
00:00:35,103 --> 00:00:39,285
all hash tables are going to export those
two operations And then, sometimes, a hash

11
00:00:39,285 --> 00:00:43,417
table also allows you to delete elements
from it. That depends a little bit on the

12
00:00:43,417 --> 00:00:47,146
underlying implementation. So certainly
when you have it implemented using

13
00:00:47,146 --> 00:00:50,925
chaining, that's when you have one linked
list per bucket, it's very easy to

14
00:00:50,925 --> 00:00:54,754
implement deletion. Sometimes, with open
addressing, it's tricky enough, you're

15
00:00:54,754 --> 00:00:59,123
just gonna wind up punting on deletion. So
when we first started talking about hash

16
00:00:59,123 --> 00:01:03,493
tables, I encouraged you to think about
them logically. Much the way you do as an

17
00:01:03,493 --> 00:01:07,863
array, except instead of being indexed
just by the positions of an array, it's

18
00:01:07,863 --> 00:01:12,234
indexed by the keys that you're storing.
So just like an array via random access

19
00:01:12,234 --> 00:01:16,386
supports constant time look up, so does a
hash table. There was some fine print

20
00:01:16,386 --> 00:01:20,591
however with hash tables. Remember there
were these two caveats. So the first one

21
00:01:20,591 --> 00:01:24,694
is that the hash table better be properly
implemented And this means a couple of

22
00:01:24,694 --> 00:01:28,442
things. So, one thing that it means is
that the number of buckets better be

23
00:01:28,442 --> 00:01:32,283
commensurate with the number of things that
you're storing in the Hash table, we'll

24
00:01:32,283 --> 00:01:35,843
talk more about that in a second. The
second thing it means is you better be

25
00:01:35,843 --> 00:01:39,450
using a descent Hash function. So we
discussed in a previous video the perils

26
00:01:39,450 --> 00:01:43,291
of bad Hash functions, and it will be even
more stringent with our demands on Hash

27
00:01:43,291 --> 00:01:46,710
functions in the videos to come. The
second caveat which I'll try to

28
00:01:46,710 --> 00:01:50,411
demystify in a few minutes is that you
better have non-pathological data. So, in

29
00:01:50,411 --> 00:01:53,997
some sense, for ever Hash table
there's Kyptonite , a pathological

30
00:01:53,997 --> 00:01:58,725
data set that will render its performance
to be quite poor. So in the video on

31
00:01:58,725 --> 00:02:03,088
implementation detail we also discussed
how hash tables inevitably have to deal

32
00:02:03,088 --> 00:02:07,177
with collisions. So you start seeing
collisions way before your hash table's

33
00:02:07,177 --> 00:02:11,321
start filling up so you need to have some
sort of method for addressing two

34
00:02:11,321 --> 00:02:15,847
different keys that map to exactly the same
bucket. Which one do you put there? Do you

35
00:02:15,847 --> 00:02:20,209
put both there or what? So there's two
popular approaches let me remind you what

36
00:02:20,209 --> 00:02:24,694
they are First called chaining. So this is
a very natural idea, where you just keep

37
00:02:24,694 --> 00:02:29,071
all of the elements that hash to a common
bucket in that bucket. How do you keep

38
00:02:29,071 --> 00:02:33,174
track of all of them? Well, you just use a
linked list. So, in the seventeenth

39
00:02:33,174 --> 00:02:37,332
bucket, you will find all of the elements
which ever hashed to bucket number 17.

40
00:02:37,332 --> 00:02:41,918
The second approach which also
has plenty of applications in practice is

41
00:02:41,918 --> 00:02:46,399
open addressing. Here the constraint is
that you are only going to store one item

42
00:02:46,399 --> 00:02:50,769
one key in each bucket. So if two things
mapped the bucket number seventeen, you

43
00:02:50,769 --> 00:02:55,194
gotta find a separate place for one of
them And so the way that you handle that

44
00:02:55,194 --> 00:02:59,896
is you demand from your hash function not
merely one bucket but rather a whole probe

45
00:02:59,896 --> 00:03:04,211
sequence. So the sequence of buckets so
that if you try to hash something into

46
00:03:04,211 --> 00:03:08,250
bucket number seventeen, 17's already
occupied then you go on to the next.

47
00:03:08,250 --> 00:03:12,082
Bucket in the probe sequence. You try to
insert it there And if it, you fail again

48
00:03:12,082 --> 00:03:15,820
you go to third bucket in the sequence.
You try to insert it there, and so on. So

49
00:03:15,820 --> 00:03:19,558
we mentioned briefly the sorta two ways
you can specify probe sequences. One is

50
00:03:19,558 --> 00:03:23,627
called linear probing. So this is where if
you fail in bucket seventeen you move on

51
00:03:23,627 --> 00:03:27,412
to eighteen and then nineteen and then
twenty and then 21. And you stop once you

52
00:03:27,412 --> 00:03:31,245
find an empty one And that's where you
insert the new element And another one is

53
00:03:31,245 --> 00:03:35,172
double hashing And this is where you use a
combination of two hash functions Where

54
00:03:35,172 --> 00:03:39,014
the first hash function specifies the
initial bucket that you probe. The second

55
00:03:39,014 --> 00:03:43,502
hash function specifies the offset for
each subsequent probe. So for example if

56
00:03:43,502 --> 00:03:48,102
you have a given elements say the name
Alice and the two hash functions give you

57
00:03:48,102 --> 00:03:52,363
the number 17 and 23 then the
corresponding finding probe sequence is

58
00:03:52,363 --> 00:03:56,736
going to be initially 17, failing
that we'll try 40, still failing that

59
00:03:56,736 --> 00:04:01,219
we'll try 63, failing that we'll try 86
and so on. So in a course on the design

60
00:04:01,219 --> 00:04:05,238
and analysis of algorithms like this one
you typically talk a little bit more about

61
00:04:05,238 --> 00:04:08,656
chaining than you do open addressing.
That's not to imply that chaining is

62
00:04:08,656 --> 00:04:12,351
somehow the more important one both of
these are important But chaining is a

63
00:04:12,351 --> 00:04:15,723
little easier to talk about
mathematically. So we will talk about it a

64
00:04:15,723 --> 00:04:19,511
little bit more cuz I'll be able to give
you complete proofs for chaining whereas

65
00:04:19,511 --> 00:04:23,253
complete proofs for open addressing would
be outside the scope of this course. So

66
00:04:23,253 --> 00:04:26,948
I'll mention it in passing but the details
will be more about chaining just for

67
00:04:26,948 --> 00:04:31,630
mathematical ease. So, there's one very
important parameter which plays a big role

68
00:04:31,630 --> 00:04:36,153
in governing the performance of a hash
table, and that's known as the load

69
00:04:36,153 --> 00:04:41,147
factor, or simply the load, of a hash
table. And it's a very simple definition.

70
00:04:41,147 --> 00:04:46,028
It just talks about how populated, a
typical bucket of the hash table is. So

71
00:04:46,028 --> 00:04:52,173
it's often denoted by alpha And in the
numerator is the number of things that

72
00:04:52,173 --> 00:04:56,807
have been inserted, and not subsequently
deleted in the hash table. Divided by

73
00:04:56,807 --> 00:05:01,324
the number of buckets in the hash table.
So, as you would expect, as you insert

74
00:05:01,324 --> 00:05:05,900
more and more things into the hash table,
the load grows, keeping the number of

75
00:05:05,900 --> 00:05:10,770
items in the hash table fixed as you scale
up the number of buckets, the load drops.

76
00:05:10,770 --> 00:05:14,737
So just to make sure that the notion of
the load is clear, and that also you're

77
00:05:14,737 --> 00:05:18,654
clear on the different strategies for
resolving collisions, the next quiz will

78
00:05:18,654 --> 00:05:22,672
ask you about the range of relevant alphas
for the chaining and open addressing

79
00:05:22,672 --> 00:05:27,008
implementations of the hash table.
Alright, so the correct answer to this

80
00:05:27,008 --> 00:05:31,671
quiz question is the third answer. Load
factors bigger than one do make sense they

81
00:05:31,671 --> 00:05:35,716
may not be optimal but they at least make
sense for hash tables that implement with

82
00:05:35,716 --> 00:05:39,569
chaining but they don't make sense for
hash tables with open addressing. And the

83
00:05:39,569 --> 00:05:43,229
reason is simple remember in open
addressing you are required to store only

84
00:05:43,229 --> 00:05:47,082
one object per bucket so as soon as the
number of objects exceeds the number of

85
00:05:47,082 --> 00:05:50,742
buckets there is no where to put the
remaining objects. So the hash the hash

86
00:05:50,742 --> 00:05:54,787
table will simply crash if, if load factor
is bigger than one. On the other hand a hash table

87
00:05:54,787 --> 00:05:58,784
with chaining there is no obvious problems
with load factor bigger than one so you

88
00:05:58,784 --> 00:06:02,589
can imagine a load factor equal to two
say, say you insert 2,000 objects into a

89
00:06:02,589 --> 00:06:06,759
hash table with 1,000 buckets. You know
that means, hopefully At least in the best

90
00:06:06,759 --> 00:06:10,786
case. Each buckets just gonna have a
length list with two objects in it. So

91
00:06:10,786 --> 00:06:14,970
there's no big deal. With having load
factors bigger than one, and hash tables

92
00:06:14,970 --> 00:06:20,434
With chaining. Alright, so let's then make
a, a quite easy but also very important

93
00:06:20,434 --> 00:06:24,969
observation about a necessary condition
for hash tables to have good performance

94
00:06:24,969 --> 00:06:29,448
And this goes into the first caveat that
you better properly implement the hash

95
00:06:29,448 --> 00:06:33,479
table if you expect to have good
performance. So the first point is that

96
00:06:33,479 --> 00:06:37,566
you're only gonna have constant time
look-ups if you keep the load to be

97
00:06:37,566 --> 00:06:41,472
constant. So for a hash table with open
addressing, this is really obvious,

98
00:06:41,472 --> 00:06:45,247
because you need alpha not just O(1) but
less than one, less than 100 percent full,

99
00:06:45,247 --> 00:06:49,500
otherwise the hash table is just gonna
crash, cuz you don't have enough room for

100
00:06:49,500 --> 00:06:53,859
all of the items, but even for hash tables
that you implement using chaining, where

101
00:06:53,859 --> 00:06:58,165
they at least make sense for load factors
which are bigger than one, you'd better

102
00:06:58,165 --> 00:07:02,259
keep the load not too much bigger than one
if you want to have constant-time

103
00:07:02,259 --> 00:07:06,483
operations. Right, so if you have, say, a
hash table with. N buckets and you hash in

104
00:07:06,483 --> 00:07:10,595
NlogN objects, then the average number of
objects in a given bucket is gonna be

105
00:07:10,595 --> 00:07:14,708
logarithmic And remember, when you do a
lookup, after you hash to the bucket, you

106
00:07:14,708 --> 00:07:18,924
have to do an exhaustive search through
the linked list in that bucket. So if you

107
00:07:18,924 --> 00:07:23,037
have NlogN objects and you hashed it with
N buckets, you're expecting more like

108
00:07:23,037 --> 00:07:27,327
logarithmic lookup time - not constant
lookup time And then, as we discussed with

109
00:07:27,327 --> 00:07:31,713
open addressing, of course, you need not
just alpha = O(1), but alpha less

110
00:07:31,713 --> 00:07:36,043
than one. And in fact, alpha better be
well below one. You don't want to let an

111
00:07:36,043 --> 00:07:40,355
open addressing table get to a 90 percent
load or something like that. So I'm going

112
00:07:40,355 --> 00:07:45,500
to write need Alpha less than less than
one. So that just means you don't want to

113
00:07:45,500 --> 00:07:50,433
let the load grow too close to 100%, you
will see performance degrade. So again, I

114
00:07:50,433 --> 00:07:54,898
hope the point of this slide is clear. If
you want good hash table performance, one

115
00:07:54,898 --> 00:07:59,252
of the things you're responsible for is
keeping the load factor under control.

116
00:07:59,252 --> 00:08:03,681
Keep it at most a small constant with open
address and keep it well below 100%. So

117
00:08:03,681 --> 00:08:06,839
you might wonder what I mean by
controlling the load. After all, you know

118
00:08:06,839 --> 00:08:10,524
you writing this hash table when you have
no idea what some client's gonna do with

119
00:08:10,524 --> 00:08:13,726
it. They can insert or delete whatever
they want. So how do you, how do you

120
00:08:13,726 --> 00:08:16,971
control alpha? Well, what you can control
under the hood of your hash table

121
00:08:16,971 --> 00:08:20,514
implementation is the number of buckets.
You can control the denominator Of this

122
00:08:20,514 --> 00:08:24,505
alpha so if the numerator starts growing
at some point the denominator is going to

123
00:08:24,505 --> 00:08:28,193
grow as well. So what actual
implementations of hash tables do is they

124
00:08:28,193 --> 00:08:32,546
keep track of the population of the hash
table. How many objects are being stored,

125
00:08:32,546 --> 00:08:36,254
and as this numerator grows, as more and
more stuff gets inserted, the

126
00:08:36,254 --> 00:08:40,284
implementation ensures that the
denominator grows at the same rate so that

127
00:08:40,284 --> 00:08:44,530
the number of buckets also increases. So
if alpha exceeds some target, you know,

128
00:08:44,530 --> 00:08:48,453
that could be say 75%, .75, or maybe
it's.5. Then what you can do is you can

129
00:08:48,453 --> 00:08:52,590
double the number of buckets, say in your
hash table. So you define a new hash

130
00:08:52,590 --> 00:08:56,890
table, you have a new hash function with
double the range, and now having doubled

131
00:08:56,890 --> 00:09:01,217
the denominator. The load has dropped by a
factor two. So that's how you can keep it

132
00:09:01,217 --> 00:09:05,306
under control. Optionally, if space is at
a real premium, you can also shrink the

133
00:09:05,306 --> 00:09:08,463
hash table if there's a bunch of
deletions, say, in a chaining

134
00:09:08,463 --> 00:09:13,057
implementation. So that's the first take
away point about what has to be happening

135
00:09:13,057 --> 00:09:16,946
correctly under the hood in order to get
the desired guarantees for hash table

136
00:09:16,946 --> 00:09:20,981
performance you gotta control the load. So
you have to have a hash table whose size

137
00:09:20,981 --> 00:09:24,870
is roughly the same as the as the number
of objects that you are storing. So the

138
00:09:24,870 --> 00:09:28,856
second thing you've gotta get right and
this is something we've touched on in the

139
00:09:28,856 --> 00:09:32,940
implementation videos is you better use a
good enough hash function And so what's a

140
00:09:32,940 --> 00:09:36,781
good hash function? It's something that
spreads the data out evenly amongst the

141
00:09:36,781 --> 00:09:41,089
buckets And what would really be awesome
would be a hash function which works well

142
00:09:41,089 --> 00:09:45,322
independent of the data And that's really
been the theme of this whole course so

143
00:09:45,322 --> 00:09:49,399
far, algorithmic solutions which work
independent of any domain assumptions. No

144
00:09:49,399 --> 00:09:53,685
matter what the input is, the algorithm is
guaranteed to, for example, run blazingly

145
00:09:53,685 --> 00:09:57,866
fast And I can appreciate that this is
exactly the sort of thing you would want

146
00:09:57,866 --> 00:10:01,942
to learn from a course like this, right?
Take a class in the design analysis of

147
00:10:01,942 --> 00:10:05,810
algorithms and you learn the secret hash
function which always works well.

148
00:10:05,810 --> 00:10:10,791
Unfortunately I'm not going to tell you
such a hash function And the reason is not

149
00:10:10,791 --> 00:10:15,593
cuz, I didn't prepare this lecture. The
reason is not because people just haven't

150
00:10:15,593 --> 00:10:20,094
been clever enough to discover such a
function. The problem is much more

151
00:10:20,094 --> 00:10:24,836
fundamental. The problem is that such a
function cannot exist. That is for every

152
00:10:24,836 --> 00:10:29,517
hash function it has its own kryptonite.
There is a pathological dataset under

153
00:10:29,517 --> 00:10:33,839
which the performance of this hash
function will be as bad as the most

154
00:10:33,839 --> 00:10:38,479
miserable constant hash function you'd
ever seen. And the reason is quite

155
00:10:38,479 --> 00:10:42,857
simple; it's really an inevitable
consequence of the compressing that hash

156
00:10:42,857 --> 00:10:47,235
functions are effectively implementing
from some massive universe to some

157
00:10:47,235 --> 00:10:51,850
relatively modest number of buckets. Let
me elaborate. Fix any hash function as

158
00:10:51,850 --> 00:10:56,465
clever as you could possibly imagine. So
this hash function maps some universe

159
00:10:56,465 --> 00:11:01,079
through the buckets indexed from 0 to
n -1. Remember in all of the

160
00:11:01,079 --> 00:11:05,457
interesting situations of hash functions,
the universe size is huge, so the

161
00:11:05,457 --> 00:11:10,250
cardinality of U should be much, much
bigger than n. That's what I'm going to

162
00:11:10,250 --> 00:11:15,328
assume here. So for example, maybe you're
remembering people's names, and then the

163
00:11:15,328 --> 00:11:20,211
universe is strings, which have, say at
most, 30 characters, and N, I assure you

164
00:11:20,211 --> 00:11:25,287
in any application is going to be much,
much, much smaller than say, 26 raised to

165
00:11:25,287 --> 00:11:30,427
the thirtieth power. So now let's use a
variant on the pigeon hole principle, and

166
00:11:30,427 --> 00:11:35,502
acclaim that at least one of these n
buckets has to have at least a 1/n

167
00:11:35,502 --> 00:11:40,770
fraction of the number of keys in this
universe. That is, there exists a bucket

168
00:11:40,770 --> 00:11:46,283
I, somewhere between 0 and n -1
Such that, at least the cardinality of

169
00:11:46,283 --> 00:11:56,346
the universe over N Keys Hash to I get
mapped to I under this hash function H. So

170
00:11:56,346 --> 00:12:01,452
the way to see this is just to remember
the picture of a hash function mapping in

171
00:12:01,452 --> 00:12:06,371
principal any key from the universe, all
keys from the universe to one of these

172
00:12:06,371 --> 00:12:10,991
buckets. So the hash function has to put
each key somewhere in one of the n buckets

173
00:12:10,991 --> 00:12:15,606
So one of the buckets has to have at least
a 1/n fraction above all of the possible

174
00:12:15,606 --> 00:12:19,893
keys. One more concrete way of thinking,
thinking about it is that you might want

175
00:12:19,893 --> 00:12:24,289
to think about a hash table implemented
with chaining. You might want to imagine,

176
00:12:24,289 --> 00:12:28,520
just in your mind, that you hash every
single key into the hash table. So this

177
00:12:28,520 --> 00:12:32,586
hash table is going to be insanely
over-populated. You'll never be able to

178
00:12:32,586 --> 00:12:36,927
store it on the computer because it will
have the full cardinality of U objects

179
00:12:36,927 --> 00:12:41,277
in it, but it has U objects, it only has n
buckets. One of those buckets has to have

180
00:12:41,277 --> 00:12:47,419
at least U /n fraction of the
population. So the point here is that no

181
00:12:47,419 --> 00:12:52,400
matter what the hash function is no matter
how clever you build it there's gonna be

182
00:12:52,400 --> 00:12:57,203
some buckets say bucket number 31 which
gets its fair share of the universe maps

183
00:12:57,203 --> 00:13:01,887
to it. So having identified this bucket,
bucket number 31 where it gets at least

184
00:13:01,887 --> 00:13:06,867
its fair share of the universe maps to it
now to construct our pathological dataset

185
00:13:06,867 --> 00:13:11,552
all we do is picked from amongst these
elements that get mapped to bucket number

186
00:13:11,552 --> 00:13:16,078
31. So for such as a data set and we can
make this data set basically as large as

187
00:13:16,078 --> 00:13:20,761
we want because the cardinality of U/n
is unimaginably big, because U itself is

188
00:13:20,761 --> 00:13:25,054
unimaginably big Then in this data set,
everything collides. The hash function

189
00:13:25,054 --> 00:13:29,499
maps every single one to bucket number 31
and that's gonna lead to Terrible hash

190
00:13:29,499 --> 00:13:33,808
table performance. Hast table performance
which is really no better than the naive

191
00:13:33,808 --> 00:13:37,966
linked list solutions. So for example an
hash table with collisions, in bucket 31,

192
00:13:37,966 --> 00:13:41,819
you'll just find a link listed every
single thing that's ever been inserted

193
00:13:41,819 --> 00:13:45,570
into the hash table and for open
addressing maybe it's a little harder to

194
00:13:45,570 --> 00:13:48,916
see, what's going on but again if
everything collides, you're gonna

195
00:13:48,916 --> 00:13:52,921
basically wind up with linear time
performance A far cry from constant time

196
00:13:52,921 --> 00:13:57,639
performance. Now, for those of you to whom
this seems like just sort of pointless,

197
00:13:57,639 --> 00:14:02,344
abstract mathematics, I want to make two
points. The first is that at the very

198
00:14:02,344 --> 00:14:06,951
least these pathological data sets. Tells
us, indicates, that we will have to

199
00:14:06,951 --> 00:14:11,509
discuss hash functions in a way
differently than how we've been discussing

200
00:14:11,509 --> 00:14:16,554
algorithms so far. So when we talked about
merge sort, we just said it runs in n log

201
00:14:16,554 --> 00:14:20,350
n time. No matter what the input is.
Whether we discuss the Dijkstra's algorithm

202
00:14:20,350 --> 00:14:24,391
It runs in n log n time, no matter what
the input is. That first search, that

203
00:14:24,391 --> 00:14:28,540
first search many a time no matter what
the input is. We're gonna have to say

204
00:14:28,540 --> 00:14:33,066
something different about hash functions.
We'll not be able to say that a hash table

205
00:14:33,066 --> 00:14:37,000
has good performance, no matter what the
input is. This slide shows that's false.

206
00:14:37,000 --> 00:14:41,365
The second point I want to
make is that while this pathological data

207
00:14:41,365 --> 00:14:45,568
sets of course are not likely to arise
just by random chance. Sometimes you're

208
00:14:45,568 --> 00:14:49,663
concerned about somebody constructing a
pathological data set for your hash

209
00:14:49,663 --> 00:14:54,077
function, for example in a denial service
attack. So there's a very clever

210
00:14:54,077 --> 00:15:00,360
illustration of exactly this point in a
research paper, from 2003 By Crosby and

211
00:15:00,360 --> 00:15:05,421
Wallach. So the main point of Crosby and
Wallach was that there's a number of real

212
00:15:05,421 --> 00:15:09,848
world systems And maybe there, most
interesting application was a network

213
00:15:09,848 --> 00:15:14,451
intrusion detection system, for which you
could bring them to their knees by

214
00:15:14,451 --> 00:15:18,906
exploiting badly designed hash functions.
So these were all applications that made

215
00:15:18,906 --> 00:15:22,635
crucial use of hash tables and, the
feasibility of these systems really,

216
00:15:22,635 --> 00:15:26,667
completed depended on getting cost and
time performance from the hash tables. So

217
00:15:26,667 --> 00:15:30,698
if you could exhibit a pathological data
set for these hash tables and make the

218
00:15:30,698 --> 00:15:34,781
performance devolve to linear, devolve to
that of a simple link list solution, the

219
00:15:34,781 --> 00:15:39,398
systems would be broken, they would just
crash of they'd fail to function. Now we

220
00:15:39,398 --> 00:15:43,669
saw in the last slide that every hash
table does have its own Kryptonite. Has a

221
00:15:43,669 --> 00:15:47,635
pathological data set But the question is.
How can you come up with such a

222
00:15:47,635 --> 00:15:52,140
pathological data set if you're trying to
do a denial of service attack on one of

223
00:15:52,140 --> 00:15:56,167
these systems? And so the systems that
Crosby and Wallach looked at generally

224
00:15:56,167 --> 00:16:00,301
shared two properties. So first of all,
they were open source. You could inspect

225
00:16:00,301 --> 00:16:04,911
the code. You could see what hash function
it was that they were using And second of

226
00:16:04,911 --> 00:16:09,097
all, the hash, function was often very
simplistic. So it was engineered for speed

227
00:16:09,097 --> 00:16:13,231
more than anything else And as a result,
it was easy to, just be inspecting the

228
00:16:13,231 --> 00:16:17,810
code, reverse engineer a data set. That
really did break the hash table. That led

229
00:16:17,810 --> 00:16:22,644
it That devolved the performance to
linear. So for example, in the network

230
00:16:22,644 --> 00:16:26,801
intrusion detection application, there was
some hash table that was just remembering

231
00:16:26,801 --> 00:16:30,711
the IP addresses of packets that we re
going through, because it was looking for

232
00:16:30,711 --> 00:16:34,883
patterns of pack, data packets that seemed
to indicate some sort of intrusion. And

233
00:16:34,883 --> 00:16:39,429
Crosby and Wallach just showed how sending
a bunch of data packets to this system

234
00:16:39,429 --> 00:16:44,029
with cleverly chosen sender IP's really
did just crash the system because the hash

235
00:16:44,029 --> 00:16:47,741
table performance blew up to an
unacceptable degree. So how should we

236
00:16:47,741 --> 00:16:52,453
address this fact of life that every hash
function has a pathological data set? And

237
00:16:52,453 --> 00:16:56,563
that question is meaningful both from a
practical perspective, so what hash

238
00:16:56,563 --> 00:17:00,454
functions should we use if we are
concerned about someone constructing

239
00:17:00,454 --> 00:17:04,837
pathological data sets, for example, the
implemented denial-of-service attack, and

240
00:17:04,837 --> 00:17:09,221
secondly, mathematically, if we can't give
the kinds of guarantees we've given so

241
00:17:09,221 --> 00:17:13,714
far, data-independent guarantees, how can
we mathematically say that hash functions

242
00:17:13,714 --> 00:17:17,797
have good performance. So let me mention
two solutions, the first solution is, is

243
00:17:17,797 --> 00:17:21,734
meant more just on the practical point.
You know what hash function should you

244
00:17:21,734 --> 00:17:25,570
implement if you are concerned with
someone creating pathological data sets?

245
00:17:25,570 --> 00:17:29,296
So there are these things called
cryptographic hash functions. There for

246
00:17:29,296 --> 00:17:34,186
example, one cryptographic hash function
or really family of hash functions, for

247
00:17:34,186 --> 00:17:38,732
different numbers of buckets, is SHA-2
And these are really outside the scope of

248
00:17:38,732 --> 00:17:41,846
this course. I mean, you'll learn more
about this in a In a course on

249
00:17:41,846 --> 00:17:45,600
cryptography And obviously using these
keywords you can look it up on the web and

250
00:17:45,600 --> 00:17:49,997
read more about it. The one point I want
to make is, you know these cryptographic

251
00:17:49,997 --> 00:17:54,681
hash functions like SHA2, they themselves,
they do have pathological datasets. They

252
00:17:54,681 --> 00:17:59,480
have their own version of kryptonite. The
reason that they work well in practice is

253
00:17:59,480 --> 00:18:04,453
because it's infeasible to figure out what
this pathological dataset is So unlike the

254
00:18:04,453 --> 00:18:08,848
very simplistic hash functions which
Crosby and Wallach found in the source

255
00:18:08,848 --> 00:18:13,589
code of their applications, where it was
easy to reverse-engineer bad datasets, for

256
00:18:13,589 --> 00:18:18,320
something like SHA2 nobody knows how to
reverse-engineer a bad dataset for it And

257
00:18:18,320 --> 00:18:22,779
when I say infeasible, I mean in the usual
cryptographic sense, in a way similar to

258
00:18:22,779 --> 00:18:27,403
how one would say it's infeasible to break
the RSA cryptosystem if you implement it

259
00:18:27,403 --> 00:18:31,807
properly, or it's infeasible to factor
large numbers except in very special case,

260
00:18:31,972 --> 00:18:36,486
and so on. So that's all I'm going to say
about cryptographic hash functions. I also

261
00:18:36,486 --> 00:18:41,110
want to mention a second solution, which
would be reasonable both in practice, and

262
00:18:41,110 --> 00:18:45,569
also for which we can say a number of
things mathematically about it Which is to

263
00:18:45,569 --> 00:18:50,292
use randomization. So more specifically
we're not going to design, a single.

264
00:18:50,292 --> 00:18:55,357
Clever hash function Because again, a
single hash function we know Must have a

265
00:18:55,357 --> 00:19:00,288
pathological data set But we're gonna
design a very clever, family of hash

266
00:19:00,288 --> 00:19:05,553
functions And then, at run time. We're
going to pick, one of these hash functions

267
00:19:05,553 --> 00:19:09,721
at random. Another kind of guarantee you
are going to want, we are going to be able

268
00:19:09,721 --> 00:19:13,564
to prove on a family affairs function
would be very much in the spirit of quick sort.

269
00:19:13,564 --> 00:19:17,362
So you would call that in the quick sort
algorithm, pretty much. For any

270
00:19:17,362 --> 00:19:21,349
fixed pivot sequence there is a
pathological input for which quick sort

271
00:19:21,349 --> 00:19:25,778
will devolve to quadratic running time. So
our solution was randomize quick sort

272
00:19:25,778 --> 00:19:30,096
which is rather than committing up front
to any particular method of choosing

273
00:19:30,096 --> 00:19:34,359
pivots at run time we are gonna pick
pivots randomly. What did we prove about

274
00:19:34,359 --> 00:19:38,567
quick sort? We proved that for any
possible input for any possible array the

275
00:19:38,567 --> 00:19:43,163
average running time with quick sort was
O(nlogn) with the average was over the

276
00:19:43,163 --> 00:19:47,592
run time random choices of quick sort.
Here we are gonna do the same thing we'll

277
00:19:47,592 --> 00:19:51,755
now be able to say for any dataset. On
average, with respect to our run time

278
00:19:51,755 --> 00:19:55,928
choice of a hash function. The hash
function will perform well, in the sense

279
00:19:55,928 --> 00:20:00,045
that it will spread out the data of the
data set evenly. So we flip to the

280
00:20:00,045 --> 00:20:04,051
quantifiers from the previous slide.
There, we said if we pre-commit to a

281
00:20:04,051 --> 00:20:09,191
single hash function. If we fix one hash
function, Then there's a data set that

282
00:20:09,191 --> 00:20:14,043
breaks the hash function. Here we're
flipping it, we're saying for each fixed

283
00:20:14,043 --> 00:20:19,023
data set a random choice of a hash
function is going to do well on average on

284
00:20:19,023 --> 00:20:23,574
that data set. Just like in Quick Sort.
This doesn't mean notice that we can't

285
00:20:23,574 --> 00:20:28,204
make our program open source. We can still
publish code which says here is our family

286
00:20:28,204 --> 00:20:32,888
of hash functions and in the code we would be
making a random choice from this set of hash

287
00:20:32,888 --> 00:20:37,409
functions But the point is by inspecting
the code you'll have no idea what was the

288
00:20:37,409 --> 00:20:41,549
real time random choices made by the
algorithm. So you'll know nothing about

289
00:20:41,549 --> 00:20:45,580
what the actual hash function is so you
won't be able to reverse engineer

290
00:20:45,580 --> 00:20:49,903
pathological dataset for the real time
choice of the hash function. So the next

291
00:20:49,903 --> 00:20:54,447
couple of videos are gonna elaborate on
the second solution Of using a real time

292
00:20:54,447 --> 00:20:59,047
random choice of a hash function as a way
of saying. You do well on every data set

293
00:20:59,047 --> 00:21:03,691
At least on average. So let me just give
you a road map of where we're going to go

294
00:21:03,691 --> 00:21:08,637
from here. So I'm going to break the
discussion of the details of a randomized

295
00:21:08,637 --> 00:21:13,143
solution into three parts, spread over two
videos. So in the next video we're going

296
00:21:13,143 --> 00:21:17,540
to begin with the definition of what do I
mean by a family of hash functions, so

297
00:21:17,540 --> 00:21:21,662
that if you pick one at random, you're
likely to do pretty well. So that's a

298
00:21:21,662 --> 00:21:25,393
definition that's called a universal
family of hash functions. Now a

299
00:21:25,393 --> 00:21:29,602
mathematical definition by itself is worth
approximately nothing. For it to be

300
00:21:29,602 --> 00:21:33,487
valuable, it has to satisfy two
properties. So first of all there have to

301
00:21:33,487 --> 00:21:37,642
be interesting and useful examples that
meet the definition. So that is, there

302
00:21:37,642 --> 00:21:42,013
better be Useful looking hash functions
that meet this definition of a universal

303
00:21:42,013 --> 00:21:46,168
family. So the second thing will be to
show you that they do indeed exist And

304
00:21:46,168 --> 00:21:50,323
then the other thing a mathematical
definition needs is applications. So that

305
00:21:50,323 --> 00:21:54,316
if you can meet, the definition. Then,
good things happen. So that'll be part