1
00:00:00,000 --> 00:00:03,508
In the last video, we discussed the
performance of hash tables that are

2
00:00:03,508 --> 00:00:07,264
implemented using chaining, using one link
list per bucket. In fact, we proved

3
00:00:07,264 --> 00:00:11,267
mathematically that if you use a hash
function chosen uniformly at random from a

4
00:00:11,267 --> 00:00:15,072
universal family, and if you keep the
buckets, number of buckets, comparable to

5
00:00:15,072 --> 00:00:19,026
the size of your data set, then in fact,
you're guaranteed constant time expected

6
00:00:19,026 --> 00:00:22,782
performance But, recall that chaining is
not the only implementation of hash

7
00:00:22,782 --> 00:00:26,439
tables. There's a second paradigm which is
also very important called open

8
00:00:26,439 --> 00:00:30,293
addressing. This is where you're only
allowed to store one object in each slot,

9
00:00:30,293 --> 00:00:34,275
and you keep searching for an empty slot
When you need to insert a new object into

10
00:00:34,275 --> 00:00:38,582
your hash table. Now it turns out it's
much harder to mathematically analyze hash

11
00:00:38,582 --> 00:00:43,130
tables implemented using open addressing
But, I do want to say a few words about it

12
00:00:43,130 --> 00:00:47,460
to give you the gist of what kind of
performance you might hope to expect from

13
00:00:47,460 --> 00:00:51,858
those sorts of hash tables. So recall how
open addressing works. We're only

14
00:00:51,858 --> 00:00:57,074
permitted to store one object in each slot
So this is unlike the case with chaining

15
00:00:57,074 --> 00:01:02,263
where we can have an arbitrarily long list
in a given bucket of the hash table. With

16
00:01:02,263 --> 00:01:06,804
at most one object per slot, obviously
open addressing only makes sense when the

17
00:01:06,804 --> 00:01:11,345
load factor alpha is less than one. When
the number of objects you're storing in

18
00:01:11,345 --> 00:01:15,344
your table is less than the number of
slots available Because of this

19
00:01:15,344 --> 00:01:20,194
requirement we have at most one object per
slot we need to demand more of our hash

20
00:01:20,194 --> 00:01:24,810
function. Our hash function might ask us
to put a given object, say with some IP

21
00:01:24,810 --> 00:01:29,660
address into say bucket number seventeen
but bucket number seventeen might already

22
00:01:29,660 --> 00:01:34,334
be full, might already be populated. In
that case, we go back to our hash function

23
00:01:34,334 --> 00:01:39,300
and ask it where to look for an empty slot
next. So maybe it tells us to next look in

24
00:01:39,300 --> 00:01:44,028
bucket 41. If 41 is full it tells us to
look in bucket number seven and so on Two

25
00:01:44,028 --> 00:01:48,708
specific strategies for producing a probe
sequence that we mentioned earlier were

26
00:01:48,708 --> 00:01:53,274
double hashing and linear probing. D ouble
hashing is where you use two different

27
00:01:53,274 --> 00:01:57,726
hash functions, h1 and h2. H1 tells you
which slot in which to search first and

28
00:01:57,726 --> 00:02:02,521
then every time you find a full slot you
add an increment which is specified by the

29
00:02:02,521 --> 00:02:07,030
second hash functions h2. Linear probing
is even simpler you just have one hash

30
00:02:07,030 --> 00:02:11,539
function that tells you where to search
first and then you just add one to the

31
00:02:11,539 --> 00:02:15,914
slot until you find an empty slot As I
mentioned at the beginning, it is quite

32
00:02:15,914 --> 00:02:20,339
nontrivial to Mathematically analyze the
performance of hash tables, using these

33
00:02:20,339 --> 00:02:24,488
various open addressing strategies. It's
not impossible. There is some quite

34
00:02:24,488 --> 00:02:28,582
beautiful and quite informative
theoretical work. That does tell us how

35
00:02:28,582 --> 00:02:32,565
hash tables perform But that's well
outside the scope, of this course. So

36
00:02:32,565 --> 00:02:37,045
instead what I wanna do is I want to give
you a quick and dirty calculation. That

37
00:02:37,045 --> 00:02:41,582
suggests, at least in an idealized world.
What kind of performance we should expect

38
00:02:41,582 --> 00:02:46,284
from a hash table with open addressing If
it's well implemented As a function of the

39
00:02:46,284 --> 00:02:50,800
load factor, alpha. Precisely, I'm going
to introduce a heuristic assumption. It's

40
00:02:50,800 --> 00:02:55,358
certainly not true but we'll do it just
for a quick and dirty calculation, that

41
00:02:55,358 --> 00:02:59,744
we're using a hash function in which each
of the n-factorial possible probe

42
00:02:59,744 --> 00:03:03,789
sequences is equally likely. Now, no hash
function you're ever going to use is

43
00:03:03,789 --> 00:03:07,805
actually going to satisfy this assumption,
and if you think about it for a little

44
00:03:07,805 --> 00:03:11,574
bit, you realize that if you use double
hashing or linear programming, you're

45
00:03:11,574 --> 00:03:15,640
certainly not going to be satisfying that
assumption. So this will still give us a

46
00:03:15,640 --> 00:03:19,557
kind of best case scenario against to
which you can compare the performance of

47
00:03:19,557 --> 00:03:23,523
your own hash table implementations. So if
you [inaudible] hash table, and you're

48
00:03:23,523 --> 00:03:27,341
seeing performance as good, as what's
suggested by this idealized [inaudible]

49
00:03:27,341 --> 00:03:31,294
analysis, then you're home free. You know
your hash table is performing great. So

50
00:03:31,294 --> 00:03:36,029
what is the line in the sand that gets
drawn, under this heuristic assumption?

51
00:03:36,029 --> 00:03:40,949
What is this idealized, idealized hash
function performance? As a function of the

52
00:03:40,949 --> 00:03:45,439
lo ad alpha Well here it is. What I'm
gonna argue next is that under this

53
00:03:45,439 --> 00:03:50,532
heuristic assumption, the expected amount
of time to insert a new object into the

54
00:03:50,532 --> 00:03:55,688
hash table, is going to be essentially one
over one minus alpha, where alpha is the

55
00:03:55,688 --> 00:04:00,781
load. Remember the load is the number of
objects in the hash table divided by the

56
00:04:00,781 --> 00:04:05,937
number of available slots. So if the hash
table is half full, then alpha's going to

57
00:04:05,937 --> 00:04:10,472
be.5. If it's 75 percent full then alpha's
going to be three-fourths. So what this

58
00:04:10,472 --> 00:04:15,338
means is that, in this idealized scenario,
if you keep the load pretty under control.

59
00:04:15,338 --> 00:04:19,617
So, say if the load is 50%, then the
insertion time is gonna be great, right?

60
00:04:19,617 --> 00:04:22,830
If alpha's.5 And 1/ (1-alpha) =two, so you
expect just two probes before the

61
00:04:22,830 --> 00:04:27,016
successful insert of the new object And of
course, if you're thinking about lookup,

62
00:04:27,016 --> 00:04:31,100
that's going to be at least as good as
insert, so if you're lucky a lookup might

63
00:04:31,100 --> 00:04:35,286
terminate early if you find what you are
looking for. In the worst case you go all

64
00:04:35,286 --> 00:04:39,064
the way until an empty slot in an
unsuccessful search, and that's gonna be

65
00:04:39,064 --> 00:04:42,740
the same as insertion. So if alpha is
small bounded away from one, you're

66
00:04:42,740 --> 00:04:46,824
getting constant time performance. On the
other hand, as the hash table gets full,

67
00:04:46,824 --> 00:04:51,010
as alpha gets close to one, this operation
time is blowing up; it's such a going to

68
00:04:51,010 --> 00:04:54,792
infinity as alpha gets close to one. So if
you need to have a nice. 90 percent full

69
00:04:54,792 --> 00:04:58,713
hash table with open addressing. You're
gonna start seeing, ten probes. So, you

70
00:04:58,713 --> 00:05:02,892
really wanna keep hash tables with open
addressing. You wanna keep the load under

71
00:05:02,892 --> 00:05:07,204
control Certainly no more than probably.7.
Maybe even less than that To refresh your

72
00:05:07,204 --> 00:05:10,977
memory, with chaining, hash tables are
perfectly well-defined even with loads

73
00:05:10,977 --> 00:05:15,002
factors bigger than one. What we derived
is that under universal hashing, under a

74
00:05:15,002 --> 00:05:18,926
weaker assumption, we had an operation
time of one plus alpha, for a load of

75
00:05:18,926 --> 00:05:22,498
alpha. So with chaining, you just gotta
keep alpha, you know, at most, some

76
00:05:22,498 --> 00:05:26,221
reasonably small constant with open
address, and you really got to keep it

77
00:05:26,221 --> 00:05:30,213
well bounded a way below one. So next
let's understand why this observation is

78
00:05:30,213 --> 00:05:34,506
true. Why under the assumption that every
probe sequence is equally likely do we

79
00:05:34,506 --> 00:05:38,850
expect a one over one minus alpha running
time for hash tables with open addressing?

80
00:05:38,850 --> 00:05:43,325
So, the reason is pretty simple. And we
can derive it by analogy with a simple

81
00:05:43,325 --> 00:05:48,032
coin flipping experiment. So, to motivate
the experiment, think just about the very

82
00:05:48,032 --> 00:05:52,682
first probe that we do. Okay, so we get
some new objects, some new IP address that

83
00:05:52,682 --> 00:05:57,215
we want to insert into our hash table.
Let's say our hash table's currently 70

84
00:05:57,215 --> 00:06:01,226
percent full. Say there's 100 slots, 70
are already taken by objects. Well, when

85
00:06:01,226 --> 00:06:06,050
we look at this first probe, by assumption
it's equally likely to be any one of the

86
00:06:06,050 --> 00:06:10,525
100 slots. 70 of which are full, 30 which
are empty So, with probability of one

87
00:06:10,525 --> 00:06:14,779
minus alpha, or in the case, 30%, our
first Probe will, luckily, find an empty

88
00:06:14,779 --> 00:06:19,511
slot and we'll be done. We'll just insert
the new object into that slot If we get

89
00:06:19,511 --> 00:06:24,219
unlucky with a probability, 70%. We find a
slot that's already occupied and then we

90
00:06:24,219 --> 00:06:28,869
have to try again. So we try a new slot,
drawn at random And we again check is it

91
00:06:28,869 --> 00:06:32,937
full, or is it not full? And again, with
30 percent probability, essentially it's

92
00:06:32,937 --> 00:06:37,645
going to be empty and we can stop And if
it's already full. Then we try, yet again.

93
00:06:37,645 --> 00:06:42,054
So Doing random probes, looking for an
empty slot, is tantamount to flipping a

94
00:06:42,054 --> 00:06:45,773
coin with the probability of heads
1-alpha, or, in this example, 30 Percent

95
00:06:45,773 --> 00:06:50,522
And the number of probes you need until
you successfully insert is just the number

96
00:06:50,522 --> 00:06:55,065
of times you flip this last coin until you
see a heads. In fact this biased coin

97
00:06:55,065 --> 00:06:59,755
flipping experiment slightly overestimates
the expected time for insertions and the

98
00:06:59,755 --> 00:07:04,500
heuristics assumptions and that's because
in the insertion time whenever we're never

99
00:07:04,500 --> 00:07:09,078
going to try the same slot twice. We're
going to try all end buckets in some order

100
00:07:09,078 --> 00:07:13,266
with each of the impact [inaudible]
ordering equally likely So back to our

101
00:07:13,266 --> 00:07:17,955
example, where we have a hash table with
100 slots, 70 of which are full. The first

102
00:07:17,955 --> 00:07:22,254
probe indeed, we have a 30 in 100 chance
of ge tting an empty slot. If that one

103
00:07:22,254 --> 00:07:26,888
fails then we're not going to try the same
slot again. So there is only 99 residual

104
00:07:26,888 --> 00:07:31,375
possibilities. Again, 30 of which are
empty. The one we checked last time was

105
00:07:31,375 --> 00:07:35,809
full. So we actually have a 30 over 99
percent chance of getting an empty slot on

106
00:07:35,809 --> 00:07:40,608
the second try. Like 30 over 98 on the
third try, if the second one fails, and so

107
00:07:40,608 --> 00:07:44,799
on But, a valid upper bond is just to
assume a 30 percent success probability

108
00:07:44,799 --> 00:07:49,172
with every single probe, and that's
precisely, what this coin flipping

109
00:07:49,172 --> 00:07:53,665
experiment will get us. So the next quiz
will ask you to actually compute the

110
00:07:53,665 --> 00:07:58,538
expected value of capital N, the number of
coin flips, needed to get heads when you

111
00:07:58,538 --> 00:08:02,953
have a probability of heads of one minus
alpha. As a hint, we actually analyzed

112
00:08:02,953 --> 00:08:07,482
this exact same coin flipping experiment
when alpha equals a half, back when we

113
00:08:07,482 --> 00:08:12,074
discussed the expected running time of
randomized linear time selection. Alright,

114
00:08:12,074 --> 00:08:16,487
so the correct answer is the first one.
One over 1-alpha So to see why, let's

115
00:08:16,487 --> 00:08:21,306
return to our derivation, where we reduced
analyzing the expected insertion time to

116
00:08:21,306 --> 00:08:26,067
this random variable. The expected number
of coin flips until we see a heads. So,

117
00:08:26,067 --> 00:08:30,770
I'm gonna solve this exactly the same way
that we did it back when we analyzed a

118
00:08:30,770 --> 00:08:35,473
randomized, selection algorithm. And it's
quite a sneaky way, but very effective.

119
00:08:35,473 --> 00:08:40,175
What we're going to do is we're going to
express the expected value of capital N,

120
00:08:40,175 --> 00:08:44,779
in terms of itself, and then solve. So how
do we do that? Well on the left hand side

121
00:08:44,779 --> 00:08:49,052
let's write the expected number of coin
flips, the expected value of capital N,

122
00:08:49,052 --> 00:08:53,545
and then let's just notice that there's
two different cases, either the first coin

123
00:08:53,545 --> 00:08:57,818
flip is a heads or it's not. So in any
case you're certainly going to have one

124
00:08:57,818 --> 00:09:02,109
coin flip so let's separate that out and
count it separately. With probability

125
00:09:02,109 --> 00:09:06,466
alpha, the first coin flip is gonna be
tails and then you start all over again

126
00:09:06,466 --> 00:09:10,935
And because it's a memory less process,
the expected number of further coin flips

127
00:09:10,935 --> 00:09:15,292
one requires, given that the first coin
flip was tails, is just the same as the

128
00:09:15,292 --> 00:09:19,754
expected number of coin flips in the first
place. So now it's a simple matter to

129
00:09:19,754 --> 00:09:24,149
solve this one linear equation for the
expected value of N, and we find that it

130
00:09:24,149 --> 00:09:28,488
is indeed one over one minus alpha, as
claimed. Summarizing, under our idealized

131
00:09:28,488 --> 00:09:32,716
heuristic assumption, that every single
probe sequence is equally likely, the

132
00:09:32,716 --> 00:09:37,055
expected insertion time is upper bounded
by the expected number of coin flips,

133
00:09:37,055 --> 00:09:41,394
which by this argument is, at most, one
over one minus alpha. So, as long as your

134
00:09:41,394 --> 00:09:45,622
load, alpha, is well bounded below one,
you're good. At least in this idealized

135
00:09:45,622 --> 00:09:50,138
analysis, you're hash table will, will
work extremely quickly. Now I hope you're

136
00:09:50,138 --> 00:09:54,120
regarding this idealized analysis with a
bit of skepticism. Right, from a false

137
00:09:54,120 --> 00:09:57,898
hypothesis you can literally derive
anything you want. And we started with

138
00:09:57,898 --> 00:10:02,084
this assumption which is not satisfied, by
hash functions you're actually going to

139
00:10:02,084 --> 00:10:05,709
use in practice. This heuristic
assumption, that all probe sequences are

140
00:10:05,709 --> 00:10:09,946
equally likely. So, should you expect this
one over one minus alpha bound to hold in

141
00:10:09,946 --> 00:10:13,673
practice or not? Well, that depends to
some extent. It depends on what open

142
00:10:13,673 --> 00:10:17,349
addressing strategy you're using. It
depends on, how good a hash function

143
00:10:17,349 --> 00:10:21,331
you're using. It depends on whether the
data is pathological or not. So, just to

144
00:10:21,331 --> 00:10:25,001
give course rules of thumb If you're Using
double hashing and you have

145
00:10:25,001 --> 00:10:28,982
non-pathological data, I would go ahead
and look for this 1/1-alpha bound in

146
00:10:28,982 --> 00:10:33,226
practice. So implement your hash table,
check its performance as a function of the

147
00:10:33,226 --> 00:10:37,260
load factor alpha and shoot for the
1/1-alpha curve. That's really what you'd

148
00:10:37,260 --> 00:10:41,398
like to see. With linear probing, on the
other hand, you should not expect to see

149
00:10:41,398 --> 00:10:45,432
this performance guarantee of 1/1-alpha
even in a totally idealized scenario.

150
00:10:45,432 --> 00:10:49,309
Remember, linear probing is the strategy
where your initial probe, the hash

151
00:10:49,309 --> 00:10:53,448
function, tells you where to look first,
and then you just skim linearly through

152
00:10:53,448 --> 00:10:57,508
the hash table until you find what you're
looking for, an empty slot, the. That

153
00:10:57,508 --> 00:11:02,067
you're looking up or whatever So a linear
probing, even in a best case scenario,

154
00:11:02,067 --> 00:11:06,603
it's going to be subject to clumping.
You're going to have contiguous Groups of

155
00:11:06,603 --> 00:11:10,753
slots which are all full, and that's
because of the linear probing strategy.

156
00:11:10,753 --> 00:11:15,180
Now I encourage you to do some experiments
with implementations to see this for

157
00:11:15,180 --> 00:11:19,274
yourself. So because of clumping with
linear probing, even in the idealized

158
00:11:19,274 --> 00:11:23,812
scenario, you're not going to see one over
one minus alpha. However, you're going to

159
00:11:23,812 --> 00:11:27,555
see something worse, but still in
idealized situations. Quite reasonable so

160
00:11:27,555 --> 00:11:31,784
that's the last thing I want to tell you
about In this video. Now needless to say,

161
00:11:31,784 --> 00:11:35,575
with linear probing the heuristic
assumption is badly false. The heuristic

162
00:11:35,575 --> 00:11:39,609
assumption is pretty much always false to
no matter what hashing strategy you're

163
00:11:39,623 --> 00:11:43,619
using, but with linear programming it's
quote on quote really false. So to see

164
00:11:43,619 --> 00:11:47,461
that, the heuristic assumption, say that
all in factorial probe sequences are

165
00:11:47,461 --> 00:11:51,251
equally likely. So your next probe is
going to be uniform or random amongst

166
00:11:51,251 --> 00:11:55,179
everything you haven't probed so far but
when you're probing, it's totally the

167
00:11:55,179 --> 00:11:59,208
opposite. Right once you know the first
slot that you're looking into say bucket

168
00:11:59,208 --> 00:12:03,187
seventeen, slot a7 is gonna be the first
slot, you know the rest of the sequence

169
00:12:03,187 --> 00:12:07,367
because it's a linear [inaudible] cancel
the table. So it's kind of [inaudible] the

170
00:12:07,367 --> 00:12:11,900
opposite from each successive probe being
independent from the previous ones except

171
00:12:11,900 --> 00:12:16,076
not exploring things twice. So to state a
conjectured or idealized performance

172
00:12:16,076 --> 00:12:20,438
guarantee for hash tables with linear
probing, we're going to place, replace the

173
00:12:20,438 --> 00:12:24,966
blatant false heuristic assumption by a
still false, but more heuristic reasonable

174
00:12:24,966 --> 00:12:29,439
assumption. So what do we know? We know
that the initial probe with linear probing

175
00:12:29,439 --> 00:12:33,967
determines the rest of the sequence. So
let's assume that these initial probes are

176
00:12:33,967 --> 00:12:38,218
uniform at random, and independent for
different keys. Of course, once you have

177
00:12:38,218 --> 00:12:42,470
the initial probe, you know everything
else, but let's assume independence and

178
00:12:42,470 --> 00:12:46,952
uniformity amongst The initial probes.
Now, this is a strong assumption. This is

179
00:12:46,952 --> 00:12:51,490
way stronger than assuming you ha ve a
universal family of hash functions. This

180
00:12:51,490 --> 00:12:57,435
assumption is not satisfied Practice, but
Performance guarantees we can derive under

181
00:12:57,435 --> 00:13:02,994
this assumption are typically satisfied in
practice by well implemented hash tables

182
00:13:02,994 --> 00:13:07,693
that use linear probing. So, the
assumption is still useful for deriving

183
00:13:07,693 --> 00:13:12,824
the correct, idealized performance of this
type of hash table. So what is that

184
00:13:12,824 --> 00:13:18,230
performance? Well this is an utterly
classic result from exactly 50 years ago

185
00:13:18,230 --> 00:13:25,491
From 1962 And this is a result by my
colleague, the living legend, Don Canuth,

186
00:13:25,491 --> 00:13:31,642
author of Art of Computer Programming. At
what can proved is, was that is that under

187
00:13:31,642 --> 00:13:37,604
this weaker [inaudible] assumptions,
suitable for linear probing. The expected

188
00:13:37,604 --> 00:13:43,296
time to insert an object into a hash table
with a load factor alpha, when you're

189
00:13:43,296 --> 00:13:48,785
using linear probing is worse than one
over one minus alpha, but. It is still a

190
00:13:48,785 --> 00:13:54,330
function of the load alpha only and not a
function of the number of objects in the

191
00:13:54,330 --> 00:13:59,007
hash table. That is with linear
programming you will not get as good a

192
00:13:59,007 --> 00:14:04,351
performance guarantee, but it is still the
case that if you keep the load factor

193
00:14:04,351 --> 00:14:09,629
bounded away from one. If you make sure
the hash table doesn't get too full you

194
00:14:09,629 --> 00:14:14,377
will enjoy constant time operations on
average so for example if with linear

195
00:14:14,377 --> 00:14:18,288
probing your hash table is 50 percent full
then you are going to get an expected

196
00:14:18,288 --> 00:14:22,521
insertion time of roughly four probes.
Note however this quantity does approach

197
00:14:22,521 --> 00:14:26,486
does blow up pretty rapidly as the hash
table grows full. If it is 90 percent full

198
00:14:26,486 --> 00:14:31,040
this is already going to be something like
a hundred probes on average. So you really

199
00:14:31,040 --> 00:14:35,412
don't wanna let hash tables get too full
when you are using linear probing. You

200
00:14:35,412 --> 00:14:39,356
might well wonder if it's ever worth,
implementing linear probing, given that it

201
00:14:39,356 --> 00:14:42,914
has the worst performance curve, one over
one minus alpha squared. Then the

202
00:14:42,914 --> 00:14:46,425
performance curve you'd hope from
something like double hashing, one over

203
00:14:46,425 --> 00:14:50,225
one, minus alpha. And it's a tricky cost
benefit analysis between linear probing

204
00:14:50,225 --> 00:14:54,252
and more complicated but better performing
strategies. That really depends on the ap

205
00:14:54,252 --> 00:14:58,064
plication. There are reasons that you do
want to use linear probing sometimes, it

206
00:14:58,064 --> 00:15:02,005
is actually quite Common in practice For
example, it's often interacts very well

207
00:15:02,005 --> 00:15:06,008
with memory hierarchies So again, as with
all of this hash and discussion. You know

208
00:15:06,008 --> 00:15:09,718
the costs and benefits Are, are very
subtle trade-offs between the different

209
00:15:09,718 --> 00:15:13,575
approaches. If you have mission critical
code that's using a hash table and you

210
00:15:13,575 --> 00:15:17,383
really want to optimize it. Try a bunch of
prototypes, and just test. Figure out

211
00:15:17,383 --> 00:15:21,673
which one is the best, for your particular
type of application. Let me conclude the

212
00:15:21,673 --> 00:15:26,413
video with a quote from Canuck himself
where he talks about the rapture of

213
00:15:26,413 --> 00:15:31,279
proving this one of our one man is half a
square theorem and how it was life

214
00:15:31,279 --> 00:15:36,271
changing. He says I first formulated the
following derivation, meaning, the proof

215
00:15:36,271 --> 00:15:41,326
of that last theorem in 1962. Ever since
that day, the analysis of algorithms has,

216
00:15:41,326 --> 00:15:44,360
in fact, been one of the major themes in
my life.