1
00:00:01,940 --> 00:00:05,434
So I know you're probably catching your 
breath from that last computation. 

2
00:00:05,434 --> 00:00:08,220
So let's zoom out. 
Let's make sure we don't lose the forest 

3
00:00:08,220 --> 00:00:10,817
for the trees. 
And see that we're actually mostly there. 

4
00:00:10,817 --> 00:00:13,933
We're just one lemma away from finishing 
the Proof of Theorem. 

5
00:00:13,933 --> 00:00:16,200
The proof of correctness of Huffman's 
algorithm. 

6
00:00:16,200 --> 00:00:19,725
So what do we have going for us. 
Well, first of all we've got the 

7
00:00:19,725 --> 00:00:24,022
inductive hypothesis. Remember and any 
proof by induction you better rely very 

8
00:00:24,022 --> 00:00:26,941
heavily on the inductive hypothesis. 
What does it say? 

9
00:00:26,941 --> 00:00:31,293
It says when Huffman's algorithm recurses 
on the smaller subproblem with the 

10
00:00:31,293 --> 00:00:34,268
smaller alpha bit signal prime, it solves 
it optimally. 

11
00:00:34,268 --> 00:00:38,619
It returns a tree which minimizes the 
average encoding length with respects to 

12
00:00:38,619 --> 00:00:41,870
the alpha bit sigma prime in the 
corresponding frequencies. 

13
00:00:41,870 --> 00:00:47,742
So, we're going to call that tree, the 
output of the recursive call, T prime 

14
00:00:47,742 --> 00:00:49,809
hat. 
So let me amplify the power of this 

15
00:00:49,809 --> 00:00:53,793
inductive hypothesis by combining it with 
the computation we did in the last slot. 

16
00:00:53,793 --> 00:00:56,903
So what do we know so far? 
We know for this smallest subproblem 

17
00:00:56,903 --> 00:00:59,381
which frankly, we don't care about in its 
own right. 

18
00:00:59,381 --> 00:01:03,074
But, nevertheless, for this smaller sub 
problem, with alphabet sigma prime, the 

19
00:01:03,074 --> 00:01:06,670
recursive call solves it optimally. 
It returns to us, this tree T hat prime. 

20
00:01:06,670 --> 00:01:10,022
And among all trees with leaves inable 
according to sigma prime. 

21
00:01:10,022 --> 00:01:13,570
This tree is as good as it gets, 
it minimizes the average encoding length. 

22
00:01:13,570 --> 00:01:15,777
But what have we learned in the last 
slide? 

23
00:01:15,777 --> 00:01:19,319
We learned that there's a one to one 
correspondence between feasible 

24
00:01:19,319 --> 00:01:23,579
solutions, between trees for the smallest 
subproblem with the alphabet sigma prime, 

25
00:01:23,579 --> 00:01:27,634
and feasible solutions to the original 
problem, the one we care about, that have 

26
00:01:27,634 --> 00:01:31,433
a certain form in which, it just so 
happens that A and B are siblings, that 

27
00:01:31,433 --> 00:01:34,821
they share a common parent. 
Moreover, this correspondence preserves 

28
00:01:34,821 --> 00:01:39,030
the objective function value, the average 
encoding length or at least it preserves 

29
00:01:39,030 --> 00:01:42,007
it up to a constant which is good enough 
for our purposes. 

30
00:01:42,007 --> 00:01:45,430
So the upshot is, 
in minimizing average encoding length 

31
00:01:45,430 --> 00:01:50,597
over all feasible solutions for the 
smallest subproblem, a recursive call is 

32
00:01:50,597 --> 00:01:54,892
actually doing more for us. 
It's actually minimizing the average 

33
00:01:54,892 --> 00:02:00,328
encoding length for the original problem 
with the original alphabet sigma over a 

34
00:02:00,328 --> 00:02:05,496
subset of the feasible solutions, the 
feasible solutions in which A and B are 

35
00:02:05,496 --> 00:02:10,932
siblings. 
Now, does this do us any good? 

36
00:02:10,932 --> 00:02:13,618
Well, it depends, what was our original 
goal? 

37
00:02:13,618 --> 00:02:18,617
Our goal was to get the smallest average 
encoding if possible over all feasible 

38
00:02:18,617 --> 00:02:22,366
solutions in the world. 
And so what have we just figured out? 

39
00:02:22,366 --> 00:02:26,990
We figured out, well we're getting the 
best possible scenario amongst some 

40
00:02:26,990 --> 00:02:29,865
solutions. 
Those in which A and B happen to be 

41
00:02:29,865 --> 00:02:32,614
siblings. 
So, we're in trouble if there is no 

42
00:02:32,614 --> 00:02:35,801
optimal solution in which A and B are 
siblings. 

43
00:02:35,801 --> 00:02:39,925
Then it doesn't do us any good to 
optimize only over these crappy 

44
00:02:39,925 --> 00:02:43,125
solutions. 
On the other hand, if there is an optimal 

45
00:02:43,125 --> 00:02:48,213
solution lying in this set XAB in which A 
and B are siblings, then we're golden. 

46
00:02:48,213 --> 00:02:53,096
Then there's an optimal solution in this 
set, while recursive call is finding the 

47
00:02:53,096 --> 00:02:56,545
best solution in the sets. 
So, we're going to find an optimal 

48
00:02:56,545 --> 00:02:59,207
solution. 
So that's really the big question. 

49
00:02:59,207 --> 00:03:03,381
Can we guarantee that it was safe to 
merge A and B in the very first 

50
00:03:03,381 --> 00:03:08,040
iteration? Can we guarantee there is an 
optimal solution, a best possible tree 

51
00:03:08,040 --> 00:03:11,790
that also has that property? That also 
has A and B as siblings. 

52
00:03:11,790 --> 00:03:16,570
So the key level, which I approve in the 
next slide in which we'll complete the 

53
00:03:16,570 --> 00:03:21,349
proof of the correctness of the apartment 
spherum, is indeed there is always an 

54
00:03:21,349 --> 00:03:25,862
optimal solution in the set XABA an 
optimal solution on which A and B, the 

55
00:03:25,862 --> 00:03:30,340
two lowest frequency symbols are indeed 
siblings. 

56
00:03:30,340 --> 00:03:33,606
So I'll give you a full-blown proof of 
this Key Lemma in the next slide. 

57
00:03:33,606 --> 00:03:37,393
But let me first just orientate you and 
you know, just develop your intuition 

58
00:03:37,393 --> 00:03:39,618
about why you'd hope this Key Lemma would 
be true. 

59
00:03:39,618 --> 00:03:43,547
The intuition is really the same as that 
as when we developed the greedy algorithm 

60
00:03:43,547 --> 00:03:47,712
in the first place, namely when you which 
symbols do you want to incur the longest 

61
00:03:47,712 --> 00:03:50,221
encoding lengths? 
Well, you want them to be the lowest 

62
00:03:50,221 --> 00:03:52,872
frequency symbols, that you want to 
minimize the average. 

63
00:03:52,872 --> 00:03:55,975
So if you look at any old tree, 
there's going to be you know, some 

64
00:03:55,975 --> 00:03:58,668
bottommost layer. 
There's going to be some deepest possible 

65
00:03:58,668 --> 00:04:00,784
leaves. 
And, you really hope that A and B, the 

66
00:04:00,784 --> 00:04:03,573
lowest frequency leaves are going to be 
in that lowest level. 

67
00:04:03,573 --> 00:04:07,276
Now they might both be in the lowest 
level and not, and might not be siblings. 

68
00:04:07,276 --> 00:04:10,738
But, if you think about it, amongst 
symbols at the same level, they're all 

69
00:04:10,738 --> 00:04:13,191
suffering exactly the same encoding 
length anyways. 

70
00:04:13,191 --> 00:04:16,798
So you can permute them in any way you 
want and it's not going to affect your 

71
00:04:16,798 --> 00:04:19,972
average encoding length. 
So once you put A and B at the bottommost 

72
00:04:19,972 --> 00:04:23,386
level you may as well put them as 
siblings and it's not going to make any 

73
00:04:23,386 --> 00:04:25,695
difference. 
So, the final punchline being, you can 

74
00:04:25,695 --> 00:04:28,340
take any tree like say an optimal tree. 
And by swapping. 

75
00:04:28,340 --> 00:04:32,270
Changing the lowest frequency elements A 
and B to the bottom and to be common 

76
00:04:32,270 --> 00:04:36,200
siblings, you can't make the tree worse 
you can only make it better, and that's 

77
00:04:36,200 --> 00:04:40,281
going to show that there is an optimal 
tree with the design property where A and 

78
00:04:40,281 --> 00:04:43,463
B are in fact siblings. 
All right, so that's a reasonably 

79
00:04:43,463 --> 00:04:47,020
accurate intuition, but it's not a proof. 
Here is the proof, 

80
00:04:51,660 --> 00:04:53,795
so we're going to use an exchange 
argument. 

81
00:04:53,795 --> 00:04:57,692
We're going to take an arbitrary optimal 
tree a tree minimizing the average 

82
00:04:57,692 --> 00:04:59,668
encoding length. 
Let's call it T star, 

83
00:04:59,668 --> 00:05:03,191
there's may be many such trees, just pick 
any one, I don't care which. 

84
00:05:03,191 --> 00:05:07,196
And the plan is to show an even better 
tree or at least as good tree which 

85
00:05:07,196 --> 00:05:11,040
satisfies the properties that we want 
where A and B show up as siblings. 

86
00:05:11,040 --> 00:05:15,025
So lets figure out where we're going to 
swap the symbols A and B to. 

87
00:05:15,025 --> 00:05:19,127
So we've got our tree, T star. 
And let's look at the deepest level of T 

88
00:05:19,127 --> 00:05:21,764
star. 
And let's find our, you know, any pair of 

89
00:05:21,764 --> 00:05:24,636
siblings at this deepest level, 
call them X and Y. 

90
00:05:24,636 --> 00:05:29,442
So in this green tree on the right we 
have two choices of pairs of siblings at 

91
00:05:29,442 --> 00:05:32,548
the lowest level. 
It doesn't matter which one we pick. 

92
00:05:32,548 --> 00:05:35,068
So let's just say I pick the leftmost 
pair, 

93
00:05:35,068 --> 00:05:38,467
call them X and Y. 
So A and B have to be somewhere in this 

94
00:05:38,467 --> 00:05:41,105
tree. 
Let me just pick a couple of the leaves 

95
00:05:41,105 --> 00:05:44,680
to be examples for where A and B might 
actually be in T star. 

96
00:05:44,680 --> 00:05:47,502
And so now we're going to execute the 
exchange. 

97
00:05:47,502 --> 00:05:51,945
We're going to obtain a new tree T hat 
from T star, merely by swapping the 

98
00:05:51,945 --> 00:05:56,149
labels of the leaves X and A and 
similarly swapping the labels of the 

99
00:05:56,149 --> 00:05:59,663
leaves Y and B. 
So after this exchange we get a new valid 

100
00:05:59,663 --> 00:06:02,339
tree. 
So once again, you know, the leaves are 

101
00:06:02,339 --> 00:06:04,953
labeled with the various symbols of 
sigma. 

102
00:06:04,953 --> 00:06:09,745
And, moreover, by our choice of X and Y, 
we've now via this exchange forced the 

103
00:06:09,745 --> 00:06:13,977
tree to conform to the property. 
We've forced A and B to be siblings. 

104
00:06:13,977 --> 00:06:18,770
They take the original positions of X and 
Y which were chosen to be siblings. 

105
00:06:18,770 --> 00:06:23,289
So to finish the proof, we're going to 
show that the averaging coding length of 

106
00:06:23,289 --> 00:06:27,637
T hat can only be less than T star. 
Since T star was optimal, that means that 

107
00:06:27,637 --> 00:06:32,328
T hat has to be optimal as well, since it 
also satisfies the desired property that 

108
00:06:32,328 --> 00:06:36,733
A and B are siblings, that's going to 
complete the proof the, the Key Lemma and 

109
00:06:36,733 --> 00:06:40,280
therefore the proof of the correctness of 
Huffman's algorithm. 

110
00:06:41,820 --> 00:06:46,486
The reason intuitively is that when we do 
the swap, A and B inherit the previous 

111
00:06:46,486 --> 00:06:50,745
debts of X and Y and conversely X and Y 
inherit the old debts of A and B. 

112
00:06:50,745 --> 00:06:55,528
So A and B can only get worse and X and Y 
can only get better, and X and Y are the 

113
00:06:55,528 --> 00:06:59,903
ones with the higher frequencies. 
So the overall cost benefit analysis is a 

114
00:06:59,903 --> 00:07:04,103
net win and the tree only improves. 
But, to make that precise, let me just 

115
00:07:04,103 --> 00:07:08,245
show you the exact computation. 
So let's look at the difference between 

116
00:07:08,245 --> 00:07:13,028
the average encoding length given by the 
3T star and that after the swap given by 

117
00:07:13,028 --> 00:07:15,781
the 3T hat. 
So analogously, to our computation on the 

118
00:07:15,781 --> 00:07:19,696
previous slide most of the stuff cancels 
because the trees T star and T hat 

119
00:07:19,696 --> 00:07:22,168
frankly just all that aren't all that 
different. 

120
00:07:22,168 --> 00:07:26,185
The only things that are different are 
the positions of the symbols A, B, X, and 

121
00:07:26,185 --> 00:07:28,040
Y. 
So we're going to have four positive 

122
00:07:28,040 --> 00:07:30,666
terms contributed by T star for these 
four symbols. 

123
00:07:30,666 --> 00:07:34,632
And we're going to have four negative 
terms contributed by the tree T hat again 

124
00:07:34,632 --> 00:07:37,620
for these exact same four symbols. 
So let me just show you, 

125
00:07:37,620 --> 00:07:42,475
what happens after the dust settles and 
after you take away all the cancellation? 

126
00:07:42,475 --> 00:07:47,031
You can write the eight terms that remain 
in a slightly sneaky but you know, 

127
00:07:47,031 --> 00:07:50,092
clarifying way as follows. 
So we're going to have one product 

128
00:07:50,092 --> 00:07:53,988
representing the four terms involving the 
symbols A and X, and then an analogous 

129
00:07:53,988 --> 00:07:55,960
term for the four terms involving B and 
Y. 

130
00:07:55,960 --> 00:07:58,917
So in the first term we look at the 
product of two differences. 

131
00:07:58,917 --> 00:08:02,532
On the one hand we look at the difference 
between the frequencies of X and A. 

132
00:08:02,532 --> 00:08:05,677
Remember those are the two things that 
got swapped with each other. 

133
00:08:05,677 --> 00:08:09,527
And then we also look at the difference 
between their depths in the original tree 

134
00:08:09,527 --> 00:08:11,733
T star. 
And then because B and Y got swapped we 

135
00:08:11,733 --> 00:08:13,940
have an entirely analogous term involving 
them. 

136
00:08:13,940 --> 00:08:18,290
The reason I re-wrote this difference in 
a slightly sneaky way, is it now becomes 

137
00:08:18,290 --> 00:08:21,352
completely obvious, what role the various 
hypotheses play. 

138
00:08:21,352 --> 00:08:25,165
Specifically, why is it important that A 
and B have the lowest possible 

139
00:08:25,165 --> 00:08:28,012
frequencies? 
We actually haven't used that yet in our 

140
00:08:28,012 --> 00:08:30,752
proof, but it's fundamental to our greedy 
algorithm. 

141
00:08:30,752 --> 00:08:35,210
Secondly, why did we choose X and Y to be 
in the deepest level of the original tree 

142
00:08:35,210 --> 00:08:35,854
T star. 
Well. 

143
00:08:35,854 --> 00:08:39,775
Let's start with the first fare products. 
The differences between, between 

144
00:08:39,775 --> 00:08:42,676
frequencies. 
So the frequency of X minus the frequency 

145
00:08:42,676 --> 00:08:46,650
of A, and similarly between Y and B. 
Well because A and B have the smallest 

146
00:08:46,650 --> 00:08:49,551
frequencies of them all. 
These differences have to be 

147
00:08:49,551 --> 00:08:52,442
non-negative. 
The frequencies of X and Y can only be 

148
00:08:52,442 --> 00:08:55,249
more. 
The second pair of differences also must 

149
00:08:55,249 --> 00:08:59,754
be non-negative and this is simply 
because we chose X and Y to be at the 

150
00:08:59,754 --> 00:09:03,209
deepest level. 
Their depths have to be at least as large 

151
00:09:03,209 --> 00:09:06,110
as that of A and B in the original 3T 
star. 

152
00:09:06,110 --> 00:09:09,719
Since all four differences are 
nonnegative, that means the sum of their 

153
00:09:09,719 --> 00:09:13,328
products is also nonnegative. 
That means the average encoding length of 

154
00:09:13,328 --> 00:09:16,989
T star is at least as big as T hat. 
So that's where T hat has to also be 

155
00:09:16,989 --> 00:09:19,276
optimal. 
It in addition satisfies the desired 

156
00:09:19,276 --> 00:09:22,937
property that A and B are siblings. 
And so that means our recursive call, 

157
00:09:22,937 --> 00:09:26,902
even though we've committed to merging A 
and B, we've committed to returning a 

158
00:09:26,902 --> 00:09:30,613
tree in which their siblings that was 
without loss, that was a safe merge. 

159
00:09:30,613 --> 00:09:34,680
That's why Huffman's Algorithm at the end 
of the day can return an optimal tree, 

160
00:09:34,680 --> 00:09:37,680
one that minimizes the average encoding 
length. 

161
00:09:37,680 --> 00:09:41,680
We're all possible binary prefix-free 
codes. 

162
00:09:47,020 --> 00:09:50,915
So let me just wrap up with a few notes 
about how to implement Huntsman's 

163
00:09:50,915 --> 00:09:55,400
algorithm and what kind of running time 
you should be able to achieve. 

164
00:09:55,400 --> 00:10:00,700
So if you just naively implement the 
pseudocode I showed you, you're going to 

165
00:10:00,700 --> 00:10:06,476
get a not particularly great quadratic 
time algorithm that moreover uses a lot 

166
00:10:06,476 --> 00:10:09,889
of recursion. 
So why is it going to be quadratic time? 

167
00:10:09,889 --> 00:10:12,199
Well, you have one recursive call each 
time. 

168
00:10:12,199 --> 00:10:14,940
And you always reduce the number of 
symbols by one. 

169
00:10:14,940 --> 00:10:19,131
So the total number of recursive calls is 
going to be linear in the number of 

170
00:10:19,131 --> 00:10:22,784
symbols in your alphabet in. 
And how much work did you do in each 

171
00:10:22,784 --> 00:10:25,847
recursive call, not counting the work 
done by later calls? 

172
00:10:25,847 --> 00:10:30,038
Well, you have to search to figure out 
the lowest frequency symbols A and B. 

173
00:10:30,038 --> 00:10:32,671
So that's basically some minimum 
computations. 

174
00:10:32,671 --> 00:10:36,540
So that's going to take linear time for 
each of the linear recursive calls. 

175
00:10:36,540 --> 00:10:40,860
Now the fact that we keep doing these 
repeated minimum computations each 

176
00:10:40,860 --> 00:10:45,476
recursive call, I hope triggers a light 
bulb, that maybe a data structure would 

177
00:10:45,476 --> 00:10:48,494
be helpful. 
Which data structure's raison d'etre is 

178
00:10:48,494 --> 00:10:50,979
to speed up repeated minimum 
computations? 

179
00:10:50,979 --> 00:10:55,358
Well as we've seen, for example, in 
Dijkstra's and Prim's, Prim's Algorithms, 

180
00:10:55,358 --> 00:10:59,182
that would be the heap. 
So of course, we need to find a heap, you 

181
00:10:59,182 --> 00:11:02,714
have to say what the keys are. 
But since we always want to know what the 

182
00:11:02,714 --> 00:11:06,443
symbols with the smallest frequencies 
are, the obvious thing to do is to use 

183
00:11:06,443 --> 00:11:09,485
frequencies as keys. 
So the only question then is what happens 

184
00:11:09,485 --> 00:11:12,477
when you do a merge? 
Well, the obvious thing is to just to add 

185
00:11:12,477 --> 00:11:16,255
the frequencies of those two symbols and 
reinsert into the heap the new meta 

186
00:11:16,255 --> 00:11:19,885
symbol with its key equal to what? 
The sum of the keys of the two elements 

187
00:11:19,885 --> 00:11:23,848
that you just extracted from the heap. 
So not only is that going to work great, 

188
00:11:23,848 --> 00:11:27,940
it's going to be N log N time, it's very 
easy to implement in iterative manner. 

189
00:11:27,940 --> 00:11:31,875
So that's starting to be a really blazing 
fast implementation of Huffman's 

190
00:11:31,875 --> 00:11:35,064
algorithm. 
In fact you can do even better. 

191
00:11:35,064 --> 00:11:40,434
You can boil down an implementation of 
Huffman's algorithm to a single 

192
00:11:40,434 --> 00:11:46,031
indication of a sorting sub-routine 
followed by a merely linear amount of 

193
00:11:46,031 --> 00:11:48,081
work. 
Now in the worst case you're still going 

194
00:11:48,081 --> 00:11:51,855
to be stuck asymptotically with an N log 
N of limitation because you can't sort 

195
00:11:51,855 --> 00:11:55,007
better than N log N if you know nothing 
about what you're sorting. 

196
00:11:55,007 --> 00:11:58,780
But in this implementation the constants 
are going to be even better, and as we 

197
00:11:58,780 --> 00:12:02,124
discussed in part one, there are 
sometimes special cases where you can 

198
00:12:02,124 --> 00:12:05,181
beat N log N for sorting. 
So for example if all, all of your data 

199
00:12:05,181 --> 00:12:08,954
is representable with a small number of 
bits, then you can do better than N log 

200
00:12:08,954 --> 00:12:11,047
N. 
So I'm going to leave this even faster 

201
00:12:11,047 --> 00:12:13,598
implemtation as a somewhat nontrivial 
exercise. 

202
00:12:13,598 --> 00:12:18,036
I'll give you a hint though, which is, if 
you do sorting as a pre-processing step 

203
00:12:18,036 --> 00:12:21,753
you actually don't even need to use a 
heap data structure for this 

204
00:12:21,753 --> 00:12:24,526
implementation. 
You can get away with an even more 

205
00:12:24,526 --> 00:12:28,299
primitive data structure of a queue. 
You might however, want to use two 

206
00:12:28,299 --> 00:12:30,850
queues. 
So, next time you find yourself in line 

207
00:12:30,850 --> 00:12:33,402
for coffee, that's a nice thing to think 
about. 

208
00:12:33,402 --> 00:12:37,951
How do you implement Huffman's Algorithm 
merely with one invocation to sorting, so 

209
00:12:37,951 --> 00:12:42,500
that the symbols are sorted by frequency, 
followed by linear work using two queues?