1 00:00:01,940 --> 00:00:05,434 So I know you're probably catching your breath from that last computation. 2 00:00:05,434 --> 00:00:08,220 So let's zoom out. Let's make sure we don't lose the forest 3 00:00:08,220 --> 00:00:10,817 for the trees. And see that we're actually mostly there. 4 00:00:10,817 --> 00:00:13,933 We're just one lemma away from finishing the Proof of Theorem. 5 00:00:13,933 --> 00:00:16,200 The proof of correctness of Huffman's algorithm. 6 00:00:16,200 --> 00:00:19,725 So what do we have going for us. Well, first of all we've got the 7 00:00:19,725 --> 00:00:24,022 inductive hypothesis. Remember and any proof by induction you better rely very 8 00:00:24,022 --> 00:00:26,941 heavily on the inductive hypothesis. What does it say? 9 00:00:26,941 --> 00:00:31,293 It says when Huffman's algorithm recurses on the smaller subproblem with the 10 00:00:31,293 --> 00:00:34,268 smaller alpha bit signal prime, it solves it optimally. 11 00:00:34,268 --> 00:00:38,619 It returns a tree which minimizes the average encoding length with respects to 12 00:00:38,619 --> 00:00:41,870 the alpha bit sigma prime in the corresponding frequencies. 13 00:00:41,870 --> 00:00:47,742 So, we're going to call that tree, the output of the recursive call, T prime 14 00:00:47,742 --> 00:00:49,809 hat. So let me amplify the power of this 15 00:00:49,809 --> 00:00:53,793 inductive hypothesis by combining it with the computation we did in the last slot. 16 00:00:53,793 --> 00:00:56,903 So what do we know so far? We know for this smallest subproblem 17 00:00:56,903 --> 00:00:59,381 which frankly, we don't care about in its own right. 18 00:00:59,381 --> 00:01:03,074 But, nevertheless, for this smaller sub problem, with alphabet sigma prime, the 19 00:01:03,074 --> 00:01:06,670 recursive call solves it optimally. It returns to us, this tree T hat prime. 20 00:01:06,670 --> 00:01:10,022 And among all trees with leaves inable according to sigma prime. 21 00:01:10,022 --> 00:01:13,570 This tree is as good as it gets, it minimizes the average encoding length. 22 00:01:13,570 --> 00:01:15,777 But what have we learned in the last slide? 23 00:01:15,777 --> 00:01:19,319 We learned that there's a one to one correspondence between feasible 24 00:01:19,319 --> 00:01:23,579 solutions, between trees for the smallest subproblem with the alphabet sigma prime, 25 00:01:23,579 --> 00:01:27,634 and feasible solutions to the original problem, the one we care about, that have 26 00:01:27,634 --> 00:01:31,433 a certain form in which, it just so happens that A and B are siblings, that 27 00:01:31,433 --> 00:01:34,821 they share a common parent. Moreover, this correspondence preserves 28 00:01:34,821 --> 00:01:39,030 the objective function value, the average encoding length or at least it preserves 29 00:01:39,030 --> 00:01:42,007 it up to a constant which is good enough for our purposes. 30 00:01:42,007 --> 00:01:45,430 So the upshot is, in minimizing average encoding length 31 00:01:45,430 --> 00:01:50,597 over all feasible solutions for the smallest subproblem, a recursive call is 32 00:01:50,597 --> 00:01:54,892 actually doing more for us. It's actually minimizing the average 33 00:01:54,892 --> 00:02:00,328 encoding length for the original problem with the original alphabet sigma over a 34 00:02:00,328 --> 00:02:05,496 subset of the feasible solutions, the feasible solutions in which A and B are 35 00:02:05,496 --> 00:02:10,932 siblings. Now, does this do us any good? 36 00:02:10,932 --> 00:02:13,618 Well, it depends, what was our original goal? 37 00:02:13,618 --> 00:02:18,617 Our goal was to get the smallest average encoding if possible over all feasible 38 00:02:18,617 --> 00:02:22,366 solutions in the world. And so what have we just figured out? 39 00:02:22,366 --> 00:02:26,990 We figured out, well we're getting the best possible scenario amongst some 40 00:02:26,990 --> 00:02:29,865 solutions. Those in which A and B happen to be 41 00:02:29,865 --> 00:02:32,614 siblings. So, we're in trouble if there is no 42 00:02:32,614 --> 00:02:35,801 optimal solution in which A and B are siblings. 43 00:02:35,801 --> 00:02:39,925 Then it doesn't do us any good to optimize only over these crappy 44 00:02:39,925 --> 00:02:43,125 solutions. On the other hand, if there is an optimal 45 00:02:43,125 --> 00:02:48,213 solution lying in this set XAB in which A and B are siblings, then we're golden. 46 00:02:48,213 --> 00:02:53,096 Then there's an optimal solution in this set, while recursive call is finding the 47 00:02:53,096 --> 00:02:56,545 best solution in the sets. So, we're going to find an optimal 48 00:02:56,545 --> 00:02:59,207 solution. So that's really the big question. 49 00:02:59,207 --> 00:03:03,381 Can we guarantee that it was safe to merge A and B in the very first 50 00:03:03,381 --> 00:03:08,040 iteration? Can we guarantee there is an optimal solution, a best possible tree 51 00:03:08,040 --> 00:03:11,790 that also has that property? That also has A and B as siblings. 52 00:03:11,790 --> 00:03:16,570 So the key level, which I approve in the next slide in which we'll complete the 53 00:03:16,570 --> 00:03:21,349 proof of the correctness of the apartment spherum, is indeed there is always an 54 00:03:21,349 --> 00:03:25,862 optimal solution in the set XABA an optimal solution on which A and B, the 55 00:03:25,862 --> 00:03:30,340 two lowest frequency symbols are indeed siblings. 56 00:03:30,340 --> 00:03:33,606 So I'll give you a full-blown proof of this Key Lemma in the next slide. 57 00:03:33,606 --> 00:03:37,393 But let me first just orientate you and you know, just develop your intuition 58 00:03:37,393 --> 00:03:39,618 about why you'd hope this Key Lemma would be true. 59 00:03:39,618 --> 00:03:43,547 The intuition is really the same as that as when we developed the greedy algorithm 60 00:03:43,547 --> 00:03:47,712 in the first place, namely when you which symbols do you want to incur the longest 61 00:03:47,712 --> 00:03:50,221 encoding lengths? Well, you want them to be the lowest 62 00:03:50,221 --> 00:03:52,872 frequency symbols, that you want to minimize the average. 63 00:03:52,872 --> 00:03:55,975 So if you look at any old tree, there's going to be you know, some 64 00:03:55,975 --> 00:03:58,668 bottommost layer. There's going to be some deepest possible 65 00:03:58,668 --> 00:04:00,784 leaves. And, you really hope that A and B, the 66 00:04:00,784 --> 00:04:03,573 lowest frequency leaves are going to be in that lowest level. 67 00:04:03,573 --> 00:04:07,276 Now they might both be in the lowest level and not, and might not be siblings. 68 00:04:07,276 --> 00:04:10,738 But, if you think about it, amongst symbols at the same level, they're all 69 00:04:10,738 --> 00:04:13,191 suffering exactly the same encoding length anyways. 70 00:04:13,191 --> 00:04:16,798 So you can permute them in any way you want and it's not going to affect your 71 00:04:16,798 --> 00:04:19,972 average encoding length. So once you put A and B at the bottommost 72 00:04:19,972 --> 00:04:23,386 level you may as well put them as siblings and it's not going to make any 73 00:04:23,386 --> 00:04:25,695 difference. So, the final punchline being, you can 74 00:04:25,695 --> 00:04:28,340 take any tree like say an optimal tree. And by swapping. 75 00:04:28,340 --> 00:04:32,270 Changing the lowest frequency elements A and B to the bottom and to be common 76 00:04:32,270 --> 00:04:36,200 siblings, you can't make the tree worse you can only make it better, and that's 77 00:04:36,200 --> 00:04:40,281 going to show that there is an optimal tree with the design property where A and 78 00:04:40,281 --> 00:04:43,463 B are in fact siblings. All right, so that's a reasonably 79 00:04:43,463 --> 00:04:47,020 accurate intuition, but it's not a proof. Here is the proof, 80 00:04:51,660 --> 00:04:53,795 so we're going to use an exchange argument. 81 00:04:53,795 --> 00:04:57,692 We're going to take an arbitrary optimal tree a tree minimizing the average 82 00:04:57,692 --> 00:04:59,668 encoding length. Let's call it T star, 83 00:04:59,668 --> 00:05:03,191 there's may be many such trees, just pick any one, I don't care which. 84 00:05:03,191 --> 00:05:07,196 And the plan is to show an even better tree or at least as good tree which 85 00:05:07,196 --> 00:05:11,040 satisfies the properties that we want where A and B show up as siblings. 86 00:05:11,040 --> 00:05:15,025 So lets figure out where we're going to swap the symbols A and B to. 87 00:05:15,025 --> 00:05:19,127 So we've got our tree, T star. And let's look at the deepest level of T 88 00:05:19,127 --> 00:05:21,764 star. And let's find our, you know, any pair of 89 00:05:21,764 --> 00:05:24,636 siblings at this deepest level, call them X and Y. 90 00:05:24,636 --> 00:05:29,442 So in this green tree on the right we have two choices of pairs of siblings at 91 00:05:29,442 --> 00:05:32,548 the lowest level. It doesn't matter which one we pick. 92 00:05:32,548 --> 00:05:35,068 So let's just say I pick the leftmost pair, 93 00:05:35,068 --> 00:05:38,467 call them X and Y. So A and B have to be somewhere in this 94 00:05:38,467 --> 00:05:41,105 tree. Let me just pick a couple of the leaves 95 00:05:41,105 --> 00:05:44,680 to be examples for where A and B might actually be in T star. 96 00:05:44,680 --> 00:05:47,502 And so now we're going to execute the exchange. 97 00:05:47,502 --> 00:05:51,945 We're going to obtain a new tree T hat from T star, merely by swapping the 98 00:05:51,945 --> 00:05:56,149 labels of the leaves X and A and similarly swapping the labels of the 99 00:05:56,149 --> 00:05:59,663 leaves Y and B. So after this exchange we get a new valid 100 00:05:59,663 --> 00:06:02,339 tree. So once again, you know, the leaves are 101 00:06:02,339 --> 00:06:04,953 labeled with the various symbols of sigma. 102 00:06:04,953 --> 00:06:09,745 And, moreover, by our choice of X and Y, we've now via this exchange forced the 103 00:06:09,745 --> 00:06:13,977 tree to conform to the property. We've forced A and B to be siblings. 104 00:06:13,977 --> 00:06:18,770 They take the original positions of X and Y which were chosen to be siblings. 105 00:06:18,770 --> 00:06:23,289 So to finish the proof, we're going to show that the averaging coding length of 106 00:06:23,289 --> 00:06:27,637 T hat can only be less than T star. Since T star was optimal, that means that 107 00:06:27,637 --> 00:06:32,328 T hat has to be optimal as well, since it also satisfies the desired property that 108 00:06:32,328 --> 00:06:36,733 A and B are siblings, that's going to complete the proof the, the Key Lemma and 109 00:06:36,733 --> 00:06:40,280 therefore the proof of the correctness of Huffman's algorithm. 110 00:06:41,820 --> 00:06:46,486 The reason intuitively is that when we do the swap, A and B inherit the previous 111 00:06:46,486 --> 00:06:50,745 debts of X and Y and conversely X and Y inherit the old debts of A and B. 112 00:06:50,745 --> 00:06:55,528 So A and B can only get worse and X and Y can only get better, and X and Y are the 113 00:06:55,528 --> 00:06:59,903 ones with the higher frequencies. So the overall cost benefit analysis is a 114 00:06:59,903 --> 00:07:04,103 net win and the tree only improves. But, to make that precise, let me just 115 00:07:04,103 --> 00:07:08,245 show you the exact computation. So let's look at the difference between 116 00:07:08,245 --> 00:07:13,028 the average encoding length given by the 3T star and that after the swap given by 117 00:07:13,028 --> 00:07:15,781 the 3T hat. So analogously, to our computation on the 118 00:07:15,781 --> 00:07:19,696 previous slide most of the stuff cancels because the trees T star and T hat 119 00:07:19,696 --> 00:07:22,168 frankly just all that aren't all that different. 120 00:07:22,168 --> 00:07:26,185 The only things that are different are the positions of the symbols A, B, X, and 121 00:07:26,185 --> 00:07:28,040 Y. So we're going to have four positive 122 00:07:28,040 --> 00:07:30,666 terms contributed by T star for these four symbols. 123 00:07:30,666 --> 00:07:34,632 And we're going to have four negative terms contributed by the tree T hat again 124 00:07:34,632 --> 00:07:37,620 for these exact same four symbols. So let me just show you, 125 00:07:37,620 --> 00:07:42,475 what happens after the dust settles and after you take away all the cancellation? 126 00:07:42,475 --> 00:07:47,031 You can write the eight terms that remain in a slightly sneaky but you know, 127 00:07:47,031 --> 00:07:50,092 clarifying way as follows. So we're going to have one product 128 00:07:50,092 --> 00:07:53,988 representing the four terms involving the symbols A and X, and then an analogous 129 00:07:53,988 --> 00:07:55,960 term for the four terms involving B and Y. 130 00:07:55,960 --> 00:07:58,917 So in the first term we look at the product of two differences. 131 00:07:58,917 --> 00:08:02,532 On the one hand we look at the difference between the frequencies of X and A. 132 00:08:02,532 --> 00:08:05,677 Remember those are the two things that got swapped with each other. 133 00:08:05,677 --> 00:08:09,527 And then we also look at the difference between their depths in the original tree 134 00:08:09,527 --> 00:08:11,733 T star. And then because B and Y got swapped we 135 00:08:11,733 --> 00:08:13,940 have an entirely analogous term involving them. 136 00:08:13,940 --> 00:08:18,290 The reason I re-wrote this difference in a slightly sneaky way, is it now becomes 137 00:08:18,290 --> 00:08:21,352 completely obvious, what role the various hypotheses play. 138 00:08:21,352 --> 00:08:25,165 Specifically, why is it important that A and B have the lowest possible 139 00:08:25,165 --> 00:08:28,012 frequencies? We actually haven't used that yet in our 140 00:08:28,012 --> 00:08:30,752 proof, but it's fundamental to our greedy algorithm. 141 00:08:30,752 --> 00:08:35,210 Secondly, why did we choose X and Y to be in the deepest level of the original tree 142 00:08:35,210 --> 00:08:35,854 T star. Well. 143 00:08:35,854 --> 00:08:39,775 Let's start with the first fare products. The differences between, between 144 00:08:39,775 --> 00:08:42,676 frequencies. So the frequency of X minus the frequency 145 00:08:42,676 --> 00:08:46,650 of A, and similarly between Y and B. Well because A and B have the smallest 146 00:08:46,650 --> 00:08:49,551 frequencies of them all. These differences have to be 147 00:08:49,551 --> 00:08:52,442 non-negative. The frequencies of X and Y can only be 148 00:08:52,442 --> 00:08:55,249 more. The second pair of differences also must 149 00:08:55,249 --> 00:08:59,754 be non-negative and this is simply because we chose X and Y to be at the 150 00:08:59,754 --> 00:09:03,209 deepest level. Their depths have to be at least as large 151 00:09:03,209 --> 00:09:06,110 as that of A and B in the original 3T star. 152 00:09:06,110 --> 00:09:09,719 Since all four differences are nonnegative, that means the sum of their 153 00:09:09,719 --> 00:09:13,328 products is also nonnegative. That means the average encoding length of 154 00:09:13,328 --> 00:09:16,989 T star is at least as big as T hat. So that's where T hat has to also be 155 00:09:16,989 --> 00:09:19,276 optimal. It in addition satisfies the desired 156 00:09:19,276 --> 00:09:22,937 property that A and B are siblings. And so that means our recursive call, 157 00:09:22,937 --> 00:09:26,902 even though we've committed to merging A and B, we've committed to returning a 158 00:09:26,902 --> 00:09:30,613 tree in which their siblings that was without loss, that was a safe merge. 159 00:09:30,613 --> 00:09:34,680 That's why Huffman's Algorithm at the end of the day can return an optimal tree, 160 00:09:34,680 --> 00:09:37,680 one that minimizes the average encoding length. 161 00:09:37,680 --> 00:09:41,680 We're all possible binary prefix-free codes. 162 00:09:47,020 --> 00:09:50,915 So let me just wrap up with a few notes about how to implement Huntsman's 163 00:09:50,915 --> 00:09:55,400 algorithm and what kind of running time you should be able to achieve. 164 00:09:55,400 --> 00:10:00,700 So if you just naively implement the pseudocode I showed you, you're going to 165 00:10:00,700 --> 00:10:06,476 get a not particularly great quadratic time algorithm that moreover uses a lot 166 00:10:06,476 --> 00:10:09,889 of recursion. So why is it going to be quadratic time? 167 00:10:09,889 --> 00:10:12,199 Well, you have one recursive call each time. 168 00:10:12,199 --> 00:10:14,940 And you always reduce the number of symbols by one. 169 00:10:14,940 --> 00:10:19,131 So the total number of recursive calls is going to be linear in the number of 170 00:10:19,131 --> 00:10:22,784 symbols in your alphabet in. And how much work did you do in each 171 00:10:22,784 --> 00:10:25,847 recursive call, not counting the work done by later calls? 172 00:10:25,847 --> 00:10:30,038 Well, you have to search to figure out the lowest frequency symbols A and B. 173 00:10:30,038 --> 00:10:32,671 So that's basically some minimum computations. 174 00:10:32,671 --> 00:10:36,540 So that's going to take linear time for each of the linear recursive calls. 175 00:10:36,540 --> 00:10:40,860 Now the fact that we keep doing these repeated minimum computations each 176 00:10:40,860 --> 00:10:45,476 recursive call, I hope triggers a light bulb, that maybe a data structure would 177 00:10:45,476 --> 00:10:48,494 be helpful. Which data structure's raison d'etre is 178 00:10:48,494 --> 00:10:50,979 to speed up repeated minimum computations? 179 00:10:50,979 --> 00:10:55,358 Well as we've seen, for example, in Dijkstra's and Prim's, Prim's Algorithms, 180 00:10:55,358 --> 00:10:59,182 that would be the heap. So of course, we need to find a heap, you 181 00:10:59,182 --> 00:11:02,714 have to say what the keys are. But since we always want to know what the 182 00:11:02,714 --> 00:11:06,443 symbols with the smallest frequencies are, the obvious thing to do is to use 183 00:11:06,443 --> 00:11:09,485 frequencies as keys. So the only question then is what happens 184 00:11:09,485 --> 00:11:12,477 when you do a merge? Well, the obvious thing is to just to add 185 00:11:12,477 --> 00:11:16,255 the frequencies of those two symbols and reinsert into the heap the new meta 186 00:11:16,255 --> 00:11:19,885 symbol with its key equal to what? The sum of the keys of the two elements 187 00:11:19,885 --> 00:11:23,848 that you just extracted from the heap. So not only is that going to work great, 188 00:11:23,848 --> 00:11:27,940 it's going to be N log N time, it's very easy to implement in iterative manner. 189 00:11:27,940 --> 00:11:31,875 So that's starting to be a really blazing fast implementation of Huffman's 190 00:11:31,875 --> 00:11:35,064 algorithm. In fact you can do even better. 191 00:11:35,064 --> 00:11:40,434 You can boil down an implementation of Huffman's algorithm to a single 192 00:11:40,434 --> 00:11:46,031 indication of a sorting sub-routine followed by a merely linear amount of 193 00:11:46,031 --> 00:11:48,081 work. Now in the worst case you're still going 194 00:11:48,081 --> 00:11:51,855 to be stuck asymptotically with an N log N of limitation because you can't sort 195 00:11:51,855 --> 00:11:55,007 better than N log N if you know nothing about what you're sorting. 196 00:11:55,007 --> 00:11:58,780 But in this implementation the constants are going to be even better, and as we 197 00:11:58,780 --> 00:12:02,124 discussed in part one, there are sometimes special cases where you can 198 00:12:02,124 --> 00:12:05,181 beat N log N for sorting. So for example if all, all of your data 199 00:12:05,181 --> 00:12:08,954 is representable with a small number of bits, then you can do better than N log 200 00:12:08,954 --> 00:12:11,047 N. So I'm going to leave this even faster 201 00:12:11,047 --> 00:12:13,598 implemtation as a somewhat nontrivial exercise. 202 00:12:13,598 --> 00:12:18,036 I'll give you a hint though, which is, if you do sorting as a pre-processing step 203 00:12:18,036 --> 00:12:21,753 you actually don't even need to use a heap data structure for this 204 00:12:21,753 --> 00:12:24,526 implementation. You can get away with an even more 205 00:12:24,526 --> 00:12:28,299 primitive data structure of a queue. You might however, want to use two 206 00:12:28,299 --> 00:12:30,850 queues. So, next time you find yourself in line 207 00:12:30,850 --> 00:12:33,402 for coffee, that's a nice thing to think about. 208 00:12:33,402 --> 00:12:37,951 How do you implement Huffman's Algorithm merely with one invocation to sorting, so 209 00:12:37,951 --> 00:12:42,500 that the symbols are sorted by frequency, followed by linear work using two queues?