1 00:00:03,940 --> 00:00:07,906 In this video we'll establish the correctness of Huffman's algorithm. 2 00:00:07,906 --> 00:00:12,160 Meaning that, that greedy algorithm always computes the prefix-free binary 3 00:00:12,160 --> 00:00:14,920 code that minimizes the average encoding length. 4 00:00:16,600 --> 00:00:21,930 Let me also remind you of our expression for the average encoding length in terms 5 00:00:21,930 --> 00:00:25,420 of a tree capital T. We'll be working with this expression 6 00:00:25,420 --> 00:00:29,281 quite a bit in this proof. So we average over the symbols so we sum 7 00:00:29,281 --> 00:00:32,221 over the symbols I and the alpha bit, capital sigma. 8 00:00:32,221 --> 00:00:36,544 We weight the various terms by the frequencies which were given as part of 9 00:00:36,544 --> 00:00:41,212 the input and remember that the encoding length that we're producing for a given 10 00:00:41,212 --> 00:00:45,823 symbol I is exactly the same as the depth of the corresponding leaf in the tree, 11 00:00:45,823 --> 00:00:48,324 capital T. So I quite like the proof of Huffman's 12 00:00:48,324 --> 00:00:50,594 Theorem. It's a cool proof and we will, it will 13 00:00:50,594 --> 00:00:54,196 give us an opportunity to revisit the themes that we've been studying in 14 00:00:54,196 --> 00:00:56,812 proving the correctness of various greedy algorithms. 15 00:00:56,812 --> 00:01:00,562 At a high level, we're going to proceed by induction, induction on the size N of 16 00:01:00,562 --> 00:01:03,424 the alpha bit sigma. So there's a vague parallel you could 17 00:01:03,424 --> 00:01:07,372 draw with how we proved, say Dijkstra's Algorithm correct back in part one, where 18 00:01:07,372 --> 00:01:11,320 by induction we show that each successive iteration of the algorithm is correct 19 00:01:11,320 --> 00:01:15,021 assuming that the previous ones were. But in the inductive step we're going to 20 00:01:15,021 --> 00:01:18,228 use an exchange argument. So to justify each of our mergers we're 21 00:01:18,228 --> 00:01:20,400 going to argue that any optimal solution could. 22 00:01:20,400 --> 00:01:24,727 You massage into one that looks more like ours without making it any worse, and 23 00:01:24,727 --> 00:01:28,506 that's how we argue that each of our individual mergers is in fact a 24 00:01:28,506 --> 00:01:32,134 reasonable idea, a good idea. So more precisely we're going to go by 25 00:01:32,134 --> 00:01:34,390 induction on the size-end of the alphabet. 26 00:01:34,390 --> 00:01:38,902 I'm going to assume to make the problem non-trivial that the alphabet size is at 27 00:01:38,902 --> 00:01:41,819 least two. So as in any proof by induction we begin 28 00:01:41,819 --> 00:01:45,471 with a base case and then we do the inductive step where in we can assume the 29 00:01:45,471 --> 00:01:48,280 inducted hypothesis. So the base case is when we just have a 30 00:01:48,280 --> 00:01:50,948 two letter alphabet. If you go back to Huffmans algorithm 31 00:01:50,948 --> 00:01:54,787 you'll see that it does the obvious thing in the base case or just outputs a tree 32 00:01:54,787 --> 00:01:58,580 that encodes one of the symbols with the letter zero, and one of the symbols with 33 00:01:58,580 --> 00:02:00,827 just the bit one. And that's the best you can do. 34 00:02:00,827 --> 00:02:04,619 You need a bit to encode every symbol in that tree uses exactly one bit for each 35 00:02:04,619 --> 00:02:07,850 symbol, so Huffman's algorithm is optimal in that trivial special case. 36 00:02:07,850 --> 00:02:11,289 So for the inductive set, we focus on an arbitrary problem instance 37 00:02:11,289 --> 00:02:13,446 where the alphabet size is at least, three. 38 00:02:13,446 --> 00:02:17,245 Now of course, whenever you're doing a proof by induction the thing you've 39 00:02:17,245 --> 00:02:19,966 really got going for you is the inductive hypothesis. 40 00:02:19,966 --> 00:02:22,943 You assume, that the assertion that you're trying to prove. 41 00:02:22,943 --> 00:02:25,408 In this case, correctness of Huffman's algorithm. 42 00:02:25,408 --> 00:02:29,207 Is true for all smaller values of N. That is, we're going to assume that if we 43 00:02:29,207 --> 00:02:31,363 invoke the algorithm on any smaller input. 44 00:02:31,363 --> 00:02:33,776 As of course we do, in this recursive algorithm. 45 00:02:33,776 --> 00:02:37,524 We get to assume that the algorithm returns the correct solution to that 46 00:02:37,524 --> 00:02:41,820 smaller subproblem. So to understand how we're going to pass 47 00:02:41,820 --> 00:02:45,885 from the inductive hypothesis. So this assumption we're correct on all 48 00:02:45,885 --> 00:02:50,060 smaller inputs to the inductive step. The assertion that we're correct on the 49 00:02:50,060 --> 00:02:53,041 current input, we need to take a closer look at how the 50 00:02:53,041 --> 00:02:57,378 original input with it's alphabet sigma relates to the smallest of problem with 51 00:02:57,378 --> 00:03:01,118 the smaller alphabet sigma prime with the two letters fused into one. 52 00:03:01,118 --> 00:03:04,480 Which is the one we solve correctly by assumption recursively. 53 00:03:07,520 --> 00:03:12,100 So, recall our notation from the pseudo code for Huffman's algarithm. 54 00:03:12,100 --> 00:03:16,393 What the algorithm does is take the two symbols with the smallest frequencies, 55 00:03:16,393 --> 00:03:20,906 and let's call those two symbols A and B, and it replaces those two symbols with a 56 00:03:20,906 --> 00:03:25,309 single symbol AB, a sort of meta-symbol representing the presence of either A or 57 00:03:25,309 --> 00:03:27,632 B. We also in the quiz discussed how the 58 00:03:27,632 --> 00:03:32,348 sensible frequency for this medi-symbol AB is the sum of the frequencies A and B. 59 00:03:32,348 --> 00:03:36,947 Last video when we're developing our intuition for what a greedy algorithm for 60 00:03:36,947 --> 00:03:41,546 successive bottom of mergers might look like, we notice that when you merge two 61 00:03:41,546 --> 00:03:46,262 symbols A and B which are really doing is committing to outputting a tree at the 62 00:03:46,262 --> 00:03:50,512 end of the day, at the end of your algorithm in which the symbols A and B 63 00:03:50,512 --> 00:03:54,122 appear as siblings in which they have exactly the same parent. 64 00:03:54,122 --> 00:03:58,197 Therefore, there is an exact correspondence 101 correspondence between 65 00:03:58,197 --> 00:04:01,645 on the one hand true. Trees whose leaves are labeled with the 66 00:04:01,645 --> 00:04:05,985 symbols in sigma prime, so that is there is no leaf labeled A there is no leaf 67 00:04:05,985 --> 00:04:08,322 labeled B there is instead one labeled AB. 68 00:04:08,322 --> 00:04:12,773 So there's a correspondence between those kinds of trees, labels with the leaves 69 00:04:12,773 --> 00:04:17,935 corresponding to sigma prime and. Trees for the original alphabet sigma in 70 00:04:17,935 --> 00:04:23,811 which the symbols A and B are siblings, in which they have exactly the same 71 00:04:23,811 --> 00:04:27,618 parent. Given a tree as shown in the left 72 00:04:27,618 --> 00:04:31,996 picture, that is given a tree for T prime, the leaves labeled according to 73 00:04:31,996 --> 00:04:36,313 sigma prime, you can, and this is in fact exactly what we do in Huffman's 74 00:04:36,313 --> 00:04:40,990 Algorithm, split the leaf with the meta symbol AB, make it an internal node, and 75 00:04:40,990 --> 00:04:45,908 give it two leaves with labels A and B. That produces a tree of the form on the 76 00:04:45,908 --> 00:04:48,606 right. Conversely, given a tree of the form on 77 00:04:48,606 --> 00:04:53,103 the right, that is its leaves are labeled according to sigma, and it just so 78 00:04:53,103 --> 00:04:57,841 happens that A and B show up as leaves of that tree, you can produce a tree for 79 00:04:57,841 --> 00:05:00,535 sigma prime just by contract. A and B together. 80 00:05:00,535 --> 00:05:04,472 So, sucking up A and B into their parent, and labeling the parent, AB. 81 00:05:04,472 --> 00:05:07,958 So you can go back and forth between these two types of trees. 82 00:05:07,958 --> 00:05:12,456 One, arbitrary trees for sigma prime, and trees of a special kind for sigma, trees 83 00:05:12,456 --> 00:05:16,903 in which A and B happen to be siblings. This subset of trees for sigma are 84 00:05:16,903 --> 00:05:19,864 important enough to be given its own notation. 85 00:05:19,864 --> 00:05:24,885 So, let me denotes by capital X, sub AB. The trees with leaves labeled according 86 00:05:24,885 --> 00:05:29,327 to sigma in which, it just so happens, that A and B appear as siblings. 87 00:05:29,327 --> 00:05:32,867 So, that'll be some trees for sigma, but not all of them. 88 00:05:32,867 --> 00:05:35,700 Only the ones in which A and B are siblings. 89 00:05:35,700 --> 00:05:40,377 So this correspondence between solutions to the smallest sub-problem and solutions 90 00:05:40,377 --> 00:05:44,772 of a particular form for the original problem has a important property, namely 91 00:05:44,772 --> 00:05:47,139 it preserves the objective function value. 92 00:05:47,139 --> 00:05:51,478 It preserves the average encoding length. Okay, that's not quite true but it's 93 00:05:51,478 --> 00:05:54,070 almost true. It's good enough for our purposes. 94 00:05:54,070 --> 00:05:57,677 It preserves the average encoding length up to a fixed constant. 95 00:05:57,677 --> 00:06:01,340 So let me show you the computation which demonstrates that point. 96 00:06:01,340 --> 00:06:04,192 So consider any pair of matched trees, t prime and t. 97 00:06:04,192 --> 00:06:08,525 And by matched I just mean t prime is any old tree who's leaves are, leaves are 98 00:06:08,525 --> 00:06:12,694 labelled according to sigma prime. And t is the tree you would obtain if you 99 00:06:12,694 --> 00:06:15,491 split the leaf with meta label ab in the usual way. 100 00:06:15,491 --> 00:06:19,331 You replace it with an internal node and children with labels a and b. 101 00:06:19,331 --> 00:06:21,745 So that's going to be the corresponding tree T. 102 00:06:21,745 --> 00:06:25,968 So take any such matched pair and let's look at the difference between their 103 00:06:25,968 --> 00:06:30,128 average encoding lengths. Now remember the average encoding length 104 00:06:30,128 --> 00:06:33,779 of a tree is just a sum over the symbols of the relevant alphabet. 105 00:06:33,779 --> 00:06:38,370 And what we got going for us here is that sigma and sigma prime are almost exactly 106 00:06:38,370 --> 00:06:39,200 the same. Right? 107 00:06:39,200 --> 00:06:43,791 So the only difference is sigma prime has the meta symbol AB, whereas sigma has the 108 00:06:43,791 --> 00:06:47,497 individual symbols A and B. Furthermore the two trees T and T prime 109 00:06:47,497 --> 00:06:51,702 are almost exactly the same, the only difference being that T prime has this 110 00:06:51,702 --> 00:06:56,072 leaf with a meta label AB, whereas T has two corresponding nodes one level down 111 00:06:56,072 --> 00:06:59,612 with the labels A and B. So when we take the difference of these 112 00:06:59,612 --> 00:07:04,071 two sums, everything cancels out except. The, the tree T contributes two sum ans, 113 00:07:04,071 --> 00:07:08,496 one for the leaf with label A, one for the leaf with label B, and the tree T 114 00:07:08,496 --> 00:07:13,456 prime contributes with a minus sign. One sum and that corresponding to the 115 00:07:13,456 --> 00:07:18,083 leaf width label AB. So when the dust clears and we cancel out 116 00:07:18,083 --> 00:07:24,422 all the terms, what we're left with is. The term for the leaf A, and with the 117 00:07:24,422 --> 00:07:29,114 frequency of A times the depth of the leaf A in the tree T. 118 00:07:29,114 --> 00:07:33,567 A similar term for the leaf would label B in the tree T. 119 00:07:33,567 --> 00:07:40,088 And then with a minus sign the frequency of the meda label AB times it's depth in 120 00:07:40,088 --> 00:07:43,929 the tree T prime. But we are certainly not done with our 121 00:07:43,929 --> 00:07:47,016 simplifications. There is a strong relationship between 122 00:07:47,016 --> 00:07:51,395 the frequencies that we're staring at, and a strong relationship between these 123 00:07:51,395 --> 00:07:54,090 various depths. Let's begin with the frequencies. 124 00:07:54,090 --> 00:07:57,121 How did we define the frequency of the meta symbol AB? 125 00:07:57,121 --> 00:08:00,714 Well remember our quiz. It made sense to divide it as the sum of 126 00:08:00,714 --> 00:08:03,522 the frequencies of A and B. What about the depths? 127 00:08:03,522 --> 00:08:07,987 Well you know, this symbol AB is at whatever depth it is in the tree T prime, 128 00:08:07,987 --> 00:08:10,455 say it's at depth ten, something like that. 129 00:08:10,455 --> 00:08:14,743 Remember T is obtained from T prime simply by splitting this leaf AB and 130 00:08:14,743 --> 00:08:17,563 giving it two new children with symbols A and B. 131 00:08:17,563 --> 00:08:22,322 So if the meta symbol AB was at depth ten in T prime, then the leaves A and B are 132 00:08:22,322 --> 00:08:26,258 going to be at depth eleven in T. So the depth is simply one more than 133 00:08:26,258 --> 00:08:29,020 whatever it was previously in the tree T prime. 134 00:08:29,020 --> 00:08:34,276 So, these relationships are going to result in a second wave of cancellations. 135 00:08:34,276 --> 00:08:39,532 So, just to make that crystal clear, let's calls the depth of AB in T prime, 136 00:08:39,532 --> 00:08:42,444 D. So, D is what I was calling ten earlier. 137 00:08:42,444 --> 00:08:46,280 So, A and B are both at depth D plus one in the tree T. 138 00:08:46,280 --> 00:08:52,007 So the first term becomes the frequency of A times the depth D plus one. 139 00:08:52,007 --> 00:08:57,496 The second term becomes the frequency of B plus the depth D plus one. 140 00:08:57,496 --> 00:09:03,860 And then the third term becomes the sum of the frequencies of A and B times the 141 00:09:03,860 --> 00:09:07,360 depth D. And, when the dust settles, yet again, 142 00:09:07,360 --> 00:09:12,850 we're left with a constant PA plus PB. So, the sums of two frequencies. 143 00:09:12,850 --> 00:09:17,273 And one thing I want you to really understand is that the difference between 144 00:09:17,273 --> 00:09:20,432 these two average encoding links is just some constant. 145 00:09:20,432 --> 00:09:23,305 It does not depend on which trees we started with. 146 00:09:23,305 --> 00:09:27,900 So if we choose a perfectly balanced tree and we do this difference, we get some 147 00:09:27,900 --> 00:09:31,175 constant like.1. If we choose some totally different pair 148 00:09:31,175 --> 00:09:35,598 of trees that are really lopsided and we take this difference, we still get 1.. 149 00:09:35,598 --> 00:09:39,932 Exactly the same thing. So this fulfills the promise I gave you 150 00:09:39,932 --> 00:09:42,316 earlier. That not only do we have this natural 151 00:09:42,316 --> 00:09:46,462 correspondence between on the one hand trees labeled with sigma prime and trees 152 00:09:46,462 --> 00:09:48,846 of a certain type labelled according to sigma. 153 00:09:48,846 --> 00:09:51,385 Mainly those in which A and B appear as siblings. 154 00:09:51,385 --> 00:09:54,443 But the correspondence preserves averaging encoding length. 155 00:09:54,443 --> 00:09:56,775 Okay? It doesn't quite preserve the averaging 156 00:09:56,775 --> 00:10:00,040 encoding length but it preserves it up to a universal constant. 157 00:10:00,040 --> 00:10:03,720 And that's going to be good enough for our purposes as we'll see in a sec.