So I know you're probably catching your breath from that last computation. So let's zoom out. Let's make sure we don't lose the forest for the trees. And see that we're actually mostly there. We're just one lemma away from finishing the Proof of Theorem. The proof of correctness of Huffman's algorithm. So what do we have going for us. Well, first of all we've got the inductive hypothesis. Remember and any proof by induction you better rely very heavily on the inductive hypothesis. What does it say? It says when Huffman's algorithm recurses on the smaller subproblem with the smaller alpha bit signal prime, it solves it optimally. It returns a tree which minimizes the average encoding length with respects to the alpha bit sigma prime in the corresponding frequencies. So, we're going to call that tree, the output of the recursive call, T prime hat. So let me amplify the power of this inductive hypothesis by combining it with the computation we did in the last slot. So what do we know so far? We know for this smallest subproblem which frankly, we don't care about in its own right. But, nevertheless, for this smaller sub problem, with alphabet sigma prime, the recursive call solves it optimally. It returns to us, this tree T hat prime. And among all trees with leaves inable according to sigma prime. This tree is as good as it gets, it minimizes the average encoding length. But what have we learned in the last slide? We learned that there's a one to one correspondence between feasible solutions, between trees for the smallest subproblem with the alphabet sigma prime, and feasible solutions to the original problem, the one we care about, that have a certain form in which, it just so happens that A and B are siblings, that they share a common parent. Moreover, this correspondence preserves the objective function value, the average encoding length or at least it preserves it up to a constant which is good enough for our purposes. So the upshot is, in minimizing average encoding length over all feasible solutions for the smallest subproblem, a recursive call is actually doing more for us. It's actually minimizing the average encoding length for the original problem with the original alphabet sigma over a subset of the feasible solutions, the feasible solutions in which A and B are siblings. Now, does this do us any good? Well, it depends, what was our original goal? Our goal was to get the smallest average encoding if possible over all feasible solutions in the world. And so what have we just figured out? We figured out, well we're getting the best possible scenario amongst some solutions. Those in which A and B happen to be siblings. So, we're in trouble if there is no optimal solution in which A and B are siblings. Then it doesn't do us any good to optimize only over these crappy solutions. On the other hand, if there is an optimal solution lying in this set XAB in which A and B are siblings, then we're golden. Then there's an optimal solution in this set, while recursive call is finding the best solution in the sets. So, we're going to find an optimal solution. So that's really the big question. Can we guarantee that it was safe to merge A and B in the very first iteration? Can we guarantee there is an optimal solution, a best possible tree that also has that property? That also has A and B as siblings. So the key level, which I approve in the next slide in which we'll complete the proof of the correctness of the apartment spherum, is indeed there is always an optimal solution in the set XABA an optimal solution on which A and B, the two lowest frequency symbols are indeed siblings. So I'll give you a full-blown proof of this Key Lemma in the next slide. But let me first just orientate you and you know, just develop your intuition about why you'd hope this Key Lemma would be true. The intuition is really the same as that as when we developed the greedy algorithm in the first place, namely when you which symbols do you want to incur the longest encoding lengths? Well, you want them to be the lowest frequency symbols, that you want to minimize the average. So if you look at any old tree, there's going to be you know, some bottommost layer. There's going to be some deepest possible leaves. And, you really hope that A and B, the lowest frequency leaves are going to be in that lowest level. Now they might both be in the lowest level and not, and might not be siblings. But, if you think about it, amongst symbols at the same level, they're all suffering exactly the same encoding length anyways. So you can permute them in any way you want and it's not going to affect your average encoding length. So once you put A and B at the bottommost level you may as well put them as siblings and it's not going to make any difference. So, the final punchline being, you can take any tree like say an optimal tree. And by swapping. Changing the lowest frequency elements A and B to the bottom and to be common siblings, you can't make the tree worse you can only make it better, and that's going to show that there is an optimal tree with the design property where A and B are in fact siblings. All right, so that's a reasonably accurate intuition, but it's not a proof. Here is the proof, so we're going to use an exchange argument. We're going to take an arbitrary optimal tree a tree minimizing the average encoding length. Let's call it T star, there's may be many such trees, just pick any one, I don't care which. And the plan is to show an even better tree or at least as good tree which satisfies the properties that we want where A and B show up as siblings. So lets figure out where we're going to swap the symbols A and B to. So we've got our tree, T star. And let's look at the deepest level of T star. And let's find our, you know, any pair of siblings at this deepest level, call them X and Y. So in this green tree on the right we have two choices of pairs of siblings at the lowest level. It doesn't matter which one we pick. So let's just say I pick the leftmost pair, call them X and Y. So A and B have to be somewhere in this tree. Let me just pick a couple of the leaves to be examples for where A and B might actually be in T star. And so now we're going to execute the exchange. We're going to obtain a new tree T hat from T star, merely by swapping the labels of the leaves X and A and similarly swapping the labels of the leaves Y and B. So after this exchange we get a new valid tree. So once again, you know, the leaves are labeled with the various symbols of sigma. And, moreover, by our choice of X and Y, we've now via this exchange forced the tree to conform to the property. We've forced A and B to be siblings. They take the original positions of X and Y which were chosen to be siblings. So to finish the proof, we're going to show that the averaging coding length of T hat can only be less than T star. Since T star was optimal, that means that T hat has to be optimal as well, since it also satisfies the desired property that A and B are siblings, that's going to complete the proof the, the Key Lemma and therefore the proof of the correctness of Huffman's algorithm. The reason intuitively is that when we do the swap, A and B inherit the previous debts of X and Y and conversely X and Y inherit the old debts of A and B. So A and B can only get worse and X and Y can only get better, and X and Y are the ones with the higher frequencies. So the overall cost benefit analysis is a net win and the tree only improves. But, to make that precise, let me just show you the exact computation. So let's look at the difference between the average encoding length given by the 3T star and that after the swap given by the 3T hat. So analogously, to our computation on the previous slide most of the stuff cancels because the trees T star and T hat frankly just all that aren't all that different. The only things that are different are the positions of the symbols A, B, X, and Y. So we're going to have four positive terms contributed by T star for these four symbols. And we're going to have four negative terms contributed by the tree T hat again for these exact same four symbols. So let me just show you, what happens after the dust settles and after you take away all the cancellation? You can write the eight terms that remain in a slightly sneaky but you know, clarifying way as follows. So we're going to have one product representing the four terms involving the symbols A and X, and then an analogous term for the four terms involving B and Y. So in the first term we look at the product of two differences. On the one hand we look at the difference between the frequencies of X and A. Remember those are the two things that got swapped with each other. And then we also look at the difference between their depths in the original tree T star. And then because B and Y got swapped we have an entirely analogous term involving them. The reason I re-wrote this difference in a slightly sneaky way, is it now becomes completely obvious, what role the various hypotheses play. Specifically, why is it important that A and B have the lowest possible frequencies? We actually haven't used that yet in our proof, but it's fundamental to our greedy algorithm. Secondly, why did we choose X and Y to be in the deepest level of the original tree T star. Well. Let's start with the first fare products. The differences between, between frequencies. So the frequency of X minus the frequency of A, and similarly between Y and B. Well because A and B have the smallest frequencies of them all. These differences have to be non-negative. The frequencies of X and Y can only be more. The second pair of differences also must be non-negative and this is simply because we chose X and Y to be at the deepest level. Their depths have to be at least as large as that of A and B in the original 3T star. Since all four differences are nonnegative, that means the sum of their products is also nonnegative. That means the average encoding length of T star is at least as big as T hat. So that's where T hat has to also be optimal. It in addition satisfies the desired property that A and B are siblings. And so that means our recursive call, even though we've committed to merging A and B, we've committed to returning a tree in which their siblings that was without loss, that was a safe merge. That's why Huffman's Algorithm at the end of the day can return an optimal tree, one that minimizes the average encoding length. We're all possible binary prefix-free codes. So let me just wrap up with a few notes about how to implement Huntsman's algorithm and what kind of running time you should be able to achieve. So if you just naively implement the pseudocode I showed you, you're going to get a not particularly great quadratic time algorithm that moreover uses a lot of recursion. So why is it going to be quadratic time? Well, you have one recursive call each time. And you always reduce the number of symbols by one. So the total number of recursive calls is going to be linear in the number of symbols in your alphabet in. And how much work did you do in each recursive call, not counting the work done by later calls? Well, you have to search to figure out the lowest frequency symbols A and B. So that's basically some minimum computations. So that's going to take linear time for each of the linear recursive calls. Now the fact that we keep doing these repeated minimum computations each recursive call, I hope triggers a light bulb, that maybe a data structure would be helpful. Which data structure's raison d'etre is to speed up repeated minimum computations? Well as we've seen, for example, in Dijkstra's and Prim's, Prim's Algorithms, that would be the heap. So of course, we need to find a heap, you have to say what the keys are. But since we always want to know what the symbols with the smallest frequencies are, the obvious thing to do is to use frequencies as keys. So the only question then is what happens when you do a merge? Well, the obvious thing is to just to add the frequencies of those two symbols and reinsert into the heap the new meta symbol with its key equal to what? The sum of the keys of the two elements that you just extracted from the heap. So not only is that going to work great, it's going to be N log N time, it's very easy to implement in iterative manner. So that's starting to be a really blazing fast implementation of Huffman's algorithm. In fact you can do even better. You can boil down an implementation of Huffman's algorithm to a single indication of a sorting sub-routine followed by a merely linear amount of work. Now in the worst case you're still going to be stuck asymptotically with an N log N of limitation because you can't sort better than N log N if you know nothing about what you're sorting. But in this implementation the constants are going to be even better, and as we discussed in part one, there are sometimes special cases where you can beat N log N for sorting. So for example if all, all of your data is representable with a small number of bits, then you can do better than N log N. So I'm going to leave this even faster implemtation as a somewhat nontrivial exercise. I'll give you a hint though, which is, if you do sorting as a pre-processing step you actually don't even need to use a heap data structure for this implementation. You can get away with an even more primitive data structure of a queue. You might however, want to use two queues. So, next time you find yourself in line for coffee, that's a nice thing to think about. How do you implement Huffman's Algorithm merely with one invocation to sorting, so that the symbols are sorted by frequency, followed by linear work using two queues?