1 00:00:01,000 --> 00:00:03,801 So we clearly have something to be excited about. 2 00:00:03,801 --> 00:00:08,031 There's clearly this opportunity to design binary prefix-free codes which 3 00:00:08,031 --> 00:00:10,490 improve over the obvious fixed link solution. 4 00:00:10,490 --> 00:00:14,778 So, we'd like to have in some sense, optimal algorithm for this problem and 5 00:00:14,778 --> 00:00:17,865 for that, we of course need a crisp problem definition. 6 00:00:17,865 --> 00:00:22,095 So, to do that it turns out to be useful to think of codes as binary trees. 7 00:00:22,095 --> 00:00:26,040 So, this video will develop that connection concluding with the final 8 00:00:26,040 --> 00:00:29,838 formal problem statement. So, the last video introduced us to this 9 00:00:29,838 --> 00:00:34,574 very interesting computational problem, namely given characters from an alphabet 10 00:00:34,574 --> 00:00:38,068 with frequencies, find the best binary prefix-free encoding, 11 00:00:38,068 --> 00:00:42,567 find the code which minimizes the average number of bits needed to encode a 12 00:00:42,567 --> 00:00:45,917 character. Crucial, the reasoning about this problem 13 00:00:45,917 --> 00:00:50,770 is thinking of binary codes as binary trees. So to give you an idea about this 14 00:00:50,770 --> 00:00:55,204 correspondence, let's just revisit three of the binary codes we saw in the 15 00:00:55,204 --> 00:00:58,860 previous video and see what kind of trees they correspond to. 16 00:01:00,080 --> 00:01:05,360 So let's just continue with the four symbol alphabet A B C D. 17 00:01:07,560 --> 00:01:12,388 The obvious fixed length code where we encode A, B, C, D was 0, 0, 0, 1, 1, 0 18 00:01:12,388 --> 00:01:17,971 and 1, 1 is just going to correspond to the complete binary tree with four 19 00:01:17,971 --> 00:01:22,735 leaves. So let me label this complete binary tree 20 00:01:22,735 --> 00:01:26,891 as follows. I'm going to label the leaves A through D from left to right, and I'm 21 00:01:26,891 --> 00:01:31,150 going to label each edge of the tree with a 0 if it corresponds to a left-child 22 00:01:31,150 --> 00:01:34,383 relationship and with a 1 if it corresponds to a right-child 23 00:01:34,383 --> 00:01:36,795 relationship. And now what you see is there's a 24 00:01:36,795 --> 00:01:40,929 correspondence between the bits along root to leaf paths and the fixed length 25 00:01:40,929 --> 00:01:43,637 encoding. So for example for the symbol C, if we 26 00:01:43,637 --> 00:01:48,304 follow the path from the root to the leaf labeled C, we first encounter a 1 because 27 00:01:48,304 --> 00:01:52,338 it's a right-child, then we encounter a 0 because it's a left-child. 28 00:01:52,338 --> 00:01:56,371 That gives us a 1 and a 0, that's the same as our encoding of the 29 00:01:56,371 --> 00:02:00,865 symbol C in this fixed length encoding and the same of course is true for the 30 00:02:00,865 --> 00:02:04,888 other three leaves. Next, when we first started playing 31 00:02:04,888 --> 00:02:09,365 around with variable-length encodings to motivate the prefix-free property, we 32 00:02:09,365 --> 00:02:14,244 studied a code where we replaced the double 0 for an A with a single 0 and the 33 00:02:14,244 --> 00:02:18,033 double 1 for a D with a single 1. Now this code was not prefix-free, 34 00:02:19,840 --> 00:02:22,604 but we can still represent it as a binary tree. 35 00:02:22,604 --> 00:02:27,015 It's just not going to be a complete one. So I'm going to label the edges of this 36 00:02:27,015 --> 00:02:31,131 tree the same way as before. Left-child edges will be given a label of 37 00:02:31,131 --> 00:02:33,778 0, right-child edges will be given a label 38 00:02:33,778 --> 00:02:36,202 of 1. I'm going to label the left and right 39 00:02:36,202 --> 00:02:40,385 children of the root A and D respectively and the two leaves will be given the 40 00:02:40,385 --> 00:02:43,298 labels B and C. The reason I labeled the nodes in this 41 00:02:43,298 --> 00:02:47,372 way is, because then we have the same kind of correspondence between the 42 00:02:47,372 --> 00:02:51,555 encodings that we proposed for the various symbols and the bits that you see 43 00:02:51,555 --> 00:02:54,542 along a path from the root to nodes with those symbols. 44 00:02:54,542 --> 00:02:58,453 So for example, if you at the node labeled D, the path from the root only 45 00:02:58,453 --> 00:03:02,853 has a single bit 1 and that coincides with the proposed encoding of the symbol 46 00:03:02,853 --> 00:03:04,591 D. Now, remember, this code is not 47 00:03:04,591 --> 00:03:07,633 prefix-free and so therefore, as we saw, it had ambiguity. 48 00:03:07,633 --> 00:03:12,251 So if you're wondering what some encoded message means and you see a 0, you're not 49 00:03:12,251 --> 00:03:16,796 sure that 0 might be representing the symbol A or alternatively it might be the 50 00:03:16,796 --> 00:03:19,130 first half of an encoding of the symbol B. 51 00:03:19,130 --> 00:03:22,107 So, this ambiguity is also noticeable in the tree. 52 00:03:22,107 --> 00:03:26,845 And the property in the tree that tips you off to the ambiguity is that there 53 00:03:26,845 --> 00:03:29,397 are symbols at internal nodes of the tree. 54 00:03:29,397 --> 00:03:34,440 The symbols are not merely at the leaves as they were with the first tree with the 55 00:03:34,440 --> 00:03:38,146 fixed length in coding, but there are also two internal nodes 56 00:03:38,146 --> 00:03:41,613 that have symbols. So let's draw the tree for our final 57 00:03:41,613 --> 00:03:46,191 example which, was the variable length but prefix-free code that we looked at in 58 00:03:46,191 --> 00:03:49,713 the previous video. So this is going to correspond to a tree 59 00:03:49,713 --> 00:03:54,291 which is not perfectly balanced, but it will have labels only at the leaves of 60 00:03:54,291 --> 00:03:58,007 the tree. So, if you label the edges of this tree 61 00:03:58,007 --> 00:04:01,699 the way we've been doing, all the left-child edges get a 0, all the 62 00:04:01,699 --> 00:04:05,392 right-left edges get a 1, and we label the leaves of the tree from 63 00:04:05,392 --> 00:04:08,602 A to D going from left to right. You'll see we have the same 64 00:04:08,602 --> 00:04:10,957 correspondence as in the previous two trees. 65 00:04:10,957 --> 00:04:15,345 the sequence of bits from the root to a leaf coincide with the proposed encoding 66 00:04:15,345 --> 00:04:17,807 for it. So, for example, if you look at the leaf 67 00:04:17,807 --> 00:04:21,606 labeled C, because you traverse a right-child, another right-child and a 68 00:04:21,606 --> 00:04:25,352 left-child to get to it, that's the sequence 1, 1, 0 and that is precisely 69 00:04:25,352 --> 00:04:29,829 the proposed encoding for the symbol C. So in general, any binary code can be 70 00:04:29,829 --> 00:04:34,434 expressed as a tree in this way, with the left-child pointers being labeled with 71 00:04:34,434 --> 00:04:37,427 0's, the right child pointers being labeled with 1's, 72 00:04:37,427 --> 00:04:41,975 and the various nodes being labeled the symbols of the given alphabets, and the 73 00:04:41,975 --> 00:04:45,774 bits from the root down to the node labeled with the given symbol 74 00:04:45,774 --> 00:04:48,940 corresponding to the proposed encoding for that symbol. 75 00:04:50,000 --> 00:04:54,569 And what's cool about thinking about codes as trees is that the really 76 00:04:54,569 --> 00:04:59,459 important prefix-free condition, which seems like a nuisance to check in the 77 00:04:59,459 --> 00:05:02,999 abstract, shows up in a really clean way in these trees, 78 00:05:02,999 --> 00:05:08,211 namely the prefix-free condition is the same as leaves being the only nodes that 79 00:05:08,211 --> 00:05:11,558 have labels. No internal nodes are allowed to have a 80 00:05:11,558 --> 00:05:16,626 label in a prefix-free code. The reason for this is that we've set it 81 00:05:16,626 --> 00:05:20,853 up so that the encodings correspond to the bits along paths from the root to the 82 00:05:20,853 --> 00:05:23,706 labeled node. So being a prefix of another corresponds 83 00:05:23,706 --> 00:05:27,668 to one node being an ancestor of the other, and so, if all the labels are at 84 00:05:27,668 --> 00:05:31,684 the leaves, then of course nobody is an ancestor of the other and we have no 85 00:05:31,684 --> 00:05:35,734 prefixes. The other things that's really cool about 86 00:05:35,734 --> 00:05:39,318 this tree representation of codes is, it becomes pictorially obvious how you do 87 00:05:39,318 --> 00:05:43,047 the coding given this sequence of 0's and 1's from a prefix-free binary code, 88 00:05:43,047 --> 00:05:46,728 namely, you start at the beginning and you start at the root of your tree, 89 00:05:46,728 --> 00:05:50,215 whenever you see a 0 you go left, whenever you see a 1 you go right. 90 00:05:50,215 --> 00:05:53,992 At some point you'll hit a leaf, that leaf will have some label and that's the 91 00:05:53,992 --> 00:05:57,286 symbol that's being encoded, and after you hit a leaf, you just start all over 92 00:05:57,286 --> 00:06:01,759 again back to the root. So for example, if you were using our 93 00:06:01,759 --> 00:06:07,035 variable-length prefix-free code for the four letter alphabet, as in our running 94 00:06:07,035 --> 00:06:10,398 example, and you were given the sequence of 0s and 95 00:06:10,398 --> 00:06:13,827 1s, 0, 1, 1, 0, 1, 1, 1. What would you do? 96 00:06:13,827 --> 00:06:18,466 Well, you'd start at the root and you see a 0, so you follow the left-child 97 00:06:18,466 --> 00:06:20,851 pointer, and you immediately get to a leaf. 98 00:06:20,851 --> 00:06:24,713 It's labeled A, so you're going to output and A as the first symbol. 99 00:06:24,713 --> 00:06:28,461 Now you start all over. You return to the root, now what do you see? 100 00:06:28,461 --> 00:06:32,891 You see a 1, so you go right, you see another one, so you go right, and now you 101 00:06:32,891 --> 00:06:36,640 see a 0, so you go left and that gets you to the leaf labeled C. 102 00:06:38,260 --> 00:06:43,621 Now you start all over again. You see a 1, you go right, you see a 1, 103 00:06:43,621 --> 00:06:48,038 you go right again, and then, finally you see yet one more 1 and you wind up at the 104 00:06:48,038 --> 00:06:53,021 lead labelled D. So by repeated traversals through the 105 00:06:53,021 --> 00:06:56,063 tree, you decode the sequence of 0s and 1s as a C, D. 106 00:06:56,063 --> 00:07:00,769 There's never any ambiguity, because when you hit a label, you know you're going to 107 00:07:00,769 --> 00:07:04,672 leave, there's no where to go. And every internal note, it's unlabeled, 108 00:07:04,672 --> 00:07:08,920 you know to expect another bit, another traversal further down in the tree. 109 00:07:10,420 --> 00:07:14,430 So a final important point about this correspondence is that the encoding 110 00:07:14,430 --> 00:07:18,766 lengths of the symbols, the number of bits needed to encode the various 111 00:07:18,766 --> 00:07:22,723 symbols, are just the depths of the corresponding leaves in the tree that 112 00:07:22,723 --> 00:07:26,246 corresponds to the code. So for example, in our running example 113 00:07:26,246 --> 00:07:30,636 the symbol A is the only one that needs only one bit to encode and it's also the 114 00:07:30,636 --> 00:07:34,918 only leaf in level one of the tree. And similarly B needs two bits and shows 115 00:07:34,918 --> 00:07:38,116 up in the next level, the C and the D which need three bits 116 00:07:38,116 --> 00:07:43,756 show up in the third level. So this correspondence is, really just by 117 00:07:43,756 --> 00:07:46,477 construction, so how do you encode a given symbol. 118 00:07:46,477 --> 00:07:50,808 Well, it's just the bits on the path from the root, and that the number of such 119 00:07:50,808 --> 00:07:55,028 bits is just the number of pointer traversals you need to get from the root 120 00:07:55,028 --> 00:07:59,620 down to that leaf, and that's just the depth of that leaf in the tree. 121 00:07:59,620 --> 00:08:04,272 So we're now in a great position to really have a crisp definition of the 122 00:08:04,272 --> 00:08:06,786 problem. The input of course is just the 123 00:08:06,786 --> 00:08:11,627 frequencies for a bunch different symbols i from some alphabet capital sigma. 124 00:08:11,627 --> 00:08:16,720 I'm going to use P sub i as notation for the frequency of symbol i. 125 00:08:16,720 --> 00:08:19,037 So we know what it is we want to optimize. 126 00:08:19,037 --> 00:08:23,451 We want to minimize the expected number of bits needed to encode a symbol, where 127 00:08:23,451 --> 00:08:27,589 the average of the expectation is taken over the provided frequencies of the 128 00:08:27,589 --> 00:08:30,348 various symbols. Well, let's express this objective 129 00:08:30,348 --> 00:08:33,768 function using our newfound correspondence with binary trees. 130 00:08:33,768 --> 00:08:38,017 In particular, the fact that we can think about encoding lengths as depths of 131 00:08:38,017 --> 00:08:42,079 leaves in trees. So, given a tree, T, which corresponds to 132 00:08:42,079 --> 00:08:46,278 a prefix-free binary code. That is it should be a binary tree and 133 00:08:46,278 --> 00:08:50,283 the leaves of this tree should be in one-to-one correspondence with the 134 00:08:50,283 --> 00:08:54,029 symbols of sigma. We're going to define capital L(T) as the 135 00:08:54,029 --> 00:09:00,111 average encoding length. It's an average in the sense that we sum 136 00:09:00,111 --> 00:09:04,678 over all of the symbols of the alphabet, we weight each symbol by the frequency, 137 00:09:04,678 --> 00:09:09,126 and remember, this is part of the input, so whether the provided frequency P 138 00:09:09,126 --> 00:09:12,684 survived that symbol i. And then how many bits do we need to 139 00:09:12,684 --> 00:09:16,124 encode that symbol i? Well, it's just the depth of the leaf 140 00:09:16,124 --> 00:09:18,496 which is labelled i in the given tree, capital T. 141 00:09:18,496 --> 00:09:21,640 So this is what we want to make as small as possible. 142 00:09:22,680 --> 00:09:28,535 So, for instance, using the data from the previous video, the letters A, B, C, D 143 00:09:28,535 --> 00:09:34,741 with frequencies 60, 25, 10, and 5%. Then if we use the complete binary tree, 144 00:09:34,741 --> 00:09:39,240 that is the fixed length encoding, we just get two bits per character. 145 00:09:39,240 --> 00:09:43,918 While if we use the lopsided tree optimized, so that each A only takes one 146 00:09:43,918 --> 00:09:49,103 bit while suffering three bits for C and D, then the average encoding length drops 147 00:09:49,103 --> 00:09:55,191 to 1.55, as we saw in the last video. So what then is the goal, 148 00:09:55,191 --> 00:09:57,447 what's the responsibility of our algorithm? 149 00:09:57,447 --> 00:10:01,330 Well, amongst all binary trees, which have leaves and correspondence to the 150 00:10:01,330 --> 00:10:05,317 symbols of sigma, we want to compute the one which makes this average encoding 151 00:10:05,317 --> 00:10:09,252 length as small as possible which minimizes our objective function capital 152 00:10:09,252 --> 00:10:11,560 L. Turns out Huffman's greedy algorithm does 153 00:10:11,560 --> 00:10:12,820 it. More details to come. 154 00:10:15,080 --> 00:10:16,400 [SOUND]