1 00:00:00,880 --> 00:00:03,898 So this set of lectures will be our final application of the greedy algorithm 2 00:00:03,898 --> 00:00:04,660 design paradigm. 3 00:00:04,660 --> 00:00:07,070 It's going to be to an application in compression. 4 00:00:07,070 --> 00:00:08,940 Specifically, I'll show you a greedy algorithm for 5 00:00:08,940 --> 00:00:13,390 constructing a certain kind of prefix-free binary codes know as Huffman codes. 6 00:00:14,450 --> 00:00:16,770 So we're going to spend this video just sort of setting the stage. 7 00:00:16,770 --> 00:00:19,200 So let's begin by just defining a binary code. 8 00:00:20,502 --> 00:00:23,200 So a binary code is just a way to write down symbols from 9 00:00:23,200 --> 00:00:26,090 some general alphabet in a manner that computers can understand. 10 00:00:26,090 --> 00:00:28,730 That is, it's just a function mapping each symbol from an alphabet, 11 00:00:28,730 --> 00:00:32,630 capital sigma, to a binary string, a sequence of zeroes and ones. 12 00:00:34,530 --> 00:00:38,020 So this alphabet capital sigma could be any number of things but, you know, 13 00:00:38,020 --> 00:00:40,620 as a simple example, you could imagine it's the letters a through z, 14 00:00:40,620 --> 00:00:44,260 say in lowercase, plus maybe the space character and some punctuation. 15 00:00:44,260 --> 00:00:46,270 So maybe, for a set of size 32 overall. 16 00:00:46,270 --> 00:00:51,110 And if you have 32 symbols you need to encode in binary, well, 17 00:00:51,110 --> 00:00:55,500 an obvious way to do it is, there happens to be 32 different binary strings of 18 00:00:55,500 --> 00:00:58,464 length five, so why not just use one of each of those for your symbols. 19 00:01:00,200 --> 00:01:03,180 So this would be a fixed length code, in the sense we're using the exactly the same 20 00:01:03,180 --> 00:01:06,720 number of bits, namely five, to encode each of the symbols of our alphabet. 21 00:01:06,720 --> 00:01:09,280 This is pretty similar to what's going on with ASCII codes. 22 00:01:09,280 --> 00:01:12,800 And of course, it's a mantra of this class to ask, 23 00:01:12,800 --> 00:01:15,170 when can we do better than the obvious solution? 24 00:01:15,170 --> 00:01:18,270 So in this context, when can we do better than the fixed length codes? 25 00:01:18,270 --> 00:01:22,640 Sometimes we can, in the important case when some symbols are much more likely to 26 00:01:22,640 --> 00:01:23,890 appear than others. 27 00:01:23,890 --> 00:01:24,560 In that case, 28 00:01:24,560 --> 00:01:29,780 we can encode information using fewer bits by deploying variable length codes. 29 00:01:32,130 --> 00:01:35,077 And this is, in fact, a very practical idea. 30 00:01:35,077 --> 00:01:38,654 Variable length codes are used all the time in practice. 31 00:01:38,654 --> 00:01:41,273 One example is in coding MP3 audio files. 32 00:01:41,273 --> 00:01:43,662 So if you look up the standard for MP3 encoding, 33 00:01:43,662 --> 00:01:47,470 there's this initial phase which you do analog-to-digital conversion. 34 00:01:47,470 --> 00:01:50,687 But then, once you're in the digital domain, you do apply Huffman Codes, 35 00:01:50,687 --> 00:01:52,498 what I'm going to teach you in these videos, 36 00:01:52,498 --> 00:01:54,630 to compress the length of the files even further. 37 00:01:54,630 --> 00:01:57,780 And of course as you know, compression, especially lossless compression, 38 00:01:57,780 --> 00:01:59,460 like Huffman codes, is a good thing. 39 00:01:59,460 --> 00:02:02,100 You want to download a file, you want it to happen as fast as possible. 40 00:02:02,100 --> 00:02:04,335 Well, you want to make the file as small as possible. 41 00:02:07,103 --> 00:02:10,104 So a new issue arises when you pass from fixed-length codes 42 00:02:10,104 --> 00:02:11,690 to variable length codes. 43 00:02:11,690 --> 00:02:14,170 So let me illustrate that with a simple example. 44 00:02:14,170 --> 00:02:18,010 Suppose our alphabet sigma is just four characters, A, B, C, D. 45 00:02:18,010 --> 00:02:23,220 So the obvious fixed length encoding of these characters would 46 00:02:23,220 --> 00:02:25,040 just be 00, 01, 10, and 11. 47 00:02:25,040 --> 00:02:27,120 Well, suppose you wanted to use fewer bits, 48 00:02:27,120 --> 00:02:30,545 and wanted to use a variable length encoding, an obvious idea would be to try 49 00:02:30,545 --> 00:02:34,540 to get away with only one bit for a couple of these characters. 50 00:02:34,540 --> 00:02:38,260 So, suppose instead of using a double 0 for A, we just use a single 0. 51 00:02:38,260 --> 00:02:42,520 And instead of using a double one for D we just use a single one. 52 00:02:44,070 --> 00:02:45,230 So that's only fewer bits. 53 00:02:45,230 --> 00:02:47,000 So that seems like that can only be better. 54 00:02:47,000 --> 00:02:48,250 But now, here's the question. 55 00:02:48,250 --> 00:02:53,910 Suppose, someone handed you an encoded transmission consisting of the digits 001. 56 00:02:53,910 --> 00:02:58,700 What would have been the original sequence of symbols 57 00:02:58,700 --> 00:03:00,761 that led to that encoded version? 58 00:03:00,761 --> 00:03:07,190 All right, so the answer is D. 59 00:03:07,190 --> 00:03:10,670 There is not enough information to know what 001 was 60 00:03:10,670 --> 00:03:12,180 supposed to be an encoding of. 61 00:03:15,570 --> 00:03:18,730 The reason is is that having passed to a variable length encoding, 62 00:03:18,730 --> 00:03:20,060 there is now ambiguity. 63 00:03:20,060 --> 00:03:23,660 There is more than one sequence of original symbols that could have led, 64 00:03:23,660 --> 00:03:27,130 under this encoding, to the output 001. 65 00:03:27,130 --> 00:03:31,844 Specifically, answers A and C would both lead to 001. 66 00:03:37,651 --> 00:03:40,270 The letter A would give you a zero, the letter B would give you a 01. 67 00:03:40,270 --> 00:03:41,280 So that would give you 001. 68 00:03:41,280 --> 00:03:45,760 On the other hand, AAD would also give you 001. 69 00:03:45,760 --> 00:03:47,540 So there's no way to know. 70 00:03:47,540 --> 00:03:49,512 Contrast this with fixed-length encoding. 71 00:03:49,512 --> 00:03:52,270 If you're given a sequence of bits with a fixed length code, 72 00:03:52,270 --> 00:03:55,470 of course you know where one letter ends and the next one begins. 73 00:03:55,470 --> 00:04:00,080 For example, if every symbol was encoded with 5 bits, you would just read 5 bits. 74 00:04:00,080 --> 00:04:03,400 You would know what symbol it was, you would read the next 5 bits, and so on. 75 00:04:03,400 --> 00:04:06,110 With variable length codes, without further precautions, 76 00:04:06,110 --> 00:04:09,370 it's unclear where one symbol starts and the next one begins. 77 00:04:09,370 --> 00:04:13,639 So that's an additional issue we have to make sure to take care of with variable 78 00:04:13,639 --> 00:04:14,487 length codes. 79 00:04:19,882 --> 00:04:23,512 So to solve this problem, that with variable length codes and without further 80 00:04:23,512 --> 00:04:27,307 precautions, it's unclear where one symbol ends and where the next one begins, 81 00:04:27,307 --> 00:04:31,060 we're going to insist that our variable length codes are prefix free. 82 00:04:31,060 --> 00:04:33,520 So what this means is when we encode a bunch of symbols, 83 00:04:33,520 --> 00:04:35,890 we're going to make sure that for each pair of symbols i and 84 00:04:35,890 --> 00:04:39,730 j from the original alphabet sigma, the corresponding encodings will have 85 00:04:39,730 --> 00:04:43,730 the property that neither one is a prefix of the other. 86 00:04:43,730 --> 00:04:45,400 So going back to the previous slide, 87 00:04:45,400 --> 00:04:48,420 you'll see that that example was not prefix free. 88 00:04:48,420 --> 00:04:53,990 For example, we used zero, was a prefix of zero one, that led to ambiguity. 89 00:04:53,990 --> 00:04:58,870 Similarly, one, the encoding for d, was a prefix of 10, the encoding for 90 00:04:58,870 --> 00:05:00,540 c, and that also leads to an ambiguity. 91 00:05:00,540 --> 00:05:03,020 So if you don't have prefixes for each other, and 92 00:05:03,020 --> 00:05:05,960 we'll develop this more shortly, then there's no ambiguity. 93 00:05:05,960 --> 00:05:10,510 Then there's a unique way to decode, to reconstruct what the original sequence of 94 00:05:10,510 --> 00:05:13,150 symbols was, given just the zeros and ones. 95 00:05:14,340 --> 00:05:18,070 So lest you think this is too strong a property, certainly interesting and 96 00:05:18,070 --> 00:05:22,240 useful variable length codes exist that satisfy the prefix-free property. 97 00:05:22,240 --> 00:05:25,530 So one simple example, again just to encode the letters A, B, C, D. 98 00:05:25,530 --> 00:05:29,420 We can get away with encoding the symbol A just using a single bit, 99 00:05:29,420 --> 00:05:30,230 just using a zero. 100 00:05:30,230 --> 00:05:32,140 Now, of course, to be prefix free, 101 00:05:32,140 --> 00:05:37,900 it better be the case that our encodings of B and C and D all start with the bit 1. 102 00:05:37,900 --> 00:05:40,190 Otherwise we're not prefix free. 103 00:05:40,190 --> 00:05:44,440 But we can get away with that, so let's encode a B with a one and 104 00:05:44,440 --> 00:05:47,350 then a zero, and now both C and 105 00:05:47,350 --> 00:05:50,980 D better have the property that they start neither with 0 nor with 10. 106 00:05:50,980 --> 00:05:56,270 That is, they better start with 11, but let's just encode c using 110 and 107 00:05:56,270 --> 00:05:58,550 D using 111. 108 00:05:58,550 --> 00:06:00,500 So that would be a variable length code. 109 00:06:00,500 --> 00:06:03,509 The number of bits varies between one and three, but it is prefix free. 110 00:06:05,360 --> 00:06:07,280 And again, the reason we might want to do this, 111 00:06:07,280 --> 00:06:10,800 the reason we might want to use a variable-length encoding, is to take 112 00:06:10,800 --> 00:06:16,010 advantage of non-uniform frequencies of symbols from a given alphabet. 113 00:06:16,010 --> 00:06:19,425 So let me show you a concrete example of the benefits you can get from these kinds 114 00:06:19,425 --> 00:06:20,633 of codes on the next slide. 115 00:06:23,739 --> 00:06:27,891 So let's continue with just our four-symbol alphabet, A, B, C, and D. 116 00:06:33,902 --> 00:06:37,515 And let's suppose we have good statistics in our application domain about 117 00:06:37,515 --> 00:06:40,320 exactly how frequent each of these symbols are. 118 00:06:40,320 --> 00:06:43,290 So, in particular, let's assume we know that A is by far the most likely symbol. 119 00:06:43,290 --> 00:06:47,270 Let's say 60% of the symbols are going to be As, 120 00:06:47,270 --> 00:06:51,850 whereas 25% are Bs, 10% are Cs, and 5% are Ds. 121 00:06:51,850 --> 00:06:53,905 So why would you know the statistics? 122 00:06:53,905 --> 00:06:56,750 Well, in some domains you're just going to have a lot of expertise. 123 00:06:56,750 --> 00:07:01,290 In genomics you're going to know the usual frequencies of As, Cs, Gs and Ts. 124 00:07:01,290 --> 00:07:04,640 For something like an mp3 file, well, you can literally just take an intermediate 125 00:07:04,640 --> 00:07:08,350 version of the file after you've done the analog to digital transformation, and 126 00:07:08,350 --> 00:07:10,790 just count the number of occurrences of each of the symbols. 127 00:07:10,790 --> 00:07:14,140 And then you know exact frequencies, and then you're good to go. 128 00:07:14,140 --> 00:07:17,780 So let's compare the performance of the obvious fixed length code, 129 00:07:17,780 --> 00:07:21,710 where we used 2 bits for each of the 4 characters, with that of the variable 130 00:07:21,710 --> 00:07:25,100 length code that's also prefix-free that we mentioned on the previous slide. 131 00:07:25,100 --> 00:07:29,140 And we're going to measure the performance of these codes by looking, on average, 132 00:07:29,140 --> 00:07:32,360 how many bits do you need to encode a character. 133 00:07:32,360 --> 00:07:35,660 Where the average is over the frequencies of the four different symbols. 134 00:07:37,090 --> 00:07:40,000 So for the fixed-length encoding, of course, it's two bits per symbol. 135 00:07:40,000 --> 00:07:41,190 We don't even need the average. 136 00:07:41,190 --> 00:07:43,379 Just whatever the symbol is, it uses exactly two bits. 137 00:07:44,630 --> 00:07:48,960 So what about the variable length encoding that's shown on the right in pink? 138 00:07:48,960 --> 00:07:53,214 How many bits, on average, for an average character, given these frequencies of 139 00:07:53,214 --> 00:07:57,377 the different symbols, are needed to encode a character of the alphabet sigma? 140 00:08:01,175 --> 00:08:03,500 Okay, so the correct answer is the second one. 141 00:08:03,500 --> 00:08:07,082 It's, on average, 1.55 bits per character. 142 00:08:07,082 --> 00:08:09,350 So what's the computation? 143 00:08:09,350 --> 00:08:12,420 Well, 60% of the time, it's going to use only 1 bit, and 144 00:08:12,420 --> 00:08:14,640 that's where the big savings comes from. 145 00:08:14,640 --> 00:08:19,450 1 bit is all that's needed whenever we see an A, and most of the characters are As. 146 00:08:19,450 --> 00:08:22,760 We don't do too bad when we see a B either, which is 25% of the time. 147 00:08:22,760 --> 00:08:24,348 We're only using 2 bits for each B. 148 00:08:27,277 --> 00:08:29,880 Now it is true that Cs and Ds, we're paying a price. 149 00:08:29,880 --> 00:08:32,460 We've having to use 3 bits for each of those, but there aren't very many. 150 00:08:32,460 --> 00:08:35,750 Only 10% of the time is it a C and 5% of the time is it a D. 151 00:08:35,750 --> 00:08:37,710 And if you add up the result, 152 00:08:37,710 --> 00:08:42,719 that's taking the average over the simple frequencies, we get the result of 1.55. 153 00:08:44,470 --> 00:08:48,630 So this example draws our attention to a very neat algorithmic opportunity. 154 00:08:48,630 --> 00:08:52,560 So namely, given a alphabet and frequencies of the symbols, 155 00:08:52,560 --> 00:08:54,480 which in general are not uniform, 156 00:08:54,480 --> 00:08:57,790 we now know that the obvious solution fixed length codes need not be optimal. 157 00:08:57,790 --> 00:09:01,560 We can improve upon them using variable length prefix-free codes. 158 00:09:01,560 --> 00:09:04,540 So the computation you probably want to solve is which one is the best? 159 00:09:04,540 --> 00:09:06,280 How do we get optimal compression? 160 00:09:06,280 --> 00:09:09,080 Which variable length code gives us the minimum average encoding 161 00:09:09,080 --> 00:09:11,160 length of a symbol from this alphabet? 162 00:09:11,160 --> 00:09:12,960 So Huffman codes are the solution to that problem. 163 00:09:12,960 --> 00:09:14,723 We'll start developing them in the next video.