1 00:00:00,000 --> 00:00:04,097 So this set of lectures will be our final application of the greedy algorithm 2 00:00:04,097 --> 00:00:06,829 design paradigm. It's going to be to an application and 3 00:00:06,829 --> 00:00:10,874 compression, specifically I'll show you a greedy algorithm for constructing a 4 00:00:10,874 --> 00:00:14,237 certain kind of prefix-free binary codes known as Huffman codes. 5 00:00:14,237 --> 00:00:17,441 So we're going to spend this video just sort of setting the stage. 6 00:00:17,441 --> 00:00:19,858 So let's begin by just defining a binary code. 7 00:00:19,858 --> 00:00:24,113 So a binary code is just the way to write down symbols from general, some general 8 00:00:24,113 --> 00:00:28,316 alphabet in a manner that computers can understand, that is, it's just a function 9 00:00:28,316 --> 00:00:32,099 mapping each symbol from an alphabet capital sigma to a binary string, a 10 00:00:32,099 --> 00:00:36,334 sequence of 0s and 1s. So this alphabet capital sigma could be 11 00:00:36,334 --> 00:00:40,429 any number of things but as a simple example you could imagine it's the 12 00:00:40,429 --> 00:00:45,035 letters A through Z say in lower case plus maybe the space character and some 13 00:00:45,035 --> 00:00:47,878 punctuation so maybe for a set of size 32 overall. 14 00:00:47,878 --> 00:00:52,371 And if you have 32 symbols you need to encode in binary well an obvious way to 15 00:00:52,371 --> 00:00:56,977 do it is there happens to be 32 different binary strings of link five so why not 16 00:00:56,977 --> 00:00:59,650 just use one of each of those for your symbols. 17 00:00:59,650 --> 00:01:03,987 So this would be a fixed length code in the sense we're using exactly the same 18 00:01:03,987 --> 00:01:07,116 number of bits. Mainly five to encode each of the symbols 19 00:01:07,116 --> 00:01:10,301 of our alphabet. This is pretty similar to what's going on 20 00:01:10,301 --> 00:01:14,583 with ASCII codes and of course it's a mantra of this class to ask when can we 21 00:01:14,583 --> 00:01:18,811 do better then the obvious solution. So in this context when can we do better 22 00:01:18,811 --> 00:01:22,489 then the fixed length codes. Sometimes we can in the important case 23 00:01:22,489 --> 00:01:25,783 and some symbols are much more likely to appear then others. 24 00:01:25,783 --> 00:01:30,066 In that case we can encode information using fewer bits by deploying variable 25 00:01:30,066 --> 00:01:33,108 length codes. And this is in fact, a very practical 26 00:01:33,108 --> 00:01:35,400 idea. The variable length codes are used all 27 00:01:35,400 --> 00:01:38,423 the time in practice. One example is in encoding mp3 audio 28 00:01:38,423 --> 00:01:40,768 files. So if you look at the standard for mp3 29 00:01:40,768 --> 00:01:44,520 encoding, there's this initial phase in which you do analogue to digital 30 00:01:44,520 --> 00:01:46,969 conversion. But then once you're in the digital 31 00:01:46,969 --> 00:01:50,982 domain, you do apply Huffman codes, what I'm going to teach you in this videos, to 32 00:01:50,982 --> 00:01:53,379 compress the length of the files even further. 33 00:01:53,379 --> 00:01:57,287 And of course, as you know, compression especially lossless compression like 34 00:01:57,287 --> 00:02:00,205 Huffman codes, is a good thing. You want to download a file. 35 00:02:00,205 --> 00:02:04,531 You want it to happen as fast as possible or you want to make the file as small as 36 00:02:04,531 --> 00:02:09,292 possible. So a new issue arises when you pass from 37 00:02:09,292 --> 00:02:12,069 fix, fixed length codes to variable length codes. 38 00:02:12,069 --> 00:02:14,846 So let me illustrate that with a simple example. 39 00:02:14,846 --> 00:02:18,259 Suppose our alphabet sigma is just four characters A, B, C, D. 40 00:02:18,259 --> 00:02:22,714 So the obvious fixed length encoding of these characters would just be 00, 01, 41 00:02:22,714 --> 00:02:26,012 10, and 11. Well suppose you wanted to use fewer bits 42 00:02:26,012 --> 00:02:30,525 and wanted to use a variable length in coding, an obvious idea would be to try 43 00:02:30,525 --> 00:02:34,170 to get away with only one bit for a couple of these characters. 44 00:02:34,170 --> 00:02:39,132 So suppose, instead of using a double zero for a, we just use a single zero. 45 00:02:39,132 --> 00:02:43,500 And instead of using a double one for D we just used a single one. 46 00:02:43,500 --> 00:02:48,773 So, that's only fewer bits so that seems like that can only be better but now, 47 00:02:48,773 --> 00:02:52,745 here's the question. Suppose someone handed you an encoded 48 00:02:52,745 --> 00:02:55,621 transmission consisting of the digits 001. 49 00:02:55,621 --> 00:03:00,484 What would have been the original sequence of symbols that led to that 50 00:03:00,484 --> 00:03:06,701 encoded version? Alright, so the answer is D. 51 00:03:06,701 --> 00:03:12,926 There is not enough information to know what 001 was supposed to be an encoding 52 00:03:12,926 --> 00:03:17,548 of. The reason is, is that having past to a 53 00:03:17,548 --> 00:03:20,909 variable link in coding, there is now ambiguity. 54 00:03:20,909 --> 00:03:26,058 There is more than one sequence of original symbols that could have led 55 00:03:26,058 --> 00:03:31,564 under this encoding to the output 001. Specifically, the answers A and C, would 56 00:03:31,564 --> 00:03:37,460 both lead to 001. . 57 00:03:37,460 --> 00:03:41,076 The letter a would give you a zero. The letter b would give you a 01. 58 00:03:41,076 --> 00:03:44,532 So that would give you 001. On the other hand, aad would also give 59 00:03:44,532 --> 00:03:46,340 you 001. So there's no way to know. 60 00:03:46,340 --> 00:03:50,754 Contrast this with fixed length encoding. If you're given a sequence of bits with a 61 00:03:50,754 --> 00:03:54,796 fixed length code, of course you know where one letter ends and the next one 62 00:03:54,796 --> 00:03:57,295 begins. For example, if every symbol was encoded 63 00:03:57,295 --> 00:04:01,337 with five bits, you would just read five bits, you'd know which symbol it was, 64 00:04:01,337 --> 00:04:04,900 you'd read the next five bits and so on. With variable length codes. 65 00:04:04,900 --> 00:04:09,021 Without further precautions it's unclear where one symbol starts and the next one 66 00:04:09,021 --> 00:04:12,740 begins, so that's an additional issue we have to make sure to take care of. 67 00:04:12,740 --> 00:04:21,532 With variable length codes. So to solve this problem, that with 68 00:04:21,532 --> 00:04:25,608 variable length codes and without further precautions, it's unclear where one 69 00:04:25,608 --> 00:04:27,832 symbol ends and where the next one begins. 70 00:04:27,832 --> 00:04:31,484 We're going to insist that our variable length codes are prefix free. 71 00:04:31,484 --> 00:04:35,508 So what this means is when we encode a bunch of symbols, we're going to make 72 00:04:35,508 --> 00:04:39,743 sure that for each pair of symbols, I and j, from the original alphabet sigma, the 73 00:04:39,743 --> 00:04:43,872 corresponding encoding will have the property that neither one is a prefix of 74 00:04:43,872 --> 00:04:46,307 the other. So going back to the previous slide, 75 00:04:46,307 --> 00:04:48,901 you'll see that, that example was not prefix free. 76 00:04:48,901 --> 00:04:53,160 For example, we used. Was zero was a prefix of 01, that led to 77 00:04:53,160 --> 00:04:57,080 ambiguity. Similarly one, the encoding for D was a. 78 00:04:57,080 --> 00:05:01,284 Prefix of one zero the encoding for the C, and that also leads to ambiguity. 79 00:05:01,284 --> 00:05:05,376 So if you don't have prefixes for each other, and we'll develop this more 80 00:05:05,376 --> 00:05:09,469 shortly, then there's no ambiguity. Then there's a unique way to decode to 81 00:05:09,469 --> 00:05:14,010 reconstruct what the original sequence of symbols was, given just the 0's and 1's. 82 00:05:14,010 --> 00:05:18,618 So lest you think this is too strong a property, certainly interesting and 83 00:05:18,618 --> 00:05:22,987 useful variable length codes exist that satisfy the prefix free property. 84 00:05:22,987 --> 00:05:27,416 So one simple example again just to encode the letters A, B, C, D, we can get 85 00:05:27,416 --> 00:05:31,845 away with encoding the symbol A just using a single bit, just using a zero. 86 00:05:31,845 --> 00:05:36,872 Now of course to be prefix free it better be the case that our encodings of B and C 87 00:05:36,872 --> 00:05:41,002 and D all start with the bit one. Otherwise we don't, we're not prefix 88 00:05:41,002 --> 00:05:43,097 free. But we can get away with that. 89 00:05:43,097 --> 00:05:45,910 So let's encode a B with a one and then a zero. 90 00:05:45,910 --> 00:05:50,799 And now both C and D better have the property that they start neither with 91 00:05:50,799 --> 00:05:54,254 zero nor with 1-0. That is they better start with 1-1. 92 00:05:54,254 --> 00:05:57,839 Well let's just encode C using 1-1-0 and D using 1-1-1. 93 00:05:57,839 --> 00:06:03,184 So that would be a variable length code. The number of bits varies between one and 94 00:06:03,184 --> 00:06:07,263 three, but it is prefix free. And again, the reason we might want to do 95 00:06:07,263 --> 00:06:11,551 this, the reason we might want to use a variable length encoding, is to take 96 00:06:11,551 --> 00:06:15,398 advantage of non-uniform frequencies of symbols from a given alphabet. 97 00:06:15,398 --> 00:06:19,741 So, let me show you a concrete example of the benefits you could get from these 98 00:06:19,741 --> 00:06:27,708 kinds of codes on the next slot. So, let's continue with just our four 99 00:06:27,708 --> 00:06:33,660 symbol alphabet, a, b, c, and d. . 100 00:06:33,660 --> 00:06:38,053 And let's suppose we have good statistics in our application domain about exactly 101 00:06:38,053 --> 00:06:42,179 how frequent each of these symbols are. So in particular let's assume we know 102 00:06:42,179 --> 00:06:46,358 that A is by far the most likely symbol. Let's say 60% of the symbols are going to 103 00:06:46,358 --> 00:06:49,063 be As. Where as, 25% are Bs, 10% are Cs, and 5% 104 00:06:49,063 --> 00:06:51,520 are Ds. So, why would you know these statistics? 105 00:06:51,520 --> 00:06:55,153 Well, some, in some domains, you're just going to have a lot of expertise. 106 00:06:55,153 --> 00:06:58,945 In genomics, you're going to know the usual frequencies of As, Cs, Gs, and Ts. 107 00:06:58,945 --> 00:07:03,272 For something like an MP3 file, where you can literally, just take an intermediate 108 00:07:03,272 --> 00:07:07,332 version of the file, after you've done the analog, the digital transformation. 109 00:07:07,332 --> 00:07:10,751 And just count the number of occurrences of each of the symbols. 110 00:07:10,751 --> 00:07:13,850 And then, you know exact frequencies, and you're good to go. 111 00:07:13,850 --> 00:07:17,890 So let's compare the performance of the just sort of obvious fixed-length code 112 00:07:17,890 --> 00:07:21,521 where we use two bits for each of the four characters with that of the 113 00:07:21,521 --> 00:07:25,204 variable-length code that's also prefixed-free that we mentioned on the 114 00:07:25,204 --> 00:07:27,556 previous slide. And we're going to measure the 115 00:07:27,556 --> 00:07:31,545 performance of these codes by looking on average how many bits do you need to 116 00:07:31,545 --> 00:07:35,279 encode a character, where the average is over the frequencies of the four 117 00:07:35,279 --> 00:07:36,651 different symbols. . 118 00:07:36,651 --> 00:07:40,491 So for the fixed-length encoding, of course, it's two bits per symbol. 119 00:07:40,491 --> 00:07:44,669 We don't even need to average, just whatever the symbol is it uses exactly 120 00:07:44,669 --> 00:07:47,097 two bits. So what about the variable-length 121 00:07:47,097 --> 00:07:49,524 encoding that's shown on the right in pink? 122 00:07:49,524 --> 00:07:54,098 How many bits, on average for an average character given these frequencies of the 123 00:07:54,098 --> 00:07:58,220 different symbols, are needed to encode a character of the alphabet sigma? 124 00:08:00,540 --> 00:08:03,275 Okay, so the correct answer is the second one. 125 00:08:03,275 --> 00:08:07,471 It's on average 1.55 bits per character. So what's the computation? 126 00:08:07,471 --> 00:08:12,091 Well, 60% of the time, it's going to use only one bit, and that's where the big 127 00:08:12,091 --> 00:08:15,739 savings comes from. One bit is all that's needed whenever we 128 00:08:15,739 --> 00:08:18,414 see an a and most of the characters are a's. 129 00:08:18,414 --> 00:08:22,488 We don't do too bad when we see a b either, which is 25% of the time. 130 00:08:22,488 --> 00:08:26,987 We're only using two bits for each b. Now it is true that c's and d's were 131 00:08:26,987 --> 00:08:30,331 paying a price. We're having to use three bits for each 132 00:08:30,331 --> 00:08:34,778 of those but there aren't very many. Only ten% of the time is it a C and, 133 00:08:34,778 --> 00:08:39,010 five% of the time is it a D. And if you add up the result that's 134 00:08:39,010 --> 00:08:44,060 taking the average over the symbol frequencies, we get the result of 1.55. 135 00:08:44,060 --> 00:08:47,913 So this example draws our attention to a very neat algorithmic opportunity. 136 00:08:47,913 --> 00:08:51,356 So namely, given a alphabet and frequencies of the symbols, which in 137 00:08:51,356 --> 00:08:55,519 general are not uniform, we now know that the obvious solution fixed length codes 138 00:08:55,519 --> 00:08:58,602 need not be optimal. We can improve upon them using variable 139 00:08:58,602 --> 00:09:01,993 length prefix free codes. So the computational problem you want to 140 00:09:01,993 --> 00:09:05,334 solve is, which one is the best? How do we get optimal compression? 141 00:09:05,334 --> 00:09:09,290 Which variable length code gives us the minimum average encoding length of a 142 00:09:09,290 --> 00:09:12,785 symbol from this alphabet? So Huffman codes are the solution to that 143 00:09:12,785 --> 00:09:15,251 problem. We'll start developing them in the next 144 00:09:15,251 --> 00:09:15,560 video.