So this set of lectures will be our final application of the greedy algorithm design paradigm. It's going to be to an application and compression, specifically I'll show you a greedy algorithm for constructing a certain kind of prefix-free binary codes known as Huffman codes. So we're going to spend this video just sort of setting the stage. So let's begin by just defining a binary code. So a binary code is just the way to write down symbols from general, some general alphabet in a manner that computers can understand, that is, it's just a function mapping each symbol from an alphabet capital sigma to a binary string, a sequence of 0s and 1s. So this alphabet capital sigma could be any number of things but as a simple example you could imagine it's the letters A through Z say in lower case plus maybe the space character and some punctuation so maybe for a set of size 32 overall. And if you have 32 symbols you need to encode in binary well an obvious way to do it is there happens to be 32 different binary strings of link five so why not just use one of each of those for your symbols. So this would be a fixed length code in the sense we're using exactly the same number of bits. Mainly five to encode each of the symbols of our alphabet. This is pretty similar to what's going on with ASCII codes and of course it's a mantra of this class to ask when can we do better then the obvious solution. So in this context when can we do better then the fixed length codes. Sometimes we can in the important case and some symbols are much more likely to appear then others. In that case we can encode information using fewer bits by deploying variable length codes. And this is in fact, a very practical idea. The variable length codes are used all the time in practice. One example is in encoding mp3 audio files. So if you look at the standard for mp3 encoding, there's this initial phase in which you do analogue to digital conversion. But then once you're in the digital domain, you do apply Huffman codes, what I'm going to teach you in this videos, to compress the length of the files even further. And of course, as you know, compression especially lossless compression like Huffman codes, is a good thing. You want to download a file. You want it to happen as fast as possible or you want to make the file as small as possible. So a new issue arises when you pass from fix, fixed length codes to variable length codes. So let me illustrate that with a simple example. Suppose our alphabet sigma is just four characters A, B, C, D. So the obvious fixed length encoding of these characters would just be 00, 01, 10, and 11. Well suppose you wanted to use fewer bits and wanted to use a variable length in coding, an obvious idea would be to try to get away with only one bit for a couple of these characters. So suppose, instead of using a double zero for a, we just use a single zero. And instead of using a double one for D we just used a single one. So, that's only fewer bits so that seems like that can only be better but now, here's the question. Suppose someone handed you an encoded transmission consisting of the digits 001. What would have been the original sequence of symbols that led to that encoded version? Alright, so the answer is D. There is not enough information to know what 001 was supposed to be an encoding of. The reason is, is that having past to a variable link in coding, there is now ambiguity. There is more than one sequence of original symbols that could have led under this encoding to the output 001. Specifically, the answers A and C, would both lead to 001. . The letter a would give you a zero. The letter b would give you a 01. So that would give you 001. On the other hand, aad would also give you 001. So there's no way to know. Contrast this with fixed length encoding. If you're given a sequence of bits with a fixed length code, of course you know where one letter ends and the next one begins. For example, if every symbol was encoded with five bits, you would just read five bits, you'd know which symbol it was, you'd read the next five bits and so on. With variable length codes. Without further precautions it's unclear where one symbol starts and the next one begins, so that's an additional issue we have to make sure to take care of. With variable length codes. So to solve this problem, that with variable length codes and without further precautions, it's unclear where one symbol ends and where the next one begins. We're going to insist that our variable length codes are prefix free. So what this means is when we encode a bunch of symbols, we're going to make sure that for each pair of symbols, I and j, from the original alphabet sigma, the corresponding encoding will have the property that neither one is a prefix of the other. So going back to the previous slide, you'll see that, that example was not prefix free. For example, we used. Was zero was a prefix of 01, that led to ambiguity. Similarly one, the encoding for D was a. Prefix of one zero the encoding for the C, and that also leads to ambiguity. So if you don't have prefixes for each other, and we'll develop this more shortly, then there's no ambiguity. Then there's a unique way to decode to reconstruct what the original sequence of symbols was, given just the 0's and 1's. So lest you think this is too strong a property, certainly interesting and useful variable length codes exist that satisfy the prefix free property. So one simple example again just to encode the letters A, B, C, D, we can get away with encoding the symbol A just using a single bit, just using a zero. Now of course to be prefix free it better be the case that our encodings of B and C and D all start with the bit one. Otherwise we don't, we're not prefix free. But we can get away with that. So let's encode a B with a one and then a zero. And now both C and D better have the property that they start neither with zero nor with 1-0. That is they better start with 1-1. Well let's just encode C using 1-1-0 and D using 1-1-1. So that would be a variable length code. The number of bits varies between one and three, but it is prefix free. And again, the reason we might want to do this, the reason we might want to use a variable length encoding, is to take advantage of non-uniform frequencies of symbols from a given alphabet. So, let me show you a concrete example of the benefits you could get from these kinds of codes on the next slot. So, let's continue with just our four symbol alphabet, a, b, c, and d. . And let's suppose we have good statistics in our application domain about exactly how frequent each of these symbols are. So in particular let's assume we know that A is by far the most likely symbol. Let's say 60% of the symbols are going to be As. Where as, 25% are Bs, 10% are Cs, and 5% are Ds. So, why would you know these statistics? Well, some, in some domains, you're just going to have a lot of expertise. In genomics, you're going to know the usual frequencies of As, Cs, Gs, and Ts. For something like an MP3 file, where you can literally, just take an intermediate version of the file, after you've done the analog, the digital transformation. And just count the number of occurrences of each of the symbols. And then, you know exact frequencies, and you're good to go. So let's compare the performance of the just sort of obvious fixed-length code where we use two bits for each of the four characters with that of the variable-length code that's also prefixed-free that we mentioned on the previous slide. And we're going to measure the performance of these codes by looking on average how many bits do you need to encode a character, where the average is over the frequencies of the four different symbols. . So for the fixed-length encoding, of course, it's two bits per symbol. We don't even need to average, just whatever the symbol is it uses exactly two bits. So what about the variable-length encoding that's shown on the right in pink? How many bits, on average for an average character given these frequencies of the different symbols, are needed to encode a character of the alphabet sigma? Okay, so the correct answer is the second one. It's on average 1.55 bits per character. So what's the computation? Well, 60% of the time, it's going to use only one bit, and that's where the big savings comes from. One bit is all that's needed whenever we see an a and most of the characters are a's. We don't do too bad when we see a b either, which is 25% of the time. We're only using two bits for each b. Now it is true that c's and d's were paying a price. We're having to use three bits for each of those but there aren't very many. Only ten% of the time is it a C and, five% of the time is it a D. And if you add up the result that's taking the average over the symbol frequencies, we get the result of 1.55. So this example draws our attention to a very neat algorithmic opportunity. So namely, given a alphabet and frequencies of the symbols, which in general are not uniform, we now know that the obvious solution fixed length codes need not be optimal. We can improve upon them using variable length prefix free codes. So the computational problem you want to solve is, which one is the best? How do we get optimal compression? Which variable length code gives us the minimum average encoding length of a symbol from this alphabet? So Huffman codes are the solution to that problem. We'll start developing them in the next video.