So this set of lectures will be our final 
application of the greedy algorithm 
design paradigm. 
It's going to be to an application and 
compression, specifically I'll show you a 
greedy algorithm for constructing a 
certain kind of prefix-free binary codes 
known as Huffman codes. 
So we're going to spend this video just 
sort of setting the stage. 
So let's begin by just defining a binary 
code. 
So a binary code is just the way to write 
down symbols from general, some general 
alphabet in a manner that computers can 
understand, that is, it's just a function 
mapping each symbol from an alphabet 
capital sigma to a binary string, a 
sequence of 0s and 1s. 
So this alphabet capital sigma could be 
any number of things but as a simple 
example you could imagine it's the 
letters A through Z say in lower case 
plus maybe the space character and some 
punctuation so maybe for a set of size 32 
overall. 
And if you have 32 symbols you need to 
encode in binary well an obvious way to 
do it is there happens to be 32 different 
binary strings of link five so why not 
just use one of each of those for your 
symbols. 
So this would be a fixed length code in 
the sense we're using exactly the same 
number of bits. 
Mainly five to encode each of the symbols 
of our alphabet. 
This is pretty similar to what's going on 
with ASCII codes and of course it's a 
mantra of this class to ask when can we 
do better then the obvious solution. 
So in this context when can we do better 
then the fixed length codes. 
Sometimes we can in the important case 
and some symbols are much more likely to 
appear then others. 
In that case we can encode information 
using fewer bits by deploying variable 
length codes. 
And this is in fact, a very practical 
idea. 
The variable length codes are used all 
the time in practice. 
One example is in encoding mp3 audio 
files. 
So if you look at the standard for mp3 
encoding, there's this initial phase in 
which you do analogue to digital 
conversion. 
But then once you're in the digital 
domain, you do apply Huffman codes, what 
I'm going to teach you in this videos, to 
compress the length of the files even 
further. 
And of course, as you know, compression 
especially lossless compression like 
Huffman codes, is a good thing. 
You want to download a file. 
You want it to happen as fast as possible 
or you want to make the file as small as 
possible. 
So a new issue arises when you pass from 
fix, fixed length codes to variable 
length codes. 
So let me illustrate that with a simple 
example. 
Suppose our alphabet sigma is just four 
characters A, B, C, D. 
So the obvious fixed length encoding of 
these characters would just be 00, 01, 
10, and 11. 
Well suppose you wanted to use fewer bits 
and wanted to use a variable length in 
coding, an obvious idea would be to try 
to get away with only one bit for a 
couple of these characters. 
So suppose, instead of using a double 
zero for a, we just use a single zero. 
And instead of using a double one for D 
we just used a single one. 
So, that's only fewer bits so that seems 
like that can only be better but now, 
here's the question. 
Suppose someone handed you an encoded 
transmission consisting of the digits 
001. 
What would have been the original 
sequence of symbols that led to that 
encoded version? 
Alright, so the answer is D. 
There is not enough information to know 
what 001 was supposed to be an encoding 
of. 
The reason is, is that having past to a 
variable link in coding, there is now 
ambiguity. 
There is more than one sequence of 
original symbols that could have led 
under this encoding to the output 001. 
Specifically, the answers A and C, would 
both lead to 001. 
. 
The letter a would give you a zero. 
The letter b would give you a 01. 
So that would give you 001. 
On the other hand, aad would also give 
you 001. 
So there's no way to know. 
Contrast this with fixed length encoding. 
If you're given a sequence of bits with a 
fixed length code, of course you know 
where one letter ends and the next one 
begins. 
For example, if every symbol was encoded 
with five bits, you would just read five 
bits, you'd know which symbol it was, 
you'd read the next five bits and so on. 
With variable length codes. 
Without further precautions it's unclear 
where one symbol starts and the next one 
begins, so that's an additional issue we 
have to make sure to take care of. 
With variable length codes. 
So to solve this problem, that with 
variable length codes and without further 
precautions, it's unclear where one 
symbol ends and where the next one 
begins. 
We're going to insist that our variable 
length codes are prefix free. 
So what this means is when we encode a 
bunch of symbols, we're going to make 
sure that for each pair of symbols, I and 
j, from the original alphabet sigma, the 
corresponding encoding will have the 
property that neither one is a prefix of 
the other. 
So going back to the previous slide, 
you'll see that, that example was not 
prefix free. 
For example, we used. 
Was zero was a prefix of 01, that led to 
ambiguity. 
Similarly one, the encoding for D was a. 
Prefix of one zero the encoding for the 
C, and that also leads to ambiguity. 
So if you don't have prefixes for each 
other, and we'll develop this more 
shortly, then there's no ambiguity. 
Then there's a unique way to decode to 
reconstruct what the original sequence of 
symbols was, given just the 0's and 1's. 
So lest you think this is too strong a 
property, certainly interesting and 
useful variable length codes exist that 
satisfy the prefix free property. 
So one simple example again just to 
encode the letters A, B, C, D, we can get 
away with encoding the symbol A just 
using a single bit, just using a zero. 
Now of course to be prefix free it better 
be the case that our encodings of B and C 
and D all start with the bit one. 
Otherwise we don't, we're not prefix 
free. 
But we can get away with that. 
So let's encode a B with a one and then a 
zero. 
And now both C and D better have the 
property that they start neither with 
zero nor with 1-0. 
That is they better start with 1-1. 
Well let's just encode C using 1-1-0 and 
D using 1-1-1. 
So that would be a variable length code. 
The number of bits varies between one and 
three, but it is prefix free. 
And again, the reason we might want to do 
this, the reason we might want to use a 
variable length encoding, is to take 
advantage of non-uniform frequencies of 
symbols from a given alphabet. 
So, let me show you a concrete example of 
the benefits you could get from these 
kinds of codes on the next slot. 
So, let's continue with just our four 
symbol alphabet, a, b, c, and d. 
. 
And let's suppose we have good statistics 
in our application domain about exactly 
how frequent each of these symbols are. 
So in particular let's assume we know 
that A is by far the most likely symbol. 
Let's say 60% of the symbols are going to 
be As. 
Where as, 25% are Bs, 10% are Cs, and 5% 
are Ds. 
So, why would you know these statistics? 
Well, some, in some domains, you're just 
going to have a lot of expertise. 
In genomics, you're going to know the 
usual frequencies of As, Cs, Gs, and Ts. 
For something like an MP3 file, where you 
can literally, just take an intermediate 
version of the file, after you've done 
the analog, the digital transformation. 
And just count the number of occurrences 
of each of the symbols. 
And then, you know exact frequencies, and 
you're good to go. 
So let's compare the performance of the 
just sort of obvious fixed-length code 
where we use two bits for each of the 
four characters with that of the 
variable-length code that's also 
prefixed-free that we mentioned on the 
previous slide. 
And we're going to measure the 
performance of these codes by looking on 
average how many bits do you need to 
encode a character, where the average is 
over the frequencies of the four 
different symbols. 
. 
So for the fixed-length encoding, of 
course, it's two bits per symbol. 
We don't even need to average, just 
whatever the symbol is it uses exactly 
two bits. 
So what about the variable-length 
encoding that's shown on the right in 
pink? 
How many bits, on average for an average 
character given these frequencies of the 
different symbols, are needed to encode a 
character of the alphabet sigma? 
Okay, so the correct answer is the second 
one. 
It's on average 1.55 bits per character. 
So what's the computation? 
Well, 60% of the time, it's going to use 
only one bit, and that's where the big 
savings comes from. 
One bit is all that's needed whenever we 
see an a and most of the characters are 
a's. 
We don't do too bad when we see a b 
either, which is 25% of the time. 
We're only using two bits for each b. 
Now it is true that c's and d's were 
paying a price. 
We're having to use three bits for each 
of those but there aren't very many. 
Only ten% of the time is it a C and, 
five% of the time is it a D. 
And if you add up the result that's 
taking the average over the symbol 
frequencies, we get the result of 1.55. 
So this example draws our attention to a 
very neat algorithmic opportunity. 
So namely, given a alphabet and 
frequencies of the symbols, which in 
general are not uniform, we now know that 
the obvious solution fixed length codes 
need not be optimal. 
We can improve upon them using variable 
length prefix free codes. 
So the computational problem you want to 
solve is, which one is the best? 
How do we get optimal compression? 
Which variable length code gives us the 
minimum average encoding length of a 
symbol from this alphabet? 
So Huffman codes are the solution to that 
problem. 
We'll start developing them in the next 
video.