1
00:00:00,000 --> 00:00:04,097
So this set of lectures will be our final 
application of the greedy algorithm 

2
00:00:04,097 --> 00:00:06,829
design paradigm. 
It's going to be to an application and 

3
00:00:06,829 --> 00:00:10,874
compression, specifically I'll show you a 
greedy algorithm for constructing a 

4
00:00:10,874 --> 00:00:14,237
certain kind of prefix-free binary codes 
known as Huffman codes. 

5
00:00:14,237 --> 00:00:17,441
So we're going to spend this video just 
sort of setting the stage. 

6
00:00:17,441 --> 00:00:19,858
So let's begin by just defining a binary 
code. 

7
00:00:19,858 --> 00:00:24,113
So a binary code is just the way to write 
down symbols from general, some general 

8
00:00:24,113 --> 00:00:28,316
alphabet in a manner that computers can 
understand, that is, it's just a function 

9
00:00:28,316 --> 00:00:32,099
mapping each symbol from an alphabet 
capital sigma to a binary string, a 

10
00:00:32,099 --> 00:00:36,334
sequence of 0s and 1s. 
So this alphabet capital sigma could be 

11
00:00:36,334 --> 00:00:40,429
any number of things but as a simple 
example you could imagine it's the 

12
00:00:40,429 --> 00:00:45,035
letters A through Z say in lower case 
plus maybe the space character and some 

13
00:00:45,035 --> 00:00:47,878
punctuation so maybe for a set of size 32 
overall. 

14
00:00:47,878 --> 00:00:52,371
And if you have 32 symbols you need to 
encode in binary well an obvious way to 

15
00:00:52,371 --> 00:00:56,977
do it is there happens to be 32 different 
binary strings of link five so why not 

16
00:00:56,977 --> 00:00:59,650
just use one of each of those for your 
symbols. 

17
00:00:59,650 --> 00:01:03,987
So this would be a fixed length code in 
the sense we're using exactly the same 

18
00:01:03,987 --> 00:01:07,116
number of bits. 
Mainly five to encode each of the symbols 

19
00:01:07,116 --> 00:01:10,301
of our alphabet. 
This is pretty similar to what's going on 

20
00:01:10,301 --> 00:01:14,583
with ASCII codes and of course it's a 
mantra of this class to ask when can we 

21
00:01:14,583 --> 00:01:18,811
do better then the obvious solution. 
So in this context when can we do better 

22
00:01:18,811 --> 00:01:22,489
then the fixed length codes. 
Sometimes we can in the important case 

23
00:01:22,489 --> 00:01:25,783
and some symbols are much more likely to 
appear then others. 

24
00:01:25,783 --> 00:01:30,066
In that case we can encode information 
using fewer bits by deploying variable 

25
00:01:30,066 --> 00:01:33,108
length codes. 
And this is in fact, a very practical 

26
00:01:33,108 --> 00:01:35,400
idea. 
The variable length codes are used all 

27
00:01:35,400 --> 00:01:38,423
the time in practice. 
One example is in encoding mp3 audio 

28
00:01:38,423 --> 00:01:40,768
files. 
So if you look at the standard for mp3 

29
00:01:40,768 --> 00:01:44,520
encoding, there's this initial phase in 
which you do analogue to digital 

30
00:01:44,520 --> 00:01:46,969
conversion. 
But then once you're in the digital 

31
00:01:46,969 --> 00:01:50,982
domain, you do apply Huffman codes, what 
I'm going to teach you in this videos, to 

32
00:01:50,982 --> 00:01:53,379
compress the length of the files even 
further. 

33
00:01:53,379 --> 00:01:57,287
And of course, as you know, compression 
especially lossless compression like 

34
00:01:57,287 --> 00:02:00,205
Huffman codes, is a good thing. 
You want to download a file. 

35
00:02:00,205 --> 00:02:04,531
You want it to happen as fast as possible 
or you want to make the file as small as 

36
00:02:04,531 --> 00:02:09,292
possible. 
So a new issue arises when you pass from 

37
00:02:09,292 --> 00:02:12,069
fix, fixed length codes to variable 
length codes. 

38
00:02:12,069 --> 00:02:14,846
So let me illustrate that with a simple 
example. 

39
00:02:14,846 --> 00:02:18,259
Suppose our alphabet sigma is just four 
characters A, B, C, D. 

40
00:02:18,259 --> 00:02:22,714
So the obvious fixed length encoding of 
these characters would just be 00, 01, 

41
00:02:22,714 --> 00:02:26,012
10, and 11. 
Well suppose you wanted to use fewer bits 

42
00:02:26,012 --> 00:02:30,525
and wanted to use a variable length in 
coding, an obvious idea would be to try 

43
00:02:30,525 --> 00:02:34,170
to get away with only one bit for a 
couple of these characters. 

44
00:02:34,170 --> 00:02:39,132
So suppose, instead of using a double 
zero for a, we just use a single zero. 

45
00:02:39,132 --> 00:02:43,500
And instead of using a double one for D 
we just used a single one. 

46
00:02:43,500 --> 00:02:48,773
So, that's only fewer bits so that seems 
like that can only be better but now, 

47
00:02:48,773 --> 00:02:52,745
here's the question. 
Suppose someone handed you an encoded 

48
00:02:52,745 --> 00:02:55,621
transmission consisting of the digits 
001. 

49
00:02:55,621 --> 00:03:00,484
What would have been the original 
sequence of symbols that led to that 

50
00:03:00,484 --> 00:03:06,701
encoded version? 
Alright, so the answer is D. 

51
00:03:06,701 --> 00:03:12,926
There is not enough information to know 
what 001 was supposed to be an encoding 

52
00:03:12,926 --> 00:03:17,548
of. 
The reason is, is that having past to a 

53
00:03:17,548 --> 00:03:20,909
variable link in coding, there is now 
ambiguity. 

54
00:03:20,909 --> 00:03:26,058
There is more than one sequence of 
original symbols that could have led 

55
00:03:26,058 --> 00:03:31,564
under this encoding to the output 001. 
Specifically, the answers A and C, would 

56
00:03:31,564 --> 00:03:37,460
both lead to 001. 
. 

57
00:03:37,460 --> 00:03:41,076
The letter a would give you a zero. 
The letter b would give you a 01. 

58
00:03:41,076 --> 00:03:44,532
So that would give you 001. 
On the other hand, aad would also give 

59
00:03:44,532 --> 00:03:46,340
you 001. 
So there's no way to know. 

60
00:03:46,340 --> 00:03:50,754
Contrast this with fixed length encoding. 
If you're given a sequence of bits with a 

61
00:03:50,754 --> 00:03:54,796
fixed length code, of course you know 
where one letter ends and the next one 

62
00:03:54,796 --> 00:03:57,295
begins. 
For example, if every symbol was encoded 

63
00:03:57,295 --> 00:04:01,337
with five bits, you would just read five 
bits, you'd know which symbol it was, 

64
00:04:01,337 --> 00:04:04,900
you'd read the next five bits and so on. 
With variable length codes. 

65
00:04:04,900 --> 00:04:09,021
Without further precautions it's unclear 
where one symbol starts and the next one 

66
00:04:09,021 --> 00:04:12,740
begins, so that's an additional issue we 
have to make sure to take care of. 

67
00:04:12,740 --> 00:04:21,532
With variable length codes. 
So to solve this problem, that with 

68
00:04:21,532 --> 00:04:25,608
variable length codes and without further 
precautions, it's unclear where one 

69
00:04:25,608 --> 00:04:27,832
symbol ends and where the next one 
begins. 

70
00:04:27,832 --> 00:04:31,484
We're going to insist that our variable 
length codes are prefix free. 

71
00:04:31,484 --> 00:04:35,508
So what this means is when we encode a 
bunch of symbols, we're going to make 

72
00:04:35,508 --> 00:04:39,743
sure that for each pair of symbols, I and 
j, from the original alphabet sigma, the 

73
00:04:39,743 --> 00:04:43,872
corresponding encoding will have the 
property that neither one is a prefix of 

74
00:04:43,872 --> 00:04:46,307
the other. 
So going back to the previous slide, 

75
00:04:46,307 --> 00:04:48,901
you'll see that, that example was not 
prefix free. 

76
00:04:48,901 --> 00:04:53,160
For example, we used. 
Was zero was a prefix of 01, that led to 

77
00:04:53,160 --> 00:04:57,080
ambiguity. 
Similarly one, the encoding for D was a. 

78
00:04:57,080 --> 00:05:01,284
Prefix of one zero the encoding for the 
C, and that also leads to ambiguity. 

79
00:05:01,284 --> 00:05:05,376
So if you don't have prefixes for each 
other, and we'll develop this more 

80
00:05:05,376 --> 00:05:09,469
shortly, then there's no ambiguity. 
Then there's a unique way to decode to 

81
00:05:09,469 --> 00:05:14,010
reconstruct what the original sequence of 
symbols was, given just the 0's and 1's. 

82
00:05:14,010 --> 00:05:18,618
So lest you think this is too strong a 
property, certainly interesting and 

83
00:05:18,618 --> 00:05:22,987
useful variable length codes exist that 
satisfy the prefix free property. 

84
00:05:22,987 --> 00:05:27,416
So one simple example again just to 
encode the letters A, B, C, D, we can get 

85
00:05:27,416 --> 00:05:31,845
away with encoding the symbol A just 
using a single bit, just using a zero. 

86
00:05:31,845 --> 00:05:36,872
Now of course to be prefix free it better 
be the case that our encodings of B and C 

87
00:05:36,872 --> 00:05:41,002
and D all start with the bit one. 
Otherwise we don't, we're not prefix 

88
00:05:41,002 --> 00:05:43,097
free. 
But we can get away with that. 

89
00:05:43,097 --> 00:05:45,910
So let's encode a B with a one and then a 
zero. 

90
00:05:45,910 --> 00:05:50,799
And now both C and D better have the 
property that they start neither with 

91
00:05:50,799 --> 00:05:54,254
zero nor with 1-0. 
That is they better start with 1-1. 

92
00:05:54,254 --> 00:05:57,839
Well let's just encode C using 1-1-0 and 
D using 1-1-1. 

93
00:05:57,839 --> 00:06:03,184
So that would be a variable length code. 
The number of bits varies between one and 

94
00:06:03,184 --> 00:06:07,263
three, but it is prefix free. 
And again, the reason we might want to do 

95
00:06:07,263 --> 00:06:11,551
this, the reason we might want to use a 
variable length encoding, is to take 

96
00:06:11,551 --> 00:06:15,398
advantage of non-uniform frequencies of 
symbols from a given alphabet. 

97
00:06:15,398 --> 00:06:19,741
So, let me show you a concrete example of 
the benefits you could get from these 

98
00:06:19,741 --> 00:06:27,708
kinds of codes on the next slot. 
So, let's continue with just our four 

99
00:06:27,708 --> 00:06:33,660
symbol alphabet, a, b, c, and d. 
. 

100
00:06:33,660 --> 00:06:38,053
And let's suppose we have good statistics 
in our application domain about exactly 

101
00:06:38,053 --> 00:06:42,179
how frequent each of these symbols are. 
So in particular let's assume we know 

102
00:06:42,179 --> 00:06:46,358
that A is by far the most likely symbol. 
Let's say 60% of the symbols are going to 

103
00:06:46,358 --> 00:06:49,063
be As. 
Where as, 25% are Bs, 10% are Cs, and 5% 

104
00:06:49,063 --> 00:06:51,520
are Ds. 
So, why would you know these statistics? 

105
00:06:51,520 --> 00:06:55,153
Well, some, in some domains, you're just 
going to have a lot of expertise. 

106
00:06:55,153 --> 00:06:58,945
In genomics, you're going to know the 
usual frequencies of As, Cs, Gs, and Ts. 

107
00:06:58,945 --> 00:07:03,272
For something like an MP3 file, where you 
can literally, just take an intermediate 

108
00:07:03,272 --> 00:07:07,332
version of the file, after you've done 
the analog, the digital transformation. 

109
00:07:07,332 --> 00:07:10,751
And just count the number of occurrences 
of each of the symbols. 

110
00:07:10,751 --> 00:07:13,850
And then, you know exact frequencies, and 
you're good to go. 

111
00:07:13,850 --> 00:07:17,890
So let's compare the performance of the 
just sort of obvious fixed-length code 

112
00:07:17,890 --> 00:07:21,521
where we use two bits for each of the 
four characters with that of the 

113
00:07:21,521 --> 00:07:25,204
variable-length code that's also 
prefixed-free that we mentioned on the 

114
00:07:25,204 --> 00:07:27,556
previous slide. 
And we're going to measure the 

115
00:07:27,556 --> 00:07:31,545
performance of these codes by looking on 
average how many bits do you need to 

116
00:07:31,545 --> 00:07:35,279
encode a character, where the average is 
over the frequencies of the four 

117
00:07:35,279 --> 00:07:36,651
different symbols. 
. 

118
00:07:36,651 --> 00:07:40,491
So for the fixed-length encoding, of 
course, it's two bits per symbol. 

119
00:07:40,491 --> 00:07:44,669
We don't even need to average, just 
whatever the symbol is it uses exactly 

120
00:07:44,669 --> 00:07:47,097
two bits. 
So what about the variable-length 

121
00:07:47,097 --> 00:07:49,524
encoding that's shown on the right in 
pink? 

122
00:07:49,524 --> 00:07:54,098
How many bits, on average for an average 
character given these frequencies of the 

123
00:07:54,098 --> 00:07:58,220
different symbols, are needed to encode a 
character of the alphabet sigma? 

124
00:08:00,540 --> 00:08:03,275
Okay, so the correct answer is the second 
one. 

125
00:08:03,275 --> 00:08:07,471
It's on average 1.55 bits per character. 
So what's the computation? 

126
00:08:07,471 --> 00:08:12,091
Well, 60% of the time, it's going to use 
only one bit, and that's where the big 

127
00:08:12,091 --> 00:08:15,739
savings comes from. 
One bit is all that's needed whenever we 

128
00:08:15,739 --> 00:08:18,414
see an a and most of the characters are 
a's. 

129
00:08:18,414 --> 00:08:22,488
We don't do too bad when we see a b 
either, which is 25% of the time. 

130
00:08:22,488 --> 00:08:26,987
We're only using two bits for each b. 
Now it is true that c's and d's were 

131
00:08:26,987 --> 00:08:30,331
paying a price. 
We're having to use three bits for each 

132
00:08:30,331 --> 00:08:34,778
of those but there aren't very many. 
Only ten% of the time is it a C and, 

133
00:08:34,778 --> 00:08:39,010
five% of the time is it a D. 
And if you add up the result that's 

134
00:08:39,010 --> 00:08:44,060
taking the average over the symbol 
frequencies, we get the result of 1.55. 

135
00:08:44,060 --> 00:08:47,913
So this example draws our attention to a 
very neat algorithmic opportunity. 

136
00:08:47,913 --> 00:08:51,356
So namely, given a alphabet and 
frequencies of the symbols, which in 

137
00:08:51,356 --> 00:08:55,519
general are not uniform, we now know that 
the obvious solution fixed length codes 

138
00:08:55,519 --> 00:08:58,602
need not be optimal. 
We can improve upon them using variable 

139
00:08:58,602 --> 00:09:01,993
length prefix free codes. 
So the computational problem you want to 

140
00:09:01,993 --> 00:09:05,334
solve is, which one is the best? 
How do we get optimal compression? 

141
00:09:05,334 --> 00:09:09,290
Which variable length code gives us the 
minimum average encoding length of a 

142
00:09:09,290 --> 00:09:12,785
symbol from this alphabet? 
So Huffman codes are the solution to that 

143
00:09:12,785 --> 00:09:15,251
problem. 
We'll start developing them in the next 

144
00:09:15,251 --> 00:09:15,560
video.