1
00:00:00,880 --> 00:00:03,898
So this set of lectures will be our final
application of the greedy algorithm

2
00:00:03,898 --> 00:00:04,660
design paradigm.

3
00:00:04,660 --> 00:00:07,070
It's going to be to
an application in compression.

4
00:00:07,070 --> 00:00:08,940
Specifically, I'll show
you a greedy algorithm for

5
00:00:08,940 --> 00:00:13,390
constructing a certain kind of prefix-free
binary codes know as Huffman codes.

6
00:00:14,450 --> 00:00:16,770
So we're going to spend this video
just sort of setting the stage.

7
00:00:16,770 --> 00:00:19,200
So let's begin by just
defining a binary code.

8
00:00:20,502 --> 00:00:23,200
So a binary code is just a way
to write down symbols from

9
00:00:23,200 --> 00:00:26,090
some general alphabet in a manner
that computers can understand.

10
00:00:26,090 --> 00:00:28,730
That is, it's just a function mapping
each symbol from an alphabet,

11
00:00:28,730 --> 00:00:32,630
capital sigma, to a binary string,
a sequence of zeroes and ones.

12
00:00:34,530 --> 00:00:38,020
So this alphabet capital sigma could
be any number of things but, you know,

13
00:00:38,020 --> 00:00:40,620
as a simple example, you could
imagine it's the letters a through z,

14
00:00:40,620 --> 00:00:44,260
say in lowercase, plus maybe the space
character and some punctuation.

15
00:00:44,260 --> 00:00:46,270
So maybe, for a set of size 32 overall.

16
00:00:46,270 --> 00:00:51,110
And if you have 32 symbols you
need to encode in binary, well,

17
00:00:51,110 --> 00:00:55,500
an obvious way to do it is, there happens
to be 32 different binary strings of

18
00:00:55,500 --> 00:00:58,464
length five, so why not just use one
of each of those for your symbols.

19
00:01:00,200 --> 00:01:03,180
So this would be a fixed length code, in
the sense we're using the exactly the same

20
00:01:03,180 --> 00:01:06,720
number of bits, namely five, to encode
each of the symbols of our alphabet.

21
00:01:06,720 --> 00:01:09,280
This is pretty similar to what's
going on with ASCII codes.

22
00:01:09,280 --> 00:01:12,800
And of course,
it's a mantra of this class to ask,

23
00:01:12,800 --> 00:01:15,170
when can we do better than
the obvious solution?

24
00:01:15,170 --> 00:01:18,270
So in this context, when can we do
better than the fixed length codes?

25
00:01:18,270 --> 00:01:22,640
Sometimes we can, in the important case
when some symbols are much more likely to

26
00:01:22,640 --> 00:01:23,890
appear than others.

27
00:01:23,890 --> 00:01:24,560
In that case,

28
00:01:24,560 --> 00:01:29,780
we can encode information using fewer
bits by deploying variable length codes.

29
00:01:32,130 --> 00:01:35,077
And this is, in fact,
a very practical idea.

30
00:01:35,077 --> 00:01:38,654
Variable length codes are used
all the time in practice.

31
00:01:38,654 --> 00:01:41,273
One example is in coding MP3 audio files.

32
00:01:41,273 --> 00:01:43,662
So if you look up the standard for
MP3 encoding,

33
00:01:43,662 --> 00:01:47,470
there's this initial phase which you
do analog-to-digital conversion.

34
00:01:47,470 --> 00:01:50,687
But then, once you're in the digital
domain, you do apply Huffman Codes,

35
00:01:50,687 --> 00:01:52,498
what I'm going to teach
you in these videos,

36
00:01:52,498 --> 00:01:54,630
to compress the length of
the files even further.

37
00:01:54,630 --> 00:01:57,780
And of course as you know, compression,
especially lossless compression,

38
00:01:57,780 --> 00:01:59,460
like Huffman codes, is a good thing.

39
00:01:59,460 --> 00:02:02,100
You want to download a file,
you want it to happen as fast as possible.

40
00:02:02,100 --> 00:02:04,335
Well, you want to make
the file as small as possible.

41
00:02:07,103 --> 00:02:10,104
So a new issue arises when you
pass from fixed-length codes

42
00:02:10,104 --> 00:02:11,690
to variable length codes.

43
00:02:11,690 --> 00:02:14,170
So let me illustrate that
with a simple example.

44
00:02:14,170 --> 00:02:18,010
Suppose our alphabet sigma is
just four characters, A, B, C, D.

45
00:02:18,010 --> 00:02:23,220
So the obvious fixed length
encoding of these characters would

46
00:02:23,220 --> 00:02:25,040
just be 00, 01, 10, and 11.

47
00:02:25,040 --> 00:02:27,120
Well, suppose you wanted
to use fewer bits,

48
00:02:27,120 --> 00:02:30,545
and wanted to use a variable length
encoding, an obvious idea would be to try

49
00:02:30,545 --> 00:02:34,540
to get away with only one bit for
a couple of these characters.

50
00:02:34,540 --> 00:02:38,260
So, suppose instead of using a double
0 for A, we just use a single 0.

51
00:02:38,260 --> 00:02:42,520
And instead of using a double one for
D we just use a single one.

52
00:02:44,070 --> 00:02:45,230
So that's only fewer bits.

53
00:02:45,230 --> 00:02:47,000
So that seems like that
can only be better.

54
00:02:47,000 --> 00:02:48,250
But now, here's the question.

55
00:02:48,250 --> 00:02:53,910
Suppose, someone handed you an encoded
transmission consisting of the digits 001.

56
00:02:53,910 --> 00:02:58,700
What would have been
the original sequence of symbols

57
00:02:58,700 --> 00:03:00,761
that led to that encoded version?

58
00:03:00,761 --> 00:03:07,190
All right, so the answer is D.

59
00:03:07,190 --> 00:03:10,670
There is not enough information
to know what 001 was

60
00:03:10,670 --> 00:03:12,180
supposed to be an encoding of.

61
00:03:15,570 --> 00:03:18,730
The reason is is that having passed
to a variable length encoding,

62
00:03:18,730 --> 00:03:20,060
there is now ambiguity.

63
00:03:20,060 --> 00:03:23,660
There is more than one sequence of
original symbols that could have led,

64
00:03:23,660 --> 00:03:27,130
under this encoding, to the output 001.

65
00:03:27,130 --> 00:03:31,844
Specifically, answers A and
C would both lead to 001.

66
00:03:37,651 --> 00:03:40,270
The letter A would give you a zero,
the letter B would give you a 01.

67
00:03:40,270 --> 00:03:41,280
So that would give you 001.

68
00:03:41,280 --> 00:03:45,760
On the other hand,
AAD would also give you 001.

69
00:03:45,760 --> 00:03:47,540
So there's no way to know.

70
00:03:47,540 --> 00:03:49,512
Contrast this with fixed-length encoding.

71
00:03:49,512 --> 00:03:52,270
If you're given a sequence of
bits with a fixed length code,

72
00:03:52,270 --> 00:03:55,470
of course you know where one letter
ends and the next one begins.

73
00:03:55,470 --> 00:04:00,080
For example, if every symbol was encoded
with 5 bits, you would just read 5 bits.

74
00:04:00,080 --> 00:04:03,400
You would know what symbol it was,
you would read the next 5 bits, and so on.

75
00:04:03,400 --> 00:04:06,110
With variable length codes,
without further precautions,

76
00:04:06,110 --> 00:04:09,370
it's unclear where one symbol starts and
the next one begins.

77
00:04:09,370 --> 00:04:13,639
So that's an additional issue we have to
make sure to take care of with variable

78
00:04:13,639 --> 00:04:14,487
length codes.

79
00:04:19,882 --> 00:04:23,512
So to solve this problem, that with
variable length codes and without further

80
00:04:23,512 --> 00:04:27,307
precautions, it's unclear where one symbol
ends and where the next one begins,

81
00:04:27,307 --> 00:04:31,060
we're going to insist that our
variable length codes are prefix free.

82
00:04:31,060 --> 00:04:33,520
So what this means is when we
encode a bunch of symbols,

83
00:04:33,520 --> 00:04:35,890
we're going to make sure that for
each pair of symbols i and

84
00:04:35,890 --> 00:04:39,730
j from the original alphabet sigma,
the corresponding encodings will have

85
00:04:39,730 --> 00:04:43,730
the property that neither one
is a prefix of the other.

86
00:04:43,730 --> 00:04:45,400
So going back to the previous slide,

87
00:04:45,400 --> 00:04:48,420
you'll see that that example
was not prefix free.

88
00:04:48,420 --> 00:04:53,990
For example, we used zero, was a prefix
of zero one, that led to ambiguity.

89
00:04:53,990 --> 00:04:58,870
Similarly, one, the encoding for d,
was a prefix of 10, the encoding for

90
00:04:58,870 --> 00:05:00,540
c, and that also leads to an ambiguity.

91
00:05:00,540 --> 00:05:03,020
So if you don't have prefixes for
each other, and

92
00:05:03,020 --> 00:05:05,960
we'll develop this more shortly,
then there's no ambiguity.

93
00:05:05,960 --> 00:05:10,510
Then there's a unique way to decode, to
reconstruct what the original sequence of

94
00:05:10,510 --> 00:05:13,150
symbols was,
given just the zeros and ones.

95
00:05:14,340 --> 00:05:18,070
So lest you think this is too strong
a property, certainly interesting and

96
00:05:18,070 --> 00:05:22,240
useful variable length codes exist
that satisfy the prefix-free property.

97
00:05:22,240 --> 00:05:25,530
So one simple example, again just
to encode the letters A, B, C, D.

98
00:05:25,530 --> 00:05:29,420
We can get away with encoding
the symbol A just using a single bit,

99
00:05:29,420 --> 00:05:30,230
just using a zero.

100
00:05:30,230 --> 00:05:32,140
Now, of course, to be prefix free,

101
00:05:32,140 --> 00:05:37,900
it better be the case that our encodings
of B and C and D all start with the bit 1.

102
00:05:37,900 --> 00:05:40,190
Otherwise we're not prefix free.

103
00:05:40,190 --> 00:05:44,440
But we can get away with that, so
let's encode a B with a one and

104
00:05:44,440 --> 00:05:47,350
then a zero, and now both C and

105
00:05:47,350 --> 00:05:50,980
D better have the property that they
start neither with 0 nor with 10.

106
00:05:50,980 --> 00:05:56,270
That is, they better start with 11,
but let's just encode c using 110 and

107
00:05:56,270 --> 00:05:58,550
D using 111.

108
00:05:58,550 --> 00:06:00,500
So that would be a variable length code.

109
00:06:00,500 --> 00:06:03,509
The number of bits varies between one and
three, but it is prefix free.

110
00:06:05,360 --> 00:06:07,280
And again,
the reason we might want to do this,

111
00:06:07,280 --> 00:06:10,800
the reason we might want to use
a variable-length encoding, is to take

112
00:06:10,800 --> 00:06:16,010
advantage of non-uniform frequencies
of symbols from a given alphabet.

113
00:06:16,010 --> 00:06:19,425
So let me show you a concrete example of
the benefits you can get from these kinds

114
00:06:19,425 --> 00:06:20,633
of codes on the next slide.

115
00:06:23,739 --> 00:06:27,891
So let's continue with just our
four-symbol alphabet, A, B, C, and D.

116
00:06:33,902 --> 00:06:37,515
And let's suppose we have good statistics
in our application domain about

117
00:06:37,515 --> 00:06:40,320
exactly how frequent each
of these symbols are.

118
00:06:40,320 --> 00:06:43,290
So, in particular, let's assume we know
that A is by far the most likely symbol.

119
00:06:43,290 --> 00:06:47,270
Let's say 60% of the symbols
are going to be As,

120
00:06:47,270 --> 00:06:51,850
whereas 25% are Bs,
10% are Cs, and 5% are Ds.

121
00:06:51,850 --> 00:06:53,905
So why would you know the statistics?

122
00:06:53,905 --> 00:06:56,750
Well, in some domains you're just
going to have a lot of expertise.

123
00:06:56,750 --> 00:07:01,290
In genomics you're going to know the usual
frequencies of As, Cs, Gs and Ts.

124
00:07:01,290 --> 00:07:04,640
For something like an mp3 file, well, you
can literally just take an intermediate

125
00:07:04,640 --> 00:07:08,350
version of the file after you've done
the analog to digital transformation, and

126
00:07:08,350 --> 00:07:10,790
just count the number of
occurrences of each of the symbols.

127
00:07:10,790 --> 00:07:14,140
And then you know exact frequencies,
and then you're good to go.

128
00:07:14,140 --> 00:07:17,780
So let's compare the performance
of the obvious fixed length code,

129
00:07:17,780 --> 00:07:21,710
where we used 2 bits for each of the 4
characters, with that of the variable

130
00:07:21,710 --> 00:07:25,100
length code that's also prefix-free
that we mentioned on the previous slide.

131
00:07:25,100 --> 00:07:29,140
And we're going to measure the performance
of these codes by looking, on average,

132
00:07:29,140 --> 00:07:32,360
how many bits do you need
to encode a character.

133
00:07:32,360 --> 00:07:35,660
Where the average is over the frequencies
of the four different symbols.

134
00:07:37,090 --> 00:07:40,000
So for the fixed-length encoding,
of course, it's two bits per symbol.

135
00:07:40,000 --> 00:07:41,190
We don't even need the average.

136
00:07:41,190 --> 00:07:43,379
Just whatever the symbol is,
it uses exactly two bits.

137
00:07:44,630 --> 00:07:48,960
So what about the variable length encoding
that's shown on the right in pink?

138
00:07:48,960 --> 00:07:53,214
How many bits, on average, for an average
character, given these frequencies of

139
00:07:53,214 --> 00:07:57,377
the different symbols, are needed to
encode a character of the alphabet sigma?

140
00:08:01,175 --> 00:08:03,500
Okay, so
the correct answer is the second one.

141
00:08:03,500 --> 00:08:07,082
It's, on average, 1.55 bits per character.

142
00:08:07,082 --> 00:08:09,350
So what's the computation?

143
00:08:09,350 --> 00:08:12,420
Well, 60% of the time,
it's going to use only 1 bit, and

144
00:08:12,420 --> 00:08:14,640
that's where the big savings comes from.

145
00:08:14,640 --> 00:08:19,450
1 bit is all that's needed whenever we see
an A, and most of the characters are As.

146
00:08:19,450 --> 00:08:22,760
We don't do too bad when we see a B
either, which is 25% of the time.

147
00:08:22,760 --> 00:08:24,348
We're only using 2 bits for each B.

148
00:08:27,277 --> 00:08:29,880
Now it is true that Cs and
Ds, we're paying a price.

149
00:08:29,880 --> 00:08:32,460
We've having to use 3 bits for
each of those, but there aren't very many.

150
00:08:32,460 --> 00:08:35,750
Only 10% of the time is it a C and
5% of the time is it a D.

151
00:08:35,750 --> 00:08:37,710
And if you add up the result,

152
00:08:37,710 --> 00:08:42,719
that's taking the average over the simple
frequencies, we get the result of 1.55.

153
00:08:44,470 --> 00:08:48,630
So this example draws our attention to
a very neat algorithmic opportunity.

154
00:08:48,630 --> 00:08:52,560
So namely, given a alphabet and
frequencies of the symbols,

155
00:08:52,560 --> 00:08:54,480
which in general are not uniform,

156
00:08:54,480 --> 00:08:57,790
we now know that the obvious solution
fixed length codes need not be optimal.

157
00:08:57,790 --> 00:09:01,560
We can improve upon them using
variable length prefix-free codes.

158
00:09:01,560 --> 00:09:04,540
So the computation you probably want
to solve is which one is the best?

159
00:09:04,540 --> 00:09:06,280
How do we get optimal compression?

160
00:09:06,280 --> 00:09:09,080
Which variable length code gives
us the minimum average encoding

161
00:09:09,080 --> 00:09:11,160
length of a symbol from this alphabet?

162
00:09:11,160 --> 00:09:12,960
So Huffman codes are the solution
to that problem.

163
00:09:12,960 --> 00:09:14,723
We'll start developing
them in the next video.