1
00:00:00,600 --> 00:00:05,800
So in this video, we're going to delve
into information theory and focus on the

2
00:00:05,800 --> 00:00:10,680
new, exciting results that Shannon
published in 1948, which we're still

3
00:00:10,680 --> 00:00:16,802
trying to understand and appreciate today.
Today's, this video is going to of talk

4
00:00:16,802 --> 00:00:20,624
about representing digital sources with
bits.

5
00:00:20,625 --> 00:00:25,170
It may seem kind of contradictory.
Isn't a digital source already expressed

6
00:00:25,170 --> 00:00:27,336
in bits?
Well, it turns out there's some

7
00:00:27,336 --> 00:00:31,634
interesting things you can do to make that
representation more efficient.

8
00:00:31,634 --> 00:00:35,706
So, the first thing I want to do is talk
about the digital channel model.

9
00:00:35,707 --> 00:00:40,126
It's a little bit different that what
we're used to, a little bit simpler and

10
00:00:40,126 --> 00:00:42,677
it's used all the time in information
theory.

11
00:00:42,678 --> 00:00:46,377
We're going to talk about Shannon's source
coding theorem.

12
00:00:46,378 --> 00:00:51,006
Learn, appreciate what it is.
And show you how that is useful today in

13
00:00:51,006 --> 00:00:55,981
talking about how to represent sources
with as few bits as possible.

14
00:00:55,982 --> 00:01:02,134
And this leads us to compression, in which
you are to discover that you've been using

15
00:01:02,134 --> 00:01:07,360
all along, compression ideas that actually
you cannot get back to the signal you

16
00:01:07,360 --> 00:01:12,274
started with, but we use these all the
time and they're still very useful and

17
00:01:12,274 --> 00:01:15,668
I'll show you how that works.
Just a second.

18
00:01:15,668 --> 00:01:19,000
Okay, so here's my digital communication
model.

19
00:01:19,000 --> 00:01:25,026
I have, we've seen this many times.
There's one little change, and that's

20
00:01:25,026 --> 00:01:29,284
going to be important today.
There's something I call the source coder.

21
00:01:29,285 --> 00:01:36,347
So the ideas is that this source is
spitting out a digital signal like text.

22
00:01:36,348 --> 00:01:41,384
It's a sequence of letters.
Not necessarily bits, but a sequence of

23
00:01:41,384 --> 00:01:44,781
letters.
The job of the source coder is to convert

24
00:01:44,781 --> 00:01:47,554
that to bits.
And hopefully efficiently.

25
00:01:47,554 --> 00:01:53,370
And do that in such a way.
That the source decoder can figure out

26
00:01:53,370 --> 00:01:56,894
what the original signal was without
error.

27
00:01:56,894 --> 00:02:01,384
So, that's what we're trying to do.
We want it as efficient as possible but

28
00:02:01,384 --> 00:02:06,307
efficient, not too efficient so that we
can't figure out what had happened.

29
00:02:06,308 --> 00:02:12,737
And as you know, once you have bits you
have to go through assigning a signal set,

30
00:02:12,737 --> 00:02:18,968
send it through a wireline or wireless
channel, receive it with the correlation

31
00:02:18,968 --> 00:02:25,442
receiver, and put out the estimated bit.
Well, the simplification in a digital

32
00:02:25,442 --> 00:02:31,630
communication model is to try to ignore
all the details about the analog channel,

33
00:02:31,630 --> 00:02:38,654
signal sets, correlation receivers, etc.
The bottom line is that when you send, put

34
00:02:38,654 --> 00:02:42,939
in a bit, out comes a bit, but it may be
wrong.

35
00:02:42,940 --> 00:02:50,644
And this little x shaped thing is called a
transition diagram which describes how the

36
00:02:50,644 --> 00:02:54,310
errors can occur.
So we start with a 0.

37
00:02:54,310 --> 00:02:58,780
We send a 0.
What this diagram says, that will come out

38
00:02:58,780 --> 00:03:03,411
a 0 with a probability of 1 minus pe and
it will come out.

39
00:03:03,412 --> 00:03:09,098
And error will create with probability p,
and we know that p is related to all the

40
00:03:09,098 --> 00:03:15,286
details in the analog channel.
Similarly, if you send a 1, it will come

41
00:03:15,286 --> 00:03:23,346
out a 1, with probability 1 minus pe and
will be convert to a 0, with probability

42
00:03:23,346 --> 00:03:26,330
pe.
So this is called a symmetric channel, for

43
00:03:26,330 --> 00:03:31,490
obvious reasons.
So, pe summarizes all the design choices

44
00:03:31,490 --> 00:03:38,139
and channel characteristics that are
needed to an information theory.

45
00:03:38,139 --> 00:03:44,792
So, we're just going to assume we have a
pe, we have a bit sequence, goes in.

46
00:03:44,793 --> 00:03:50,100
It got another bit sequence comes out.
And that, those bits may be flipped,

47
00:03:50,100 --> 00:03:54,290
changed but with probability pe.
All right.

48
00:03:54,290 --> 00:04:01,250
So, let's go over the assumptions that
information theory makes.

49
00:04:01,250 --> 00:04:07,032
And we're again going to assume the source
emits letters.

50
00:04:07,033 --> 00:04:14,687
And this here now is a technical term that
are drawn from an alphabet a.

51
00:04:14,688 --> 00:04:19,748
So in this case they're capital K letters.
It could be literally an alphabet, in

52
00:04:19,748 --> 00:04:22,410
other words, it could be text that you
type.

53
00:04:22,410 --> 00:04:28,120
And also this alphabet could consist of a
to d output values.

54
00:04:28,120 --> 00:04:35,050
So every sample and some quantize value
and you could consider each of those

55
00:04:35,050 --> 00:04:40,309
possible values in letter.
What Shannon did which was really an

56
00:04:40,309 --> 00:04:46,119
important assumption as he assumed that
these were some come out of the source in

57
00:04:46,119 --> 00:04:51,665
a random or probabilistic fashion.
He said each letter had some probability,

58
00:04:51,665 --> 00:04:57,190
and it's, depends on what the source is,
is what those probabilities turn out to

59
00:04:57,190 --> 00:05:00,080
be.
And as you know, probabilities have to be

60
00:05:00,081 --> 00:05:05,165
positive and they have to be less than 1,
and, and their sum has to be equal to 1.

61
00:05:05,166 --> 00:05:10,680
So a probabilistic model for sources.
And I may say the things I type in my

62
00:05:10,680 --> 00:05:15,348
email to people is not random.
Well it turns out, more complicated models

63
00:05:15,348 --> 00:05:19,836
related to this, that are probabilistic do
a pretty good job in describing actual

64
00:05:19,836 --> 00:05:22,940
texts that's actually produced in many
languages.

65
00:05:22,940 --> 00:05:25,110
So, the next thing we have to assume is,
what that alphabet is, is known by the

66
00:05:25,110 --> 00:05:28,640
source and the sink.
Because I'm about to convert those letters

67
00:05:28,640 --> 00:05:34,464
into bits, and I have to know what those
bits are trying to express, so I have to

68
00:05:34,464 --> 00:05:38,397
know what the alphabet is or else I'm in
big trouble.

69
00:05:38,397 --> 00:05:48,394
Once I convert to bits, I won't know how
to get back to the original letters that

70
00:05:48,394 --> 00:05:54,780
were sent.
So that's a, not a very hard assumption to

71
00:05:54,780 --> 00:05:57,889
make.
So here's the, there are two issues in

72
00:05:57,889 --> 00:06:01,618
information theory.
First is how should letters be represented

73
00:06:01,618 --> 00:06:04,797
by bits.
That's what this video is about.

74
00:06:04,798 --> 00:06:11,710
The second point in information theories,
how to mitigate, reduce the errors that

75
00:06:11,710 --> 00:06:16,142
the channel introduces.
So the channel is going to introduce pe

76
00:06:16,142 --> 00:06:20,626
whether you like it or not, that's going
to be the probability of yet being

77
00:06:20,626 --> 00:06:23,848
changed.
We can't do anything about that but there

78
00:06:23,848 --> 00:06:28,712
is something we can do before we've send
the bits and afterwards to reduce the

79
00:06:28,712 --> 00:06:32,543
effective errors.
And it turns out the answer is yes we can

80
00:06:32,543 --> 00:06:37,125
and the answer I think can be pretty
surprising but that's going to be in a

81
00:06:37,125 --> 00:06:40,549
later video.
Well this video is about, is how do you

82
00:06:40,549 --> 00:06:45,826
represent letters by bits.
So, we're going to start by talking about

83
00:06:45,826 --> 00:06:50,140
what Shannon called the source coding
theorem.

84
00:06:50,140 --> 00:06:55,642
And it started, we first needed to find
something called the entropy.

85
00:06:55,642 --> 00:07:02,336
So the entropy, capital H of an alphabet,
is simply the negative sum.

86
00:07:02,336 --> 00:07:07,252
The probability of each letter times the
log of that probability.

87
00:07:07,252 --> 00:07:13,682
And it's just a definition for right now.
Use log base 2 so that the answer comes

88
00:07:13,682 --> 00:07:17,030
out in bits.
So I'm going to do a sample calculation

89
00:07:17,030 --> 00:07:20,616
here in just a second.
But it's just a definition of a quantity

90
00:07:20,616 --> 00:07:25,656
related to the probabilities of the
letters and essentially it summarizes what

91
00:07:25,656 --> 00:07:31,880
those probabilities are in one number.
Now the entropy has a couple of important

92
00:07:31,880 --> 00:07:37,734
properties.
What's the set of probabilities that

93
00:07:37,734 --> 00:07:44,180
maximize the entropy?
That occurs when the letters are equally

94
00:07:44,180 --> 00:07:47,915
likely.
So there are k letters, so the probability

95
00:07:47,915 --> 00:07:51,942
of each is 1 over k.
And you can show by taking derivatives

96
00:07:51,942 --> 00:07:57,531
that, it's just a calculus exercise, that
the entropy is maximized in this case and

97
00:07:57,531 --> 00:08:02,754
you get for the entropy the log base 2 of
the number of letters in the alphabet.

98
00:08:02,754 --> 00:08:08,106
So that's the biggest entropy can be.
The smallest it can be is 0, and that

99
00:08:08,106 --> 00:08:13,146
occurs when one of the letters has a
probability of 1, which means all the

100
00:08:13,146 --> 00:08:19,407
other letters have a probability of 0.
Probabilities have to add to 1.

101
00:08:19,407 --> 00:08:24,076
So in that case, the entropy turns out to
be 0.

102
00:08:24,076 --> 00:08:30,692
So the entropy's a positive number for any
alphabet that goes between 0 and the log

103
00:08:30,692 --> 00:08:34,337
based to the number of letters in the
alphabet.

104
00:08:34,337 --> 00:08:36,982
There's a little detail here in doing this
calculation.

105
00:08:36,982 --> 00:08:44,250
And that is, in this expression, I will
need to compute that.

106
00:08:44,250 --> 00:08:49,394
0 times log base 2 of 0.
In information theory, whenever you see

107
00:08:49,394 --> 00:08:54,270
this thing, that is just 0.
That's the result you get when you take

108
00:08:54,270 --> 00:08:58,703
limits and things like that.
But in order to simplify things, 0 log of

109
00:08:58,703 --> 00:09:02,314
0.
Oh, that's just 0, so that's how you can

110
00:09:02,314 --> 00:09:07,931
easily confirm that this [unknown] results
in an entropy of 0.

111
00:09:07,932 --> 00:09:16,072
Alright so let's calculate an entropy for
an example.

112
00:09:16,073 --> 00:09:22,573
So I have here a 4 letter source a1
through a4 and they have these

113
00:09:22,573 --> 00:09:29,081
probabilities and I think it's pretty easy
to these add up to 1.

114
00:09:29,081 --> 00:09:35,179
So, there all this is a good description
of a source and I want to calculate its

115
00:09:35,179 --> 00:09:38,043
entropy.
So, if I just plug it into the expression,

116
00:09:38,043 --> 00:09:45,197
this isn't difficult to do.
This is the raw expression for entropy.

117
00:09:45,198 --> 00:09:48,194
And now the question is what are these
logs?

118
00:09:48,194 --> 00:09:54,297
So I point out to the half, is 2 to the
minus 1.

119
00:09:54,298 --> 00:09:58,318
And so the log base 2 of this, is just
minus 1.

120
00:09:58,318 --> 00:10:06,523
Well, 1 quarter is 1 over 2 squared, which
makes it 2 to the minus 2.

121
00:10:06,523 --> 00:10:13,890
So with log base 2, that is minus 2 and
the other one is you can easily see are

122
00:10:13,890 --> 00:10:18,086
the same.
Well, when you multiply this all out, you

123
00:10:18,086 --> 00:10:22,848
get all said and done, the entropy of this
alphabet is 1.75.

124
00:10:22,848 --> 00:10:27,335
Don't know what the entropy is good for
yet but it's going to be important in

125
00:10:27,335 --> 00:10:31,199
just, as I'll show you in just a second,
but this is how you do entropy

126
00:10:31,199 --> 00:10:34,424
calculations for a given alphabet.
It's pretty straightforward.

127
00:10:34,424 --> 00:10:40,610
Not difficult to do.
So what is the Source Coding Theorem?

128
00:10:40,610 --> 00:10:46,428
I'm finally going to state it to you here.
I needed to find one more thing and that

129
00:10:46,428 --> 00:10:53,643
is the availability to find b of ak to be
the number of bits that are assigned to a

130
00:10:53,643 --> 00:10:57,694
given letter.
And b bar is the average number of bits

131
00:10:57,694 --> 00:11:03,490
that's going to be required on average for
each letter to represent the alphabet, and

132
00:11:03,490 --> 00:11:06,754
that's just the average of these b of
ak's.

133
00:11:06,754 --> 00:11:15,788
So, the quality, how efficient the
representation by bits is, is going to be

134
00:11:15,788 --> 00:11:22,260
summarized by b bar.
If that b bar is small then we use very

135
00:11:22,260 --> 00:11:25,964
few bits on average to represent the
alphabet.

136
00:11:25,964 --> 00:11:32,777
The question is how small can it be?
Well, here's the result.

137
00:11:32,778 --> 00:11:36,356
This is published in his very famous 1948
paper.

138
00:11:36,356 --> 00:11:42,719
That Shannon published and he said the
following.

139
00:11:42,719 --> 00:11:53,640
There exists a coder and a decoder pair
that can represent our source and its bits

140
00:11:53,640 --> 00:12:02,940
and so that you can retrieve it as long as
b bar is within these ranges.

141
00:12:02,940 --> 00:12:10,545
The lower limit is entropy of the alphabet
and you don't need more bits than the

142
00:12:10,545 --> 00:12:16,588
entropy plus 1 bit.
So, there exists some coder, decoder pair

143
00:12:16,588 --> 00:12:24,258
that'll do this as long as the average
number, whatever code you generate,

144
00:12:24,258 --> 00:12:30,056
however you assign bits to letters, is
within this range.

145
00:12:30,056 --> 00:12:36,290
Clearly, the most efficient code that you
could have is one where you get close to

146
00:12:36,290 --> 00:12:43,094
entropy as possible.
Well, he has, there's more to the theorem.

147
00:12:43,094 --> 00:12:52,962
Furthermore, if you try to use fewer bits
than the entropy, you cannot recover what

148
00:12:52,962 --> 00:12:57,199
the original letters were.
There's going to be an error.

149
00:12:57,199 --> 00:13:05,163
So this lower limit of the entropy is a
hard limit for what's called Lossless

150
00:13:05,163 --> 00:13:09,743
Compression.
So, I'll show you some compression things

151
00:13:09,743 --> 00:13:15,224
in a second, but what this means is if I
use this number of bits to represent my

152
00:13:15,224 --> 00:13:19,103
source, the entropy number and I can find
that code.

153
00:13:19,104 --> 00:13:22,368
Then I can retrieve my original letters
with no problem and that's a lossless

154
00:13:22,368 --> 00:13:27,681
compression.
However, it turns out there are what are

155
00:13:27,681 --> 00:13:35,159
called lossy compression algorithms that
we use all the time.

156
00:13:35,159 --> 00:13:41,257
And they compress with a fewer number of
bits than entropy so you cannot get back

157
00:13:41,257 --> 00:13:46,567
the original letters.
And you may initially think well that's

158
00:13:46,567 --> 00:13:50,587
not good and it depends what you mean by
good.

159
00:13:50,588 --> 00:13:55,689
Now I'm going to point out that Shannon's
theorem has a little problem.

160
00:13:55,689 --> 00:14:02,027
And it's all buried in this little phrase
right there, there exists.

161
00:14:02,028 --> 00:14:08,117
He doesn't tell you what the coder and
decoder is that will satisfy this

162
00:14:08,117 --> 00:14:11,094
condition.
He just says it's there, and he can prove

163
00:14:11,094 --> 00:14:13,540
there's one there, but he doesn't know
what it is.

164
00:14:13,540 --> 00:14:17,415
And at the time he wrote the paper in
1948, he didn't.

165
00:14:17,416 --> 00:14:23,095
Well, the, the 50s, a procedure for
finding this source code was developed

166
00:14:23,095 --> 00:14:29,140
that does satisfy this bound all the time
and I'm going to show you that in just a

167
00:14:29,140 --> 00:14:32,409
second.
Now let me talk about lossless and lossy

168
00:14:32,409 --> 00:14:37,722
compression a little bit, so here are the
compression approaches that you may have

169
00:14:37,722 --> 00:14:42,458
heard of.
And how they fall into the lossless and

170
00:14:42,458 --> 00:14:48,996
lossy compression scheme.
So LZW, is a lossless compression

171
00:14:48,996 --> 00:14:52,692
algorithm.
It's the same algorithm that's used by

172
00:14:52,692 --> 00:14:55,955
when you zip a file.
We're going to talk about a Huffman code

173
00:14:55,955 --> 00:15:01,805
in just a second.
So, these represent a source, a sequence

174
00:15:01,805 --> 00:15:06,797
of letters in a very efficient way with
bits.

175
00:15:06,798 --> 00:15:10,487
And you can undo that assignment and get
back the original file.

176
00:15:10,488 --> 00:15:15,813
So I think this might be pretty important
if you were trying to compress your entire

177
00:15:15,813 --> 00:15:19,360
file system on your computer using, you
try to zip it up.

178
00:15:19,360 --> 00:15:23,569
You want that, compression algorithm to
use as few as bytes as possible, but you

179
00:15:23,569 --> 00:15:27,840
want to get your original files back.
That's what a lossless compression

180
00:15:27,840 --> 00:15:32,650
algorithm does for you.
Lossy compression, you may be surprised to

181
00:15:32,650 --> 00:15:36,202
know that the MP3, MPEG, and JPEG are all
Lossy.

182
00:15:36,203 --> 00:15:42,657
So if you convert a CD to an MP3, set of
MP3 files.

183
00:15:42,657 --> 00:15:48,168
It turns out that you're using fewer bits
than were used original, in the original

184
00:15:48,168 --> 00:15:53,595
sampling that the, that's used in the a to
d converter to make the CD and you've

185
00:15:53,595 --> 00:15:58,805
introduced errors.
And the whole game with lossy compression

186
00:15:58,805 --> 00:16:01,490
is to try to reduce the effect of the
errors.

187
00:16:01,490 --> 00:16:06,235
They're going to be there, you're using
fewer bits in entropy, but you're trying

188
00:16:06,235 --> 00:16:10,907
to make it so that, in the case of MP3 you
don't hear the errors, or in the case of

189
00:16:10,907 --> 00:16:13,868
MPEG and JPEG you don't see the errors
very much.

190
00:16:13,868 --> 00:16:18,704
And so we use lossy compression algorithms
all the time, but there are certain error

191
00:16:18,704 --> 00:16:22,683
cases In which you want to use compression
schemes that are not Lossy.

192
00:16:22,684 --> 00:16:30,518
So, it depends on the application.
So let me go though our little example

193
00:16:30,518 --> 00:16:35,854
here that we've already done.
And we calculated already that the entropy

194
00:16:35,854 --> 00:16:41,666
was 1 and 3 quarters bits.
And so the source coding theorem says that

195
00:16:41,667 --> 00:16:48,179
we should be able to come up with a code
that's loss, lossless compression.

196
00:16:48,180 --> 00:16:52,620
Whose average number of bits is between 1
and 3 quarters and 2 and 3 quarters.

197
00:16:52,620 --> 00:16:55,289
Well, let's try what's called a simple
code.

198
00:16:56,510 --> 00:17:03,704
And the simple code merely assigns the
same number of bits for each letter and so

199
00:17:03,704 --> 00:17:10,506
depends on how many letters you have.
And that is given by this expression.

200
00:17:10,507 --> 00:17:19,171
I don't know if you've seen this before,
these little hack brackets on the side and

201
00:17:19,171 --> 00:17:25,737
that's called the ceiling function.
And what it means is round up.

202
00:17:25,738 --> 00:17:33,819
So the ceiling of 2.5 is 3 but the ceiling
of 2 is 2.

203
00:17:35,500 --> 00:17:41,298
So the ceiling of 2.01 is again 3.
And it just means the greatest integer,

204
00:17:41,298 --> 00:17:44,557
the integer is just greater than the log
base 2 of k.

205
00:17:44,558 --> 00:17:50,394
So we had 4 letters in the alphabet the
log based 2 that 4 is 2, so we can

206
00:17:50,394 --> 00:17:56,450
represent our letters with this sim,
what's called a simple code.

207
00:17:56,450 --> 00:18:00,834
You just use the bit sequence in whatever
way you want.

208
00:18:00,834 --> 00:18:07,455
Well, b bar for that of a course, is 2 .
And you'd say, okay well that's great.

209
00:18:07,455 --> 00:18:12,860
That turns out falls within the limits of
a source coding theorem.

210
00:18:12,860 --> 00:18:16,014
So, here's a code that exists that works
fine.

211
00:18:16,014 --> 00:18:22,421
Well, is there something better?
Is there something that can be better than

212
00:18:22,422 --> 00:18:28,767
2 bits that can get closer to the, what's
called the entropy limit?

213
00:18:28,768 --> 00:18:31,908
This lower down.
Let me give you an example of one.

214
00:18:31,909 --> 00:18:38,955
And it may look a little screwy at first,
a little interesting.

215
00:18:38,956 --> 00:18:46,038
What I'm going to do is I'm going to
represent a1 with 1 bit a 0, represent a2

216
00:18:46,038 --> 00:18:52,869
with 2 bits and a3 and a4 with 3 bits.
And the idea here is I'm using fewer bits

217
00:18:52,869 --> 00:18:59,827
to represent the letters that have higher
probabilities, more bits to represent the

218
00:18:59,827 --> 00:19:05,707
ones that have fewer, at the lower
probable, in other words they will not

219
00:19:05,707 --> 00:19:08,690
occur as often.
This is an example of what's called a

220
00:19:08,690 --> 00:19:14,062
variable length code.
So, the simple binary code, if you will,

221
00:19:14,062 --> 00:19:19,566
is an equal length code.
All the bits, number bits are the same.

222
00:19:19,566 --> 00:19:23,434
And in variable length code, you use a
different number of bits.

223
00:19:23,434 --> 00:19:30,249
Well, if you go through the calculation to
figure out what b bar is guess what?

224
00:19:30,249 --> 00:19:36,637
It's 1.75.
So, this is as good as it gets.

225
00:19:36,638 --> 00:19:39,070
You can compress down to the entropy
limit.

226
00:19:39,070 --> 00:19:44,972
And what you have to do is use a different
number of bits for each letter depending

227
00:19:44,972 --> 00:19:51,666
on what their probabilities are.
And it's also true that the decoder has to

228
00:19:51,666 --> 00:19:58,404
know what this code is in order to figure
out what's going on, of course.

229
00:19:58,404 --> 00:20:04,769
So, this is not the word, code, as in a
secret; this is a code, it's standard

230
00:20:04,769 --> 00:20:11,839
terminology used in information theorem
trying to represent our code a letters

231
00:20:11,839 --> 00:20:14,790
with bits.
All right.

232
00:20:14,791 --> 00:20:22,600
So, I'm happy about that.
But there's a problem, it comes up with

233
00:20:22,600 --> 00:20:27,460
variable length codes.
I have to know where the letter boundaries

234
00:20:27,460 --> 00:20:33,710
are.
So, let's suppose the source puts out that

235
00:20:33,710 --> 00:20:40,995
sequence of letters.
And using this code is represented by this

236
00:20:40,995 --> 00:20:45,894
sequence of bits.
Now what the decoder has to do is figure

237
00:20:45,894 --> 00:20:50,003
out what were the code letters, or the
code words are.

238
00:20:51,220 --> 00:20:57,035
When this is all it gets, it doesn't know
anything else other than what it receives.

239
00:20:57,035 --> 00:21:02,062
So, it has to figure out the boundaries.
What you have to be able to do is say,

240
00:21:02,062 --> 00:21:08,175
well, here's the 0, so that's got to be an
a1 and that's gotta be an a1.

241
00:21:08,175 --> 00:21:12,184
Now I get 1, 1 0, well there's the next
code.

242
00:21:12,185 --> 00:21:17,600
Here's a 0 for an a1.
1 0, that's an a2.

243
00:21:17,600 --> 00:21:23,007
And there's an a1.
And finally we get an a4.

244
00:21:23,008 --> 00:21:27,772
So you have to be able to do this without
error and with no complications like you

245
00:21:27,772 --> 00:21:34,594
get lost or anything like that.
So, this variable length code has to be

246
00:21:34,594 --> 00:21:40,724
rigged so that no code word begins another
code word.

247
00:21:40,724 --> 00:21:44,320
If it does, you're in big trouble.
You're going to have trouble trying to

248
00:21:44,320 --> 00:21:48,690
figure out where these boudnaries are.
So, how do you do that?

249
00:21:48,690 --> 00:21:54,210
So in addition to satisfying the source
coding theorem you have to have a

250
00:21:54,210 --> 00:21:59,914
procedure for setting up a code so that
you can figure out the receiver, can

251
00:21:59,914 --> 00:22:04,127
figure out what's going on.
So, I'm going to show you how to build

252
00:22:04,127 --> 00:22:06,709
such a code and these are called Huffman
Codes.

253
00:22:09,130 --> 00:22:14,014
Dave Huffman was a graduate student in the
1950s, and as a student he figured out

254
00:22:14,014 --> 00:22:17,460
this code and proved it had very important
properties.

255
00:22:17,460 --> 00:22:23,134
Rather than try to describe this in a
formal way, I'm just going to give an

256
00:22:23,134 --> 00:22:27,144
example cause it's really easy to see how
this works.

257
00:22:27,144 --> 00:22:30,129
So, I am going to build what's called a
coding tree.

258
00:22:31,280 --> 00:22:36,490
I'll show you what that is as we go on.
So the first thing you do is you write

259
00:22:36,490 --> 00:22:40,600
down the letters in decreasing order of
probability.

260
00:22:40,600 --> 00:22:45,764
So the probabilities are going down.
So you may have to reorder the letters or

261
00:22:45,764 --> 00:22:48,230
anything.
But the important thing is that the

262
00:22:48,230 --> 00:22:53,936
probabilities are going down.
Next thing you do is you look at the two

263
00:22:53,936 --> 00:23:02,285
smallest probability letters and you think
you put on a binary tree, you merge them

264
00:23:02,285 --> 00:23:10,336
as coming, as nodes coming out of a leaves
rather coming out of a simple node.

265
00:23:10,336 --> 00:23:17,180
And we define a new probability for that,
that's just a sum of component

266
00:23:17,180 --> 00:23:20,524
probabilities.
So the next thing you do is you look for

267
00:23:20,524 --> 00:23:26,552
the two smallest probability things.
Express them as coming out of leaves of a

268
00:23:26,552 --> 00:23:33,828
simple binary tree node and you keep, and
you find the new probability which is the

269
00:23:33,828 --> 00:23:39,620
sum of the probabilites these letters and
you finally merge it.

270
00:23:39,620 --> 00:23:43,604
We only have one more we can do.
And so that's the, that's what called a

271
00:23:43,604 --> 00:23:46,244
coding tree.
Okay.

272
00:23:46,245 --> 00:23:53,285
And it's just a way of, of defining that
binary tree for a given alpha bit in terms

273
00:23:53,285 --> 00:23:59,303
of its letters probabilities.
The next thing you do is you arbitrarily

274
00:23:59,303 --> 00:24:04,272
define.
Assigns 0s and 1s to each branch.

275
00:24:04,272 --> 00:24:08,747
And this literally is arbitrary, I've done
it in a kind of boring way.

276
00:24:08,747 --> 00:24:12,717
Always put 0 for the upper branch and a 1
for lower branch.

277
00:24:12,718 --> 00:24:17,897
You could change that to a 1 and that to a
0 and everything would work fine.

278
00:24:17,898 --> 00:24:22,309
So that's the next thing that you do
assign bits to each branch.

279
00:24:22,309 --> 00:24:29,160
And then what you do is to find your code
is you figure out the path it takes to get

280
00:24:29,160 --> 00:24:36,728
to each letter from the source nodes.
So, we start here and you're going to go

281
00:24:36,728 --> 00:24:45,483
to the source node.
So, suppose we look at this one for a3, so

282
00:24:45,483 --> 00:24:50,875
it was 1, 1, 0.
So they're written in the same order as

283
00:24:50,875 --> 00:24:58,542
the path it took me to get to that letter.
So 1 0 of course is 1 0.

284
00:24:58,542 --> 00:25:03,729
So it's pretty easy to now jam the binary
tree to figure out a source code.

285
00:25:03,729 --> 00:25:10,338
Now, because of the structure of this tree
and because there is a unique path to get

286
00:25:10,338 --> 00:25:16,449
through this tree to the letter, that
means no other coded word, no code word

287
00:25:16,449 --> 00:25:22,463
begins any other, because it is just a
representation of the paths through this

288
00:25:22,463 --> 00:25:26,396
tree.
So you have, it turns out, now a code that

289
00:25:26,396 --> 00:25:32,895
obeys the rules, allows you to figure out,
from the whole sequence of these bits,

290
00:25:32,895 --> 00:25:36,717
where the boundaries are.
So that's good.

291
00:25:36,718 --> 00:25:41,750
It's also true that this has other
important properties.

292
00:25:41,750 --> 00:25:48,154
This, property of not, no code word
beginning at any other is called a prefix

293
00:25:48,154 --> 00:25:51,956
code.
And it's just a terminology in the field.

294
00:25:51,956 --> 00:25:57,718
And it turns out you can show, and I'll
give an example in a second, that you may

295
00:25:57,718 --> 00:26:02,810
not actually produce a code that goes all
the way to the entropy limit.

296
00:26:02,810 --> 00:26:11,932
It turns out it may not quite reach h of a
for the average code length.

297
00:26:11,932 --> 00:26:19,166
But what, Huffman proved was that no
lossless source code can have, have a

298
00:26:19,166 --> 00:26:25,672
shorter average code length.
A smaller b bar, than what his procedure

299
00:26:25,673 --> 00:26:29,208
gives you.
So his procedures are more realistic

300
00:26:29,208 --> 00:26:33,966
reflection of the fused number of bits
that you can use and have a lossless

301
00:26:33,966 --> 00:26:38,402
compression algorithm.
Shanna's result is just a bound, says it

302
00:26:38,402 --> 00:26:43,061
does not need to be smaller.
Then cannot be smaller than entropy, but

303
00:26:43,061 --> 00:26:45,854
it could be as big as an entropy plus 1
bits.

304
00:26:45,854 --> 00:26:51,775
And so you can look at the Huffman Code
and set up a coding tree to figure out

305
00:26:51,775 --> 00:26:57,517
what is the shortest possible average code
length, the smallest b bar.

306
00:26:57,518 --> 00:27:00,299
And that means the Huffman coding is
optimal.

307
00:27:00,300 --> 00:27:08,388
No other code can give you a shorter, few
number of bits per, on the average for an

308
00:27:08,388 --> 00:27:12,850
alphabet than a Huffman code.
Very nice.

309
00:27:12,850 --> 00:27:19,847
So, there's a little problem; which
Huffman code is great, but there's a

310
00:27:19,847 --> 00:27:25,310
little problem with errors that channel
might introduce.

311
00:27:25,310 --> 00:27:32,678
So here's the same example I gave before,
and here's our sequence of letters.

312
00:27:32,678 --> 00:27:39,942
Suppose at 0, comes out a 1.
So, pe is not 0, so suppose it changes

313
00:27:39,942 --> 00:27:44,539
that 1 bit to a 1.
Well now, where are the boundaries?

314
00:27:44,540 --> 00:27:48,327
So, we have 0.
That's fine.

315
00:27:48,328 --> 00:27:51,957
Well it's now 1, 1, 1.
Well that's this.

316
00:27:51,958 --> 00:28:00,177
And then we have a 0 and another 0 and
then one 0, another 0.

317
00:28:00,178 --> 00:28:10,410
Well that comes out as an a1 and an a4.
And then an a1, and an a1, et cetera.

318
00:28:10,410 --> 00:28:17,960
So that is not the same as that.
So if there's a channel error, the Huffman

319
00:28:17,960 --> 00:28:22,556
coding can be entirely wrong.
And if there are no channel errors it's

320
00:28:22,556 --> 00:28:26,020
always true that you can get back the
original letter sequence.

321
00:28:26,020 --> 00:28:29,658
That's always true.
But if the channel Introduces errors, the

322
00:28:29,658 --> 00:28:34,650
Huffman Code, and in fact any variable
length code is going to be very sensitive

323
00:28:34,650 --> 00:28:40,622
to those errors.
So, you want to be a little careful about

324
00:28:40,622 --> 00:28:43,189
that.
And we're going to take care of these

325
00:28:43,189 --> 00:28:46,365
errors in the channel in just a second, in
another video.

326
00:28:46,366 --> 00:28:51,325
But just like to say, you have to be
aware.

327
00:28:51,326 --> 00:28:54,640
And the channel errors can really mess
things up.

328
00:28:54,640 --> 00:29:00,272
So if you send your zip file to somebody
and one of those bits gets changed, the

329
00:29:00,272 --> 00:29:06,520
entire file system you just put into a zip
file can be totally corrupted and wrong,

330
00:29:06,520 --> 00:29:12,173
all the file names will be wrong, etc.
So it's real important not to have any

331
00:29:12,173 --> 00:29:14,881
errors in the transmission.
So.

332
00:29:14,882 --> 00:29:18,040
I want to point out also something about
Morse codes.

333
00:29:18,040 --> 00:29:21,756
So Morse code was invented for the
telegraph.

334
00:29:21,756 --> 00:29:29,630
And I'm sure you know that SOS is dot,
dot, dot, dash, dash, dash, dot, dot, dot.

335
00:29:29,630 --> 00:29:35,170
It's the early way they represented
letters with dashes and dots which you

336
00:29:35,170 --> 00:29:40,955
think about it for a second, that's the
same as representing the letters of the

337
00:29:40,955 --> 00:29:45,371
alphabet with 0s and 1s.
Just by a different name.

338
00:29:45,371 --> 00:29:49,792
And what the Morse code is, it turns out
very reminiscent of the idea behind the

339
00:29:49,792 --> 00:29:56,103
Huffman Code.
In that the most probable letter, the e,

340
00:29:56,103 --> 00:30:01,669
is represented by a very short sequence, 1
dot.

341
00:30:01,669 --> 00:30:13,448
And the s over here, which has a smaller
probability, has more characters, 3 dots.

342
00:30:13,448 --> 00:30:17,842
So, it turns that's right kind of idea.
The most improbable letter in the list, I

343
00:30:17,842 --> 00:30:21,724
think, is the j.
And it's got the most dashes and dots of

344
00:30:21,724 --> 00:30:25,269
anything in the Morse code.
So that's the right kind of idea.

345
00:30:25,269 --> 00:30:28,670
Well, how well does it work?
So if you figure out, using the

346
00:30:28,670 --> 00:30:35,096
probabilities here, you go through and you
figure out what, using all these

347
00:30:35,096 --> 00:30:41,743
probabilities, you figure out what the
entropy is, you come up with 4.14 bits.

348
00:30:41,744 --> 00:30:50,590
Well, if you figure out the average number
of bits used in the Morse code, it turns

349
00:30:50,590 --> 00:30:54,785
out to be 5.56.
Which I will point out is more than a bit

350
00:30:54,785 --> 00:30:59,290
away from entropy, which says you can do
better than the Morse code.

351
00:30:59,290 --> 00:31:03,193
Well the question is why?
What's going on?

352
00:31:03,193 --> 00:31:08,140
And the answer turns out to be there's one
little thing that they didn't do right.

353
00:31:08,140 --> 00:31:14,868
So like I said an E is 1 dot.
An S is 3 dots.

354
00:31:14,868 --> 00:31:20,327
Well how'd he know that S is not really 3
Es?

355
00:31:20,328 --> 00:31:26,749
3 Es will have 3 dots and S is 3 dots.
So what they did in the tele, telegraph

356
00:31:26,749 --> 00:31:31,088
case is they'd send a dot and then the
operator would pause.

357
00:31:31,088 --> 00:31:35,938
And then doing, do, go onto the next
letter of the alphabet.

358
00:31:35,938 --> 00:31:42,018
So, for an S, the 3 dots would be right
next to each other but if you sent 3s, it

359
00:31:42,018 --> 00:31:46,923
would be dot, dot, dot, if you send an S
it'd be dot, dot, dot.

360
00:31:46,924 --> 00:31:52,278
So, those spaces, those pauses are commas,
what are called commas.

361
00:31:52,279 --> 00:31:58,510
So, you can cannot decode the original
sequence of letters, if everything's all

362
00:31:58,510 --> 00:32:03,146
jammed together.
So if you count in that extra pause,

363
00:32:03,146 --> 00:32:06,784
that's where the 5.56 comes from, turns
out.

364
00:32:06,784 --> 00:32:12,265
So it's inefficient.
So here is the Huffman Code, which I

365
00:32:12,265 --> 00:32:19,740
found, for the same set of probabilities.
And you can see some of these code words

366
00:32:19,740 --> 00:32:23,861
are very long.
But what's the average code length?

367
00:32:23,861 --> 00:32:29,286
And that's kind of a surprise.
It turns out it does a whole lot better

368
00:32:29,286 --> 00:32:32,324
than the Morse code.
It's 4.35.

369
00:32:32,324 --> 00:32:38,400
It doesn't get to the entropy limit.
But Huffman's result says you can't do any

370
00:32:38,400 --> 00:32:43,558
better than 4.35.
There is not going to be any variable link

371
00:32:43,558 --> 00:32:50,686
code that is lossless that has a shorter
average code link than this code.

372
00:32:50,687 --> 00:32:54,302
It's not unique.
This sequence of 0s and 1s may be

373
00:32:54,302 --> 00:32:58,122
different, but the smallest you could do
is 4.35.

374
00:32:58,122 --> 00:33:02,543
You can't reach the entropy limit but
you're only 0.2 bits away from it.

375
00:33:02,543 --> 00:33:07,038
This is significantly less.
More than a bit is saved per each letter,

376
00:33:07,038 --> 00:33:14,900
on average, with the use of Huffman Code.
So, that's the idea of source coding,

377
00:33:14,900 --> 00:33:21,366
what's called compression.
So most data sources are, can be

378
00:33:21,366 --> 00:33:27,501
compressed in a lossless sense with much
fewer bits than a simple code.

379
00:33:27,502 --> 00:33:32,617
Which you might just pick off the shelf.
So, the effect of doing that, by the way,

380
00:33:32,617 --> 00:33:37,148
is to lower the source data rate.
So the simple codes we've been talking

381
00:33:37,148 --> 00:33:41,495
about to date we can actually do better if
we use a variable length code we can

382
00:33:41,495 --> 00:33:46,332
dramatically cut down the data rate.
You can always use lossy compression.

383
00:33:46,332 --> 00:33:49,430
But you can't get back the original
sequence of letters.

384
00:33:49,430 --> 00:33:56,570
And you also, in your lossless code have
to have a way of getting back figuring out

385
00:33:56,570 --> 00:34:01,867
what you really did mean and that's what
MPEG, JPEG, MP3 does.

386
00:34:01,867 --> 00:34:07,457
Now the other downside of course is that
in the lossless case you're going to be

387
00:34:07,457 --> 00:34:13,650
very sensitive to channel errors.
And so what we're going to build up to in

388
00:34:13,650 --> 00:34:19,613
succeeding videos is determining ways,
getting around those channel errors so

389
00:34:19,613 --> 00:34:25,398
they're not going to be much of a problem,
even though the analog channel is noisy

390
00:34:25,398 --> 00:34:28,193
and has a pe that's greater than 0.