So in this video, we're going to delve
into information theory and focus on the
new, exciting results that Shannon
published in 1948, which we're still
trying to understand and appreciate today.
Today's, this video is going to of talk
about representing digital sources with
bits.
It may seem kind of contradictory.
Isn't a digital source already expressed
in bits?
Well, it turns out there's some
interesting things you can do to make that
representation more efficient.
So, the first thing I want to do is talk
about the digital channel model.
It's a little bit different that what
we're used to, a little bit simpler and
it's used all the time in information
theory.
We're going to talk about Shannon's source
coding theorem.
Learn, appreciate what it is.
And show you how that is useful today in
talking about how to represent sources
with as few bits as possible.
And this leads us to compression, in which
you are to discover that you've been using
all along, compression ideas that actually
you cannot get back to the signal you
started with, but we use these all the
time and they're still very useful and
I'll show you how that works.
Just a second.
Okay, so here's my digital communication
model.
I have, we've seen this many times.
There's one little change, and that's
going to be important today.
There's something I call the source coder.
So the ideas is that this source is
spitting out a digital signal like text.
It's a sequence of letters.
Not necessarily bits, but a sequence of
letters.
The job of the source coder is to convert
that to bits.
And hopefully efficiently.
And do that in such a way.
That the source decoder can figure out
what the original signal was without
error.
So, that's what we're trying to do.
We want it as efficient as possible but
efficient, not too efficient so that we
can't figure out what had happened.
And as you know, once you have bits you
have to go through assigning a signal set,
send it through a wireline or wireless
channel, receive it with the correlation
receiver, and put out the estimated bit.
Well, the simplification in a digital
communication model is to try to ignore
all the details about the analog channel,
signal sets, correlation receivers, etc.
The bottom line is that when you send, put
in a bit, out comes a bit, but it may be
wrong.
And this little x shaped thing is called a
transition diagram which describes how the
errors can occur.
So we start with a 0.
We send a 0.
What this diagram says, that will come out
a 0 with a probability of 1 minus pe and
it will come out.
And error will create with probability p,
and we know that p is related to all the
details in the analog channel.
Similarly, if you send a 1, it will come
out a 1, with probability 1 minus pe and
will be convert to a 0, with probability
pe.
So this is called a symmetric channel, for
obvious reasons.
So, pe summarizes all the design choices
and channel characteristics that are
needed to an information theory.
So, we're just going to assume we have a
pe, we have a bit sequence, goes in.
It got another bit sequence comes out.
And that, those bits may be flipped,
changed but with probability pe.
All right.
So, let's go over the assumptions that
information theory makes.
And we're again going to assume the source
emits letters.
And this here now is a technical term that
are drawn from an alphabet a.
So in this case they're capital K letters.
It could be literally an alphabet, in
other words, it could be text that you
type.
And also this alphabet could consist of a
to d output values.
So every sample and some quantize value
and you could consider each of those
possible values in letter.
What Shannon did which was really an
important assumption as he assumed that
these were some come out of the source in
a random or probabilistic fashion.
He said each letter had some probability,
and it's, depends on what the source is,
is what those probabilities turn out to
be.
And as you know, probabilities have to be
positive and they have to be less than 1,
and, and their sum has to be equal to 1.
So a probabilistic model for sources.
And I may say the things I type in my
email to people is not random.
Well it turns out, more complicated models
related to this, that are probabilistic do
a pretty good job in describing actual
texts that's actually produced in many
languages.
So, the next thing we have to assume is,
what that alphabet is, is known by the
source and the sink.
Because I'm about to convert those letters
into bits, and I have to know what those
bits are trying to express, so I have to
know what the alphabet is or else I'm in
big trouble.
Once I convert to bits, I won't know how
to get back to the original letters that
were sent.
So that's a, not a very hard assumption to
make.
So here's the, there are two issues in
information theory.
First is how should letters be represented
by bits.
That's what this video is about.
The second point in information theories,
how to mitigate, reduce the errors that
the channel introduces.
So the channel is going to introduce pe
whether you like it or not, that's going
to be the probability of yet being
changed.
We can't do anything about that but there
is something we can do before we've send
the bits and afterwards to reduce the
effective errors.
And it turns out the answer is yes we can
and the answer I think can be pretty
surprising but that's going to be in a
later video.
Well this video is about, is how do you
represent letters by bits.
So, we're going to start by talking about
what Shannon called the source coding
theorem.
And it started, we first needed to find
something called the entropy.
So the entropy, capital H of an alphabet,
is simply the negative sum.
The probability of each letter times the
log of that probability.
And it's just a definition for right now.
Use log base 2 so that the answer comes
out in bits.
So I'm going to do a sample calculation
here in just a second.
But it's just a definition of a quantity
related to the probabilities of the
letters and essentially it summarizes what
those probabilities are in one number.
Now the entropy has a couple of important
properties.
What's the set of probabilities that
maximize the entropy?
That occurs when the letters are equally
likely.
So there are k letters, so the probability
of each is 1 over k.
And you can show by taking derivatives
that, it's just a calculus exercise, that
the entropy is maximized in this case and
you get for the entropy the log base 2 of
the number of letters in the alphabet.
So that's the biggest entropy can be.
The smallest it can be is 0, and that
occurs when one of the letters has a
probability of 1, which means all the
other letters have a probability of 0.
Probabilities have to add to 1.
So in that case, the entropy turns out to
be 0.
So the entropy's a positive number for any
alphabet that goes between 0 and the log
based to the number of letters in the
alphabet.
There's a little detail here in doing this
calculation.
And that is, in this expression, I will
need to compute that.
0 times log base 2 of 0.
In information theory, whenever you see
this thing, that is just 0.
That's the result you get when you take
limits and things like that.
But in order to simplify things, 0 log of
0.
Oh, that's just 0, so that's how you can
easily confirm that this [unknown] results
in an entropy of 0.
Alright so let's calculate an entropy for
an example.
So I have here a 4 letter source a1
through a4 and they have these
probabilities and I think it's pretty easy
to these add up to 1.
So, there all this is a good description
of a source and I want to calculate its
entropy.
So, if I just plug it into the expression,
this isn't difficult to do.
This is the raw expression for entropy.
And now the question is what are these
logs?
So I point out to the half, is 2 to the
minus 1.
And so the log base 2 of this, is just
minus 1.
Well, 1 quarter is 1 over 2 squared, which
makes it 2 to the minus 2.
So with log base 2, that is minus 2 and
the other one is you can easily see are
the same.
Well, when you multiply this all out, you
get all said and done, the entropy of this
alphabet is 1.75.
Don't know what the entropy is good for
yet but it's going to be important in
just, as I'll show you in just a second,
but this is how you do entropy
calculations for a given alphabet.
It's pretty straightforward.
Not difficult to do.
So what is the Source Coding Theorem?
I'm finally going to state it to you here.
I needed to find one more thing and that
is the availability to find b of ak to be
the number of bits that are assigned to a
given letter.
And b bar is the average number of bits
that's going to be required on average for
each letter to represent the alphabet, and
that's just the average of these b of
ak's.
So, the quality, how efficient the
representation by bits is, is going to be
summarized by b bar.
If that b bar is small then we use very
few bits on average to represent the
alphabet.
The question is how small can it be?
Well, here's the result.
This is published in his very famous 1948
paper.
That Shannon published and he said the
following.
There exists a coder and a decoder pair
that can represent our source and its bits
and so that you can retrieve it as long as
b bar is within these ranges.
The lower limit is entropy of the alphabet
and you don't need more bits than the
entropy plus 1 bit.
So, there exists some coder, decoder pair
that'll do this as long as the average
number, whatever code you generate,
however you assign bits to letters, is
within this range.
Clearly, the most efficient code that you
could have is one where you get close to
entropy as possible.
Well, he has, there's more to the theorem.
Furthermore, if you try to use fewer bits
than the entropy, you cannot recover what
the original letters were.
There's going to be an error.
So this lower limit of the entropy is a
hard limit for what's called Lossless
Compression.
So, I'll show you some compression things
in a second, but what this means is if I
use this number of bits to represent my
source, the entropy number and I can find
that code.
Then I can retrieve my original letters
with no problem and that's a lossless
compression.
However, it turns out there are what are
called lossy compression algorithms that
we use all the time.
And they compress with a fewer number of
bits than entropy so you cannot get back
the original letters.
And you may initially think well that's
not good and it depends what you mean by
good.
Now I'm going to point out that Shannon's
theorem has a little problem.
And it's all buried in this little phrase
right there, there exists.
He doesn't tell you what the coder and
decoder is that will satisfy this
condition.
He just says it's there, and he can prove
there's one there, but he doesn't know
what it is.
And at the time he wrote the paper in
1948, he didn't.
Well, the, the 50s, a procedure for
finding this source code was developed
that does satisfy this bound all the time
and I'm going to show you that in just a
second.
Now let me talk about lossless and lossy
compression a little bit, so here are the
compression approaches that you may have
heard of.
And how they fall into the lossless and
lossy compression scheme.
So LZW, is a lossless compression
algorithm.
It's the same algorithm that's used by
when you zip a file.
We're going to talk about a Huffman code
in just a second.
So, these represent a source, a sequence
of letters in a very efficient way with
bits.
And you can undo that assignment and get
back the original file.
So I think this might be pretty important
if you were trying to compress your entire
file system on your computer using, you
try to zip it up.
You want that, compression algorithm to
use as few as bytes as possible, but you
want to get your original files back.
That's what a lossless compression
algorithm does for you.
Lossy compression, you may be surprised to
know that the MP3, MPEG, and JPEG are all
Lossy.
So if you convert a CD to an MP3, set of
MP3 files.
It turns out that you're using fewer bits
than were used original, in the original
sampling that the, that's used in the a to
d converter to make the CD and you've
introduced errors.
And the whole game with lossy compression
is to try to reduce the effect of the
errors.
They're going to be there, you're using
fewer bits in entropy, but you're trying
to make it so that, in the case of MP3 you
don't hear the errors, or in the case of
MPEG and JPEG you don't see the errors
very much.
And so we use lossy compression algorithms
all the time, but there are certain error
cases In which you want to use compression
schemes that are not Lossy.
So, it depends on the application.
So let me go though our little example
here that we've already done.
And we calculated already that the entropy
was 1 and 3 quarters bits.
And so the source coding theorem says that
we should be able to come up with a code
that's loss, lossless compression.
Whose average number of bits is between 1
and 3 quarters and 2 and 3 quarters.
Well, let's try what's called a simple
code.
And the simple code merely assigns the
same number of bits for each letter and so
depends on how many letters you have.
And that is given by this expression.
I don't know if you've seen this before,
these little hack brackets on the side and
that's called the ceiling function.
And what it means is round up.
So the ceiling of 2.5 is 3 but the ceiling
of 2 is 2.
So the ceiling of 2.01 is again 3.
And it just means the greatest integer,
the integer is just greater than the log
base 2 of k.
So we had 4 letters in the alphabet the
log based 2 that 4 is 2, so we can
represent our letters with this sim,
what's called a simple code.
You just use the bit sequence in whatever
way you want.
Well, b bar for that of a course, is 2 .
And you'd say, okay well that's great.
That turns out falls within the limits of
a source coding theorem.
So, here's a code that exists that works
fine.
Well, is there something better?
Is there something that can be better than
2 bits that can get closer to the, what's
called the entropy limit?
This lower down.
Let me give you an example of one.
And it may look a little screwy at first,
a little interesting.
What I'm going to do is I'm going to
represent a1 with 1 bit a 0, represent a2
with 2 bits and a3 and a4 with 3 bits.
And the idea here is I'm using fewer bits
to represent the letters that have higher
probabilities, more bits to represent the
ones that have fewer, at the lower
probable, in other words they will not
occur as often.
This is an example of what's called a
variable length code.
So, the simple binary code, if you will,
is an equal length code.
All the bits, number bits are the same.
And in variable length code, you use a
different number of bits.
Well, if you go through the calculation to
figure out what b bar is guess what?
It's 1.75.
So, this is as good as it gets.
You can compress down to the entropy
limit.
And what you have to do is use a different
number of bits for each letter depending
on what their probabilities are.
And it's also true that the decoder has to
know what this code is in order to figure
out what's going on, of course.
So, this is not the word, code, as in a
secret; this is a code, it's standard
terminology used in information theorem
trying to represent our code a letters
with bits.
All right.
So, I'm happy about that.
But there's a problem, it comes up with
variable length codes.
I have to know where the letter boundaries
are.
So, let's suppose the source puts out that
sequence of letters.
And using this code is represented by this
sequence of bits.
Now what the decoder has to do is figure
out what were the code letters, or the
code words are.
When this is all it gets, it doesn't know
anything else other than what it receives.
So, it has to figure out the boundaries.
What you have to be able to do is say,
well, here's the 0, so that's got to be an
a1 and that's gotta be an a1.
Now I get 1, 1 0, well there's the next
code.
Here's a 0 for an a1.
1 0, that's an a2.
And there's an a1.
And finally we get an a4.
So you have to be able to do this without
error and with no complications like you
get lost or anything like that.
So, this variable length code has to be
rigged so that no code word begins another
code word.
If it does, you're in big trouble.
You're going to have trouble trying to
figure out where these boudnaries are.
So, how do you do that?
So in addition to satisfying the source
coding theorem you have to have a
procedure for setting up a code so that
you can figure out the receiver, can
figure out what's going on.
So, I'm going to show you how to build
such a code and these are called Huffman
Codes.
Dave Huffman was a graduate student in the
1950s, and as a student he figured out
this code and proved it had very important
properties.
Rather than try to describe this in a
formal way, I'm just going to give an
example cause it's really easy to see how
this works.
So, I am going to build what's called a
coding tree.
I'll show you what that is as we go on.
So the first thing you do is you write
down the letters in decreasing order of
probability.
So the probabilities are going down.
So you may have to reorder the letters or
anything.
But the important thing is that the
probabilities are going down.
Next thing you do is you look at the two
smallest probability letters and you think
you put on a binary tree, you merge them
as coming, as nodes coming out of a leaves
rather coming out of a simple node.
And we define a new probability for that,
that's just a sum of component
probabilities.
So the next thing you do is you look for
the two smallest probability things.
Express them as coming out of leaves of a
simple binary tree node and you keep, and
you find the new probability which is the
sum of the probabilites these letters and
you finally merge it.
We only have one more we can do.
And so that's the, that's what called a
coding tree.
Okay.
And it's just a way of, of defining that
binary tree for a given alpha bit in terms
of its letters probabilities.
The next thing you do is you arbitrarily
define.
Assigns 0s and 1s to each branch.
And this literally is arbitrary, I've done
it in a kind of boring way.
Always put 0 for the upper branch and a 1
for lower branch.
You could change that to a 1 and that to a
0 and everything would work fine.
So that's the next thing that you do
assign bits to each branch.
And then what you do is to find your code
is you figure out the path it takes to get
to each letter from the source nodes.
So, we start here and you're going to go
to the source node.
So, suppose we look at this one for a3, so
it was 1, 1, 0.
So they're written in the same order as
the path it took me to get to that letter.
So 1 0 of course is 1 0.
So it's pretty easy to now jam the binary
tree to figure out a source code.
Now, because of the structure of this tree
and because there is a unique path to get
through this tree to the letter, that
means no other coded word, no code word
begins any other, because it is just a
representation of the paths through this
tree.
So you have, it turns out, now a code that
obeys the rules, allows you to figure out,
from the whole sequence of these bits,
where the boundaries are.
So that's good.
It's also true that this has other
important properties.
This, property of not, no code word
beginning at any other is called a prefix
code.
And it's just a terminology in the field.
And it turns out you can show, and I'll
give an example in a second, that you may
not actually produce a code that goes all
the way to the entropy limit.
It turns out it may not quite reach h of a
for the average code length.
But what, Huffman proved was that no
lossless source code can have, have a
shorter average code length.
A smaller b bar, than what his procedure
gives you.
So his procedures are more realistic
reflection of the fused number of bits
that you can use and have a lossless
compression algorithm.
Shanna's result is just a bound, says it
does not need to be smaller.
Then cannot be smaller than entropy, but
it could be as big as an entropy plus 1
bits.
And so you can look at the Huffman Code
and set up a coding tree to figure out
what is the shortest possible average code
length, the smallest b bar.
And that means the Huffman coding is
optimal.
No other code can give you a shorter, few
number of bits per, on the average for an
alphabet than a Huffman code.
Very nice.
So, there's a little problem; which
Huffman code is great, but there's a
little problem with errors that channel
might introduce.
So here's the same example I gave before,
and here's our sequence of letters.
Suppose at 0, comes out a 1.
So, pe is not 0, so suppose it changes
that 1 bit to a 1.
Well now, where are the boundaries?
So, we have 0.
That's fine.
Well it's now 1, 1, 1.
Well that's this.
And then we have a 0 and another 0 and
then one 0, another 0.
Well that comes out as an a1 and an a4.
And then an a1, and an a1, et cetera.
So that is not the same as that.
So if there's a channel error, the Huffman
coding can be entirely wrong.
And if there are no channel errors it's
always true that you can get back the
original letter sequence.
That's always true.
But if the channel Introduces errors, the
Huffman Code, and in fact any variable
length code is going to be very sensitive
to those errors.
So, you want to be a little careful about
that.
And we're going to take care of these
errors in the channel in just a second, in
another video.
But just like to say, you have to be
aware.
And the channel errors can really mess
things up.
So if you send your zip file to somebody
and one of those bits gets changed, the
entire file system you just put into a zip
file can be totally corrupted and wrong,
all the file names will be wrong, etc.
So it's real important not to have any
errors in the transmission.
So.
I want to point out also something about
Morse codes.
So Morse code was invented for the
telegraph.
And I'm sure you know that SOS is dot,
dot, dot, dash, dash, dash, dot, dot, dot.
It's the early way they represented
letters with dashes and dots which you
think about it for a second, that's the
same as representing the letters of the
alphabet with 0s and 1s.
Just by a different name.
And what the Morse code is, it turns out
very reminiscent of the idea behind the
Huffman Code.
In that the most probable letter, the e,
is represented by a very short sequence, 1
dot.
And the s over here, which has a smaller
probability, has more characters, 3 dots.
So, it turns that's right kind of idea.
The most improbable letter in the list, I
think, is the j.
And it's got the most dashes and dots of
anything in the Morse code.
So that's the right kind of idea.
Well, how well does it work?
So if you figure out, using the
probabilities here, you go through and you
figure out what, using all these
probabilities, you figure out what the
entropy is, you come up with 4.14 bits.
Well, if you figure out the average number
of bits used in the Morse code, it turns
out to be 5.56.
Which I will point out is more than a bit
away from entropy, which says you can do
better than the Morse code.
Well the question is why?
What's going on?
And the answer turns out to be there's one
little thing that they didn't do right.
So like I said an E is 1 dot.
An S is 3 dots.
Well how'd he know that S is not really 3
Es?
3 Es will have 3 dots and S is 3 dots.
So what they did in the tele, telegraph
case is they'd send a dot and then the
operator would pause.
And then doing, do, go onto the next
letter of the alphabet.
So, for an S, the 3 dots would be right
next to each other but if you sent 3s, it
would be dot, dot, dot, if you send an S
it'd be dot, dot, dot.
So, those spaces, those pauses are commas,
what are called commas.
So, you can cannot decode the original
sequence of letters, if everything's all
jammed together.
So if you count in that extra pause,
that's where the 5.56 comes from, turns
out.
So it's inefficient.
So here is the Huffman Code, which I
found, for the same set of probabilities.
And you can see some of these code words
are very long.
But what's the average code length?
And that's kind of a surprise.
It turns out it does a whole lot better
than the Morse code.
It's 4.35.
It doesn't get to the entropy limit.
But Huffman's result says you can't do any
better than 4.35.
There is not going to be any variable link
code that is lossless that has a shorter
average code link than this code.
It's not unique.
This sequence of 0s and 1s may be
different, but the smallest you could do
is 4.35.
You can't reach the entropy limit but
you're only 0.2 bits away from it.
This is significantly less.
More than a bit is saved per each letter,
on average, with the use of Huffman Code.
So, that's the idea of source coding,
what's called compression.
So most data sources are, can be
compressed in a lossless sense with much
fewer bits than a simple code.
Which you might just pick off the shelf.
So, the effect of doing that, by the way,
is to lower the source data rate.
So the simple codes we've been talking
about to date we can actually do better if
we use a variable length code we can
dramatically cut down the data rate.
You can always use lossy compression.
But you can't get back the original
sequence of letters.
And you also, in your lossless code have
to have a way of getting back figuring out
what you really did mean and that's what
MPEG, JPEG, MP3 does.
Now the other downside of course is that
in the lossless case you're going to be
very sensitive to channel errors.
And so what we're going to build up to in
succeeding videos is determining ways,
getting around those channel errors so
they're not going to be much of a problem,
even though the analog channel is noisy
and has a pe that's greater than 0.