So in this video, we're going to delve into information theory and focus on the new, exciting results that Shannon published in 1948, which we're still trying to understand and appreciate today. Today's, this video is going to of talk about representing digital sources with bits. It may seem kind of contradictory. Isn't a digital source already expressed in bits? Well, it turns out there's some interesting things you can do to make that representation more efficient. So, the first thing I want to do is talk about the digital channel model. It's a little bit different that what we're used to, a little bit simpler and it's used all the time in information theory. We're going to talk about Shannon's source coding theorem. Learn, appreciate what it is. And show you how that is useful today in talking about how to represent sources with as few bits as possible. And this leads us to compression, in which you are to discover that you've been using all along, compression ideas that actually you cannot get back to the signal you started with, but we use these all the time and they're still very useful and I'll show you how that works. Just a second. Okay, so here's my digital communication model. I have, we've seen this many times. There's one little change, and that's going to be important today. There's something I call the source coder. So the ideas is that this source is spitting out a digital signal like text. It's a sequence of letters. Not necessarily bits, but a sequence of letters. The job of the source coder is to convert that to bits. And hopefully efficiently. And do that in such a way. That the source decoder can figure out what the original signal was without error. So, that's what we're trying to do. We want it as efficient as possible but efficient, not too efficient so that we can't figure out what had happened. And as you know, once you have bits you have to go through assigning a signal set, send it through a wireline or wireless channel, receive it with the correlation receiver, and put out the estimated bit. Well, the simplification in a digital communication model is to try to ignore all the details about the analog channel, signal sets, correlation receivers, etc. The bottom line is that when you send, put in a bit, out comes a bit, but it may be wrong. And this little x shaped thing is called a transition diagram which describes how the errors can occur. So we start with a 0. We send a 0. What this diagram says, that will come out a 0 with a probability of 1 minus pe and it will come out. And error will create with probability p, and we know that p is related to all the details in the analog channel. Similarly, if you send a 1, it will come out a 1, with probability 1 minus pe and will be convert to a 0, with probability pe. So this is called a symmetric channel, for obvious reasons. So, pe summarizes all the design choices and channel characteristics that are needed to an information theory. So, we're just going to assume we have a pe, we have a bit sequence, goes in. It got another bit sequence comes out. And that, those bits may be flipped, changed but with probability pe. All right. So, let's go over the assumptions that information theory makes. And we're again going to assume the source emits letters. And this here now is a technical term that are drawn from an alphabet a. So in this case they're capital K letters. It could be literally an alphabet, in other words, it could be text that you type. And also this alphabet could consist of a to d output values. So every sample and some quantize value and you could consider each of those possible values in letter. What Shannon did which was really an important assumption as he assumed that these were some come out of the source in a random or probabilistic fashion. He said each letter had some probability, and it's, depends on what the source is, is what those probabilities turn out to be. And as you know, probabilities have to be positive and they have to be less than 1, and, and their sum has to be equal to 1. So a probabilistic model for sources. And I may say the things I type in my email to people is not random. Well it turns out, more complicated models related to this, that are probabilistic do a pretty good job in describing actual texts that's actually produced in many languages. So, the next thing we have to assume is, what that alphabet is, is known by the source and the sink. Because I'm about to convert those letters into bits, and I have to know what those bits are trying to express, so I have to know what the alphabet is or else I'm in big trouble. Once I convert to bits, I won't know how to get back to the original letters that were sent. So that's a, not a very hard assumption to make. So here's the, there are two issues in information theory. First is how should letters be represented by bits. That's what this video is about. The second point in information theories, how to mitigate, reduce the errors that the channel introduces. So the channel is going to introduce pe whether you like it or not, that's going to be the probability of yet being changed. We can't do anything about that but there is something we can do before we've send the bits and afterwards to reduce the effective errors. And it turns out the answer is yes we can and the answer I think can be pretty surprising but that's going to be in a later video. Well this video is about, is how do you represent letters by bits. So, we're going to start by talking about what Shannon called the source coding theorem. And it started, we first needed to find something called the entropy. So the entropy, capital H of an alphabet, is simply the negative sum. The probability of each letter times the log of that probability. And it's just a definition for right now. Use log base 2 so that the answer comes out in bits. So I'm going to do a sample calculation here in just a second. But it's just a definition of a quantity related to the probabilities of the letters and essentially it summarizes what those probabilities are in one number. Now the entropy has a couple of important properties. What's the set of probabilities that maximize the entropy? That occurs when the letters are equally likely. So there are k letters, so the probability of each is 1 over k. And you can show by taking derivatives that, it's just a calculus exercise, that the entropy is maximized in this case and you get for the entropy the log base 2 of the number of letters in the alphabet. So that's the biggest entropy can be. The smallest it can be is 0, and that occurs when one of the letters has a probability of 1, which means all the other letters have a probability of 0. Probabilities have to add to 1. So in that case, the entropy turns out to be 0. So the entropy's a positive number for any alphabet that goes between 0 and the log based to the number of letters in the alphabet. There's a little detail here in doing this calculation. And that is, in this expression, I will need to compute that. 0 times log base 2 of 0. In information theory, whenever you see this thing, that is just 0. That's the result you get when you take limits and things like that. But in order to simplify things, 0 log of 0. Oh, that's just 0, so that's how you can easily confirm that this [unknown] results in an entropy of 0. Alright so let's calculate an entropy for an example. So I have here a 4 letter source a1 through a4 and they have these probabilities and I think it's pretty easy to these add up to 1. So, there all this is a good description of a source and I want to calculate its entropy. So, if I just plug it into the expression, this isn't difficult to do. This is the raw expression for entropy. And now the question is what are these logs? So I point out to the half, is 2 to the minus 1. And so the log base 2 of this, is just minus 1. Well, 1 quarter is 1 over 2 squared, which makes it 2 to the minus 2. So with log base 2, that is minus 2 and the other one is you can easily see are the same. Well, when you multiply this all out, you get all said and done, the entropy of this alphabet is 1.75. Don't know what the entropy is good for yet but it's going to be important in just, as I'll show you in just a second, but this is how you do entropy calculations for a given alphabet. It's pretty straightforward. Not difficult to do. So what is the Source Coding Theorem? I'm finally going to state it to you here. I needed to find one more thing and that is the availability to find b of ak to be the number of bits that are assigned to a given letter. And b bar is the average number of bits that's going to be required on average for each letter to represent the alphabet, and that's just the average of these b of ak's. So, the quality, how efficient the representation by bits is, is going to be summarized by b bar. If that b bar is small then we use very few bits on average to represent the alphabet. The question is how small can it be? Well, here's the result. This is published in his very famous 1948 paper. That Shannon published and he said the following. There exists a coder and a decoder pair that can represent our source and its bits and so that you can retrieve it as long as b bar is within these ranges. The lower limit is entropy of the alphabet and you don't need more bits than the entropy plus 1 bit. So, there exists some coder, decoder pair that'll do this as long as the average number, whatever code you generate, however you assign bits to letters, is within this range. Clearly, the most efficient code that you could have is one where you get close to entropy as possible. Well, he has, there's more to the theorem. Furthermore, if you try to use fewer bits than the entropy, you cannot recover what the original letters were. There's going to be an error. So this lower limit of the entropy is a hard limit for what's called Lossless Compression. So, I'll show you some compression things in a second, but what this means is if I use this number of bits to represent my source, the entropy number and I can find that code. Then I can retrieve my original letters with no problem and that's a lossless compression. However, it turns out there are what are called lossy compression algorithms that we use all the time. And they compress with a fewer number of bits than entropy so you cannot get back the original letters. And you may initially think well that's not good and it depends what you mean by good. Now I'm going to point out that Shannon's theorem has a little problem. And it's all buried in this little phrase right there, there exists. He doesn't tell you what the coder and decoder is that will satisfy this condition. He just says it's there, and he can prove there's one there, but he doesn't know what it is. And at the time he wrote the paper in 1948, he didn't. Well, the, the 50s, a procedure for finding this source code was developed that does satisfy this bound all the time and I'm going to show you that in just a second. Now let me talk about lossless and lossy compression a little bit, so here are the compression approaches that you may have heard of. And how they fall into the lossless and lossy compression scheme. So LZW, is a lossless compression algorithm. It's the same algorithm that's used by when you zip a file. We're going to talk about a Huffman code in just a second. So, these represent a source, a sequence of letters in a very efficient way with bits. And you can undo that assignment and get back the original file. So I think this might be pretty important if you were trying to compress your entire file system on your computer using, you try to zip it up. You want that, compression algorithm to use as few as bytes as possible, but you want to get your original files back. That's what a lossless compression algorithm does for you. Lossy compression, you may be surprised to know that the MP3, MPEG, and JPEG are all Lossy. So if you convert a CD to an MP3, set of MP3 files. It turns out that you're using fewer bits than were used original, in the original sampling that the, that's used in the a to d converter to make the CD and you've introduced errors. And the whole game with lossy compression is to try to reduce the effect of the errors. They're going to be there, you're using fewer bits in entropy, but you're trying to make it so that, in the case of MP3 you don't hear the errors, or in the case of MPEG and JPEG you don't see the errors very much. And so we use lossy compression algorithms all the time, but there are certain error cases In which you want to use compression schemes that are not Lossy. So, it depends on the application. So let me go though our little example here that we've already done. And we calculated already that the entropy was 1 and 3 quarters bits. And so the source coding theorem says that we should be able to come up with a code that's loss, lossless compression. Whose average number of bits is between 1 and 3 quarters and 2 and 3 quarters. Well, let's try what's called a simple code. And the simple code merely assigns the same number of bits for each letter and so depends on how many letters you have. And that is given by this expression. I don't know if you've seen this before, these little hack brackets on the side and that's called the ceiling function. And what it means is round up. So the ceiling of 2.5 is 3 but the ceiling of 2 is 2. So the ceiling of 2.01 is again 3. And it just means the greatest integer, the integer is just greater than the log base 2 of k. So we had 4 letters in the alphabet the log based 2 that 4 is 2, so we can represent our letters with this sim, what's called a simple code. You just use the bit sequence in whatever way you want. Well, b bar for that of a course, is 2 . And you'd say, okay well that's great. That turns out falls within the limits of a source coding theorem. So, here's a code that exists that works fine. Well, is there something better? Is there something that can be better than 2 bits that can get closer to the, what's called the entropy limit? This lower down. Let me give you an example of one. And it may look a little screwy at first, a little interesting. What I'm going to do is I'm going to represent a1 with 1 bit a 0, represent a2 with 2 bits and a3 and a4 with 3 bits. And the idea here is I'm using fewer bits to represent the letters that have higher probabilities, more bits to represent the ones that have fewer, at the lower probable, in other words they will not occur as often. This is an example of what's called a variable length code. So, the simple binary code, if you will, is an equal length code. All the bits, number bits are the same. And in variable length code, you use a different number of bits. Well, if you go through the calculation to figure out what b bar is guess what? It's 1.75. So, this is as good as it gets. You can compress down to the entropy limit. And what you have to do is use a different number of bits for each letter depending on what their probabilities are. And it's also true that the decoder has to know what this code is in order to figure out what's going on, of course. So, this is not the word, code, as in a secret; this is a code, it's standard terminology used in information theorem trying to represent our code a letters with bits. All right. So, I'm happy about that. But there's a problem, it comes up with variable length codes. I have to know where the letter boundaries are. So, let's suppose the source puts out that sequence of letters. And using this code is represented by this sequence of bits. Now what the decoder has to do is figure out what were the code letters, or the code words are. When this is all it gets, it doesn't know anything else other than what it receives. So, it has to figure out the boundaries. What you have to be able to do is say, well, here's the 0, so that's got to be an a1 and that's gotta be an a1. Now I get 1, 1 0, well there's the next code. Here's a 0 for an a1. 1 0, that's an a2. And there's an a1. And finally we get an a4. So you have to be able to do this without error and with no complications like you get lost or anything like that. So, this variable length code has to be rigged so that no code word begins another code word. If it does, you're in big trouble. You're going to have trouble trying to figure out where these boudnaries are. So, how do you do that? So in addition to satisfying the source coding theorem you have to have a procedure for setting up a code so that you can figure out the receiver, can figure out what's going on. So, I'm going to show you how to build such a code and these are called Huffman Codes. Dave Huffman was a graduate student in the 1950s, and as a student he figured out this code and proved it had very important properties. Rather than try to describe this in a formal way, I'm just going to give an example cause it's really easy to see how this works. So, I am going to build what's called a coding tree. I'll show you what that is as we go on. So the first thing you do is you write down the letters in decreasing order of probability. So the probabilities are going down. So you may have to reorder the letters or anything. But the important thing is that the probabilities are going down. Next thing you do is you look at the two smallest probability letters and you think you put on a binary tree, you merge them as coming, as nodes coming out of a leaves rather coming out of a simple node. And we define a new probability for that, that's just a sum of component probabilities. So the next thing you do is you look for the two smallest probability things. Express them as coming out of leaves of a simple binary tree node and you keep, and you find the new probability which is the sum of the probabilites these letters and you finally merge it. We only have one more we can do. And so that's the, that's what called a coding tree. Okay. And it's just a way of, of defining that binary tree for a given alpha bit in terms of its letters probabilities. The next thing you do is you arbitrarily define. Assigns 0s and 1s to each branch. And this literally is arbitrary, I've done it in a kind of boring way. Always put 0 for the upper branch and a 1 for lower branch. You could change that to a 1 and that to a 0 and everything would work fine. So that's the next thing that you do assign bits to each branch. And then what you do is to find your code is you figure out the path it takes to get to each letter from the source nodes. So, we start here and you're going to go to the source node. So, suppose we look at this one for a3, so it was 1, 1, 0. So they're written in the same order as the path it took me to get to that letter. So 1 0 of course is 1 0. So it's pretty easy to now jam the binary tree to figure out a source code. Now, because of the structure of this tree and because there is a unique path to get through this tree to the letter, that means no other coded word, no code word begins any other, because it is just a representation of the paths through this tree. So you have, it turns out, now a code that obeys the rules, allows you to figure out, from the whole sequence of these bits, where the boundaries are. So that's good. It's also true that this has other important properties. This, property of not, no code word beginning at any other is called a prefix code. And it's just a terminology in the field. And it turns out you can show, and I'll give an example in a second, that you may not actually produce a code that goes all the way to the entropy limit. It turns out it may not quite reach h of a for the average code length. But what, Huffman proved was that no lossless source code can have, have a shorter average code length. A smaller b bar, than what his procedure gives you. So his procedures are more realistic reflection of the fused number of bits that you can use and have a lossless compression algorithm. Shanna's result is just a bound, says it does not need to be smaller. Then cannot be smaller than entropy, but it could be as big as an entropy plus 1 bits. And so you can look at the Huffman Code and set up a coding tree to figure out what is the shortest possible average code length, the smallest b bar. And that means the Huffman coding is optimal. No other code can give you a shorter, few number of bits per, on the average for an alphabet than a Huffman code. Very nice. So, there's a little problem; which Huffman code is great, but there's a little problem with errors that channel might introduce. So here's the same example I gave before, and here's our sequence of letters. Suppose at 0, comes out a 1. So, pe is not 0, so suppose it changes that 1 bit to a 1. Well now, where are the boundaries? So, we have 0. That's fine. Well it's now 1, 1, 1. Well that's this. And then we have a 0 and another 0 and then one 0, another 0. Well that comes out as an a1 and an a4. And then an a1, and an a1, et cetera. So that is not the same as that. So if there's a channel error, the Huffman coding can be entirely wrong. And if there are no channel errors it's always true that you can get back the original letter sequence. That's always true. But if the channel Introduces errors, the Huffman Code, and in fact any variable length code is going to be very sensitive to those errors. So, you want to be a little careful about that. And we're going to take care of these errors in the channel in just a second, in another video. But just like to say, you have to be aware. And the channel errors can really mess things up. So if you send your zip file to somebody and one of those bits gets changed, the entire file system you just put into a zip file can be totally corrupted and wrong, all the file names will be wrong, etc. So it's real important not to have any errors in the transmission. So. I want to point out also something about Morse codes. So Morse code was invented for the telegraph. And I'm sure you know that SOS is dot, dot, dot, dash, dash, dash, dot, dot, dot. It's the early way they represented letters with dashes and dots which you think about it for a second, that's the same as representing the letters of the alphabet with 0s and 1s. Just by a different name. And what the Morse code is, it turns out very reminiscent of the idea behind the Huffman Code. In that the most probable letter, the e, is represented by a very short sequence, 1 dot. And the s over here, which has a smaller probability, has more characters, 3 dots. So, it turns that's right kind of idea. The most improbable letter in the list, I think, is the j. And it's got the most dashes and dots of anything in the Morse code. So that's the right kind of idea. Well, how well does it work? So if you figure out, using the probabilities here, you go through and you figure out what, using all these probabilities, you figure out what the entropy is, you come up with 4.14 bits. Well, if you figure out the average number of bits used in the Morse code, it turns out to be 5.56. Which I will point out is more than a bit away from entropy, which says you can do better than the Morse code. Well the question is why? What's going on? And the answer turns out to be there's one little thing that they didn't do right. So like I said an E is 1 dot. An S is 3 dots. Well how'd he know that S is not really 3 Es? 3 Es will have 3 dots and S is 3 dots. So what they did in the tele, telegraph case is they'd send a dot and then the operator would pause. And then doing, do, go onto the next letter of the alphabet. So, for an S, the 3 dots would be right next to each other but if you sent 3s, it would be dot, dot, dot, if you send an S it'd be dot, dot, dot. So, those spaces, those pauses are commas, what are called commas. So, you can cannot decode the original sequence of letters, if everything's all jammed together. So if you count in that extra pause, that's where the 5.56 comes from, turns out. So it's inefficient. So here is the Huffman Code, which I found, for the same set of probabilities. And you can see some of these code words are very long. But what's the average code length? And that's kind of a surprise. It turns out it does a whole lot better than the Morse code. It's 4.35. It doesn't get to the entropy limit. But Huffman's result says you can't do any better than 4.35. There is not going to be any variable link code that is lossless that has a shorter average code link than this code. It's not unique. This sequence of 0s and 1s may be different, but the smallest you could do is 4.35. You can't reach the entropy limit but you're only 0.2 bits away from it. This is significantly less. More than a bit is saved per each letter, on average, with the use of Huffman Code. So, that's the idea of source coding, what's called compression. So most data sources are, can be compressed in a lossless sense with much fewer bits than a simple code. Which you might just pick off the shelf. So, the effect of doing that, by the way, is to lower the source data rate. So the simple codes we've been talking about to date we can actually do better if we use a variable length code we can dramatically cut down the data rate. You can always use lossy compression. But you can't get back the original sequence of letters. And you also, in your lossless code have to have a way of getting back figuring out what you really did mean and that's what MPEG, JPEG, MP3 does. Now the other downside of course is that in the lossless case you're going to be very sensitive to channel errors. And so what we're going to build up to in succeeding videos is determining ways, getting around those channel errors so they're not going to be much of a problem, even though the analog channel is noisy and has a pe that's greater than 0.