1 00:00:00,600 --> 00:00:05,800 So in this video, we're going to delve into information theory and focus on the 2 00:00:05,800 --> 00:00:10,680 new, exciting results that Shannon published in 1948, which we're still 3 00:00:10,680 --> 00:00:16,802 trying to understand and appreciate today. Today's, this video is going to of talk 4 00:00:16,802 --> 00:00:20,624 about representing digital sources with bits. 5 00:00:20,625 --> 00:00:25,170 It may seem kind of contradictory. Isn't a digital source already expressed 6 00:00:25,170 --> 00:00:27,336 in bits? Well, it turns out there's some 7 00:00:27,336 --> 00:00:31,634 interesting things you can do to make that representation more efficient. 8 00:00:31,634 --> 00:00:35,706 So, the first thing I want to do is talk about the digital channel model. 9 00:00:35,707 --> 00:00:40,126 It's a little bit different that what we're used to, a little bit simpler and 10 00:00:40,126 --> 00:00:42,677 it's used all the time in information theory. 11 00:00:42,678 --> 00:00:46,377 We're going to talk about Shannon's source coding theorem. 12 00:00:46,378 --> 00:00:51,006 Learn, appreciate what it is. And show you how that is useful today in 13 00:00:51,006 --> 00:00:55,981 talking about how to represent sources with as few bits as possible. 14 00:00:55,982 --> 00:01:02,134 And this leads us to compression, in which you are to discover that you've been using 15 00:01:02,134 --> 00:01:07,360 all along, compression ideas that actually you cannot get back to the signal you 16 00:01:07,360 --> 00:01:12,274 started with, but we use these all the time and they're still very useful and 17 00:01:12,274 --> 00:01:15,668 I'll show you how that works. Just a second. 18 00:01:15,668 --> 00:01:19,000 Okay, so here's my digital communication model. 19 00:01:19,000 --> 00:01:25,026 I have, we've seen this many times. There's one little change, and that's 20 00:01:25,026 --> 00:01:29,284 going to be important today. There's something I call the source coder. 21 00:01:29,285 --> 00:01:36,347 So the ideas is that this source is spitting out a digital signal like text. 22 00:01:36,348 --> 00:01:41,384 It's a sequence of letters. Not necessarily bits, but a sequence of 23 00:01:41,384 --> 00:01:44,781 letters. The job of the source coder is to convert 24 00:01:44,781 --> 00:01:47,554 that to bits. And hopefully efficiently. 25 00:01:47,554 --> 00:01:53,370 And do that in such a way. That the source decoder can figure out 26 00:01:53,370 --> 00:01:56,894 what the original signal was without error. 27 00:01:56,894 --> 00:02:01,384 So, that's what we're trying to do. We want it as efficient as possible but 28 00:02:01,384 --> 00:02:06,307 efficient, not too efficient so that we can't figure out what had happened. 29 00:02:06,308 --> 00:02:12,737 And as you know, once you have bits you have to go through assigning a signal set, 30 00:02:12,737 --> 00:02:18,968 send it through a wireline or wireless channel, receive it with the correlation 31 00:02:18,968 --> 00:02:25,442 receiver, and put out the estimated bit. Well, the simplification in a digital 32 00:02:25,442 --> 00:02:31,630 communication model is to try to ignore all the details about the analog channel, 33 00:02:31,630 --> 00:02:38,654 signal sets, correlation receivers, etc. The bottom line is that when you send, put 34 00:02:38,654 --> 00:02:42,939 in a bit, out comes a bit, but it may be wrong. 35 00:02:42,940 --> 00:02:50,644 And this little x shaped thing is called a transition diagram which describes how the 36 00:02:50,644 --> 00:02:54,310 errors can occur. So we start with a 0. 37 00:02:54,310 --> 00:02:58,780 We send a 0. What this diagram says, that will come out 38 00:02:58,780 --> 00:03:03,411 a 0 with a probability of 1 minus pe and it will come out. 39 00:03:03,412 --> 00:03:09,098 And error will create with probability p, and we know that p is related to all the 40 00:03:09,098 --> 00:03:15,286 details in the analog channel. Similarly, if you send a 1, it will come 41 00:03:15,286 --> 00:03:23,346 out a 1, with probability 1 minus pe and will be convert to a 0, with probability 42 00:03:23,346 --> 00:03:26,330 pe. So this is called a symmetric channel, for 43 00:03:26,330 --> 00:03:31,490 obvious reasons. So, pe summarizes all the design choices 44 00:03:31,490 --> 00:03:38,139 and channel characteristics that are needed to an information theory. 45 00:03:38,139 --> 00:03:44,792 So, we're just going to assume we have a pe, we have a bit sequence, goes in. 46 00:03:44,793 --> 00:03:50,100 It got another bit sequence comes out. And that, those bits may be flipped, 47 00:03:50,100 --> 00:03:54,290 changed but with probability pe. All right. 48 00:03:54,290 --> 00:04:01,250 So, let's go over the assumptions that information theory makes. 49 00:04:01,250 --> 00:04:07,032 And we're again going to assume the source emits letters. 50 00:04:07,033 --> 00:04:14,687 And this here now is a technical term that are drawn from an alphabet a. 51 00:04:14,688 --> 00:04:19,748 So in this case they're capital K letters. It could be literally an alphabet, in 52 00:04:19,748 --> 00:04:22,410 other words, it could be text that you type. 53 00:04:22,410 --> 00:04:28,120 And also this alphabet could consist of a to d output values. 54 00:04:28,120 --> 00:04:35,050 So every sample and some quantize value and you could consider each of those 55 00:04:35,050 --> 00:04:40,309 possible values in letter. What Shannon did which was really an 56 00:04:40,309 --> 00:04:46,119 important assumption as he assumed that these were some come out of the source in 57 00:04:46,119 --> 00:04:51,665 a random or probabilistic fashion. He said each letter had some probability, 58 00:04:51,665 --> 00:04:57,190 and it's, depends on what the source is, is what those probabilities turn out to 59 00:04:57,190 --> 00:05:00,080 be. And as you know, probabilities have to be 60 00:05:00,081 --> 00:05:05,165 positive and they have to be less than 1, and, and their sum has to be equal to 1. 61 00:05:05,166 --> 00:05:10,680 So a probabilistic model for sources. And I may say the things I type in my 62 00:05:10,680 --> 00:05:15,348 email to people is not random. Well it turns out, more complicated models 63 00:05:15,348 --> 00:05:19,836 related to this, that are probabilistic do a pretty good job in describing actual 64 00:05:19,836 --> 00:05:22,940 texts that's actually produced in many languages. 65 00:05:22,940 --> 00:05:25,110 So, the next thing we have to assume is, what that alphabet is, is known by the 66 00:05:25,110 --> 00:05:28,640 source and the sink. Because I'm about to convert those letters 67 00:05:28,640 --> 00:05:34,464 into bits, and I have to know what those bits are trying to express, so I have to 68 00:05:34,464 --> 00:05:38,397 know what the alphabet is or else I'm in big trouble. 69 00:05:38,397 --> 00:05:48,394 Once I convert to bits, I won't know how to get back to the original letters that 70 00:05:48,394 --> 00:05:54,780 were sent. So that's a, not a very hard assumption to 71 00:05:54,780 --> 00:05:57,889 make. So here's the, there are two issues in 72 00:05:57,889 --> 00:06:01,618 information theory. First is how should letters be represented 73 00:06:01,618 --> 00:06:04,797 by bits. That's what this video is about. 74 00:06:04,798 --> 00:06:11,710 The second point in information theories, how to mitigate, reduce the errors that 75 00:06:11,710 --> 00:06:16,142 the channel introduces. So the channel is going to introduce pe 76 00:06:16,142 --> 00:06:20,626 whether you like it or not, that's going to be the probability of yet being 77 00:06:20,626 --> 00:06:23,848 changed. We can't do anything about that but there 78 00:06:23,848 --> 00:06:28,712 is something we can do before we've send the bits and afterwards to reduce the 79 00:06:28,712 --> 00:06:32,543 effective errors. And it turns out the answer is yes we can 80 00:06:32,543 --> 00:06:37,125 and the answer I think can be pretty surprising but that's going to be in a 81 00:06:37,125 --> 00:06:40,549 later video. Well this video is about, is how do you 82 00:06:40,549 --> 00:06:45,826 represent letters by bits. So, we're going to start by talking about 83 00:06:45,826 --> 00:06:50,140 what Shannon called the source coding theorem. 84 00:06:50,140 --> 00:06:55,642 And it started, we first needed to find something called the entropy. 85 00:06:55,642 --> 00:07:02,336 So the entropy, capital H of an alphabet, is simply the negative sum. 86 00:07:02,336 --> 00:07:07,252 The probability of each letter times the log of that probability. 87 00:07:07,252 --> 00:07:13,682 And it's just a definition for right now. Use log base 2 so that the answer comes 88 00:07:13,682 --> 00:07:17,030 out in bits. So I'm going to do a sample calculation 89 00:07:17,030 --> 00:07:20,616 here in just a second. But it's just a definition of a quantity 90 00:07:20,616 --> 00:07:25,656 related to the probabilities of the letters and essentially it summarizes what 91 00:07:25,656 --> 00:07:31,880 those probabilities are in one number. Now the entropy has a couple of important 92 00:07:31,880 --> 00:07:37,734 properties. What's the set of probabilities that 93 00:07:37,734 --> 00:07:44,180 maximize the entropy? That occurs when the letters are equally 94 00:07:44,180 --> 00:07:47,915 likely. So there are k letters, so the probability 95 00:07:47,915 --> 00:07:51,942 of each is 1 over k. And you can show by taking derivatives 96 00:07:51,942 --> 00:07:57,531 that, it's just a calculus exercise, that the entropy is maximized in this case and 97 00:07:57,531 --> 00:08:02,754 you get for the entropy the log base 2 of the number of letters in the alphabet. 98 00:08:02,754 --> 00:08:08,106 So that's the biggest entropy can be. The smallest it can be is 0, and that 99 00:08:08,106 --> 00:08:13,146 occurs when one of the letters has a probability of 1, which means all the 100 00:08:13,146 --> 00:08:19,407 other letters have a probability of 0. Probabilities have to add to 1. 101 00:08:19,407 --> 00:08:24,076 So in that case, the entropy turns out to be 0. 102 00:08:24,076 --> 00:08:30,692 So the entropy's a positive number for any alphabet that goes between 0 and the log 103 00:08:30,692 --> 00:08:34,337 based to the number of letters in the alphabet. 104 00:08:34,337 --> 00:08:36,982 There's a little detail here in doing this calculation. 105 00:08:36,982 --> 00:08:44,250 And that is, in this expression, I will need to compute that. 106 00:08:44,250 --> 00:08:49,394 0 times log base 2 of 0. In information theory, whenever you see 107 00:08:49,394 --> 00:08:54,270 this thing, that is just 0. That's the result you get when you take 108 00:08:54,270 --> 00:08:58,703 limits and things like that. But in order to simplify things, 0 log of 109 00:08:58,703 --> 00:09:02,314 0. Oh, that's just 0, so that's how you can 110 00:09:02,314 --> 00:09:07,931 easily confirm that this [unknown] results in an entropy of 0. 111 00:09:07,932 --> 00:09:16,072 Alright so let's calculate an entropy for an example. 112 00:09:16,073 --> 00:09:22,573 So I have here a 4 letter source a1 through a4 and they have these 113 00:09:22,573 --> 00:09:29,081 probabilities and I think it's pretty easy to these add up to 1. 114 00:09:29,081 --> 00:09:35,179 So, there all this is a good description of a source and I want to calculate its 115 00:09:35,179 --> 00:09:38,043 entropy. So, if I just plug it into the expression, 116 00:09:38,043 --> 00:09:45,197 this isn't difficult to do. This is the raw expression for entropy. 117 00:09:45,198 --> 00:09:48,194 And now the question is what are these logs? 118 00:09:48,194 --> 00:09:54,297 So I point out to the half, is 2 to the minus 1. 119 00:09:54,298 --> 00:09:58,318 And so the log base 2 of this, is just minus 1. 120 00:09:58,318 --> 00:10:06,523 Well, 1 quarter is 1 over 2 squared, which makes it 2 to the minus 2. 121 00:10:06,523 --> 00:10:13,890 So with log base 2, that is minus 2 and the other one is you can easily see are 122 00:10:13,890 --> 00:10:18,086 the same. Well, when you multiply this all out, you 123 00:10:18,086 --> 00:10:22,848 get all said and done, the entropy of this alphabet is 1.75. 124 00:10:22,848 --> 00:10:27,335 Don't know what the entropy is good for yet but it's going to be important in 125 00:10:27,335 --> 00:10:31,199 just, as I'll show you in just a second, but this is how you do entropy 126 00:10:31,199 --> 00:10:34,424 calculations for a given alphabet. It's pretty straightforward. 127 00:10:34,424 --> 00:10:40,610 Not difficult to do. So what is the Source Coding Theorem? 128 00:10:40,610 --> 00:10:46,428 I'm finally going to state it to you here. I needed to find one more thing and that 129 00:10:46,428 --> 00:10:53,643 is the availability to find b of ak to be the number of bits that are assigned to a 130 00:10:53,643 --> 00:10:57,694 given letter. And b bar is the average number of bits 131 00:10:57,694 --> 00:11:03,490 that's going to be required on average for each letter to represent the alphabet, and 132 00:11:03,490 --> 00:11:06,754 that's just the average of these b of ak's. 133 00:11:06,754 --> 00:11:15,788 So, the quality, how efficient the representation by bits is, is going to be 134 00:11:15,788 --> 00:11:22,260 summarized by b bar. If that b bar is small then we use very 135 00:11:22,260 --> 00:11:25,964 few bits on average to represent the alphabet. 136 00:11:25,964 --> 00:11:32,777 The question is how small can it be? Well, here's the result. 137 00:11:32,778 --> 00:11:36,356 This is published in his very famous 1948 paper. 138 00:11:36,356 --> 00:11:42,719 That Shannon published and he said the following. 139 00:11:42,719 --> 00:11:53,640 There exists a coder and a decoder pair that can represent our source and its bits 140 00:11:53,640 --> 00:12:02,940 and so that you can retrieve it as long as b bar is within these ranges. 141 00:12:02,940 --> 00:12:10,545 The lower limit is entropy of the alphabet and you don't need more bits than the 142 00:12:10,545 --> 00:12:16,588 entropy plus 1 bit. So, there exists some coder, decoder pair 143 00:12:16,588 --> 00:12:24,258 that'll do this as long as the average number, whatever code you generate, 144 00:12:24,258 --> 00:12:30,056 however you assign bits to letters, is within this range. 145 00:12:30,056 --> 00:12:36,290 Clearly, the most efficient code that you could have is one where you get close to 146 00:12:36,290 --> 00:12:43,094 entropy as possible. Well, he has, there's more to the theorem. 147 00:12:43,094 --> 00:12:52,962 Furthermore, if you try to use fewer bits than the entropy, you cannot recover what 148 00:12:52,962 --> 00:12:57,199 the original letters were. There's going to be an error. 149 00:12:57,199 --> 00:13:05,163 So this lower limit of the entropy is a hard limit for what's called Lossless 150 00:13:05,163 --> 00:13:09,743 Compression. So, I'll show you some compression things 151 00:13:09,743 --> 00:13:15,224 in a second, but what this means is if I use this number of bits to represent my 152 00:13:15,224 --> 00:13:19,103 source, the entropy number and I can find that code. 153 00:13:19,104 --> 00:13:22,368 Then I can retrieve my original letters with no problem and that's a lossless 154 00:13:22,368 --> 00:13:27,681 compression. However, it turns out there are what are 155 00:13:27,681 --> 00:13:35,159 called lossy compression algorithms that we use all the time. 156 00:13:35,159 --> 00:13:41,257 And they compress with a fewer number of bits than entropy so you cannot get back 157 00:13:41,257 --> 00:13:46,567 the original letters. And you may initially think well that's 158 00:13:46,567 --> 00:13:50,587 not good and it depends what you mean by good. 159 00:13:50,588 --> 00:13:55,689 Now I'm going to point out that Shannon's theorem has a little problem. 160 00:13:55,689 --> 00:14:02,027 And it's all buried in this little phrase right there, there exists. 161 00:14:02,028 --> 00:14:08,117 He doesn't tell you what the coder and decoder is that will satisfy this 162 00:14:08,117 --> 00:14:11,094 condition. He just says it's there, and he can prove 163 00:14:11,094 --> 00:14:13,540 there's one there, but he doesn't know what it is. 164 00:14:13,540 --> 00:14:17,415 And at the time he wrote the paper in 1948, he didn't. 165 00:14:17,416 --> 00:14:23,095 Well, the, the 50s, a procedure for finding this source code was developed 166 00:14:23,095 --> 00:14:29,140 that does satisfy this bound all the time and I'm going to show you that in just a 167 00:14:29,140 --> 00:14:32,409 second. Now let me talk about lossless and lossy 168 00:14:32,409 --> 00:14:37,722 compression a little bit, so here are the compression approaches that you may have 169 00:14:37,722 --> 00:14:42,458 heard of. And how they fall into the lossless and 170 00:14:42,458 --> 00:14:48,996 lossy compression scheme. So LZW, is a lossless compression 171 00:14:48,996 --> 00:14:52,692 algorithm. It's the same algorithm that's used by 172 00:14:52,692 --> 00:14:55,955 when you zip a file. We're going to talk about a Huffman code 173 00:14:55,955 --> 00:15:01,805 in just a second. So, these represent a source, a sequence 174 00:15:01,805 --> 00:15:06,797 of letters in a very efficient way with bits. 175 00:15:06,798 --> 00:15:10,487 And you can undo that assignment and get back the original file. 176 00:15:10,488 --> 00:15:15,813 So I think this might be pretty important if you were trying to compress your entire 177 00:15:15,813 --> 00:15:19,360 file system on your computer using, you try to zip it up. 178 00:15:19,360 --> 00:15:23,569 You want that, compression algorithm to use as few as bytes as possible, but you 179 00:15:23,569 --> 00:15:27,840 want to get your original files back. That's what a lossless compression 180 00:15:27,840 --> 00:15:32,650 algorithm does for you. Lossy compression, you may be surprised to 181 00:15:32,650 --> 00:15:36,202 know that the MP3, MPEG, and JPEG are all Lossy. 182 00:15:36,203 --> 00:15:42,657 So if you convert a CD to an MP3, set of MP3 files. 183 00:15:42,657 --> 00:15:48,168 It turns out that you're using fewer bits than were used original, in the original 184 00:15:48,168 --> 00:15:53,595 sampling that the, that's used in the a to d converter to make the CD and you've 185 00:15:53,595 --> 00:15:58,805 introduced errors. And the whole game with lossy compression 186 00:15:58,805 --> 00:16:01,490 is to try to reduce the effect of the errors. 187 00:16:01,490 --> 00:16:06,235 They're going to be there, you're using fewer bits in entropy, but you're trying 188 00:16:06,235 --> 00:16:10,907 to make it so that, in the case of MP3 you don't hear the errors, or in the case of 189 00:16:10,907 --> 00:16:13,868 MPEG and JPEG you don't see the errors very much. 190 00:16:13,868 --> 00:16:18,704 And so we use lossy compression algorithms all the time, but there are certain error 191 00:16:18,704 --> 00:16:22,683 cases In which you want to use compression schemes that are not Lossy. 192 00:16:22,684 --> 00:16:30,518 So, it depends on the application. So let me go though our little example 193 00:16:30,518 --> 00:16:35,854 here that we've already done. And we calculated already that the entropy 194 00:16:35,854 --> 00:16:41,666 was 1 and 3 quarters bits. And so the source coding theorem says that 195 00:16:41,667 --> 00:16:48,179 we should be able to come up with a code that's loss, lossless compression. 196 00:16:48,180 --> 00:16:52,620 Whose average number of bits is between 1 and 3 quarters and 2 and 3 quarters. 197 00:16:52,620 --> 00:16:55,289 Well, let's try what's called a simple code. 198 00:16:56,510 --> 00:17:03,704 And the simple code merely assigns the same number of bits for each letter and so 199 00:17:03,704 --> 00:17:10,506 depends on how many letters you have. And that is given by this expression. 200 00:17:10,507 --> 00:17:19,171 I don't know if you've seen this before, these little hack brackets on the side and 201 00:17:19,171 --> 00:17:25,737 that's called the ceiling function. And what it means is round up. 202 00:17:25,738 --> 00:17:33,819 So the ceiling of 2.5 is 3 but the ceiling of 2 is 2. 203 00:17:35,500 --> 00:17:41,298 So the ceiling of 2.01 is again 3. And it just means the greatest integer, 204 00:17:41,298 --> 00:17:44,557 the integer is just greater than the log base 2 of k. 205 00:17:44,558 --> 00:17:50,394 So we had 4 letters in the alphabet the log based 2 that 4 is 2, so we can 206 00:17:50,394 --> 00:17:56,450 represent our letters with this sim, what's called a simple code. 207 00:17:56,450 --> 00:18:00,834 You just use the bit sequence in whatever way you want. 208 00:18:00,834 --> 00:18:07,455 Well, b bar for that of a course, is 2 . And you'd say, okay well that's great. 209 00:18:07,455 --> 00:18:12,860 That turns out falls within the limits of a source coding theorem. 210 00:18:12,860 --> 00:18:16,014 So, here's a code that exists that works fine. 211 00:18:16,014 --> 00:18:22,421 Well, is there something better? Is there something that can be better than 212 00:18:22,422 --> 00:18:28,767 2 bits that can get closer to the, what's called the entropy limit? 213 00:18:28,768 --> 00:18:31,908 This lower down. Let me give you an example of one. 214 00:18:31,909 --> 00:18:38,955 And it may look a little screwy at first, a little interesting. 215 00:18:38,956 --> 00:18:46,038 What I'm going to do is I'm going to represent a1 with 1 bit a 0, represent a2 216 00:18:46,038 --> 00:18:52,869 with 2 bits and a3 and a4 with 3 bits. And the idea here is I'm using fewer bits 217 00:18:52,869 --> 00:18:59,827 to represent the letters that have higher probabilities, more bits to represent the 218 00:18:59,827 --> 00:19:05,707 ones that have fewer, at the lower probable, in other words they will not 219 00:19:05,707 --> 00:19:08,690 occur as often. This is an example of what's called a 220 00:19:08,690 --> 00:19:14,062 variable length code. So, the simple binary code, if you will, 221 00:19:14,062 --> 00:19:19,566 is an equal length code. All the bits, number bits are the same. 222 00:19:19,566 --> 00:19:23,434 And in variable length code, you use a different number of bits. 223 00:19:23,434 --> 00:19:30,249 Well, if you go through the calculation to figure out what b bar is guess what? 224 00:19:30,249 --> 00:19:36,637 It's 1.75. So, this is as good as it gets. 225 00:19:36,638 --> 00:19:39,070 You can compress down to the entropy limit. 226 00:19:39,070 --> 00:19:44,972 And what you have to do is use a different number of bits for each letter depending 227 00:19:44,972 --> 00:19:51,666 on what their probabilities are. And it's also true that the decoder has to 228 00:19:51,666 --> 00:19:58,404 know what this code is in order to figure out what's going on, of course. 229 00:19:58,404 --> 00:20:04,769 So, this is not the word, code, as in a secret; this is a code, it's standard 230 00:20:04,769 --> 00:20:11,839 terminology used in information theorem trying to represent our code a letters 231 00:20:11,839 --> 00:20:14,790 with bits. All right. 232 00:20:14,791 --> 00:20:22,600 So, I'm happy about that. But there's a problem, it comes up with 233 00:20:22,600 --> 00:20:27,460 variable length codes. I have to know where the letter boundaries 234 00:20:27,460 --> 00:20:33,710 are. So, let's suppose the source puts out that 235 00:20:33,710 --> 00:20:40,995 sequence of letters. And using this code is represented by this 236 00:20:40,995 --> 00:20:45,894 sequence of bits. Now what the decoder has to do is figure 237 00:20:45,894 --> 00:20:50,003 out what were the code letters, or the code words are. 238 00:20:51,220 --> 00:20:57,035 When this is all it gets, it doesn't know anything else other than what it receives. 239 00:20:57,035 --> 00:21:02,062 So, it has to figure out the boundaries. What you have to be able to do is say, 240 00:21:02,062 --> 00:21:08,175 well, here's the 0, so that's got to be an a1 and that's gotta be an a1. 241 00:21:08,175 --> 00:21:12,184 Now I get 1, 1 0, well there's the next code. 242 00:21:12,185 --> 00:21:17,600 Here's a 0 for an a1. 1 0, that's an a2. 243 00:21:17,600 --> 00:21:23,007 And there's an a1. And finally we get an a4. 244 00:21:23,008 --> 00:21:27,772 So you have to be able to do this without error and with no complications like you 245 00:21:27,772 --> 00:21:34,594 get lost or anything like that. So, this variable length code has to be 246 00:21:34,594 --> 00:21:40,724 rigged so that no code word begins another code word. 247 00:21:40,724 --> 00:21:44,320 If it does, you're in big trouble. You're going to have trouble trying to 248 00:21:44,320 --> 00:21:48,690 figure out where these boudnaries are. So, how do you do that? 249 00:21:48,690 --> 00:21:54,210 So in addition to satisfying the source coding theorem you have to have a 250 00:21:54,210 --> 00:21:59,914 procedure for setting up a code so that you can figure out the receiver, can 251 00:21:59,914 --> 00:22:04,127 figure out what's going on. So, I'm going to show you how to build 252 00:22:04,127 --> 00:22:06,709 such a code and these are called Huffman Codes. 253 00:22:09,130 --> 00:22:14,014 Dave Huffman was a graduate student in the 1950s, and as a student he figured out 254 00:22:14,014 --> 00:22:17,460 this code and proved it had very important properties. 255 00:22:17,460 --> 00:22:23,134 Rather than try to describe this in a formal way, I'm just going to give an 256 00:22:23,134 --> 00:22:27,144 example cause it's really easy to see how this works. 257 00:22:27,144 --> 00:22:30,129 So, I am going to build what's called a coding tree. 258 00:22:31,280 --> 00:22:36,490 I'll show you what that is as we go on. So the first thing you do is you write 259 00:22:36,490 --> 00:22:40,600 down the letters in decreasing order of probability. 260 00:22:40,600 --> 00:22:45,764 So the probabilities are going down. So you may have to reorder the letters or 261 00:22:45,764 --> 00:22:48,230 anything. But the important thing is that the 262 00:22:48,230 --> 00:22:53,936 probabilities are going down. Next thing you do is you look at the two 263 00:22:53,936 --> 00:23:02,285 smallest probability letters and you think you put on a binary tree, you merge them 264 00:23:02,285 --> 00:23:10,336 as coming, as nodes coming out of a leaves rather coming out of a simple node. 265 00:23:10,336 --> 00:23:17,180 And we define a new probability for that, that's just a sum of component 266 00:23:17,180 --> 00:23:20,524 probabilities. So the next thing you do is you look for 267 00:23:20,524 --> 00:23:26,552 the two smallest probability things. Express them as coming out of leaves of a 268 00:23:26,552 --> 00:23:33,828 simple binary tree node and you keep, and you find the new probability which is the 269 00:23:33,828 --> 00:23:39,620 sum of the probabilites these letters and you finally merge it. 270 00:23:39,620 --> 00:23:43,604 We only have one more we can do. And so that's the, that's what called a 271 00:23:43,604 --> 00:23:46,244 coding tree. Okay. 272 00:23:46,245 --> 00:23:53,285 And it's just a way of, of defining that binary tree for a given alpha bit in terms 273 00:23:53,285 --> 00:23:59,303 of its letters probabilities. The next thing you do is you arbitrarily 274 00:23:59,303 --> 00:24:04,272 define. Assigns 0s and 1s to each branch. 275 00:24:04,272 --> 00:24:08,747 And this literally is arbitrary, I've done it in a kind of boring way. 276 00:24:08,747 --> 00:24:12,717 Always put 0 for the upper branch and a 1 for lower branch. 277 00:24:12,718 --> 00:24:17,897 You could change that to a 1 and that to a 0 and everything would work fine. 278 00:24:17,898 --> 00:24:22,309 So that's the next thing that you do assign bits to each branch. 279 00:24:22,309 --> 00:24:29,160 And then what you do is to find your code is you figure out the path it takes to get 280 00:24:29,160 --> 00:24:36,728 to each letter from the source nodes. So, we start here and you're going to go 281 00:24:36,728 --> 00:24:45,483 to the source node. So, suppose we look at this one for a3, so 282 00:24:45,483 --> 00:24:50,875 it was 1, 1, 0. So they're written in the same order as 283 00:24:50,875 --> 00:24:58,542 the path it took me to get to that letter. So 1 0 of course is 1 0. 284 00:24:58,542 --> 00:25:03,729 So it's pretty easy to now jam the binary tree to figure out a source code. 285 00:25:03,729 --> 00:25:10,338 Now, because of the structure of this tree and because there is a unique path to get 286 00:25:10,338 --> 00:25:16,449 through this tree to the letter, that means no other coded word, no code word 287 00:25:16,449 --> 00:25:22,463 begins any other, because it is just a representation of the paths through this 288 00:25:22,463 --> 00:25:26,396 tree. So you have, it turns out, now a code that 289 00:25:26,396 --> 00:25:32,895 obeys the rules, allows you to figure out, from the whole sequence of these bits, 290 00:25:32,895 --> 00:25:36,717 where the boundaries are. So that's good. 291 00:25:36,718 --> 00:25:41,750 It's also true that this has other important properties. 292 00:25:41,750 --> 00:25:48,154 This, property of not, no code word beginning at any other is called a prefix 293 00:25:48,154 --> 00:25:51,956 code. And it's just a terminology in the field. 294 00:25:51,956 --> 00:25:57,718 And it turns out you can show, and I'll give an example in a second, that you may 295 00:25:57,718 --> 00:26:02,810 not actually produce a code that goes all the way to the entropy limit. 296 00:26:02,810 --> 00:26:11,932 It turns out it may not quite reach h of a for the average code length. 297 00:26:11,932 --> 00:26:19,166 But what, Huffman proved was that no lossless source code can have, have a 298 00:26:19,166 --> 00:26:25,672 shorter average code length. A smaller b bar, than what his procedure 299 00:26:25,673 --> 00:26:29,208 gives you. So his procedures are more realistic 300 00:26:29,208 --> 00:26:33,966 reflection of the fused number of bits that you can use and have a lossless 301 00:26:33,966 --> 00:26:38,402 compression algorithm. Shanna's result is just a bound, says it 302 00:26:38,402 --> 00:26:43,061 does not need to be smaller. Then cannot be smaller than entropy, but 303 00:26:43,061 --> 00:26:45,854 it could be as big as an entropy plus 1 bits. 304 00:26:45,854 --> 00:26:51,775 And so you can look at the Huffman Code and set up a coding tree to figure out 305 00:26:51,775 --> 00:26:57,517 what is the shortest possible average code length, the smallest b bar. 306 00:26:57,518 --> 00:27:00,299 And that means the Huffman coding is optimal. 307 00:27:00,300 --> 00:27:08,388 No other code can give you a shorter, few number of bits per, on the average for an 308 00:27:08,388 --> 00:27:12,850 alphabet than a Huffman code. Very nice. 309 00:27:12,850 --> 00:27:19,847 So, there's a little problem; which Huffman code is great, but there's a 310 00:27:19,847 --> 00:27:25,310 little problem with errors that channel might introduce. 311 00:27:25,310 --> 00:27:32,678 So here's the same example I gave before, and here's our sequence of letters. 312 00:27:32,678 --> 00:27:39,942 Suppose at 0, comes out a 1. So, pe is not 0, so suppose it changes 313 00:27:39,942 --> 00:27:44,539 that 1 bit to a 1. Well now, where are the boundaries? 314 00:27:44,540 --> 00:27:48,327 So, we have 0. That's fine. 315 00:27:48,328 --> 00:27:51,957 Well it's now 1, 1, 1. Well that's this. 316 00:27:51,958 --> 00:28:00,177 And then we have a 0 and another 0 and then one 0, another 0. 317 00:28:00,178 --> 00:28:10,410 Well that comes out as an a1 and an a4. And then an a1, and an a1, et cetera. 318 00:28:10,410 --> 00:28:17,960 So that is not the same as that. So if there's a channel error, the Huffman 319 00:28:17,960 --> 00:28:22,556 coding can be entirely wrong. And if there are no channel errors it's 320 00:28:22,556 --> 00:28:26,020 always true that you can get back the original letter sequence. 321 00:28:26,020 --> 00:28:29,658 That's always true. But if the channel Introduces errors, the 322 00:28:29,658 --> 00:28:34,650 Huffman Code, and in fact any variable length code is going to be very sensitive 323 00:28:34,650 --> 00:28:40,622 to those errors. So, you want to be a little careful about 324 00:28:40,622 --> 00:28:43,189 that. And we're going to take care of these 325 00:28:43,189 --> 00:28:46,365 errors in the channel in just a second, in another video. 326 00:28:46,366 --> 00:28:51,325 But just like to say, you have to be aware. 327 00:28:51,326 --> 00:28:54,640 And the channel errors can really mess things up. 328 00:28:54,640 --> 00:29:00,272 So if you send your zip file to somebody and one of those bits gets changed, the 329 00:29:00,272 --> 00:29:06,520 entire file system you just put into a zip file can be totally corrupted and wrong, 330 00:29:06,520 --> 00:29:12,173 all the file names will be wrong, etc. So it's real important not to have any 331 00:29:12,173 --> 00:29:14,881 errors in the transmission. So. 332 00:29:14,882 --> 00:29:18,040 I want to point out also something about Morse codes. 333 00:29:18,040 --> 00:29:21,756 So Morse code was invented for the telegraph. 334 00:29:21,756 --> 00:29:29,630 And I'm sure you know that SOS is dot, dot, dot, dash, dash, dash, dot, dot, dot. 335 00:29:29,630 --> 00:29:35,170 It's the early way they represented letters with dashes and dots which you 336 00:29:35,170 --> 00:29:40,955 think about it for a second, that's the same as representing the letters of the 337 00:29:40,955 --> 00:29:45,371 alphabet with 0s and 1s. Just by a different name. 338 00:29:45,371 --> 00:29:49,792 And what the Morse code is, it turns out very reminiscent of the idea behind the 339 00:29:49,792 --> 00:29:56,103 Huffman Code. In that the most probable letter, the e, 340 00:29:56,103 --> 00:30:01,669 is represented by a very short sequence, 1 dot. 341 00:30:01,669 --> 00:30:13,448 And the s over here, which has a smaller probability, has more characters, 3 dots. 342 00:30:13,448 --> 00:30:17,842 So, it turns that's right kind of idea. The most improbable letter in the list, I 343 00:30:17,842 --> 00:30:21,724 think, is the j. And it's got the most dashes and dots of 344 00:30:21,724 --> 00:30:25,269 anything in the Morse code. So that's the right kind of idea. 345 00:30:25,269 --> 00:30:28,670 Well, how well does it work? So if you figure out, using the 346 00:30:28,670 --> 00:30:35,096 probabilities here, you go through and you figure out what, using all these 347 00:30:35,096 --> 00:30:41,743 probabilities, you figure out what the entropy is, you come up with 4.14 bits. 348 00:30:41,744 --> 00:30:50,590 Well, if you figure out the average number of bits used in the Morse code, it turns 349 00:30:50,590 --> 00:30:54,785 out to be 5.56. Which I will point out is more than a bit 350 00:30:54,785 --> 00:30:59,290 away from entropy, which says you can do better than the Morse code. 351 00:30:59,290 --> 00:31:03,193 Well the question is why? What's going on? 352 00:31:03,193 --> 00:31:08,140 And the answer turns out to be there's one little thing that they didn't do right. 353 00:31:08,140 --> 00:31:14,868 So like I said an E is 1 dot. An S is 3 dots. 354 00:31:14,868 --> 00:31:20,327 Well how'd he know that S is not really 3 Es? 355 00:31:20,328 --> 00:31:26,749 3 Es will have 3 dots and S is 3 dots. So what they did in the tele, telegraph 356 00:31:26,749 --> 00:31:31,088 case is they'd send a dot and then the operator would pause. 357 00:31:31,088 --> 00:31:35,938 And then doing, do, go onto the next letter of the alphabet. 358 00:31:35,938 --> 00:31:42,018 So, for an S, the 3 dots would be right next to each other but if you sent 3s, it 359 00:31:42,018 --> 00:31:46,923 would be dot, dot, dot, if you send an S it'd be dot, dot, dot. 360 00:31:46,924 --> 00:31:52,278 So, those spaces, those pauses are commas, what are called commas. 361 00:31:52,279 --> 00:31:58,510 So, you can cannot decode the original sequence of letters, if everything's all 362 00:31:58,510 --> 00:32:03,146 jammed together. So if you count in that extra pause, 363 00:32:03,146 --> 00:32:06,784 that's where the 5.56 comes from, turns out. 364 00:32:06,784 --> 00:32:12,265 So it's inefficient. So here is the Huffman Code, which I 365 00:32:12,265 --> 00:32:19,740 found, for the same set of probabilities. And you can see some of these code words 366 00:32:19,740 --> 00:32:23,861 are very long. But what's the average code length? 367 00:32:23,861 --> 00:32:29,286 And that's kind of a surprise. It turns out it does a whole lot better 368 00:32:29,286 --> 00:32:32,324 than the Morse code. It's 4.35. 369 00:32:32,324 --> 00:32:38,400 It doesn't get to the entropy limit. But Huffman's result says you can't do any 370 00:32:38,400 --> 00:32:43,558 better than 4.35. There is not going to be any variable link 371 00:32:43,558 --> 00:32:50,686 code that is lossless that has a shorter average code link than this code. 372 00:32:50,687 --> 00:32:54,302 It's not unique. This sequence of 0s and 1s may be 373 00:32:54,302 --> 00:32:58,122 different, but the smallest you could do is 4.35. 374 00:32:58,122 --> 00:33:02,543 You can't reach the entropy limit but you're only 0.2 bits away from it. 375 00:33:02,543 --> 00:33:07,038 This is significantly less. More than a bit is saved per each letter, 376 00:33:07,038 --> 00:33:14,900 on average, with the use of Huffman Code. So, that's the idea of source coding, 377 00:33:14,900 --> 00:33:21,366 what's called compression. So most data sources are, can be 378 00:33:21,366 --> 00:33:27,501 compressed in a lossless sense with much fewer bits than a simple code. 379 00:33:27,502 --> 00:33:32,617 Which you might just pick off the shelf. So, the effect of doing that, by the way, 380 00:33:32,617 --> 00:33:37,148 is to lower the source data rate. So the simple codes we've been talking 381 00:33:37,148 --> 00:33:41,495 about to date we can actually do better if we use a variable length code we can 382 00:33:41,495 --> 00:33:46,332 dramatically cut down the data rate. You can always use lossy compression. 383 00:33:46,332 --> 00:33:49,430 But you can't get back the original sequence of letters. 384 00:33:49,430 --> 00:33:56,570 And you also, in your lossless code have to have a way of getting back figuring out 385 00:33:56,570 --> 00:34:01,867 what you really did mean and that's what MPEG, JPEG, MP3 does. 386 00:34:01,867 --> 00:34:07,457 Now the other downside of course is that in the lossless case you're going to be 387 00:34:07,457 --> 00:34:13,650 very sensitive to channel errors. And so what we're going to build up to in 388 00:34:13,650 --> 00:34:19,613 succeeding videos is determining ways, getting around those channel errors so 389 00:34:19,613 --> 00:34:25,398 they're not going to be much of a problem, even though the analog channel is noisy 390 00:34:25,398 --> 00:34:28,193 and has a pe that's greater than 0.