1 00:00:03,600 --> 00:00:08,196 So, Huffman coding is an optimal prefix-free code. And can we have a better 2 00:00:08,196 --> 00:00:13,459 data compression algorithm? and as you might suspect the answer is yes. And 3 00:00:13,459 --> 00:00:19,696 that's we're going to look at next it the LZW algorithm, named after Abraham Lempel 4 00:00:19,696 --> 00:00:26,185 and Jacob Ziv, and W is for Terry Welch who got involved. and so the way that we 5 00:00:26,185 --> 00:00:32,594 improve on Huffman coding is by changing the rules under which it operates. And 6 00:00:32,120 --> 00:00:39,185 we've already changed the rules once, so let's take a high level view. so one idea 7 00:00:39,185 --> 00:00:46,143 for codes is that we're going to, for all text that we ever encounter, we're going 8 00:00:45,608 --> 00:00:51,770 to use the same code or the same model of the text. and so everybody's got that 9 00:00:51,770 --> 00:00:57,672 model, we don't have to transmit it. but the problem is that one text might have a 10 00:00:57,672 --> 00:01:03,083 lot of As another one might have a lot of Zs. They have different statistical 11 00:01:03,083 --> 00:01:08,353 properties. so, it's not going to be optimal for a given text. but still, over 12 00:01:08,353 --> 00:01:14,045 the preponderance of text that we see this can be effective. And so, that's really 13 00:01:14,326 --> 00:01:19,148 what an ASCII is. but now, for Huffman, well, we consider, we change that rule. 14 00:01:19,148 --> 00:01:24,519 And we said, well, we're going to use different encodings depending on what the 15 00:01:24,519 --> 00:01:30,078 text is. so, we're going to look at the text, figure out the frequencies and then, 16 00:01:30,078 --> 00:01:36,941 go ahead and create a code that works for that text. now, we have to transmit the 17 00:01:36,941 --> 00:01:43,644 code. that's what the Huffman's code is. for every text we've got a code that's 18 00:01:43,644 --> 00:01:49,589 suited for that text. what we're going to look at now is another change in the 19 00:01:49,589 --> 00:01:55,068 rules, where it's called an adaptive model, where we're going to just read the 20 00:01:55,068 --> 00:02:01,544 text one character a time. but we're going to use the text to update the model and 21 00:02:01,544 --> 00:02:07,877 learn about the model as we read it. and we're going to assume the decoder does the 22 00:02:07,877 --> 00:02:13,918 same thing. And this gives perhaps it does give more accurate modelling and better 23 00:02:13,918 --> 00:02:19,995 compression over the, throughout the text. And it's a very effective method, so let's 24 00:02:19,995 --> 00:02:28,198 look at how it works. we'll just do this through an example. now, this is, We're 25 00:02:28,198 --> 00:02:36,301 going to use hex values. so, in LZW encoding we start with our input, which is 26 00:02:36,587 --> 00:02:44,213 which is characters. O r say, ASCII characters. and what we're going to do is 27 00:02:44,213 --> 00:02:51,392 keep a dictionary of code words. and the code words are going to be fixed length. 28 00:02:51,632 --> 00:02:58,196 and so, we'll use eight bits or two hex characters as an output. And we'll start 29 00:02:58,196 --> 00:03:04,471 out with just the ASCII value for all the characters that we're going to use in the 30 00:03:04,471 --> 00:03:11,624 message. so, that's the starting point. so, at the beginning, so this is going to 31 00:03:11,624 --> 00:03:19,048 be compression for this string here. at the beginning we just pick off characters 32 00:03:19,048 --> 00:03:27,583 and output their value from the from the table. but every character that we read, 33 00:03:27,583 --> 00:03:34,858 it gives us a little new information. For example, in this case we've seen AB and 34 00:03:34,858 --> 00:03:41,954 what we're going to do is say, okay, we've seen AB, maybe that occurs again in the 35 00:03:41,954 --> 00:03:48,493 text if it does we'll since, since we've seen it, we know where it is, we know what 36 00:03:48,493 --> 00:03:55,175 it is. we'll just give it the value 81 and we'll maintain this table. If we see AB 37 00:03:55,175 --> 00:04:03,421 again we'll use those eight bits. so we put out the B, but we remember the fact 38 00:04:03,421 --> 00:04:10,847 that we saw AB. And similarly, when we see BR then that will be an 8-bit code word 39 00:04:11,166 --> 00:04:18,498 that we can put out. and the idea is that the decoder or the expander can do the 40 00:04:18,498 --> 00:04:24,953 same thing. We don't actually have to transmit this table. we can know that the 41 00:04:25,192 --> 00:04:31,736 expander is going to do the same thing, and we'll have the same a table. So now, 42 00:04:31,736 --> 00:04:39,589 we see the R. So, this is BR. So then, and that's going to be encoded with 82 but the 43 00:04:39,589 --> 00:04:47,072 R is asking for R52 so we put that out. now we see the A and so, that's going to 44 00:04:47,072 --> 00:04:55,294 be 83. If we if we see it again and, but the A is just 41 and then the AC is going 45 00:04:55,294 --> 00:05:02,933 to be 84 the C is the 43. CA is going to be 85 and then 41. and so, what we're 46 00:05:02,933 --> 00:05:10,149 doing is building up a table that is a model for what the string has and it will 47 00:05:10,149 --> 00:05:19,564 allow us to get better compression as we get layer into the message. so now, we're 48 00:05:19,564 --> 00:05:28,851 reading the A after the B, so D is going to be 87. but now, we see that the next 49 00:05:28,851 --> 00:05:38,137 letter is B, we look in our table in for AB and it's there. So, we can put out 81 50 00:05:38,137 --> 00:05:48,266 instead of just putting that A out there. So at this point when we look at the B we 51 00:05:48,266 --> 00:05:55,185 can know that we had AB. And we can look that up on the table and just put out 81. 52 00:05:55,185 --> 00:06:03,152 and similarly we when we, but now, when we look at the R our previous character was 53 00:06:03,152 --> 00:06:12,207 AB. So now, we're going to code ABR with 88. and again, RA if we look that up in 54 00:06:12,207 --> 00:06:19,126 the table and we can put out 83. and now, we can remember RAB is going to be 89. so 55 00:06:19,381 --> 00:06:25,580 now, what happens next? Now, we look at our next characters. we look for the 56 00:06:25,580 --> 00:06:32,629 longest prefix that we can match in the table and that's going to be the code word 57 00:06:32,629 --> 00:06:39,844 that we put out. So in this case, we have BR that's in there that's 82. and the next 58 00:06:39,844 --> 00:06:46,540 character is A so I'm going to put a BRA in there. and now this is the remaining to 59 00:06:46,540 --> 00:06:53,154 be encoded and we're going to look for the longest prefix that we know the code for, 60 00:06:53,154 --> 00:07:00,472 and in this case, it's going to be ABR, which is 88. So the previous characters in 61 00:07:00,472 --> 00:07:08,297 the text built a model that allow us to more effectively encode the text. and 62 00:07:08,297 --> 00:07:15,864 that's 88 ABR and then all that's left is the last A. so, for this small example 63 00:07:16,084 --> 00:07:22,897 the, compression is not great, but the idea is still there's definitely some 64 00:07:22,897 --> 00:07:29,637 savings in there, cuz these code words, the input was 8-bit ASCII code words, and 65 00:07:29,637 --> 00:07:35,425 these code words are, are also eight bits, and the key thing is we don't have to 66 00:07:35,425 --> 00:07:41,212 transmit the table, cuz we can assume that the expander is going to rebuild the 67 00:07:41,212 --> 00:07:48,464 table, the same way that we built it. so, that's the basic idea beside behind LZW 68 00:07:48,685 --> 00:07:55,055 compression. We also have a stop character at the end that says, it's the end of the 69 00:07:55,055 --> 00:08:00,927 message. So, this is the summary of LZW compression. we created a symbol table 70 00:08:00,927 --> 00:08:07,056 that associates fixed length code words with string keys. We initialize it with 71 00:08:07,056 --> 00:08:13,047 code words for single character keys. Now, when it's time to encode part of the 72 00:08:13,047 --> 00:08:19,059 input, we look for the longest string in the symbol table that's a prefix of the 73 00:08:19,059 --> 00:08:24,927 part of the input we haven't seen. and that string, that's the longest one, its 74 00:08:24,927 --> 00:08:30,940 the best we can do. we're going to have a fixed length code word for that string. 75 00:08:30,940 --> 00:08:37,133 And then, we look ahead to the next character in the input and add a new entry 76 00:08:37,133 --> 00:08:44,448 into the symbol table with tha t one extra character. that's the LZW compression. so 77 00:08:44,696 --> 00:08:51,649 one question that comes up is how do we represent that code table in, in code. And 78 00:08:51,649 --> 00:08:58,776 actually, the answer is really simple. we're going to use a trie. because what a 79 00:08:58,776 --> 00:09:06,827 trie can do for us is if you remember, when you looked at the tries, if you don't 80 00:09:06,827 --> 00:09:15,169 know, when we looked at tries, what we did was support longer prefix match operation. 81 00:09:15,460 --> 00:09:23,608 so, this trie represents all the prefixes that we have seen so far in the message, 82 00:09:23,608 --> 00:09:30,567 and it's very easy if the next four characters in the text for ABRA, then we 83 00:09:30,567 --> 00:09:36,761 have the code word for it. for anything that we've seen in the text we've got a 84 00:09:36,761 --> 00:09:43,038 code word for a fixed length code word. So, as the trie gets bigger, as we move 85 00:09:43,038 --> 00:09:49,151 down more into the trie, we get better compression cuz we're using the same 86 00:09:49,151 --> 00:09:56,960 length code word to compress more letters. so here's the implementation of LZW 87 00:09:56,960 --> 00:10:05,433 compression. again very simple implementation for such a sophisticated 88 00:10:05,433 --> 00:10:14,885 algorithm, really. and we're actually going to use the TST so that we'll have to 89 00:10:14,885 --> 00:10:23,926 worry about the extra space. and so, first thing we do is initialize the TST with a 90 00:10:23,926 --> 00:10:31,428 code word for each of the single characters, so it's rated XR, so there are 91 00:10:31,428 --> 00:10:40,022 different letters. and we'll just put an entry into the trie for each one of the 92 00:10:40,022 --> 00:10:47,375 letters along with its coding. and so we didn't show, we only showed a few letters 93 00:10:47,620 --> 00:10:54,237 in the examples, but in general, we'll, we'll assign the code word i to each one 94 00:10:54,237 --> 00:11:02,499 of the, to the ith letter. and then, the rest of the trie is built from the input 95 00:11:02,499 --> 00:11:10,442 string. And you know, so the very first thing we do is find the longest prefix of 96 00:11:10,442 --> 00:11:18,452 the input string in the symbol table. and that longest prefix of method just marches 97 00:11:18,452 --> 00:11:25,100 down the trie eating off the characters in the input string one character at a time 98 00:11:25,324 --> 00:11:31,834 until it gets to the bottom, because the bottom it has a code word. so it, it 99 00:11:31,834 --> 00:11:41,189 actually, the symbol table method actually gives us back the string. and so, then we 100 00:11:41,189 --> 00:11:49,949 can use that to get the value associated with that string out in the symbol, in the 101 00:11:50,126 --> 00:11:58,943 second call. And that gives us the code word. an d then what we want to do is get 102 00:11:58,943 --> 00:12:07,711 the length of that longest prefix match and add one more character to it and add 103 00:12:07,711 --> 00:12:15,169 which is the next character in the input. and add that code word to the to the 104 00:12:15,463 --> 00:12:23,682 symbol table. and then scan past that in the input. so that's the entire 105 00:12:24,570 --> 00:12:33,286 compression algorithm for LZW compression using a trie. and so, what it's doing is 106 00:12:33,286 --> 00:12:40,137 writing out a fixed-length code word and chewing up the longest substring that 107 00:12:40,137 --> 00:12:45,896 we've seen before. and then, at the end, it writes out a stop close code word, and 108 00:12:46,100 --> 00:12:52,130 closes out the input stream. so, using the tech, trie technology that we've developed 109 00:12:52,333 --> 00:12:57,918 we have a compact implementation of LZW compression. now, let's look at the 110 00:12:57,918 --> 00:13:04,513 expansion. And expansion is going to be basically the same code. All that the 111 00:13:04,513 --> 00:13:11,572 expansion algorithm get, gets is the list of fixed length code words. and it's going 112 00:13:11,080 --> 00:13:17,995 to maintain its own code word table in order to get the expansion done. now, we 113 00:13:17,995 --> 00:13:25,135 can start it out again by generating the table for a each other single letter 114 00:13:25,135 --> 00:13:31,765 that's going to be in the message or if it's only a few may be. there is some 115 00:13:31,765 --> 00:13:38,990 other sliding coding that needs to be done at the beginning. but so, it seems 41 in 116 00:13:39,994 --> 00:13:44,937 roles are switched. the fixed length code words of the key and the values are 117 00:13:44,937 --> 00:13:50,237 strings. So, we just look up the key and write out the values as we see at 42, 118 00:13:50,237 --> 00:13:59,985 that's a B we see. a 41, that's an A. but not only do we want to put, put out the, 119 00:14:01,050 --> 00:14:06,433 the character corresponding to the code word. but we also want to build up code 120 00:14:06,433 --> 00:14:11,614 words in the same way that the compression method would have done it. So, the 121 00:14:11,614 --> 00:14:17,401 compression method I want to see is an AB would have associated the string AB with a 122 00:14:17,401 --> 00:14:25,189 new key. And now, we read a 52. Look up 52, that's an R but we also put a new 123 00:14:25,189 --> 00:14:31,882 entry in the table for the string BR. Same way the compression algorithm would have 124 00:14:32,718 --> 00:14:41,362 done. We see 41 we get an A. Now, we put RA in the table 43 we put a C and we put a 125 00:14:41,362 --> 00:14:50,472 C in the table 41, we put an A. we put CA in the table now 81. we look up at our 126 00:14:50,472 --> 00:14:58,117 table, and it's I'm sorry. the D is 44 and AD in the table. And now 81, we look at 127 00:14:58,117 --> 00:15:07,834 up, and we have AB and once we have seen the AB then we know the compression were 128 00:15:07,834 --> 00:15:13,715 to put DA on the table. And that's a little bit tricky because the expansion is 129 00:15:13,715 --> 00:15:20,043 kind of one step behind the compression. It's got to put out the characters before 130 00:15:20,043 --> 00:15:25,924 it knows it. And it does lead to a slightly tricky situation that we'll talk 131 00:15:25,924 --> 00:15:32,028 about. So now 83 is going to be RA. And once we put out the RA, then we know that 132 00:15:32,252 --> 00:15:39,420 compression would have done ABR now, so now 82 is BR. And again, we know 133 00:15:39,420 --> 00:15:46,923 compression would have put RAB in there. and 88 is ABR and compression would have 134 00:15:46,923 --> 00:15:54,241 put BRA in there. 41 is A and then 80 is the stop character. Oh, and we would have 135 00:15:54,241 --> 00:16:00,910 put and once it did the a, then it would have put ABRA in there. And then, 80 is 136 00:16:01,374 --> 00:16:07,568 the stop character. So, that's an expansion just building the table in the 137 00:16:07,568 --> 00:16:12,676 same way the compression would have and then, using the table to figure out the 138 00:16:12,676 --> 00:16:18,344 string associated with, with each fixed length code word. so this is a summary of 139 00:16:18,344 --> 00:16:24,570 expansion which is pretty much the same as for compression except the reverse. so 140 00:16:24,775 --> 00:16:30,673 we're going to create a symbol table that has fixed linked keys and associates 141 00:16:30,673 --> 00:16:36,590 string values with them. We put in the single character values. we read a key, we 142 00:16:36,590 --> 00:16:42,192 find the string value and write it out. And then, we update the symbol table from 143 00:16:42,395 --> 00:16:48,267 the last the two things that we, the first character, the thing we just wrote out and 144 00:16:48,267 --> 00:16:55,988 the thing we wrote out before that. and again you know, to represent the, the code 145 00:16:55,988 --> 00:17:04,261 table this time we can just use an array. and because we, our keys are, are fixed 146 00:17:04,261 --> 00:17:12,157 linked. And we assign them one, one bit at a time. So, we can just store the strings 147 00:17:12,439 --> 00:17:19,949 in an array. And use the, the key bits, the key values, the fixed bits, key values 148 00:17:19,949 --> 00:17:26,490 we get to index into the array and give us the right string. so that part's easier. 149 00:17:26,490 --> 00:17:33,934 so now there's a tricky case that really everyone needs to work through this case a 150 00:17:33,934 --> 00:17:40,847 few times to really understand what's going on. And it's worth doing once even 151 00:17:40,847 --> 00:17:47,849 in lecture. so, let's look at this case where we have AB, AB, ABA and just follow 152 00:17:47,849 --> 00:17:54,167 through th e algorithm for this for this case. so we get an A, and we look it up, 153 00:17:54,167 --> 00:18:01,384 and it's 41. and so, that's what we're going to put out. and then, we look at the 154 00:18:01,384 --> 00:18:08,842 next character. And it's AB so we're going to remember 81 in our code word table. 155 00:18:09,083 --> 00:18:15,980 then, we read a B. and we look it up, and it's 42. and the next character is A. So, 156 00:18:15,980 --> 00:18:23,573 we're going to say, well, BA is going to be, be 82. then next character is B. We 157 00:18:23,573 --> 00:18:30,929 look up AB and we have 81. and so, we can put out the 81. And now, the next, look 158 00:18:30,929 --> 00:18:38,380 ahead, the next character is A. So, we're going to do a code for ABA. That's going 159 00:18:38,380 --> 00:18:45,377 to be 83. now, we have the rest of our string to be encoded is ABA and our 160 00:18:45,377 --> 00:18:52,615 longest prefix match is ABA. so, we're going to put out 83. and that's the end of 161 00:18:52,615 --> 00:18:59,793 the string. so, we do the 80, which is the end of the string. So, that's compression 162 00:19:00,051 --> 00:19:07,181 for that string, working the same way as for the other example. now, lets look at 163 00:19:07,181 --> 00:19:15,252 expansion for this case. they start up the same way. so now, we see a 41 and we look 164 00:19:15,252 --> 00:19:22,790 it up in our table. and again, this can be just an indexed array. so it's going to be 165 00:19:22,790 --> 00:19:31,844 a so that's a starting point and now, 42 is going to be a B and then we look back, 166 00:19:31,844 --> 00:19:39,658 we just put out an A and a B so our compression algorithm would have encoded 167 00:19:39,658 --> 00:19:48,374 an AB as 81, we know that. So now, we can when we see 81 we've got AB so we can put 168 00:19:48,374 --> 00:19:56,416 a B. and so now, we look at the and we put out the AD and the next entry in our table 169 00:19:56,701 --> 00:20:04,471 is going to be BA, because that's what our compression were to put out. but now, we 170 00:20:04,471 --> 00:20:10,844 just got a problem. the next character that we see that we need to expand is 83 171 00:20:11,166 --> 00:20:19,639 but we need to know what key has value 83 but it's not in the symbol table yet. so 172 00:20:19,960 --> 00:20:29,291 that's definitely a tricky case for the now it is possible to when we go through 173 00:20:29,291 --> 00:20:37,303 that one again. so I, at the time that we, we read the 83, we need to know which key 174 00:20:37,303 --> 00:20:43,383 has value 83 before it gets into the symbol table. in all of its first 175 00:20:43,383 --> 00:20:50,111 characters is going to be A, so that means that ABA has convenience with the code. In 176 00:20:50,111 --> 00:20:56,838 a book which you can examine has a way to test for this case, and it's definitely a 177 00:20:56,838 --> 00:21:04,621 tr icky case that at least can be considered for LZW expansion. we expand 83 178 00:21:04,621 --> 00:21:11,012 at the same time that we put it in the symbol table, in this example. okay. So 179 00:21:11,012 --> 00:21:20,613 there are all different kinds of details for LZW, we've just given one sometimes 180 00:21:20,613 --> 00:21:29,668 people find it effective to clear out the symbol table after a while v maybe how big 181 00:21:29,668 --> 00:21:37,258 do we make this symbol table, how big do we let it get. maybe it's not worthwhile 182 00:21:37,258 --> 00:21:41,942 to keep older parts of the message. It should adapt the pieces of the message. 183 00:21:42,118 --> 00:21:46,720 there's many other variations like that, that have been developed. so, so, for 184 00:21:46,720 --> 00:21:53,894 example, what GIF does, is when the symbol table is full, we just throw it away and 185 00:21:53,894 --> 00:22:00,271 start over. Unix compress it's throws the keeps a measure of how well it's doing. 186 00:22:00,271 --> 00:22:06,825 And throws it away when it's not being effective. and there's many, many other 187 00:22:06,825 --> 00:22:12,296 variations. why not put even longer substrings? Why just one character at a 188 00:22:12,296 --> 00:22:17,611 time? Why not the double ones and so forth. And there have been many variations 189 00:22:17,611 --> 00:22:23,880 like that have been develop, developed. and actually in the real world the, the 190 00:22:23,880 --> 00:22:29,603 development of this tech, technology was influenced by patent law. There's a 191 00:22:29,603 --> 00:22:35,258 version called LZ77 and another version called LZW because these guys worked for a 192 00:22:35,258 --> 00:22:42,321 company in the summer and then it was patented for a while you couldn't use LZW 193 00:22:42,321 --> 00:22:47,904 unless you paid the patent holder. But that expired in 2003 and then things 194 00:22:47,904 --> 00:22:54,901 started to go up again. so there's lots and lots of different effective methods 195 00:22:54,901 --> 00:23:00,931 that are variants of LZW. and really to do a good job you might also, you need to 196 00:23:01,601 --> 00:23:08,375 include Huffman coding for the characters or run-length encoding and in really just 197 00:23:08,375 --> 00:23:14,405 combinations of these basic, basic methods that we talked about that are, are most 198 00:23:14,405 --> 00:23:21,389 effective. so you see this technology author up computational infrastructure 199 00:23:21,389 --> 00:23:27,447 that you use everyday. and, and at the time we were talking about tries, they 200 00:23:27,447 --> 00:23:34,681 seemed a bit abstract, but actually they are, tries are definitely part of the 201 00:23:34,681 --> 00:23:41,593 algorithmic infrastructure and they are really fine example of a clear, clean, 202 00:23:41,593 --> 00:23:48,155 elegent data structure and algorithmic idea of be ing used to great effect in the 203 00:23:48,155 --> 00:23:53,292 real world. and this is there's people, plenty of people still doing that 204 00:23:53,292 --> 00:23:57,622 research. Even on lossless data compression there's still ideas being 205 00:23:57,622 --> 00:24:02,957 developed. and these are the kind that does a famous benchmark of set of texts 206 00:24:02,957 --> 00:24:08,229 that if you think you have a good new compression algorithm you can run it on 207 00:24:08,229 --> 00:24:13,564 this benchmark. where it's asking you is a seven bits per character, it's eight bit 208 00:24:13,564 --> 00:24:18,836 but one of the bits is redundant, so you will immediately get down to seven. these 209 00:24:18,836 --> 00:24:26,143 are the kinds of compression ratios that people have achieved with various mostly 210 00:24:26,363 --> 00:24:33,181 levels of variance down to less than half of what you can get with asking. but it 211 00:24:33,181 --> 00:24:38,533 continues to drive down and there was a completely new method called the 212 00:24:38,533 --> 00:24:44,251 Burrows-Wheeler method developed in the 90s' that took a, a big jump down and 213 00:24:44,251 --> 00:24:51,301 there's a few more that have continued to improve even through the 90s'. So there's 214 00:24:51,301 --> 00:24:58,291 still room for improvement in lossless data compression, but it's a really fine 215 00:24:58,291 --> 00:25:05,702 example of the power of good algorithmic technology. so here's our quick summary on 216 00:25:05,702 --> 00:25:12,363 data compression. so, we considered two classic fundamental algorithm. The first 217 00:25:12,363 --> 00:25:18,942 one, Huffman encoding, represents fixed length symbols with variable length codes. 218 00:25:19,189 --> 00:25:25,686 so the prefix code uses, tries to use smaller number of bits for the most 219 00:25:25,686 --> 00:25:32,465 frequently used symbols. the other way the, the LempelZiv method takes variable 220 00:25:32,465 --> 00:25:38,897 length symbols and uses fixed length code. So, it's interesting to think about those 221 00:25:39,130 --> 00:25:45,640 two ends of the spectrum. there's plenty of compression algorithms out there that 222 00:25:45,640 --> 00:25:52,926 don't try to get an exact compression, but just try to get an approximation using FFT 223 00:25:52,926 --> 00:25:59,070 and wavelets and many other mathematical techniques. And that's appropriate for 224 00:25:59,070 --> 00:26:03,932 things like pictures and movies where maybe you can miss a few bits. But 225 00:26:03,932 --> 00:26:09,739 lossless compression is still a very important when you download an application 226 00:26:09,739 --> 00:26:15,680 which is machine code onto your computer. you can't afford to have one of the bits 227 00:26:15,680 --> 00:26:20,610 be wrong. so that's why lossless compression will always be with us. 228 00:26:20,855 --> 00:26:27,229 there's a theory on compress ion. This theoretical limits that is, is a 229 00:26:27,229 --> 00:26:34,851 measure of the and it's called the entropy, Shannon entropy of a text that 230 00:26:34,851 --> 00:26:43,766 says a, a very fundamental way to indicate how much information there is in a text as 231 00:26:43,766 --> 00:26:51,700 a function of its frequency. So, it's a sum of overall characters of the product 232 00:26:51,700 --> 00:26:59,100 of the frequency. The log and the frequency with these methods we can get 233 00:26:59,100 --> 00:27:04,164 very close to the entropy in some theoretical session settings. but actually 234 00:27:04,164 --> 00:27:10,419 in, in practice the most effective methods uses tricks and techniques that have extra 235 00:27:10,419 --> 00:27:16,542 knowledge about the data being compressed to really get the most effective kind of 236 00:27:16,542 --> 00:27:22,065 results. as I explained, if you're doing a page like this, you better use a method 237 00:27:22,065 --> 00:27:27,848 that does really well on huge amounts of wide space for example. that's LZW 238 00:27:27,848 --> 00:27:31,400 compression and our last compression algorithm.