1 00:00:03,260 --> 00:00:07,773 Today we're going to talk about data compression. Now this is a family of 2 00:00:07,773 --> 00:00:13,027 algorithms that, everyone uses. and it's based on, many of the classical ideas, 3 00:00:13,213 --> 00:00:18,355 that we've covered in terms of, implementing basic algorithms. we'll start 4 00:00:18,355 --> 00:00:24,755 with a introduction, just what data compression is. So the idea is to reduce 5 00:00:24,755 --> 00:00:31,392 the size of a file. really for two primary reasons. One is to save some space when 6 00:00:31,392 --> 00:00:37,712 storing it. and the other is to save some time with transmitting it. And often 7 00:00:37,950 --> 00:00:43,308 having significant savings is not too difficult because most files have, plenty 8 00:00:43,308 --> 00:00:48,275 of redundancy. but of course we're interested in efficient algorithms that 9 00:00:48,275 --> 00:00:56,410 can guess, get the best savings possible. so, you might think that with the way that 10 00:00:56,755 --> 00:01:02,805 technology, has been improving over recent years. That we wouldn't really need 11 00:01:02,805 --> 00:01:08,188 compression, we can just buy a bigger, faster disk. but the problem with that 12 00:01:08,188 --> 00:01:14,028 idea is, that even though. Moore's law tells us that we're going to get, twice as 13 00:01:14,028 --> 00:01:19,042 much space every one and a half to two years. It's also the case that we, 14 00:01:19,042 --> 00:01:24,752 whatever space we have, we're going to want to fill it up, with, better, higher 15 00:01:24,752 --> 00:01:30,253 quality, data. so you want a higher resolution movie if you get more space. 16 00:01:30,253 --> 00:01:35,894 And if we can remove redundancy, you're always going to want to do that. So it's a 17 00:01:35,894 --> 00:01:41,743 cheap way to, save space, that allows us to do more with the space that we have 18 00:01:41,743 --> 00:01:48,019 available. this is a report last year on big data. Every day we could create 2.5 19 00:01:48,019 --> 00:01:54,342 quintillion bites of data, so much that 90 percent of the data in the world today has 20 00:01:54,342 --> 00:02:00,813 been created in the last two years alone. if we can use data compression to cut that 21 00:02:00,813 --> 00:02:08,820 amount of data in half then that's all the data in the world created in a year. So 22 00:02:09,160 --> 00:02:15,811 this is a very interesting topic to consider from an algorithmic point of 23 00:02:15,811 --> 00:02:22,037 view. Because the basic concepts some of them date back to the 1950's. when it was 24 00:02:22,037 --> 00:02:27,502 extremely important to save time when transmitting information cuz of, or space. 25 00:02:27,710 --> 00:02:33,659 because it was so costly. but some of the best technology, algorithmic technology, 26 00:02:33,867 --> 00:02:39,608 has devel-, been developed recently, and much of it uses, the data structures that 27 00:02:39,608 --> 00:02:45,004 we've considered for other applications. And we'll see that as we get into the 28 00:02:45,004 --> 00:02:52,116 topic. so, just specific applications that, maybe are familiar for file 29 00:02:52,116 --> 00:02:59,394 compression. All file systems and, and disks have built-in, compression 30 00:02:59,394 --> 00:03:06,770 technologies. Such as, as zip, and b-zip and many others of similar type. 31 00:03:06,770 --> 00:03:11,964 Technologies. And the multimedia, everybody's familiar with, the JPEG and 32 00:03:11,964 --> 00:03:17,300 MP3 and MPEG, and all those sorts of things for images, sound and video. Those 33 00:03:17,300 --> 00:03:23,196 are all about data compression. and for communication, now, that's, what has, what 34 00:03:23,196 --> 00:03:29,433 enabled, fax, and also enables new technologies, like Skype, the ability to, 35 00:03:29,659 --> 00:03:35,593 reduce the, amount of data that you actually need to send. for any kind of 36 00:03:35,593 --> 00:03:41,453 technology, is gonna, help, improve things, and improve the quality of what 37 00:03:41,453 --> 00:03:48,099 you can do. . Also, the types of massive amounts of data that, people are saving, 38 00:03:48,311 --> 00:03:54,982 nowadays. Google and Facebook and other, , other web giants are saving lots of data. 39 00:03:54,982 --> 00:04:00,876 And they're going to be able to save more, because of data compression. now from a 40 00:04:00,876 --> 00:04:06,105 theoretical point of view the type of compression that we're going to consider, 41 00:04:06,105 --> 00:04:11,268 lossless compression is very simple conceptually. So we think of a, we talk 42 00:04:11,268 --> 00:04:16,894 about a bitstream rema, everything can be encoded in bits, so we might as well just 43 00:04:16,894 --> 00:04:22,123 consider a, a stream of bits that we want to compress and we'll call that B. In a 44 00:04:22,123 --> 00:04:27,551 data compression algorithm it's going to be a black box that goes ahead and takes 45 00:04:27,551 --> 00:04:33,268 those bits and produces a compressed version, which is many fewer bits. and I 46 00:04:33,268 --> 00:04:39,264 will call that C of B. But we hope that it's many fewer bits. and so that's what 47 00:04:39,264 --> 00:04:44,280 we save or we send, but then we need a companion algorithm, an expansion 48 00:04:44,280 --> 00:04:49,862 algorithm, that takes C of B and produces B back. That's lossless compression. We 49 00:04:49,862 --> 00:04:55,514 want to get back precisely the same bits that we put in. And that's very important, 50 00:04:55,726 --> 00:04:59,767 in many applications. It's not so important in some applications, like 51 00:05:00,065 --> 00:05:04,882 images and video where you're happy with an approximation and the original a bit 52 00:05:04,882 --> 00:05:09,700 strained but we're not going to consider that kind of algorithm. We're just going 53 00:05:09,344 --> 00:05:15,057 to consider a lossless compression, where maybe it's a program or that you can't 54 00:05:15,057 --> 00:05:21,543 afford to lose a single bit. Now it turns out that where we talk about in terms of 55 00:05:21,543 --> 00:05:27,525 evaluating our algorithms, it's what's the compression ratio. Now, we also care about 56 00:05:27,525 --> 00:05:33,362 efficiency. We don't want to spend a lot of time creating CFB, but the compression 57 00:05:33,362 --> 00:05:38,840 ratio is the measure that we use to compare algorithms; and we're interested 58 00:05:38,840 --> 00:05:44,986 in knowing the percentage of bits, a percentage of the size would be that we're 59 00:05:44,986 --> 00:05:51,568 able to save. And surprisingly even for natural languages texts we can get, 50 to 60 00:05:51,568 --> 00:05:57,709 75% or even better compression ratios. Now just in the larger context, data 61 00:05:57,709 --> 00:06:04,752 compression has been with us since antiquity. it's always better to try to 62 00:06:04,752 --> 00:06:11,371 express what you're doing concisely. Mathematical notation is an example of 63 00:06:11,371 --> 00:06:18,137 that, or number systems, or even natural languages. Communications Technology has 64 00:06:18,137 --> 00:06:25,438 evolved; it definitely has to do with data compression from braille to morse code to 65 00:06:25,438 --> 00:06:32,218 the telephone system. Those are all ways of trying to, they depend on, trying to 66 00:06:32,218 --> 00:06:39,346 achieve good data compression. In a modern life, everybody's trying to get pictures 67 00:06:39,346 --> 00:06:45,256 or movies on their devices, and it's definitely enabled by good data 68 00:06:45,256 --> 00:06:50,801 compression. So, something to think about, what role data compression is going to 69 00:06:50,801 --> 00:06:56,042 play in the future. But, it's really hard to see it going away. Number one, it is 70 00:06:56,042 --> 00:07:00,812 effective. Number two, it is not that difficult to implement, and it's in 71 00:07:00,812 --> 00:07:05,851 software. So, you don't have to buy a bigger disk to have a good compression 72 00:07:05,851 --> 00:07:11,603 algorithm in many situations. here's a simple example that sprung up just in 73 00:07:11,603 --> 00:07:18,205 recent years. so we've talked about in other applications that genome is a string 74 00:07:18,205 --> 00:07:24,234 over a four letter alphabet. and the string might be huge. It might be billions 75 00:07:24,234 --> 00:07:30,118 of letters. And so, and there might, and there might be lots of them, for different 76 00:07:30,118 --> 00:07:36,362 individuals and different species. so the huge genome data base is out there with 77 00:07:36,577 --> 00:07:43,081 these very long strings of the alphabet A, C, T and G. Now, the standard way to store 78 00:07:43,081 --> 00:07:49,480 them in standard programming language, standard computers, is to use ascii, and 79 00:07:49,480 --> 00:07:55,412 there's eight bits per character, and so if you have an n bit genome, genomic 80 00:07:55,412 --> 00:08:01,889 string, it'll be eight n bits. If it's a billion, characters, it's eight billion 81 00:08:01,889 --> 00:08:08,419 bits. but just look at the problem. really there's only four characters. You can get 82 00:08:08,419 --> 00:08:14,336 by with just two bits per character. So, that's one quarter the storage. if you 83 00:08:14,336 --> 00:08:20,105 spent X dollars on this space to store your stuff, you could spend X over $four. 84 00:08:20,105 --> 00:08:26,365 And that might be a significant number. And, the fact in general, if you know you 85 00:08:26,365 --> 00:08:32,441 have a small alphabet you can get a savings in this way. This seems very 86 00:08:32,441 --> 00:08:38,939 obvious and straight forward, but it's amazingly true that in the 1990's when 87 00:08:38,939 --> 00:08:45,606 generally, data basis started to emerge there were four times too big. They used 88 00:08:45,606 --> 00:08:52,188 ASCII then, didn't think to use this trivial coding that could have cut costs 89 00:08:52,188 --> 00:08:58,674 by one fourth. so again, if it's so easy to do, why not do it? If you can cut your 90 00:08:58,674 --> 00:09:05,149 cost by a factor of four. And where else, would you give up, reducing cost by a 91 00:09:05,149 --> 00:09:10,149 factor of four so easily. So that's why we're interested in data compression. Now 92 00:09:10,149 --> 00:09:15,838 we're going to need some tools that are slightly different from the tools that 93 00:09:15,838 --> 00:09:21,526 we've been using in order to write effective data compression algorithms. And 94 00:09:21,526 --> 00:09:28,122 so we've got , extensions to our standard in and standard out libraries that work 95 00:09:28,122 --> 00:09:34,897 with bits. so, these are what we want to, what the methods that we're going to use 96 00:09:34,897 --> 00:09:40,663 to write bits to standard output and standard input. So, these bit streams are 97 00:09:40,663 --> 00:09:46,141 different than a standard in, standard out. They're binary standard in, binary 98 00:09:46,141 --> 00:09:51,980 standard out. And so, what we can do is read Boolean, which is just read one bit, 99 00:09:51,980 --> 00:09:58,210 or to save some processing, we can read eight bits of data and put it in a care. 100 00:09:58,210 --> 00:10:04,282 Or in general, R bits, where R is less than eight, and put it into a care value. 101 00:10:04,282 --> 00:10:10,433 And we can also, if we want, put it into nth values long and boule in a similar 102 00:10:10,433 --> 00:10:17,119 methods. And we don't have to read all the bits if we've got some. , a scheme where 103 00:10:17,119 --> 00:10:22,816 we only wanna use a certain number of them. but the idea is that, we can read a 104 00:10:22,816 --> 00:10:28,581 variable number of bits, and get'em ready for processing easily, by putting them in 105 00:10:28,581 --> 00:10:33,938 one of Java's, primitive types. And conversely, we have binary standard out, 106 00:10:33,938 --> 00:10:39,906 where we can write a bit. or we can write, eight bits, or we can write any number of 107 00:10:39,906 --> 00:10:45,404 bits from a care value or from an int value, or any others. So this allows us to 108 00:10:45,404 --> 00:10:51,663 take in bit streams and to write out bit streams and takes care of all the encoding 109 00:10:51,663 --> 00:10:57,549 between Java primitive types and bit streams. And it's not too difficult to use 110 00:10:57,549 --> 00:11:03,810 and it's very intuitive. And you'll see when we come to code. just to, give an 111 00:11:03,810 --> 00:11:10,811 example of how just having this, can save space. this is just a simple example of 112 00:11:10,811 --> 00:11:18,221 representing a date. So if you, represent the date, 12/31/1999, as an ASCII 113 00:11:18,221 --> 00:11:26,846 character string, in Java. then, when you write out the string, you've got one, two, 114 00:11:26,846 --> 00:11:33,838 three, four, five, six, seven, eight, nine, ten characters. in each one of the 115 00:11:33,838 --> 00:11:41,193 characters is eight bits so that's a total of 80 bits. so the standard ASCII encoding 116 00:11:41,193 --> 00:11:47,983 is to just look up the ASCII of the one and that's 00110001 and then the two and 117 00:11:47,983 --> 00:11:55,258 so forth. So in the slash 31 and so forth. So for each character we've got eight bits 118 00:11:55,258 --> 00:12:01,415 and we put them up and that's 80 bits. If it was Unicode there would be even more 119 00:12:01,415 --> 00:12:06,941 bits. So that's one way that you might consider representing a date with bits. 120 00:12:06,941 --> 00:12:12,395 Now if we represent it as three nth values, that's not so effective a way to 121 00:12:12,395 --> 00:12:18,208 do it. The month is a nth, the day is an nth and the year's an nth and each one of 122 00:12:18,208 --> 00:12:24,580 those is going to be 32 bits. and that's the total of 96 bits though that's, that's 123 00:12:24,580 --> 00:12:31,052 actually even worse. by the month in all these things are within small ranges. So, 124 00:12:31,271 --> 00:12:37,379 with binary standard out we can say, well, we're learning right out the right most 125 00:12:37,379 --> 00:12:43,779 four bits of the month end cuz that give us a number between zero and sixteen and 126 00:12:43,779 --> 00:12:49,224 months between one and twelve. And the five bits of the day. And you input that 127 00:12:49,224 --> 00:12:54,024 again, and that's, between zero and 31. And that's going to cover any possible 128 00:12:54,024 --> 00:12:57,831 day. and then twelve bits for the year. And if we do that . 129 00:12:57,831 --> 00:13:03,744 Then we have a total of 21 bits. and there's a little bit of extra at the end, 130 00:13:03,744 --> 00:13:10,027 because our byte streams, our bits really are implemented, reasonably in most 131 00:13:10,027 --> 00:13:16,014 situations as byte streams. so they're going to get carved into chunks of size, 132 00:13:16,236 --> 00:13:23,184 eight by, hardware or, their drivers. And so, we always round up to the nearest 133 00:13:23,184 --> 00:13:28,701 multiple of eight to get the proper alignment on bytes. but that's still 134 00:13:28,701 --> 00:13:35,453 pretty, significant savings from 80 to 24, just by having, the ability to, write out 135 00:13:36,034 --> 00:13:42,204 portions of, INT values. And if we had a billion dates, then that would be a huge 136 00:13:42,204 --> 00:13:48,999 cost savings that would translate to dollars, as well. the other thing, that we 137 00:13:48,999 --> 00:13:56,006 do to give some visual impression of our effectiveness of our compression 138 00:13:56,006 --> 00:14:04,105 algorithms particularly in the text, is to look at different ways to examine the 139 00:14:04,105 --> 00:14:11,983 contents of a bit-stream. so we have on the book site A binary dump program that, 140 00:14:12,201 --> 00:14:18,308 prints out, the bits, sixteen bits at a time, rows of sixteen bits at a time. And 141 00:14:18,308 --> 00:14:24,261 then the total number of bits. so, this file aver.text is, a standard character 142 00:14:24,261 --> 00:14:29,514 stream. so everything's encoded with, eight bits. and so there's twelve 143 00:14:29,514 --> 00:14:35,208 characters and it's 96 bits. And you can see what all the bits are. Or, sometimes 144 00:14:35,208 --> 00:14:42,989 it's convenient to use a hex dumpler. We just read the hex digits considering the 145 00:14:42,989 --> 00:14:50,674 bits, four bits at a time. And so those 96 bits are twelve bytes, or 24 hex digits. 146 00:14:50,674 --> 00:14:57,591 And we print it out so that we can see them. Another thing that's useful 147 00:14:57,591 --> 00:15:05,535 sometimes for big files is to look at the bits as a pitcher where we just do a. Give 148 00:15:05,535 --> 00:15:12,068 the width and the height of a pixel window and just map the bits to white squares and 149 00:15:12,068 --> 00:15:18,175 black squares. So this is the same aver.text. It's just that you can see that 150 00:15:18,175 --> 00:15:23,358 they all start with zero one much more easily. This visual representation 151 00:15:23,358 --> 00:15:29,820 sometimes gives an interesting perspective into what a compressed or what a bunch of 152 00:15:29,820 --> 00:15:37,666 bits looks like. So, this is just the hex to ASCII conversion table that, that if 153 00:15:37,666 --> 00:15:45,006 you are not familiar with it it's a quick way to find the hexadecimal corresponding 154 00:15:45,006 --> 00:15:51,924 to an ASCII character. So, you look up sa y R in the table and that's in row five 155 00:15:51,924 --> 00:15:59,540 and column two. So it's 52. So this is a 52 for the R in Abracadabra. And so those 156 00:15:59,540 --> 00:16:04,047 are the basic tools, that we are going to use when we talk about compression 157 00:16:04,047 --> 00:16:09,180 algorithms. Now there's one thing that's, really important to think about. and 158 00:16:09,180 --> 00:16:14,355 that's the idea that, you cannot have a compression algorithm that can compress 159 00:16:14,355 --> 00:16:20,038 every file. despite the fact that, somebody has, patented that. And, and 160 00:16:20,038 --> 00:16:26,038 there's a claim that, methods for data compression is capable of compressing all 161 00:16:26,038 --> 00:16:32,327 files. and, also, this, this is another, thing that came out of Slash Dot a few 162 00:16:32,327 --> 00:16:38,472 years ago, where a company had announced a breakthrough in data compression that does 163 00:16:38,472 --> 00:16:43,729 a 100 to one losses compression of random data. And unfortunately, they do say, if 164 00:16:43,729 --> 00:16:48,936 this is true, our then withdrawals got a lot smaller. Cause, there's no way that is 165 00:16:48,936 --> 00:16:53,868 true, and lets look at why that's the case. Proposition is no algorithm can 166 00:16:53,868 --> 00:16:59,836 compress every bit string. Here's one easy way to prove it, not by contradiction. So, 167 00:16:59,836 --> 00:17:05,732 let's suppose that we have an algorithm MUI that can go ahead and compress every 168 00:17:05,732 --> 00:17:11,408 possible bit string. So we take a bit string and we use our algorithm to 169 00:17:11,408 --> 00:17:17,026 compress it, to get a smaller bit string. But since our algorithm can compress every 170 00:17:17,026 --> 00:17:21,998 bit string. Why not use it on that one? Use it on that one, you get a smaller one. 171 00:17:21,998 --> 00:17:27,158 Then we keep going. Since it's a compress, can compress every bit string, eventually 172 00:17:27,158 --> 00:17:32,067 we get down to a bit string of size one. And since it can compress that one, it 173 00:17:32,067 --> 00:17:36,850 gets down to zero. So if you can have an algorithm that compresses every bit 174 00:17:36,850 --> 00:17:41,759 string, it means that you can compress all bit strings to zero bits, and that's 175 00:17:41,759 --> 00:17:47,080 absurd. So that's a proof by contradiction. Or another way to see the 176 00:17:47,080 --> 00:17:53,240 same thing is just by counting. So let's say you've got an algorithm that can 177 00:17:53,240 --> 00:17:59,480 compress all 1000 bit strings, just to pick a number. Now there's a lot of 1000 178 00:17:59,480 --> 00:18:07,783 bit strings. There's two^1000 of them. but how many of those, can be encoded with 99, 179 00:18:07,783 --> 00:18:17,867 999 bits or fewer. Well. Just one plus two plus four, and the powers of two up to 180 00:18:17,867 --> 00:18:26,648 two, and out to the 99, 999. Or, like less than 500, that's. so that's way fewer than 181 00:18:26,648 --> 00:18:32,275 the total number of, of possible bit strings. Well, it says it's one if you add 182 00:18:32,275 --> 00:18:37,698 up the sum that's 2,501. So, there's one that can't be compressed. So, that's it. 183 00:18:37,698 --> 00:18:43,325 So, if you want to say do fewer bits, you have much worse problem. only one out of 184 00:18:43,325 --> 00:18:49,155 every two forty ninth can be encoded with less than 500 bits. It's just the smaller 185 00:18:49,155 --> 00:18:55,710 number of git, bits gives you way number, way fewer possible bit strings and You 186 00:18:55,710 --> 00:19:01,381 have to, be able to get your original bit string back. So, the smaller number of 187 00:19:01,381 --> 00:19:06,782 bits gives you, a limit on the number of, bit strings that you can possibly 188 00:19:06,782 --> 00:19:12,327 compress. so, we can't have universal data compression. And if you want to convince 189 00:19:12,327 --> 00:19:17,784 yourself of that, now just try running your favorite compression algorithm on the 190 00:19:17,979 --> 00:19:23,631 , result of what it produces. Say for a photo most good compression algorithms 191 00:19:23,631 --> 00:19:28,764 will figure out that you're doing that, and just give you the same file back. So 192 00:19:28,959 --> 00:19:34,741 at least it won't make it get larger but there's plenty of, of methods out there 193 00:19:34,936 --> 00:19:41,997 that will make your file larger. there's another kind of basic idea. and, and that 194 00:19:41,997 --> 00:19:47,149 is that there's no way to find the best way to compress a file. And if you think 195 00:19:47,149 --> 00:19:52,425 about it, it can be extremely complicated. This is just an example that illustrates 196 00:19:52,425 --> 00:19:57,142 the point. But it's also a relevant example, cuz it applies to so much of the 197 00:19:57,142 --> 00:20:02,070 data that we deal with. So at the top is a picture dump of a million bit file. So 198 00:20:02,070 --> 00:20:06,491 that's a million bits and it's, those are random bits. Well they're not random, 199 00:20:06,491 --> 00:20:11,282 they're pseudorandom. what do I mean, pseudorandom? Well we talked about that. 200 00:20:11,466 --> 00:20:16,445 they're generated by a program. And, this is a Java program that uses a pseudo 201 00:20:16,445 --> 00:20:21,726 random number generator to generate bits. And, that's where those bits came from. 202 00:20:21,726 --> 00:20:27,342 But if you think about it, this program is representation of those million bits. It's 203 00:20:27,342 --> 00:20:32,824 only a couple of dozen characters, so, you know, if you want to find the optimal way 204 00:20:32,824 --> 00:20:38,440 to compress those bits, one of the things you would have to consider is this program 205 00:20:38,440 --> 00:20:44,453 T hat's a pretty compact way to represent a million bit. And, if you think about it 206 00:20:44,453 --> 00:20:49,965 so much of the data that's out there actually was created by a program so, 207 00:20:50,180 --> 00:20:55,693 compression amounts to finding the program that creates, created the data. And, 208 00:20:55,693 --> 00:21:01,133 that's obviously a pretty difficult problem in fact it's known that it's 209 00:21:01,133 --> 00:21:07,004 undecideable. So, let's not think that we can find the best possible compression 210 00:21:07,004 --> 00:21:14,484 algorithm. and the other point that I want to finish with in the introduction is, to 211 00:21:14,484 --> 00:21:21,950 realize that when we're talking about, compressing, say English language, it's 212 00:21:21,950 --> 00:21:29,329 amazing to see that actually there's a lot of, redundancy in the English language. 213 00:21:29,593 --> 00:21:38,754 and so this is, a, . A perturbed version of an English language paragraph that 214 00:21:38,754 --> 00:21:47,621 shows that you can change letters and even delete letters and still figure out what 215 00:21:47,621 --> 00:21:53,549 it says. and actually at the end, it's you really need the first and last two 216 00:21:53,549 --> 00:22:00,201 letters, in a lot of situations to, really, figure out readability. so, there, 217 00:22:00,201 --> 00:22:06,414 there is a lot of freedom, because there's so much redundancy. And since there's so 218 00:22:06,414 --> 00:22:12,407 much re, redundancy, we can actually do, really well on, on compressing English 219 00:22:12,407 --> 00:22:18,694 language texts, for example. So that's an introduction to data compression and we'll 220 00:22:18,694 --> 00:22:20,960 take a look at algorithms next.