1
00:00:03,260 --> 00:00:07,773
Today we're going to talk about data
compression. Now this is a family of

2
00:00:07,773 --> 00:00:13,027
algorithms that, everyone uses. and it's
based on, many of the classical ideas,

3
00:00:13,213 --> 00:00:18,355
that we've covered in terms of,
implementing basic algorithms. we'll start

4
00:00:18,355 --> 00:00:24,755
with a introduction, just what data
compression is. So the idea is to reduce

5
00:00:24,755 --> 00:00:31,392
the size of a file. really for two primary
reasons. One is to save some space when

6
00:00:31,392 --> 00:00:37,712
storing it. and the other is to save some
time with transmitting it. And often

7
00:00:37,950 --> 00:00:43,308
having significant savings is not too
difficult because most files have, plenty

8
00:00:43,308 --> 00:00:48,275
of redundancy. but of course we're
interested in efficient algorithms that

9
00:00:48,275 --> 00:00:56,410
can guess, get the best savings possible.
so, you might think that with the way that

10
00:00:56,755 --> 00:01:02,805
technology, has been improving over recent
years. That we wouldn't really need

11
00:01:02,805 --> 00:01:08,188
compression, we can just buy a bigger,
faster disk. but the problem with that

12
00:01:08,188 --> 00:01:14,028
idea is, that even though. Moore's law
tells us that we're going to get, twice as

13
00:01:14,028 --> 00:01:19,042
much space every one and a half to two
years. It's also the case that we,

14
00:01:19,042 --> 00:01:24,752
whatever space we have, we're going to
want to fill it up, with, better, higher

15
00:01:24,752 --> 00:01:30,253
quality, data. so you want a higher
resolution movie if you get more space.

16
00:01:30,253 --> 00:01:35,894
And if we can remove redundancy, you're
always going to want to do that. So it's a

17
00:01:35,894 --> 00:01:41,743
cheap way to, save space, that allows us
to do more with the space that we have

18
00:01:41,743 --> 00:01:48,019
available. this is a report last year on
big data. Every day we could create 2.5

19
00:01:48,019 --> 00:01:54,342
quintillion bites of data, so much that 90
percent of the data in the world today has

20
00:01:54,342 --> 00:02:00,813
been created in the last two years alone.
if we can use data compression to cut that

21
00:02:00,813 --> 00:02:08,820
amount of data in half then that's all the
data in the world created in a year. So

22
00:02:09,160 --> 00:02:15,811
this is a very interesting topic to
consider from an algorithmic point of

23
00:02:15,811 --> 00:02:22,037
view. Because the basic concepts some of
them date back to the 1950's. when it was

24
00:02:22,037 --> 00:02:27,502
extremely important to save time when
transmitting information cuz of, or space.

25
00:02:27,710 --> 00:02:33,659
because it was so costly. but some of the
best technology, algorithmic technology,

26
00:02:33,867 --> 00:02:39,608
has devel-, been developed recently, and
much of it uses, the data structures that

27
00:02:39,608 --> 00:02:45,004
we've considered for other applications.
And we'll see that as we get into the

28
00:02:45,004 --> 00:02:52,116
topic. so, just specific applications
that, maybe are familiar for file

29
00:02:52,116 --> 00:02:59,394
compression. All file systems and, and
disks have built-in, compression

30
00:02:59,394 --> 00:03:06,770
technologies. Such as, as zip, and b-zip
and many others of similar type.

31
00:03:06,770 --> 00:03:11,964
Technologies. And the multimedia,
everybody's familiar with, the JPEG and

32
00:03:11,964 --> 00:03:17,300
MP3 and MPEG, and all those sorts of
things for images, sound and video. Those

33
00:03:17,300 --> 00:03:23,196
are all about data compression. and for
communication, now, that's, what has, what

34
00:03:23,196 --> 00:03:29,433
enabled, fax, and also enables new
technologies, like Skype, the ability to,

35
00:03:29,659 --> 00:03:35,593
reduce the, amount of data that you
actually need to send. for any kind of

36
00:03:35,593 --> 00:03:41,453
technology, is gonna, help, improve
things, and improve the quality of what

37
00:03:41,453 --> 00:03:48,099
you can do. . Also, the types of massive
amounts of data that, people are saving,

38
00:03:48,311 --> 00:03:54,982
nowadays. Google and Facebook and other, ,
other web giants are saving lots of data.

39
00:03:54,982 --> 00:04:00,876
And they're going to be able to save more,
because of data compression. now from a

40
00:04:00,876 --> 00:04:06,105
theoretical point of view the type of
compression that we're going to consider,

41
00:04:06,105 --> 00:04:11,268
lossless compression is very simple
conceptually. So we think of a, we talk

42
00:04:11,268 --> 00:04:16,894
about a bitstream rema, everything can be
encoded in bits, so we might as well just

43
00:04:16,894 --> 00:04:22,123
consider a, a stream of bits that we want
to compress and we'll call that B. In a

44
00:04:22,123 --> 00:04:27,551
data compression algorithm it's going to
be a black box that goes ahead and takes

45
00:04:27,551 --> 00:04:33,268
those bits and produces a compressed
version, which is many fewer bits. and I

46
00:04:33,268 --> 00:04:39,264
will call that C of B. But we hope that
it's many fewer bits. and so that's what

47
00:04:39,264 --> 00:04:44,280
we save or we send, but then we need a
companion algorithm, an expansion

48
00:04:44,280 --> 00:04:49,862
algorithm, that takes C of B and produces
B back. That's lossless compression. We

49
00:04:49,862 --> 00:04:55,514
want to get back precisely the same bits
that we put in. And that's very important,

50
00:04:55,726 --> 00:04:59,767
in many applications. It's not so
important in some applications, like

51
00:05:00,065 --> 00:05:04,882
images and video where you're happy with
an approximation and the original a bit

52
00:05:04,882 --> 00:05:09,700
strained but we're not going to consider
that kind of algorithm. We're just going

53
00:05:09,344 --> 00:05:15,057
to consider a lossless compression, where
maybe it's a program or that you can't

54
00:05:15,057 --> 00:05:21,543
afford to lose a single bit. Now it turns
out that where we talk about in terms of

55
00:05:21,543 --> 00:05:27,525
evaluating our algorithms, it's what's the
compression ratio. Now, we also care about

56
00:05:27,525 --> 00:05:33,362
efficiency. We don't want to spend a lot
of time creating CFB, but the compression

57
00:05:33,362 --> 00:05:38,840
ratio is the measure that we use to
compare algorithms; and we're interested

58
00:05:38,840 --> 00:05:44,986
in knowing the percentage of bits, a
percentage of the size would be that we're

59
00:05:44,986 --> 00:05:51,568
able to save. And surprisingly even for
natural languages texts we can get, 50 to

60
00:05:51,568 --> 00:05:57,709
75% or even better compression ratios. Now
just in the larger context, data

61
00:05:57,709 --> 00:06:04,752
compression has been with us since
antiquity. it's always better to try to

62
00:06:04,752 --> 00:06:11,371
express what you're doing concisely.
Mathematical notation is an example of

63
00:06:11,371 --> 00:06:18,137
that, or number systems, or even natural
languages. Communications Technology has

64
00:06:18,137 --> 00:06:25,438
evolved; it definitely has to do with data
compression from braille to morse code to

65
00:06:25,438 --> 00:06:32,218
the telephone system. Those are all ways
of trying to, they depend on, trying to

66
00:06:32,218 --> 00:06:39,346
achieve good data compression. In a modern
life, everybody's trying to get pictures

67
00:06:39,346 --> 00:06:45,256
or movies on their devices, and it's
definitely enabled by good data

68
00:06:45,256 --> 00:06:50,801
compression. So, something to think about,
what role data compression is going to

69
00:06:50,801 --> 00:06:56,042
play in the future. But, it's really hard
to see it going away. Number one, it is

70
00:06:56,042 --> 00:07:00,812
effective. Number two, it is not that
difficult to implement, and it's in

71
00:07:00,812 --> 00:07:05,851
software. So, you don't have to buy a
bigger disk to have a good compression

72
00:07:05,851 --> 00:07:11,603
algorithm in many situations. here's a
simple example that sprung up just in

73
00:07:11,603 --> 00:07:18,205
recent years. so we've talked about in
other applications that genome is a string

74
00:07:18,205 --> 00:07:24,234
over a four letter alphabet. and the
string might be huge. It might be billions

75
00:07:24,234 --> 00:07:30,118
of letters. And so, and there might, and
there might be lots of them, for different

76
00:07:30,118 --> 00:07:36,362
individuals and different species. so the
huge genome data base is out there with

77
00:07:36,577 --> 00:07:43,081
these very long strings of the alphabet A,
C, T and G. Now, the standard way to store

78
00:07:43,081 --> 00:07:49,480
them in standard programming language,
standard computers, is to use ascii, and

79
00:07:49,480 --> 00:07:55,412
there's eight bits per character, and so
if you have an n bit genome, genomic

80
00:07:55,412 --> 00:08:01,889
string, it'll be eight n bits. If it's a
billion, characters, it's eight billion

81
00:08:01,889 --> 00:08:08,419
bits. but just look at the problem. really
there's only four characters. You can get

82
00:08:08,419 --> 00:08:14,336
by with just two bits per character. So,
that's one quarter the storage. if you

83
00:08:14,336 --> 00:08:20,105
spent X dollars on this space to store
your stuff, you could spend X over $four.

84
00:08:20,105 --> 00:08:26,365
And that might be a significant number.
And, the fact in general, if you know you

85
00:08:26,365 --> 00:08:32,441
have a small alphabet you can get a
savings in this way. This seems very

86
00:08:32,441 --> 00:08:38,939
obvious and straight forward, but it's
amazingly true that in the 1990's when

87
00:08:38,939 --> 00:08:45,606
generally, data basis started to emerge
there were four times too big. They used

88
00:08:45,606 --> 00:08:52,188
ASCII then, didn't think to use this
trivial coding that could have cut costs

89
00:08:52,188 --> 00:08:58,674
by one fourth. so again, if it's so easy
to do, why not do it? If you can cut your

90
00:08:58,674 --> 00:09:05,149
cost by a factor of four. And where else,
would you give up, reducing cost by a

91
00:09:05,149 --> 00:09:10,149
factor of four so easily. So that's why
we're interested in data compression. Now

92
00:09:10,149 --> 00:09:15,838
we're going to need some tools that are
slightly different from the tools that

93
00:09:15,838 --> 00:09:21,526
we've been using in order to write
effective data compression algorithms. And

94
00:09:21,526 --> 00:09:28,122
so we've got , extensions to our standard
in and standard out libraries that work

95
00:09:28,122 --> 00:09:34,897
with bits. so, these are what we want to,
what the methods that we're going to use

96
00:09:34,897 --> 00:09:40,663
to write bits to standard output and
standard input. So, these bit streams are

97
00:09:40,663 --> 00:09:46,141
different than a standard in, standard
out. They're binary standard in, binary

98
00:09:46,141 --> 00:09:51,980
standard out. And so, what we can do is
read Boolean, which is just read one bit,

99
00:09:51,980 --> 00:09:58,210
or to save some processing, we can read
eight bits of data and put it in a care.

100
00:09:58,210 --> 00:10:04,282
Or in general, R bits, where R is less
than eight, and put it into a care value.

101
00:10:04,282 --> 00:10:10,433
And we can also, if we want, put it into
nth values long and boule in a similar

102
00:10:10,433 --> 00:10:17,119
methods. And we don't have to read all the
bits if we've got some. , a scheme where

103
00:10:17,119 --> 00:10:22,816
we only wanna use a certain number of
them. but the idea is that, we can read a

104
00:10:22,816 --> 00:10:28,581
variable number of bits, and get'em ready
for processing easily, by putting them in

105
00:10:28,581 --> 00:10:33,938
one of Java's, primitive types. And
conversely, we have binary standard out,

106
00:10:33,938 --> 00:10:39,906
where we can write a bit. or we can write,
eight bits, or we can write any number of

107
00:10:39,906 --> 00:10:45,404
bits from a care value or from an int
value, or any others. So this allows us to

108
00:10:45,404 --> 00:10:51,663
take in bit streams and to write out bit
streams and takes care of all the encoding

109
00:10:51,663 --> 00:10:57,549
between Java primitive types and bit
streams. And it's not too difficult to use

110
00:10:57,549 --> 00:11:03,810
and it's very intuitive. And you'll see
when we come to code. just to, give an

111
00:11:03,810 --> 00:11:10,811
example of how just having this, can save
space. this is just a simple example of

112
00:11:10,811 --> 00:11:18,221
representing a date. So if you, represent
the date, 12/31/1999, as an ASCII

113
00:11:18,221 --> 00:11:26,846
character string, in Java. then, when you
write out the string, you've got one, two,

114
00:11:26,846 --> 00:11:33,838
three, four, five, six, seven, eight,
nine, ten characters. in each one of the

115
00:11:33,838 --> 00:11:41,193
characters is eight bits so that's a total
of 80 bits. so the standard ASCII encoding

116
00:11:41,193 --> 00:11:47,983
is to just look up the ASCII of the one
and that's 00110001 and then the two and

117
00:11:47,983 --> 00:11:55,258
so forth. So in the slash 31 and so forth.
So for each character we've got eight bits

118
00:11:55,258 --> 00:12:01,415
and we put them up and that's 80 bits. If
it was Unicode there would be even more

119
00:12:01,415 --> 00:12:06,941
bits. So that's one way that you might
consider representing a date with bits.

120
00:12:06,941 --> 00:12:12,395
Now if we represent it as three nth
values, that's not so effective a way to

121
00:12:12,395 --> 00:12:18,208
do it. The month is a nth, the day is an
nth and the year's an nth and each one of

122
00:12:18,208 --> 00:12:24,580
those is going to be 32 bits. and that's
the total of 96 bits though that's, that's

123
00:12:24,580 --> 00:12:31,052
actually even worse. by the month in all
these things are within small ranges. So,

124
00:12:31,271 --> 00:12:37,379
with binary standard out we can say, well,
we're learning right out the right most

125
00:12:37,379 --> 00:12:43,779
four bits of the month end cuz that give
us a number between zero and sixteen and

126
00:12:43,779 --> 00:12:49,224
months between one and twelve. And the
five bits of the day. And you input that

127
00:12:49,224 --> 00:12:54,024
again, and that's, between zero and 31.
And that's going to cover any possible

128
00:12:54,024 --> 00:12:57,831
day. and then twelve bits for the year.
And if we do that .

129
00:12:57,831 --> 00:13:03,744
Then we have a total of 21 bits. and
there's a little bit of extra at the end,

130
00:13:03,744 --> 00:13:10,027
because our byte streams, our bits really
are implemented, reasonably in most

131
00:13:10,027 --> 00:13:16,014
situations as byte streams. so they're
going to get carved into chunks of size,

132
00:13:16,236 --> 00:13:23,184
eight by, hardware or, their drivers. And
so, we always round up to the nearest

133
00:13:23,184 --> 00:13:28,701
multiple of eight to get the proper
alignment on bytes. but that's still

134
00:13:28,701 --> 00:13:35,453
pretty, significant savings from 80 to 24,
just by having, the ability to, write out

135
00:13:36,034 --> 00:13:42,204
portions of, INT values. And if we had a
billion dates, then that would be a huge

136
00:13:42,204 --> 00:13:48,999
cost savings that would translate to
dollars, as well. the other thing, that we

137
00:13:48,999 --> 00:13:56,006
do to give some visual impression of our
effectiveness of our compression

138
00:13:56,006 --> 00:14:04,105
algorithms particularly in the text, is to
look at different ways to examine the

139
00:14:04,105 --> 00:14:11,983
contents of a bit-stream. so we have on
the book site A binary dump program that,

140
00:14:12,201 --> 00:14:18,308
prints out, the bits, sixteen bits at a
time, rows of sixteen bits at a time. And

141
00:14:18,308 --> 00:14:24,261
then the total number of bits. so, this
file aver.text is, a standard character

142
00:14:24,261 --> 00:14:29,514
stream. so everything's encoded with,
eight bits. and so there's twelve

143
00:14:29,514 --> 00:14:35,208
characters and it's 96 bits. And you can
see what all the bits are. Or, sometimes

144
00:14:35,208 --> 00:14:42,989
it's convenient to use a hex dumpler. We
just read the hex digits considering the

145
00:14:42,989 --> 00:14:50,674
bits, four bits at a time. And so those 96
bits are twelve bytes, or 24 hex digits.

146
00:14:50,674 --> 00:14:57,591
And we print it out so that we can see
them. Another thing that's useful

147
00:14:57,591 --> 00:15:05,535
sometimes for big files is to look at the
bits as a pitcher where we just do a. Give

148
00:15:05,535 --> 00:15:12,068
the width and the height of a pixel window
and just map the bits to white squares and

149
00:15:12,068 --> 00:15:18,175
black squares. So this is the same
aver.text. It's just that you can see that

150
00:15:18,175 --> 00:15:23,358
they all start with zero one much more
easily. This visual representation

151
00:15:23,358 --> 00:15:29,820
sometimes gives an interesting perspective
into what a compressed or what a bunch of

152
00:15:29,820 --> 00:15:37,666
bits looks like. So, this is just the hex
to ASCII conversion table that, that if

153
00:15:37,666 --> 00:15:45,006
you are not familiar with it it's a quick
way to find the hexadecimal corresponding

154
00:15:45,006 --> 00:15:51,924
to an ASCII character. So, you look up sa
y R in the table and that's in row five

155
00:15:51,924 --> 00:15:59,540
and column two. So it's 52. So this is a
52 for the R in Abracadabra. And so those

156
00:15:59,540 --> 00:16:04,047
are the basic tools, that we are going to
use when we talk about compression

157
00:16:04,047 --> 00:16:09,180
algorithms. Now there's one thing that's,
really important to think about. and

158
00:16:09,180 --> 00:16:14,355
that's the idea that, you cannot have a
compression algorithm that can compress

159
00:16:14,355 --> 00:16:20,038
every file. despite the fact that,
somebody has, patented that. And, and

160
00:16:20,038 --> 00:16:26,038
there's a claim that, methods for data
compression is capable of compressing all

161
00:16:26,038 --> 00:16:32,327
files. and, also, this, this is another,
thing that came out of Slash Dot a few

162
00:16:32,327 --> 00:16:38,472
years ago, where a company had announced a
breakthrough in data compression that does

163
00:16:38,472 --> 00:16:43,729
a 100 to one losses compression of random
data. And unfortunately, they do say, if

164
00:16:43,729 --> 00:16:48,936
this is true, our then withdrawals got a
lot smaller. Cause, there's no way that is

165
00:16:48,936 --> 00:16:53,868
true, and lets look at why that's the
case. Proposition is no algorithm can

166
00:16:53,868 --> 00:16:59,836
compress every bit string. Here's one easy
way to prove it, not by contradiction. So,

167
00:16:59,836 --> 00:17:05,732
let's suppose that we have an algorithm
MUI that can go ahead and compress every

168
00:17:05,732 --> 00:17:11,408
possible bit string. So we take a bit
string and we use our algorithm to

169
00:17:11,408 --> 00:17:17,026
compress it, to get a smaller bit string.
But since our algorithm can compress every

170
00:17:17,026 --> 00:17:21,998
bit string. Why not use it on that one?
Use it on that one, you get a smaller one.

171
00:17:21,998 --> 00:17:27,158
Then we keep going. Since it's a compress,
can compress every bit string, eventually

172
00:17:27,158 --> 00:17:32,067
we get down to a bit string of size one.
And since it can compress that one, it

173
00:17:32,067 --> 00:17:36,850
gets down to zero. So if you can have an
algorithm that compresses every bit

174
00:17:36,850 --> 00:17:41,759
string, it means that you can compress all
bit strings to zero bits, and that's

175
00:17:41,759 --> 00:17:47,080
absurd. So that's a proof by
contradiction. Or another way to see the

176
00:17:47,080 --> 00:17:53,240
same thing is just by counting. So let's
say you've got an algorithm that can

177
00:17:53,240 --> 00:17:59,480
compress all 1000 bit strings, just to
pick a number. Now there's a lot of 1000

178
00:17:59,480 --> 00:18:07,783
bit strings. There's two^1000 of them. but
how many of those, can be encoded with 99,

179
00:18:07,783 --> 00:18:17,867
999 bits or fewer. Well. Just one plus two
plus four, and the powers of two up to

180
00:18:17,867 --> 00:18:26,648
two, and out to the 99, 999. Or, like less
than 500, that's. so that's way fewer than

181
00:18:26,648 --> 00:18:32,275
the total number of, of possible bit
strings. Well, it says it's one if you add

182
00:18:32,275 --> 00:18:37,698
up the sum that's 2,501. So, there's one
that can't be compressed. So, that's it.

183
00:18:37,698 --> 00:18:43,325
So, if you want to say do fewer bits, you
have much worse problem. only one out of

184
00:18:43,325 --> 00:18:49,155
every two forty ninth can be encoded with
less than 500 bits. It's just the smaller

185
00:18:49,155 --> 00:18:55,710
number of git, bits gives you way number,
way fewer possible bit strings and You

186
00:18:55,710 --> 00:19:01,381
have to, be able to get your original bit
string back. So, the smaller number of

187
00:19:01,381 --> 00:19:06,782
bits gives you, a limit on the number of,
bit strings that you can possibly

188
00:19:06,782 --> 00:19:12,327
compress. so, we can't have universal data
compression. And if you want to convince

189
00:19:12,327 --> 00:19:17,784
yourself of that, now just try running
your favorite compression algorithm on the

190
00:19:17,979 --> 00:19:23,631
, result of what it produces. Say for a
photo most good compression algorithms

191
00:19:23,631 --> 00:19:28,764
will figure out that you're doing that,
and just give you the same file back. So

192
00:19:28,959 --> 00:19:34,741
at least it won't make it get larger but
there's plenty of, of methods out there

193
00:19:34,936 --> 00:19:41,997
that will make your file larger. there's
another kind of basic idea. and, and that

194
00:19:41,997 --> 00:19:47,149
is that there's no way to find the best
way to compress a file. And if you think

195
00:19:47,149 --> 00:19:52,425
about it, it can be extremely complicated.
This is just an example that illustrates

196
00:19:52,425 --> 00:19:57,142
the point. But it's also a relevant
example, cuz it applies to so much of the

197
00:19:57,142 --> 00:20:02,070
data that we deal with. So at the top is a
picture dump of a million bit file. So

198
00:20:02,070 --> 00:20:06,491
that's a million bits and it's, those are
random bits. Well they're not random,

199
00:20:06,491 --> 00:20:11,282
they're pseudorandom. what do I mean,
pseudorandom? Well we talked about that.

200
00:20:11,466 --> 00:20:16,445
they're generated by a program. And, this
is a Java program that uses a pseudo

201
00:20:16,445 --> 00:20:21,726
random number generator to generate bits.
And, that's where those bits came from.

202
00:20:21,726 --> 00:20:27,342
But if you think about it, this program is
representation of those million bits. It's

203
00:20:27,342 --> 00:20:32,824
only a couple of dozen characters, so, you
know, if you want to find the optimal way

204
00:20:32,824 --> 00:20:38,440
to compress those bits, one of the things
you would have to consider is this program

205
00:20:38,440 --> 00:20:44,453
T hat's a pretty compact way to represent
a million bit. And, if you think about it

206
00:20:44,453 --> 00:20:49,965
so much of the data that's out there
actually was created by a program so,

207
00:20:50,180 --> 00:20:55,693
compression amounts to finding the program
that creates, created the data. And,

208
00:20:55,693 --> 00:21:01,133
that's obviously a pretty difficult
problem in fact it's known that it's

209
00:21:01,133 --> 00:21:07,004
undecideable. So, let's not think that we
can find the best possible compression

210
00:21:07,004 --> 00:21:14,484
algorithm. and the other point that I want
to finish with in the introduction is, to

211
00:21:14,484 --> 00:21:21,950
realize that when we're talking about,
compressing, say English language, it's

212
00:21:21,950 --> 00:21:29,329
amazing to see that actually there's a lot
of, redundancy in the English language.

213
00:21:29,593 --> 00:21:38,754
and so this is, a, . A perturbed version
of an English language paragraph that

214
00:21:38,754 --> 00:21:47,621
shows that you can change letters and even
delete letters and still figure out what

215
00:21:47,621 --> 00:21:53,549
it says. and actually at the end, it's you
really need the first and last two

216
00:21:53,549 --> 00:22:00,201
letters, in a lot of situations to,
really, figure out readability. so, there,

217
00:22:00,201 --> 00:22:06,414
there is a lot of freedom, because there's
so much redundancy. And since there's so

218
00:22:06,414 --> 00:22:12,407
much re, redundancy, we can actually do,
really well on, on compressing English

219
00:22:12,407 --> 00:22:18,694
language texts, for example. So that's an
introduction to data compression and we'll

220
00:22:18,694 --> 00:22:20,960
take a look at algorithms next.