The introduction to Data Compression
by
Arturo San Emeterio Campos
Disclaimer
Copyright (c) Arturo San Emeterio Campos 2000. All rights reserved.
Permission is granted to make verbatim copies of this
document for private use only.
Note
"Compression programming" is on a early stage. This text is not finished.
Table of contents
How does compression work?
Text and
signals: lossless and lossy compression
Kind of models
Computing
probabilities with a finite context model
Why do we use a finite
context model?
Entropy and
the properties our codes must have
Some coders
Parts of a compressor
A compressor,
or how do we put it all together
Text compression in the
real world
Measuring the compression
rate
Measuring another
aspects of a compressor
What to do now?
References
Compression relies on the fact that the data is redundant, that till some extent it was generated following some rules, and that we can learn those rules, and thus predict accurately the data. A compressor can reduce the size of a file by deciding which data is more frequent and assigning it less bits than to less frequent data. Clearly compression has two parts: one guess which are the most frequent symbols, and other which outputs the "decision" of the first one.
Using the right terminology we would say that the first is called a model and that predicts the probability of the source, it then makes a probability distribution. And that the second one is the coder and it makes and emits codes based on the probabilities assigned by the model. The closer that are the predictions made by the model to the data, the more compression we can achieve, because we'll assign less bits to most of the data.
Immediately the two main questions of data compression arise: which
rules should we follow to assign probabilities to the symbols? and given
a probability distribution, which are the smallest codes that we can assign?
Information theory is a science whose role is to answer these questions.
As we'll see later the answer is a Markov model and the entropy respectively.
Text and signals: lossless and lossy compression
We have seen before that we may want to compress different kinds of data such as text, data bases, binary programs, sound, image and video. In practice we distinguish about text compression and signal compression. We do this separation because data bases, and binary programs have the same characteristic as text. Likewise sound, image and video are signals and thus share properties. In the other hand text and image data have nothing in common, and that's what they don't belong to the same group.
We also do this separation because for these two groups we use different kinds of compression. That comes from the nature of the data. Digital signals are an imperfect representation of an analogic signal, thus when compressing them we can discard some of the information to achieve more compression. This is done with transformation and quantization algorithms.
Let's say we have a byte from an image, its value is 65 and it represent the quantity of red in a given pixel. If when decompressing this byte is 66 we wouldn't notice the difference between red and a very little more of red. However if that was a text file 65 would be 'A' (assuming it's Ascii), and there's a big difference if we decompress 66 which would be a 'B' instead of 'A'. Due to the nature of text we can't afford errors.
So we use lossless compression for text where the original file must
be exact bit per bit to the original one. And lossy compression for signals
where some error is acceptable and in most of the cases is not detected.
However you should note that signals can be lossless compressed, though
then the compression achieved is far worse than with lossy compression.
In most of the cases signal compression goes of discarding as much data
as possible but retaining as much quality as possible.
There are different kinds of models such as finite state models, grammar models and ergodic models [1] but by far the most used ones a variation of Markov models: finite context models. They rely on the following property: in an order-kth Markov source the probability of the current symbol depends only on the kth previous symbols. To know the probability of the current symbol we take in account what happened before, expecting that the same will occur.
For example, in an order-1 finite context model we take in account the symbols that appeared after the current order-1 context. If we had to compress an 'h' preceded by a 't' (order-1 context) we would use the past probabilities but only when 't' preceded. Just look at english text, an 'h' is quite likely after a 't' (there, that, the).
There are four kinds of models depending on how often we update them:
static, semi static, adaptive and the unusual quasi static. The first one
never updates the probabilities for the symbols. A semi static model does
a first pass gathering all the probabilities, and a second one for coding
all the symbols with the probabilities of the first pass. The adaptive
model updates the probabilities every time a byte is coded. The quasi static
updates the probabilities every K bytes, which is tuned to speed up the
process while hurting little the compression achieved.
Computing probabilities with a finite context model
We have said that we use the probabilities of a context, but how is this done? We assume that the probability of a symbol is proportional to the number of times it has appeared. In other words if it has not appeared yet, probably it won't.
Let's say we have an order-0 finite context model for this example: "bacb". Order-0 means taking no context in account. The frequency, the number of times that a symbol has appeared, is: 'a',1 'b',2 'c',1 The total frequency is the sum of the frequencies: 4. Thus the probability for every symbol is: a,1/4 b,2/4 c,1/4. This is a probability distribution.
In the previous example if a 'd' occurred now, we wouldn't be able to
code it, then we wouldn't be able to lossless compress the file. So if
we want to do lossless compression, the model must assign a non zero probability
to all the symbols.
Why do we use a finite context model?
The goal of compression is to predict accurately symbols that appear more often and assign them less bits. Obviously if all the symbols have the same probability, there's no chance for compression! If you use an order-0 model (no context) to know the probabilities for this paragraph of text, you would notice that most of the symbols, in that cases letters and punctuation, have the same probability. Obviously that would achieve little or no compression at all.
Text is context sensitive, its probabilities seem generated with a finite
context model. That's why using a higher order makes the probabilities
more skewed. That is, giving more importance to a few symbols. Recall our
examples of modeling and apply them to a paragraph of text. In the simple
order-0 model, the probabilities seem more or less flat. If we go to order-3
things improve a lot, we notice for example that after 'tex' always occurs
a 't'. That makes the probability of 't' under this context quite high,
thus an small code can be assigned, and high compression can be achieved.
Entropy and the properties our codes must have
Coding is making a set of codewords, from a probability distribution.
We assign a codeword to every symbol.
| Symbol | Probability | Binary codeword |
| a | 1/4 | 10 |
| b | 2/4 | 1 |
| c | 1/4 | 01 |
For data compression the codewords must have some properties. The first is that they must have the prefix property: no codeword can be prefix of another codeword. For example, the previous example doesn't have the prefix property, '1' is prefix of '10', thus if we have 101, we don't know if it means 'bc' or 'ab'. Also they must be instantaneously decodable: we must be able to decode the symbol as soon as we see the codeword (code). See [3] for further explanation.
According to information theory the best code length on average that we can choose for a probability P is the entropy: -log2(P) (minus logarithm in base 2 of P). Clearly if we have a symbol with a probability of 1/2 the code should have a length of 1 bit (-log2(1/2)=1). The probability P is chosen by the model. If we want our codes to be instantaneously decodable and have the prefix property the minimum code length is the entropy.
We can feel tempted to think that we can do better than the entropy, but Shannon's Noiseless source coding theorem proves we can't [1]. Because codes must have characteristics like the prefix property and being instantaneously decodable [3]. Also this can be proved with Kraft inequalities [2].
ASCII is a simple example of the model and coder paradigm. It assigns
a flat probability (the same probability to every symbol) and also achieves
the entropy. For example the byte 'a' (as any other one) has a probability
of 1/256, and thus its probability is -log2(1/256)=8 bits, which is exactly
the code length used by ASCII.
The two main coders are huffman variations and arithmetic coding variations. Huffman reassembles more or less the example shown above. It makes codes with variables lengths. But the length must be an integer. Like in the example, the codes have different lengths, but no code has a fractional length. That posses a problem, what about if the entropy says that the ideal code length is 1.5? That's why huffman introduces redundancy in the codes.
Arithmetic coding is almost optimal compared with the entropy. Instead of making a code for every symbol, it makes a code for all the input. This makes it slow but optimal. It can code a symbol optimally when its ideal code length is like 2.32 bits. By saying optimally, I mean it codes the symbol exactly in 2.32 bits.
At the end all the coders output bits. So we have bitio routines which do this task. But they can also be used directly. For example if we only have to code 16 symbols, we can decide to make a flat probability (all the symbols with the same probability) and output them in 4 bits.
Simple bitio is used when the speed matters. If we want however more
compression we have to switch to huffman, which offers a good tradeoff
compression/speed. For highest compression we use an arithmetic coder,
however it's slower than huffman (it spends most of the time updating the
model). Range coder is a variation of arithmetic coder, which is twice
faster. If you plan to implement any lz variation use simple bitio or huffman.
For bwt use huffman or arithmetic coding. And for ppmc use only arithmetic
coding.
The model and the coder are common to all compressors. But we can use another algorithms in our compressors. They are transformation and quantization algorithms. If the compressor has any lossy step it is the transformation or quantization part.
Transformation algorithms are made to improve the probability estimation. They are quite often in lossy signal compression, in fact they are the backbones of it. FFT, DCT and wavelets, are some transformation algorithms used in signal compression. But text can be transformed too, capital letters can be substituted by a flag, alphabet reordering, etc.
Quantization is reducing the precision of the data to store it in less
bits. For example, in sound we store the preassure of the air. Quantization
would be storing it on 4 bits instead of 8. It tries to reduce the number
of symbols that represent the signal without losing much quality.
A compressor, or how do we put it all together
Now let's see the basic chart flow of a compression program.
'Input' is the data to compress, first we send the context (the past data) to the model, which then makes a probability distribution. The model supplies the code with it, and the coder outputs the codeword of the current symbol to the output. The output would be transmitted or stored, and when needed it will be used by the decoder, which would do the following:
'Input' is the compressed data, which includes the codes and all other relevant information to the process. The decoder converts the codes, based in the probabilities, in symbols which we can output and then use to update our model.
The decompression program is the inverse of the compression program.
If we want that decompression works fine, both compressor and decompressor
must be synchronized. They always must have the same probability distribution,
or otherwise the codes would be misunderstood, and the output symbols would
be wrong.
Text compression in the real world
Unfortunately there hasn't been any discussion about a classification, but we could say that in practice there are three different kind of compressors: complex models, bwt variants and lz variants.
Complex models, highest compression, slow. They use a chart flow similar to the one shown in the previous section. With a complex model which does accurate predictions of the input. Algorithms: ppmc, ppmd, ppmz, ctw.
Bwt variants, less compression than complex models but faster. Bwt is a lossless transformation algorithm for text, whose output contains a lot of runs of the same symbol. This redundancy is then captured with mtf and rle or another model, and is followed by a coder. The difference between most implementations is the sorting routine used in the transform.
Lz variants, less compression than bwt variants but faster. Lz variants code phrases from the input in a single code. Lz is the most simple and fastest compressor that can be found. Algorithms: lz77, lz78, lzp, lzw, lzrw, lzfg. The main difference between them is in how do they choose the phrases.
Once we choose one of these algorithms, we still have to choose the coder. Again we will do this selection based on the compression/speed tradeoff that we prefer. In practice we can combine different models/coders. For example we can use ppmc for the output of lzp, which is the main model, lzp, a secondary model, ppmc, and a coder, arithmetic coding.
Text can be transformed too. Some of the transformations are: Capital
letters substituted by a flag, the overhead of a flag to represent lower
case or upper case is worth the compression gained with more reliable contexts.
The alphabet can be reorder, to improve bwt. The practice of transforming
text is not widely extended, because text compression is not a goal itself
but rather a way to tune the models to make them work with all the kind
of files.
Measuring the compression rate
Bits Per Byte (bpb) is the most used way of measuring the compression achieved by a program. We compute it that way: (compressed_length / original_length) * 8 If we compressed a 400 bytes file down to 100 bytes: (100/400)*8 = 2bpb, which means that we need only 2 bits to represent one byte. This measure is accurate enough: (47/134)*8 = 2.805970149254 bpb, though usually we only get three digits, like 2.806 bpb (no rules concerning rounding). Note that when we know the expected compression of a given algorithm we can know the supposed length of the output: (input_length / 8) * bpb = output length. I recommend to use this kind of measurement.
You can also measure using %, (compressed_length / original_length) * 100, for example if we had a file with a size of 400 bytes and compress it down to 100 bytes, the ratio will be (100/400)*100 = 25% so the output compressed file is only 25% of the original. However there's other method: (1-(compressed_length / original_length)) * 100 In this case the ratio is 75% meaning that we have subtracted 75% of the original file. In both cases the compression is the same, but the ratios are different, thus this measure is not a good because we have to specify which one we used.
In advertisements they often use a measure which doesn't have this problem: 2:1, meaning that the file has been compressed to the half, or 4:1 meaning that we have compressed it to one quarter. However while we can have no doubt understanding this way of measuring, it has low precision (2.231:1 it's not pretty).
Also you can find bpc, Bit Per Char, in some cases is the same as bpb, but when specified, it refers to a given alphabet, which usually only contains characters that appear in text data. But this is not suitable for compression of binary data.
For image files one can use the term of bpp, which means Bits Per Pixel.
It's similar to bpb but instead of 8, we use the number of bits needed
to specify a pixel.
Measuring another aspects of a compressor
There are also some measures for speed. Like kbyps, Kilo BYtes Per Second, you can compute it with: file_length / seconds. Obviously the file length must be in kilobytes (divide its size in bytes by 1024) and seconds should have more precision than a second, using hundredths of seconds is enough. Unfortunately this has a big problem, a program runs faster in a faster computer. Obviously we can't use the same computer for testing all the programs (at least not for technical papers).
The only solution which seems like a good idea is comparing the speed of our compression program with an standard program. In [5] they test the speed of different algorithms against the speed of the utility compress. Unfortunately hardly anyone does that. Thus the best is running different compression programs with same characteristics to yours in the same computer.
To be able to compare the bpb of an algorithm with another, an standard
set of files is used. The Calgary Corpus is still widely used. Though I
would recommend to use also the large Canterbury Corpus. Both are available
at
[5].
Now you'll have to choose what to implement. Look again the section Text compression in the real world. It can be of a great help having a look to The Archive Comparison Test (http://www.geocities.com/SiliconValley/Park/4264/act.html). First decide if you are going to use huffman or arithmetic coding. In case that you choose huffman, read the new article about bitio. Otherwise go directly to the article about arithmetic coding. First implement the coder and be sure it works correctly. Then add the rest.
Due to studies so far I have only rewrited three articles: bitio,
lzp
and huffman. Till I can finish the new versions,
you'll have to stick with the old ones, which
are of a lower quality. If you want me to rewrite any article faster, it
can help to send me an email.
References
[1] T.C. Bell, J.G. Cleary and I.H. Witten, Text compression
(1990)
[2] Charles Bloom "Compression : News Postings : Kraft Inequality"
(http://www.cbloom.com)
[3] Debra A. Lelewer and Daniel S. Hirschberg, Data Compression
(http://www.ics.uci.edu/~dan/pubs/DataCompression.html)
[4] Charles Bloom Solving the problems of context modeling (http://www.cbloom.com)
[5] Compression corpus (http://corpus.canterbury.ac.nz)
[6] Comp.compression FAQ (ftp://rtfm.mit.edu/pub/usenet-by-hierarchy/comp/compression/)