Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped.
Table of contents
Introduction
This is an introduction to finite context modeling, which discuss most
of the topics needed to understand ppm variants. However this article talks
about some topics which are used in other modeling techniques, and which
may be useful for the programmer interested in compression. I personally
recommend to fully read this article as it includes some basic knowledge,
like measures of compression, a brief introduction to the entropy, why
we can't have magic compressors, and some definitions that are widely used.
The need for
compression
Nowadays there are bigger hard disks than before. However I doubt that
someone would reject to have twice the space on it. And that's where compression
does his job. Having more disk space is not the only usage of compression,
though. Communications via Internet are slow, and therefore if we can reduce
the amount of data to send, we are going to need less time to send it.
Therefore we are going to reduce the cost of transmission. So in fact data
compression can help us to get much more space from our hard disk or greatly
reduce our phone bills, but compression is not magic. It's a science that
has its roots in a branch of mathematics, information theory, which discusses
data storage and communication. Information Theory was founded around 1940
by Claude Shannon, and has theoretical limits which can't be surpassed
and know theories which can't be ignored. So don't expect to find the magic
compression algorithm which codes all the files to a given length. Also
don't believe anyone who tells you that he is capable of doing so.
Among all the compression algorithms, finite context modeling is widely
used, and both the state-of-arts and fastest compressors rely on it. Ppm
variants are the most know applications of finite context modeling. Lzp
is based in context modeling for selecting one offset. Bwt is also based
on it. Not to mention simple models for widely used coders, like huffman
or arithmetic coding. Such an important kind of models should be learned
from the roots. This should let you understand current algorithms or papers
about them, and in case you want to, you may be able to do improvements
on it by researching.
Measuring compression
There are different ways of knowing the compression achieved by a given
algorithm between a file and a compressed version of it. You can do it
with %, (compressed_length / original_length) * 100, for example if we
had a file with a size of 400 bytes and compress it down to 100 bytes,
the ratio will be (100/400)*100 = 25% so the output compressed file is
only 25% of the original. However there's other method: (1-(compressed_length
/ original_length)) * 100 In this case the ratio is 75% meaning that we
have subtracted 75% of the original file. In both cases the compression
is the same, but the ratios different, thus this is not a satisfactory
way of measuring compression, as we'll need to say the formula used.
In advertisements usually a measure is used which doesn't have this problem, it's the following one: 2:1 meaning that the file has been compressed to the half, or 4:1 meaning that we have compressed it to one quarter. However while we can have no doubt understanding this way of measuring, it still has a problem, it has low precision. How do we write that a file has been compressed to 40%? Thus this is neither satisfactory.
Fortunately there's one which does not present any of those problems. Bits Per Byte (bpb) the formula is quite easy: (compressed_length / original_length) * 8 In the previous example (100/400)*8 = 2bpb, meaning that we need only 2 bits to represent one byte. It's also accurate enough: (47/134)*8 = 2.805970149254 bpb, usually we only get three digits, like 2.806 bpb. Note that when we know the expected compression of a given algorithm we can know the supposed length of the output: (input_length / 8) * bpb = output length. Also this way of measuring has another good point, it's widely used, you'll find in most of the papers.
Also you can find bpc, Bit Per Char, in some cases is the same as bpb, and when specified, it refers to a given alphabet, which usually only contains characters that appear in text data. (but then this is not suitable for compressing binary data) In compression we code a file, and if the coded file is smaller than the original file we have achieved compression, otherwise we have expanded it. For image files one can use the term of bpp, which means Bits Per Pixel.
There are also some measures for speed. Like kbyps, Kilo BYtes Per Second, you can compute it with: file_length / seconds. Obviously the file length must be in kilobytes (divide its size in bytes by 1024) and seconds should have more precision than a second, using hundredths of seconds is enough. Unfortunately this has a big problem, a program runs faster in a faster computer. Obviously we can't use the same computer for testing all the programs (at least not for technical papers). The only solution which seems like a good idea is comparing the speed of our compression program with an standard program. In [5] they test the speed of different algorithms against the speed of the utility compress.
To be able to compare the bpb of an algorithm with another, an standard
set of files is used. The Calgary Corpus is still widely used. Though I
would recommend to use also the large Canterbury Corpus. Both are available
at [5].
Messages and symbols
A message is a sequence of symbols. A symbol can represent any kind
of data, it may be a byte, a character, or a pixel, it depends on your
application. An alphabet is the set of symbols which may appear in our
message. The alphabetsize is the number of symbols in the alphabet, usually
is represented with the symbol "Z". The message has a finite length, it's
usually represented with "N". The frequency of a symbol it's the number
of times it has appeared in the message.
For example, let's say we have a message like "987654321012" the alphabet contains all the numbers in decimal ("0","1","2","3","4","5","6","7","8","9") so the alphabetsize is 10. In this case, the length (N) of the message is 12. In our example "9" is the first symbol in the message, "8" the second and so on, till "2" which is the last. In our example the frequency of the symbol "1" is two and the frequency "5" is one. Optionally our alphabet could have another symbol, let's say '$', which would mark the end of the message, however we could avoid this symbol by transmitting the length of the message.
We have to compress a given file (the message) with a given length (know
beforehand), whose symbols are bytes. Some data like sound or image can
be lossy compressed, that is we can choose to only code the information
which is more relevant and throw away the rest of it, thus introducing
error on it. In the other hand, the loss of data is something which we
can't afford with text or binary data. Just think about the mess of a text
with almost random characters, or corrupted binary data which our programs
may need to run. Due to the need of a decompressed file identical to the
original one, we have to use lossless compression for this kind of data.
In most of the cases we name this kind of compression (the compression
of files without loss) text compression, commonly english text compression,
as English has an structure which has been taken as representative for
any language. Also this structure is not very different of that of binary
files, and thus if a given compressor performs well with text, it's going
to have a good performance with binary data.
Information
content and entropy
Let's have a look at the following message "theretherethere" we can
easily see that we could easily substitute "there" with only a reference
to the previous one. This is usually what Lz variants do. (lz77, lz78,
lzw, lzp...) And let's see another one "0123456789", here we could rely
on the fact that there's a little difference between the current symbol
and the previous, to try to recode it as the difference "0111111111". This
is called differential coding (coded_symbol=current_symbol-last_symbol).
Both of them use a model. A model is a set of rules and parameters which
based on the statistics of the message tries to predict its structure and
therefore tries to predict the exact probability of the next symbol.
As we see we are using different models which have differently results, but none of them seems optimal for all the messages (in fact there's no model optimal for all the messages). Both of them exploit the redundancy (the similarities, correlations) of the symbols in a message. If we take a look at the compression of a given message, we'll see two stages, the first gets the probabilities of the symbols in the message, this is the model, and the second, based on the probabilities of the model, generates codes, and this is the coder. (or encoder, both names mean the same) Together they make a compressed file. Needless to say we also have a decompressor, with a decoder and the same model, so it can reconstruct the original file based on the compressed file.
Let's see two different messages: "abcabcabc" "odwtmcbey" We can notice that both of them have the same length (and thus they take the same space). But we can clearly see that they have different structures, the first repeating a pattern ("abc") and the second being rather random. For analysing it we have the most powerful model, our brain (unfortunately we have not developed models better than our brain for text compression yet) so we can see that the length of a given message isn't a very reliable measure of how much can we compress it. A message has an internal structure with a probability for every symbol, however both are unknown to us, and the only information that a model has about the message is the message itself. Our model then assigns a probability to every symbol. Then the job of the coder is, assign a code to every symbol based on a probability distribution. (in theoretical words we would say that it maps the source message in codewords) But what's the optimal code length for a given probability distribution?
Between 1940 and 1950 Claude Shannon created the information theory,
a branch of mathematics, among his job we can find the answer to the previous
question. The entropy, which measures the content of information in a given
message, also gives the optimal code length in average for a probability
distribution. The entropy of a given symbol is defined as -Log2(P)
where P is the probability of a given symbol (the one whose code we want
to generate). Note that we do Log2 if we
want the result to be expressed in bits, otherwise we only have to change
the base of the logarithm, if we want it to be expressed in bytes we would
use: -Log256(P). Just as a curiosity, the
term entropy is also used in thermodynamics, a branch of physics, they
are called the same because both are analogous.
Finite context
modeling
Finite context models assign a probability to a symbol based on the
frequencies of the symbols that happened before the current symbol to code.
Before because when we code a symbol based on the previous symbols, we
can be sure that the decoder will have this information too. (as the previous
symbols have been already transmitted) Finite context modeling is based
in the fact that a symbol that has appeared recently will appear in a near
future, so the more times that it appears, the higher probability that
it appears again, so every time it's seen we increment its probability.
In this article I'll not deal with coders, but you can choose among static huffman, arithmetic coding, a variant of arithmetic coding, called range coder or many others. Arithmetic coding approaches the entropy, so from this point at on, we'll only use the entropy as reference. (assuming that we are using arithmetic coding)
Let's say we have a message like "aaabbc" and a finite context model
which assigns a probability of symbol_frequency/total_frequency to every
symbol, then we have:
| Symbol | Frequency | Probability | Optimal code length |
| a | 3 | 3/6 | -Log2(3/6) = 1 bit |
| b | 2 | 1/6 | -Log2(2/6) = 1.59 bits |
| c | 1 | 2/6 | -Log2(1/6) = 2.59 bits |
In this example the original length of the symbol "c" was 8 bits (we assume it was a byte) however our model assigned a code length of 2.59 bits, from this we can see that the redundancy was of 8-2.59 = 5.41 bits, we can generalize the following: for any model the redundancy of a given symbol is original_code_length-code_length. Note that the probability 1/6 for 'c', is the expected probably of the next symbol being a 'c'.
Just notice from the previous example that every time a symbol appears we increment its probability, and 6 is the total number of symbols seen, (total frequency) from there we can get the probability of a given symbol. When we implement this simple finite context model we have a table with Z entries (alphabetsize, in case of bytes: 256) which is initialised to 0, and then when a symbol appears we increment its value, something like: ++table[symbol].
Now I'm sure that you are wondering if we can do better than the entropy,
the answer is no. The compressed length of the message is (1*3)+(1.59*2)+(2.59*1)=
8.77 bits. Let's try to give a higher probability to the most probable
symbol:
| Symbol | Probability | Fake probability | Entropy |
| a | 3/6 = 6/12 | 9/12 | -Log2(9/12) = 0.42 bits |
| b | 2/6 = 4/12 | 2/12 | -Log2(2/12) = 2.59 bits |
| c | 1/6 = 2/12 | 1/12 | -Log2(1/12) = 3.59 bits |
So the final length is: (0.42*3)+(2.59*2)+(3.59*1)= 10.03 bits. We have coded one symbol (the most probable one) with fewer bits, but the resulting output file is bigger, so on average we can't do better than the entropy. This can be mathematically proved with the Kraft Inequalities, for further reading see [1]. Of course we can do better models, which give higher probabilities to the symbols being coded, and thus we achieve higher compression.
Finite context models are also called finite Markov models (Markov (1856-1922)
was a Russian mathematician), because they assume the input file to be
a Markovian sequence: every symbol depends only in the finite preceding
sequence of symbols. Note that also exist models for a message of an infinite
length, but in most of the cases this is not practical, see [4].
Also the term Markov model is used to refer to finite state
models, however this falls out of the scope of this article.
Context and order-n
As we have seen we can't do better than the entropy, so now the only
problem is to make a finite context model which gives higher probability
to the most probable symbols, the best solution is using the information
of the contexts. (which has proven to be very high)
The context of a given symbol is the finite length sequence of symbols before or after it. In finite context modeling we use the symbols before it. Because we can be sure that the decompressor's model will have this information. That ensures that it can have the same probabilities as the compressor's model and there would be no problem decoding the symbols. In the other hand if the compressor and decompressor have different probability distributions, the same code may mean different things, and thus we would get corrupted data. Due to this fact, a rule which you must follow is that both compressor and decompressor must have the same model always. (if you want to do lossless compression)
We use context because the probability of a letter appearing in a text heavily depends on the symbols that have appeared before, wouldn't it be more probable to see a "h" after "whic" rather than a "q" or any other letter?
In the previous example model, the probability of a symbol was taken from the number of time that it appeared in the file, when using the information of context we also have to take care of what is the context of the current symbol, and thus we take only the probability of the symbols which have appeared under this context. Both text and binary data are context sensible, that is, under a given context the same symbols tend to appear, so we can get a more reliable probability. (Binary data tends to be context sensible, however it shows a more random structure than text)
I'll only prove this for text data: "the text" (the context is in red colour and the symbols used for making the probabilities are in green) We are currently at the end of the message, we take care of the last symbol, in this case a "t", the probability distribution under the context "t" is 1/2 for "e" and 1/2 for "h". (we can also reserve probability space for meaning that the symbol is another one, that would be an escape code) If we don't take context information in account the probability of "h" would be 1/8 and thus it would be coded to 3 bits, instead of 1 bit using the last symbol as context. Implementing this goes something like the previous example, but instead of one table we would use Z tables, and depending on the context we would choose one.
As you can see, using the information from the context we have symbols with higher probabilities than if we don't take care of the context. This is called a skewed distribution, in the other hand we have a flat distribution, which means that all the symbols have a probability of 1/Z where Z is the alphabetsize, or any other probability distribution which assigns the same probability to every symbol. Just as an example a text file which is not compressed has a flat distribution, with every symbol having a probability of 1/256, thus every symbol is coded with 8 bits. The goal of a model is get a skewed probability distribution, which hopefully gives high probability to the symbols which actually occur, and thus we can code them with fewer bits.
When we use context in a finite context model, we don't say that we have a model using one byte as a context, we say that we have an order-1 model. When we use no context at all (one table for all the symbol) we are using order-0. Of course if we use 3 symbols it's order-3, if we use 5 it's order-5 and so on. There's also an special order, order-(-1) which is a table with a flat distribution for all the symbols, used for ppm.
You have to note that using high orders gives high probabilities only
if the data reassembles a Markov model, if it's context sensible, like
text is. (under a given context the same symbols appear again)
Blending
Once we have different orders the question of how to use them arises.
One of the solutions is blending, which is combining the probability of
all the orders in a single probability for every symbol. This is done by
adding together the probability of the same symbol under different contexts.
However as we have seen probabilities of higher orders tend to be more
reliable and thus achieve higher compression, the way we reflect this in
blending is by weighting higher orders, that is multiplying them to give
them more importance when they are added together. Every context has a
different weight which we can choose beforehand, or change while compressing.
Now let's say we have a message like "baabcab" and an order-2-1-0 model,
using the different weights values: 8,2,1 for order-2, order-1 and order-0
respectively. In this case once all the data has been processed the probability
distribution will look like the following:
|
Probability distribution for next symbol:
lengths showed above. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
However using blending is slow, though it gives good compression, there's
another method, which is faster, this is the use of escape codes, that's
the method used by ppm. In [2] you can find more about
blending, they also recommend making the weights sum up to 1. (to not surpass
the limit of a value that the arithmetic coder being used has)
Zero
frequency problem and escape codes
The zero information problem is when we have no information about the
input, we haven't seen any symbols yet, and the problem is what probability
distribution to choose. In this case the rational solution seems a flat
probability, however with a flat probability we achieve no compression.
The zero frequency problem specially arises when using high orders, look at this text: "we have a model using one byte" Let's say we are under a order-4 model, and we want to know the probability of the next symbol (the context is "byte"), as we have never seen before 'byte' the probability of the next symbol appearing is 0, then in a context where no symbol has appeared the most optimal solution seems to be a flat distribution, 1/Z (1/256) which gives the same probability to every new symbol. But should we assign an equal probability to all the symbols when more than one symbol has appeared? Look at the following example: If you have the following text "aa" and you have seen the symbol 'a' once and you start with a flat distribution, the probability of the next 'a' is 2/257 even if you see it again it will be 3/258 but you'll expect the probability to be much more higher than 3/258, may be something like 2/3 (because its frequency is 2 and nothing else has appeared), but you still have to leave some probability for the case of a new (unseen in this context) symbol appears. This is called an escape code, which means that the symbol which we are currently coding is not in the current context, and thus we have to try in another context. We start in a table (order-0) with all the symbols (Z) and the escape code, all the symbols have a probability of 0 and the escape code has a probability of 1. Let's say we have to code a given symbol now, as it doesn't appear in the order-0 table (when only the escape codes has a probability) you have to fallback to order-(-1) where all the symbols have a flat probability and you can safely code the symbol. An escape code is used to make the decoder know that it has to use the next table, we always go from high orders to lower order, till we hit order-(-1) where we can code the symbol in case it hasn't appeared in higher orders.
For example, the case of "aa" will be the following: the first 'a' is not in the order-0 table, so we emit an escape code (which is in the order-0 table) and then code the 'a' symbol with the order-(-1) table, with a probability of 1/Z, once the symbol is coded we have to update the order-0 table to reflect the "a" that has appeared once, however we don't update the order-(-1) table, as its only purpose is to code symbols which have never appeared, and symbols which have already appeared will be found in higher orders (probably resulting in more compression). Then the next 'a' is found in the order-0 table (which we have updated and now has two symbols on it, "a" and the escape code) and now it has a probability of 1/2, and thus we can code it to only one bit. In that case note that the real probability of the second "a" was 1 however our model assigned it a probability of 1/2, and then the coder coded the "surprise".
As you see this scheme of making the escape code is a little bit inefficient,
as it always has the same value, no matter how many symbols we have seen
till the date. Fortunately there are some other ones. The first which I've
described is method A (and thus a ppm implementation which uses it, would
be called ppma) Method B is based on "Don't believe it till you seen twice"
by subtracting one to the count of every symbol, but this is kind of old
stuff, so let's go to see the next one. In method C, the probability of
the escape code is equal to the number of different symbols seen. (also
remember to add the escape code probability to the total of probabilities)
Using escape codes is an alternative of blending, let's see the previous
example ("baabcab") but using escape codes with method C instead of blending:
(E is the escape code, whose probability is equal to all the different
symbols in the current context, obviously is not used in order-(-1) as
it already has all the symbols)
|
Probability distribution for next symbol:
|
Higher orders produce better results, however it has been proven that
orders above 5 are not worth the processing time and space requirements
in basic ppm implementations like ppma or ppmc, however ppmd, ppm* or ppmz
make use of higher orders to achieve more compression. Ppm (Prediction
by Partial Matching) is a finite context adaptive model which uses arithmetic
coding as a coder, and tries to accurately predict the next symbol, using
the current context in a context tree (or any other data structure like
a suffix tree or hash tables) for predicting the current symbol, using
the highest order possible and going to a lower order (using escape codes)
till the symbol is found. (actually this is not true for all the versions,
ppm* chooses the lowest deterministic context, and ppmz chooses the highest
deterministic context, or any context via LOE) Ppmc achieves around 2.4
bpb in the Calgary Corpus Set, and ppmz (by Charles Bloom) is the state
of art with 1.9 bpb. Ppmz is the most promising model for lossless compression,
in the case of infinite context modeling (which is not used) the state
of art is Ctw.
Unfortunately both of them (especially ctw) are very slow.
Models,
probability updating
The difference between finite context models are not only the order
or escape code method, but also how often the probabilities are updated.
In this topic we have three different kinds of models:
References
[1] Charles Bloom Kraft Inequality, http://www.cbloom.com
(Compression : News Postings : Kraft Inequality)
[2] Bell, T.C., Cleary, J.G. and Witten, I.H. (1990)
Text
compression, Prentice Hall, Englewood Cliffs, NJ.
[3] Charles Bloom Solving the Problems of Context
Modeling, http://www.cbloom.com
(ppmz.ps)
[4] The Context-Tree Weighting Project Url:
http://ei1.ei.ele.tue.nl/~frans/ResearchCTW.html
[5] http://corpus.canterbury.ac.nz
Closing words
I finish this article hoping that this article gives you all the information
you need to fully understand how ppm works. I've done my best with this
article so it's as accurate and complete as possible, so I'd be very grateful
hearing your suggestions for further improvements of it, or just comments
about it, just write me an email.
I want to thank Michael Schindler,
Sampo
Syreeni and Stefaan
for their help proofreading this article and answering my questions. I
plan to do another article talking about the entropy formula only. The
article of ppmc is already available. If I had
more knowledge about information theory (I'm only 17, still in the institute)
I'd write about it.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.