Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped.
Table of contents
Introduction
Arithmetic coding, is entropy coder widely used, the only problem is
it's speed, but compression tends to be better than Huffman
can achieve. This presents a basic arithmetic coding implementation, if
you have never implemented an arithmetic coder, this is the article which
suits your needs, otherwise look for better implementations.
Arithmetic coding
The idea behind arithmetic coding is to have a probability line, 01,
and assign to every symbol a range in this line based on its probability,
the higher the probability, the higher range which assigns to it. Once
we have defined the ranges and the probability line, start to encode symbols,
every symbol defines where the output floating point number lands. Let's
say we have:
Symbol  Probability  Range 
a  2  [0.0 , 0.5) 
b  1  [0.5 , 0.75) 
c  1  [0.7.5 , 1.0) 
Note that the "[" means that the number is also included, so all the numbers from 0 to 5 belong to "a" but 5. And then we start to code the symbols and compute our output number. The algorithm to compute the output number is:




Implementation
As you can see from the example is a must that the whole floating point
number is passed to the decompressor, no rounding can be performed, but
with today's fpu the higher precision which
it ca offer is 80 bits, so we can't work with the whole number. So instead
we'll need to redefine our range, instead of 01 it will be 0000h to FFFFh,
which in fact is the same. And we'll also reduce the probabilities so we
don't need the whole part, only 16 bits. Don't you believe that it's the
same? let's have a look at some numbers:
0.000  0.250  0.500  0,750  1.000 
0000h  4000h  8000h  C000h  FFFFh 
If we take a number and divide it by the maximum (FFFFh) will clearly see it:
Underflow
Underflow occurs when both high and low get close to a number but theirs
msb don't match: High = 0,300001 Low = 0,29997 if we ever have such
numbers, and the continue getting closer and closer we'll not be able to
output the msb, and then in a few itinerations our 16 bits will not be
enough, what we have to do in this situation is to shift the second digit
(in our implementation the second bit) and when finally both msb are equal
also output the digits discarded.
Gathering
the probabilities
In this example we'll use a simple statical order0 model. You have
an array initialized to 0, and you count there the occurrences of every
byte. And then once we have them we have to adjust them, so they don't
make us need more than 16 bits in the calculations, if we want to accomplish
that, the total of our probabilities should be below 16,384. (2^14) To
scale them we divide all the probabilities by a factor till all of them
fit in 8 bits, however there's an easier (and faster) way of doing so,
you get the maximum probability, divide it by 256, this is the factor that
you'll use to scale the probabilities. Also when dividing, if the result
ever is 0 or below put it to one, so the symbol is has a range. The next
scaling deals with the maximum of 2^14, add to a value (initialized to
0) all the probabilities, and then check if it's above 2^14, if it is then
divide them by a factor, (2 or 4) and the following assumptions will be
true:
Saving the
probabilities
Our probabilities are one byte long, so we can save the whole array,
as a maximum it can be 256 bytes, and it's only written once, so it will
not hurt compression a lot. If you expect some symbols to not appear you
could rle code it. If you expect some probabilities
to have lower values than others, you can use a flag to say how many bits
the next probability uses, and then code one with 4 or 8 bits, anyway you
should tune the parameters.
Assign ranges
For every symbol we have to define its high value and low value, they
define the range, doing this is rather simple, we use its probability:
Symbol  Probability  Low  High  01 (x/4) 
a  2  0  2  [ 0.0 , 0.5 ) 
b  1  2  3  [ 0.5 , 0.75 ) 
c  1  3  4  [ 0.75 , 1 ) 
What we'll use is high and low, and when computing the number we'll
perform the division, to make it fit between 0 and 1. Anyway if you have
a look at high and low you'll notice that the low value of the current
symbol is equal to the high value of the last symbol, we can use it to
use half the memory, we only have to care about setting the 1 symbol with
a high value of 0:
Symbol  Probability  High 
1  0  0 
a  2  2 
b  1  3 
c  1  4 
And thus when reading the high value of a symbol we read it in its position,
and for the low value we read the entry "position1", I think you don't
need pseudo code for doing such routine, you just have to assign to the
high value the current probability of the symbol + the last high value,
and set it up with the symbol "1" with a high probability of 0. I.e.:
When reading the range of the symbol "b" we read its high value at the
current position (of the symbol in the table) "3" and for the low value,
the previous: "2". And because our probabilities take one byte, the whole
table will only take 256 bytes.
Pseudo code
And this is the pseudo code for the initialization:
Decoding
The first thing to do when decoding is read the probabilities, because
the encode did the scaling you just have to read them and to do the ranges.
The process will be the following: see in what symbol our number falls,
extract the code of this symbol from the code. Before starting we have
to init "code" this value will hold the bits from the input, init it to
the first 16 bits in the input. And this is how it's done:
When searching for the current number (temp) in the table we use a for loop, which based in the fact that the probabilities are sorted from low to high, have to do one comparison in the current symbol, until it's in the range of the number.Shift low to the left one time. Now we have to put in low, high and code new bits Shift high to the left one time, and or the lsb with the value 1 Shift code to the left one time, and or it the next bit in the input Repeat to the first loop.
Closing words
First of all thanks to Mark Nelson for some help with it. This is the
first version of this article, I hope to mend possible mistakes, which
you should report if you find. Also any idea is welcome. There are faster
implementations, but this was only an introduction, once you have a good
encoder you only need a good model to have good compression, so research
a little bit. If you want a faster arithmetic coder, look the range
coder.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.