Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped.
Table of contents
Introduction
The range coder is presented, a variation of arithmetic coding which
does renormalization in bytes instead of bits thus running twice faster,
and with 0.01% worse compression than an standard implementation of arithmetic
coding. The range coder is believed to be patent free.
Some experience on models and arithmetic
coding is needed. In this case I recommend the qs
model.
Entropy
and how to achieve it
The entropy is defined as -Log2(P)
where P is the probability (ranging from 0 to 1) of a given symbol. Huffman,
in most of the cases does not achieve the entropy, unless the probability
is 2^x, in this case it does: -Log2(4)=2
or -Log2(32)=5. But this is not what happens
in most of the cases, in such cases Huffman assign less or more bits than
it should. Just an example if the probability of a given symbol is 7/8
= 0.875, the entropy of it is: -Log2(0.875)=
0,192, but in this case Huffman will assign it 1 bit, which is by far more
than it should.
But we have the arithmetic coder which more or less (it actually loses some compression due to the not infinite precision, scaled probabilities and end of data symbol) achieves the entropy. Arithmetic coding works doing an interval between 0-1 and further subdiving it, based on the cumulative probability of the symbol which is currently being processed. Now let's see the routines which the range coder uses. For adjusting the interval it uses the following code:
|
Let's say we have to encode the following symbols "bab" and we have the probabilities which are at the left for them. The encoding and decoding will look like the following. |
|
Encode a symbol:
Decode and update:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
More or less what we have done, is represented by the pic at the left. Note how the interval is first adjusted to represent 'b', and the new interval is again divided by the number of probabilities, and the range of 'a' is chose, again, the same is done for 'b'. (in case we have to encode b) |
Encoder implementation
The encoder is made of four routines: init, renormalize, encode a symbol
and flush encoder. Init must be called before starting to encode symbols,
both encode and renormalize code the symbols, and flushing is done when
you have encoded all the symbols. The next section discusses the pseudo
code.
Encoder Init:
Encoder renormalize:
Encoder encode symbol:
Encoder flush:
The decoding has five routines: init, renormalize, decode symbol, update state and flush. Update state must be called after knowing the symbol's range, so you can extract it from the number.
Decoder init:
Decoder renormalize:
Decoder decode symbol:
Decode update state:
Decoder flush:
Implementation
Low holds the floating point number, in the theory we use floating
point for implementations we use integer math, and instead of 0-1 range
we use 0-FFFFFFFFh (dependent of the implementation), which is a 32 bits
value. The bit 31 (at the left) is reserved to detect carry (when adding
tmp) In bits 30-22 we perform renormalization. The rest of the bits from
21-0 give the high precision. Its estimated that you can code 2^23 symbols
without problems.
When the range is below 800000h we know that the renormalization part of low will never change, so we can output it. To not suffer underflow, when the renormalization part of low (bits 30-22) are 0FFh we extract them and keep a counter of how many underflow bytes we have discarded. Once we can output the renormalization part, we can output all those bytes too, but in case that carry has occurred, instead of 0FFh we have to output 00h. (Obviusly if we have a value like 1FFFFh and we have carry, we must output 20000h instead, you see?)
If you are using a low level language you may want to check directly the carry, without using bit 31 for checking it, thus incrementing precision. If you are interested in such implementation contact Michael Schindler, who is able to tell you how to implement it.
About the byte based io. (input/output) If you are using a high level
language, you can skip this paragraph, as the compiler will do this itself.
When doing io you can't directly output a byte as soon as it's available,
that's very slow. Instead you keep a buffer and when its full or when you
have to flush it (end of compression) you write it to the output file.
This has a great impact in speed. Time is reported in machine clocks. Report
from file paper, from the Calgary
Corpus set.
| Buffer size (in bytes) | Clocks |
| 1 | 476679208 |
| 2 | 52550010 |
| 4 | 32170836 |
| 8 | 16429902 |
| 16 | 11381554 |
| 32 | 7301852 |
| 64 | 4177316 |
| 128 | 3673010 |
| 256 | 3259018 |
| 512 | 3121538 |
| 1024 | 3117984 |
| 2048 | 3018602 |
| 4096 | 3010252 |
| 8192 | 3018116 |
Models
In this article I've only discussed the routines to encode, now you
need a model which gives the probabilities (in form of cumulative probabilities).
You can use a simple static order-0 model, as presented for arithmetic
coding. Or you can use a Quasi Static model
which is a good decision. You may want to use the data structures presented
by Peter Fenwick
in his paper "A New Data Structure for Cumulative Probability Tables: an
Improved Frequency-to-Symbol Algorithm." The range coder can be used with
ppm just as arithmetic coding is used, there's no difference in its use.
References
The range coder package. Available here.
Michael Schindler, private communication.
Closing words
The range coder is faster due to the renormalization in bytes (renormalization
happens less usually) and byte based output.
Hope you've found this article accurate enough. If you ever used arithmetic coding in any compressor, you can easily plug the range coder. If you find any mistake in this article or want something to improve email me.
Thanks to Michael Schindler for his help and ideas. If you find any
mistake in this article, email me.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.