Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
Download the whole article zipped.
Table of contents
The range coder is presented, a variation of arithmetic coding which does renormalization in bytes instead of bits thus running twice faster, and with 0.01% worse compression than an standard implementation of arithmetic coding. The range coder is believed to be patent free.
Some experience on models and arithmetic
coding is needed. In this case I recommend the qs
and how to achieve it
The entropy is defined as -Log2(P) where P is the probability (ranging from 0 to 1) of a given symbol. Huffman, in most of the cases does not achieve the entropy, unless the probability is 2^x, in this case it does: -Log2(4)=2 or -Log2(32)=5. But this is not what happens in most of the cases, in such cases Huffman assign less or more bits than it should. Just an example if the probability of a given symbol is 7/8 = 0.875, the entropy of it is: -Log2(0.875)= 0,192, but in this case Huffman will assign it 1 bit, which is by far more than it should.
But we have the arithmetic coder which more or less (it actually loses some compression due to the not infinite precision, scaled probabilities and end of data symbol) achieves the entropy. Arithmetic coding works doing an interval between 0-1 and further subdiving it, based on the cumulative probability of the symbol which is currently being processed. Now let's see the routines which the range coder uses. For adjusting the interval it uses the following code:
|Let's say we have to encode the following symbols "bab" and we have the probabilities which are at the left for them. The encoding and decoding will look like the following.|
Encode a symbol:
Decode and update:
||More or less what we have done, is represented by the pic at the left. Note how the interval is first adjusted to represent 'b', and the new interval is again divided by the number of probabilities, and the range of 'a' is chose, again, the same is done for 'b'. (in case we have to encode b)|
The encoder is made of four routines: init, renormalize, encode a symbol and flush encoder. Init must be called before starting to encode symbols, both encode and renormalize code the symbols, and flushing is done when you have encoded all the symbols. The next section discusses the pseudo code.
Encoder encode symbol:
The decoding has five routines: init, renormalize, decode symbol, update state and flush. Update state must be called after knowing the symbol's range, so you can extract it from the number.
Decoder decode symbol:
Decode update state:
Low holds the floating point number, in the theory we use floating point for implementations we use integer math, and instead of 0-1 range we use 0-FFFFFFFFh (dependent of the implementation), which is a 32 bits value. The bit 31 (at the left) is reserved to detect carry (when adding tmp) In bits 30-22 we perform renormalization. The rest of the bits from 21-0 give the high precision. Its estimated that you can code 2^23 symbols without problems.
When the range is below 800000h we know that the renormalization part of low will never change, so we can output it. To not suffer underflow, when the renormalization part of low (bits 30-22) are 0FFh we extract them and keep a counter of how many underflow bytes we have discarded. Once we can output the renormalization part, we can output all those bytes too, but in case that carry has occurred, instead of 0FFh we have to output 00h. (Obviusly if we have a value like 1FFFFh and we have carry, we must output 20000h instead, you see?)
If you are using a low level language you may want to check directly the carry, without using bit 31 for checking it, thus incrementing precision. If you are interested in such implementation contact Michael Schindler, who is able to tell you how to implement it.
About the byte based io. (input/output) If you are using a high level
language, you can skip this paragraph, as the compiler will do this itself.
When doing io you can't directly output a byte as soon as it's available,
that's very slow. Instead you keep a buffer and when its full or when you
have to flush it (end of compression) you write it to the output file.
This has a great impact in speed. Time is reported in machine clocks. Report
from file paper, from the Calgary
|Buffer size (in bytes)||Clocks|
In this article I've only discussed the routines to encode, now you need a model which gives the probabilities (in form of cumulative probabilities). You can use a simple static order-0 model, as presented for arithmetic coding. Or you can use a Quasi Static model which is a good decision. You may want to use the data structures presented by Peter Fenwick in his paper "A New Data Structure for Cumulative Probability Tables: an Improved Frequency-to-Symbol Algorithm." The range coder can be used with ppm just as arithmetic coding is used, there's no difference in its use.
The range coder package. Available here.
Michael Schindler, private communication.
The range coder is faster due to the renormalization in bytes (renormalization happens less usually) and byte based output.
Hope you've found this article accurate enough. If you ever used arithmetic coding in any compressor, you can easily plug the range coder. If you find any mistake in this article or want something to improve email me.
Thanks to Michael Schindler for his help and ideas. If you find any
mistake in this article, email me.
You can reach me via email at: firstname.lastname@example.org Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.