"Range coder"
Arturo San Emeterio Campos

Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.

Download the whole article zipped.

Table of contents

  • Introduction
  • Entropy and how to achieve it
  • Encoder Implementation
  • Decoder implementation
  • Implementation
  • Models
  • References
  • Closing words
  • Contacting the author



    The range coder is presented, a variation of arithmetic coding which does renormalization in bytes instead of bits thus running twice faster, and with 0.01% worse compression than an standard implementation of arithmetic coding. The range coder is believed to be patent free.

    Some experience on models and arithmetic coding is needed. In this case I recommend the qs model.

    Entropy and how to achieve it
    The entropy is defined as -Log2(P) where P is the probability (ranging from 0 to 1) of a given symbol. Huffman, in most of the cases does not achieve the entropy, unless the probability is 2^x, in this case it does: -Log2(4)=2 or -Log2(32)=5. But this is not what happens in most of the cases, in such cases Huffman assign less or more bits than it should. Just an example if the probability of a given symbol is 7/8 = 0.875, the entropy of it is: -Log2(0.875)= 0,192, but in this case Huffman will assign it 1 bit, which is by far more than it should.

    But we have the arithmetic coder which more or less (it actually loses some compression due to the not infinite precision, scaled probabilities and end of data symbol) achieves the entropy. Arithmetic coding works doing an interval between 0-1 and further subdiving it, based on the cumulative probability of the symbol which is currently being processed. Now let's see the routines which the range coder uses. For adjusting the interval it uses the following code:

    Where range is initialized to 1 and low to 0. This operations should be done with infinite floating point precision. When decoding, first we have to see what symbol is, and then update the interval.
    Symbol  a b c
    Probability 2 1 1
    Cumulative p.  0 2 3
    Let's say we have to encode the following symbols "bab" and we have the probabilities which are at the left for them. The encoding and decoding will look like the following.
    Action Variable Operation Result
    Init range 1
    Init low 0
    Encode b  r 1/4 0.25
    tmp 0.25*2 0.5
    range 0.25*1 0.25
    low 0+0.5 0.5
    Encode a r 0.25/4 0.0625
    tmp 0.0625*0 0
    range 0.0625*2 0.125
    low 0.5+0 0.5
    Action Variable Operation Result
    Init range 1
    Init low 0.5
    Decode help 1/4 0.25
    Output b tmp 0.5/0.25 2
    Update tmp 0.25*1 0.25
    low 0.5-0.25 0.25
    range 1-0.25 0.75
    Decode help 0.75/4 0.1875
    Output a tmp 0.25/0.1875 1.3333
    Update tmp 0.1875*0 0
    low 0.25-0 0.25
    range 0.1875*2 0.375

    Encode a symbol:
    • r = range / cumulative_probability[last_symbol] 
    • tmp = r * cumulative_probability[current_symbol]
    • range = r * probability[current_symbol]
    • low += tmp


    Decode and update:

    • help = range / cumulative_probability[last_symbol]
    • tmp = low / help
    • Note that based on the cumulative probability, we find the current symbol, and thus we can know its probability.
    • tmp = help *  cumulative_probability[current_symbol]
    • low -= tmp
    • range = help * probability[current_symbol]
    Encoding bab
    More or less what we have done, is represented by the pic at the left. Note how the interval is first adjusted to represent 'b', and the new interval is again divided by the number of probabilities, and the range of 'a' is chose, again, the same is done for 'b'. (in case we have to encode b)


    Encoder implementation
    The encoder is made of four routines: init, renormalize, encode a symbol and flush encoder. Init must be called before starting to encode symbols, both encode and renormalize code the symbols, and flushing is done when you have encoded all the symbols. The next section discusses the pseudo code.

    Encoder Init:


    Encoder renormalize:

    Explanation of predefined values like Bottom_value are at the end of this section.

    Encoder encode symbol:


    Encoder flush:

    Predefined values:

    Decoder implementation

    The decoding has five routines: init, renormalize, decode symbol, update state and flush. Update state must be called after knowing the symbol's range, so you can extract it from the number.

    Decoder init:

    Decoder renormalize:

    Decoder decode symbol:

    Decode update state:

    Decoder flush:

    Low holds the floating point number, in the theory we use floating point for implementations we use integer math, and instead of 0-1 range we use 0-FFFFFFFFh (dependent of the implementation), which is a 32 bits value. The bit 31 (at the left) is reserved to detect carry (when adding tmp) In bits 30-22 we perform renormalization. The rest of the bits from 21-0 give the high precision. Its estimated that you can code 2^23 symbols without problems.

    When the range is below 800000h we know that the renormalization part of low will never change, so we can output it. To not suffer underflow, when the renormalization part of low (bits 30-22) are 0FFh we extract them and keep a counter of how many underflow bytes we have discarded. Once we can output the renormalization part, we can output all those bytes too, but in case that carry has occurred, instead of 0FFh we have to output 00h. (Obviusly if we have a value like 1FFFFh and we have carry, we must output 20000h instead, you see?)

    If you are using a low level language you may want to check directly the carry, without using bit 31 for checking it, thus incrementing precision. If you are interested in such implementation contact Michael Schindler, who is able to tell you how to implement it.

    About the byte based io. (input/output) If you are using a high level language, you can skip this paragraph, as the compiler will do this itself. When doing io you can't directly output a byte as soon as it's available, that's very slow. Instead you keep a buffer and when its full or when you have to flush it (end of compression) you write it to the output file. This has a great impact in speed. Time is reported in machine clocks. Report from file paper, from the Calgary Corpus set.
    Buffer size (in bytes) Clocks
    1 476679208
    2 52550010
    4 32170836
    8 16429902
    16 11381554
    32 7301852
    64 4177316
    128 3673010
    256 3259018
    512 3121538
    1024 3117984
    2048 3018602
    4096 3010252
    8192 3018116


    In this article I've only discussed the routines to encode, now you need a model which gives the probabilities (in form of cumulative probabilities). You can use a simple static order-0 model, as presented for arithmetic coding. Or you can use a Quasi Static model which is a good decision. You may want to use the data structures presented by Peter Fenwick in his paper "A New Data Structure for Cumulative Probability Tables: an Improved Frequency-to-Symbol Algorithm." The range coder can be used with ppm just as arithmetic coding is used, there's no difference in its use.

    The range coder package. Available here.
    Michael Schindler, private communication.

    Closing words
    The range coder is faster due to the renormalization in bytes (renormalization happens less usually) and byte based output.

    Hope you've found this article accurate enough. If you ever used arithmetic coding in any compressor, you can easily plug the range coder. If you find any mistake in this article or want something to improve email me.

    Thanks to Michael Schindler for his help and ideas. If you find any mistake in this article, email me.

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!

    Arturo San Emeterio Campos, Barcelona 17-Nov-1999

    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.