"Range coder"
by
Arturo San Emeterio Campos

Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.

• Introduction
• Entropy and how to achieve it
• Encoder Implementation
• Decoder implementation
• Implementation
• Models
• References
• Closing words
• Contacting the author

•

Introduction
The range coder is presented, a variation of arithmetic coding which does renormalization in bytes instead of bits thus running twice faster, and with 0.01% worse compression than an standard implementation of arithmetic coding. The range coder is believed to be patent free.

Some experience on models and arithmetic coding is needed. In this case I recommend the qs model.

Entropy and how to achieve it
The entropy is defined as -Log2(P) where P is the probability (ranging from 0 to 1) of a given symbol. Huffman, in most of the cases does not achieve the entropy, unless the probability is 2^x, in this case it does: -Log2(4)=2 or -Log2(32)=5. But this is not what happens in most of the cases, in such cases Huffman assign less or more bits than it should. Just an example if the probability of a given symbol is 7/8 = 0.875, the entropy of it is: -Log2(0.875)= 0,192, but in this case Huffman will assign it 1 bit, which is by far more than it should.

But we have the arithmetic coder which more or less (it actually loses some compression due to the not infinite precision, scaled probabilities and end of data symbol) achieves the entropy. Arithmetic coding works doing an interval between 0-1 and further subdiving it, based on the cumulative probability of the symbol which is currently being processed. Now let's see the routines which the range coder uses. For adjusting the interval it uses the following code:

• r = range / cumulative_probability[last_symbol]    range/maximum_cumulative_probability
• tmp = r * cumulative_probability[current_symbol]
• range = r * probability[current_symbol]
• low += tmp
Where range is initialized to 1 and low to 0. This operations should be done with infinite floating point precision. When decoding, first we have to see what symbol is, and then update the interval.
• help = range / cumulative_probability[last_symbol]
• tmp = low / help    Now tmp holds the cumulative probability of the current symbol
• Once we know the symbol, we can update the interval for decoding the next one
• tmp = help *  cumulative_probability[current_symbol]
• low -= tmp
• range = help * probability[current_symbol]
 Symbol a b c Probability 2 1 1 Cumulative p. 0 2 3
Let's say we have to encode the following symbols "bab" and we have the probabilities which are at the left for them. The encoding and decoding will look like the following.
 Encoding Action Variable Operation Result Init range 1 Init low 0 Encode b r 1/4 0.25 tmp 0.25*2 0.5 range 0.25*1 0.25 low 0+0.5 0.5 Encode a r 0.25/4 0.0625 tmp 0.0625*0 0 range 0.0625*2 0.125 low 0.5+0 0.5 Decoding Action Variable Operation Result Init range 1 Init low 0.5 Decode help 1/4 0.25 Output b tmp 0.5/0.25 2 Update tmp 0.25*1 0.25 low 0.5-0.25 0.25 range 1-0.25 0.75 Decode help 0.75/4 0.1875 Output a tmp 0.25/0.1875 1.3333 Update tmp 0.1875*0 0 low 0.25-0 0.25 range 0.1875*2 0.375

Encode a symbol:
• r = range / cumulative_probability[last_symbol]
• tmp = r * cumulative_probability[current_symbol]
• range = r * probability[current_symbol]
• low += tmp

Decode and update:

• help = range / cumulative_probability[last_symbol]
• tmp = low / help
• Note that based on the cumulative probability, we find the current symbol, and thus we can know its probability.
• tmp = help *  cumulative_probability[current_symbol]
• low -= tmp
• range = help * probability[current_symbol]
More or less what we have done, is represented by the pic at the left. Note how the interval is first adjusted to represent 'b', and the new interval is again divided by the number of probabilities, and the range of 'a' is chose, again, the same is done for 'b'. (in case we have to encode b)

Encoder implementation
The encoder is made of four routines: init, renormalize, encode a symbol and flush encoder. Init must be called before starting to encode symbols, both encode and renormalize code the symbols, and flushing is done when you have encoded all the symbols. The next section discusses the pseudo code.

Encoder Init:

• Low = 0    Both low and range keep track of the number
• Range = Top_value    Top_value is a defined macro
• Buffer = 0    This is the first byte to output
• Help = 0    No underflow bytes
• Bytecount = 0    No bytes outputted
Variables:
• Low and range. They have the floating point number.
• Buffer. This is the byte which we have to output. The first time we output a byte which will not have any information of the coding, but which is needed to avoid an "if" to see if we don't have changed byte yet. (in the renormalization)
• Help. This variable keeps track of the underflow bytes.
• Bytecount. It keeps track of the number of bytes that the encoder has outputted. (only needed for error recovery or if you want to do an archiver)

Encoder renormalize:

• While ( range <= Bottom_value )    As long as it's not above than Bottom_value (it means we can still output)
• if ( low < ( 0FFh << Shift_bits ) )    If higher part is 0FFh a carry will occur which we can't afford
• Yes    We had a range with a potential carry (which we can't capture with our precision)
• Output_byte ( buffer )    Output the byte
• For ( ; help != 0 ; --help )    Output all underflow bytes pending for output
• Output_byte ( 0FFh )
• Buffer = low >> Shift_bits    Assign to the byte to output the high bytes of low
• No    Now check for actual carry
• if ( ( low And Top_value) != 0)    Check carry bit
• Yes
• Output_byte ( buffer + 1 )    The +1 comes from the carry which propagates till the byte saved
• For ( ; help != 0 ; --help )    Output all underflow bytes pending for output
• Output_byte ( 00h )    0FFh's become 0's (the carry!)
• Buffer = low >> Shift_bits    Assign to the byte to output the high bytes of low
• No
• ++Buffer
• Range << 8    Shift out byte which we have already outputted or are going to be outputted (in case of 0FFh)
• Low << 8
• Low And ( Top_value - 1 )    Clear carry bit
• Repeat
Explanation of predefined values like Bottom_value are at the end of this section.

Encoder encode symbol:

• Renormalize( );    Before encoding, renormalize
• R = range / tot_f    New range for all the symbols
• Tmp = r * lt_f    Start of the range of the current symbol
• If ( lt_f + sy_f < tot_f )    Last frequency?
• Yes    For all the symbols but the last
• Range = r * sy_f    Assign to range the range of the current range
• No    For the last symbol
• Range -= tmp
• Low += tmp    Update low
Variables:
• Tot_f. This is the maximum cumulative probability.
• Lt_f. This is the cumulative probability of the symbol before the one to encode. (low part of range, in a simple order-0 model this should be cump[symbol])
• Sy_f. How big the range of our symbol is. In a simple order-0 model, it will be: freq[symbol+1]-freq[symbol]

Encoder flush:

• Renormalize( );    Renormalize state before flushing
• Temp = low >> 23    Output two bytes more, and Buffer
• If ( temp > 0FFh )    Check for overflow
• Output_byte ( Buffer+1 )    There was overflow
• For ( ; help != 0 ; --help )    Output all underflow bytes pending for output
• Output_byte ( 00h )    0FFh's become 0's (the carry!)
• Else
• Output_byte ( Buffer )    No overflow
• For ( ; help != 0 ; --help )
• Output_byte ( 0FFh )
• Output_byte ( temp & 0FFh)    Output Low
• Output_byte ( (temp = low >> (23-8)) & 0FFh)
Predefined values:
• Top_value. 80000000h (hex)
• Bottom_value. 00800000h.
• Shift_bits. 23.
• Extra_bits. 7.

Decoder implementation

The decoding has five routines: init, renormalize, decode symbol, update state and flush. Update state must be called after knowing the symbol's range, so you can extract it from the number.

Decoder init:

• C = inbyte();    The first byte, a dummy one
• Buffer = inbyte();    Get first encoded byte
• Low = buffer >> ( 8 - Extra_bits )    Init low
• Range = 1 << Extra_bits    Init range

Decoder renormalize:

• While ( range <= Bottom_value )    Read the bytes that the encode outputted
• Low = ( low << 8 ) Or ( ( buffer << Extra_bits ) And 0FFh )    Put bits in low
• Buffer = inbyte( );
• Low = Low Or ( buffer >> ( 8 - Extra_bits) )
• Repeat

Decoder decode symbol:

• Renormalize( );    Remember to use the decoder version in the decompressor
• Help = range / tot_f    The range for all the symbols. Note that help this time doesn't keep track of the underflow bytes, but rather it helps with decoding.
• Tmp = low / help    Get the range of the current symbol
• If ( tmp > tot_f )
• Yes
• No
• Return tmp    The cumulative probability

Decode update state:

• Tmp = help * lt_f
• low -= tmp    Extract number from low
• If ( lt_f + sy_f < tot_f )    Check to see if that is symbol with highest frequency
• Yes
• Range = help * sy_f    Extract number from range
• No
• Range -= tmp

Decoder flush:

• Renormalize ( );    Read the bytes that the encoder wrote (in case it did)

Implementation
Low holds the floating point number, in the theory we use floating point for implementations we use integer math, and instead of 0-1 range we use 0-FFFFFFFFh (dependent of the implementation), which is a 32 bits value. The bit 31 (at the left) is reserved to detect carry (when adding tmp) In bits 30-22 we perform renormalization. The rest of the bits from 21-0 give the high precision. Its estimated that you can code 2^23 symbols without problems.

When the range is below 800000h we know that the renormalization part of low will never change, so we can output it. To not suffer underflow, when the renormalization part of low (bits 30-22) are 0FFh we extract them and keep a counter of how many underflow bytes we have discarded. Once we can output the renormalization part, we can output all those bytes too, but in case that carry has occurred, instead of 0FFh we have to output 00h. (Obviusly if we have a value like 1FFFFh and we have carry, we must output 20000h instead, you see?)

If you are using a low level language you may want to check directly the carry, without using bit 31 for checking it, thus incrementing precision. If you are interested in such implementation contact Michael Schindler, who is able to tell you how to implement it.

About the byte based io. (input/output) If you are using a high level language, you can skip this paragraph, as the compiler will do this itself. When doing io you can't directly output a byte as soon as it's available, that's very slow. Instead you keep a buffer and when its full or when you have to flush it (end of compression) you write it to the output file. This has a great impact in speed. Time is reported in machine clocks. Report from file paper, from the Calgary Corpus set.

 Buffer size (in bytes) Clocks 1 476679208 2 52550010 4 32170836 8 16429902 16 11381554 32 7301852 64 4177316 128 3673010 256 3259018 512 3121538 1024 3117984 2048 3018602 4096 3010252 8192 3018116

Models
In this article I've only discussed the routines to encode, now you need a model which gives the probabilities (in form of cumulative probabilities). You can use a simple static order-0 model, as presented for arithmetic coding. Or you can use a Quasi Static model which is a good decision. You may want to use the data structures presented by Peter Fenwick in his paper "A New Data Structure for Cumulative Probability Tables: an Improved Frequency-to-Symbol Algorithm." The range coder can be used with ppm just as arithmetic coding is used, there's no difference in its use.

References
The range coder package. Available here.
Michael Schindler, private communication.

Closing words
The range coder is faster due to the renormalization in bytes (renormalization happens less usually) and byte based output.

Hope you've found this article accurate enough. If you ever used arithmetic coding in any compressor, you can easily plug the range coder. If you find any mistake in this article or want something to improve email me.

Thanks to Michael Schindler for his help and ideas. If you find any mistake in this article, email me.

Contacting the author
You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!

Arturo San Emeterio Campos, Barcelona 17-Nov-1999