Restricting huffman codes to a maximum length
by
Arturo San Emeterio Campos
Disclaimer
Copyright (c) Arturo San Emeterio Campos 2000. All rights reserved.
Permission is granted to make verbatim copies of this
document for private use only.
Note
"Compression programming" is on a early stage. However the text itself
is complete and will suffer only minor modifications.
Table of contents
Introduction
What's the maximum code length?
The probabilities
that make an unbalanced tree
When does a code
need more than 32 bits?
How to solve this problem
All the article assumes that we have an order-0 model, and an alphabet
of 256 symbols, bytes.
What's the maximum code length?
We want to restrict our codes to 32 bytes. To do so we must know when does the algorithm makes long codes. To know so let's see the two most different shapes that a binary tree can have. In the first example there's a balanced huffman binary tree with four symbols on it. The second example is the opposite, and unbalanced one. Other shapes would be just intermediate between them.
The balanced tree is produced with a flat probability, that is when all the symbols have the same probability. It isn't of interest for us now, because the maximum code length will be log2(1/number_of_symbols). Which in the case of coding bytes (256 symbols) would be 8 bits.
·
/ \
a ·
/ \
b ·
/ \
c d
This is obviously the worst case. In case we have A symbols, the maximum
code length can be up to A-1. That means that for our case, the maximum
code length is 255 instead of 32!
The probabilities that make an unbalanced tree
Such a degenerate case can only happen if the probabilities are like a fibbonaci sequence. A fibbonaci sequence is a sequence where the first number is 0, the second is 1, and the next ones are the sum of the two previous numbers in the sequence. The first eight numbers from the fibbonaci sequence are: 0, 1, 1, 2, 3, 5, 8, 13. Recall the example of an unbalanced tree.
If the tree must have this shape, the probability of 'b' must be at least the sum of the probabilities of 'c' and 'd'. And the probability of 'a' must be at least the sum of 'b', 'c' and 'd'.
Let's plug there the fibbonaci sequence. We discard the first number, 0, because if it had a probability of 0 it wouldn't be included in the tree. So the probability of both 'c' and 'd' must be equal. And to know the worst case, minimum too. Which means that they are both 1. Now the probability of 'b' must be at least the sum of 'c' and 'd', which is 2. To end with the probability of 'a' must be at least equal to the probability of the fictive node which has as children 'b' and another fictive node whose children are 'c' and 'd'.
The final shape shows an unbalanced tree. And the probabilities of the
symbols are indeed part of the fibbonaci sequence, "1, 1, 2 and 3".
When does a code need more than 32 bits?
Now we want to know which situation makes a tree which needs more than 32 bits for the code length. To do so, let's analyze our example. We have 4 symbols. Because in the lowest part of the tree there are two symbols, the maximum code length is 3. From that we know that if we want to restrict the code length to 2, we must have a maximum of 3 symbols. In that case even if they had a fibbonaci distribution our program would run fine. If we restrict our codes to N bits, if we have at least N+2 symbols, we know we could get in trouble (if we had N+1 there would be no problem, though).
From the previous we can know how many bytes must occur in our example before the code length is too high. In our case we need to have at least 34 symbols with a fibbonaci distribution. The first one would have a probability of 1, the second of 1, 2, 3, 5 and so on. If we sum them together, the result is 14,930,351. This is the minimum number of symbols that must appear to make a fibbonaci sequence which can degenerate our binary tree. So we know that the case is rather rare. And that if you don't plan to code more than 14 megas of data, you don't need to care about this problem. Note that if you restricted your code length to 16 bits, the minimum is reduced to 6764 bytes. And if for any reason you used 8 bit codes, it would be 142 bytes!
But that's only a minimum. In practice you'll get much more data than
data, and your codes will never be that long. Real probabilities are not
that degenerated at all! However we have to care about worst case, because
that's lossless compression.
The only thing we can do is modify the probabilities of the symbols. So that's what we are going to do. The trick is changing them, so they are not like a fibbonaci sequence, but that they are still more or less like they were (otherwise compression would get hurt). In that case what we do is dividing them by a factor. In our case 4, because that can be emulated with a right shift. Also we may have to care about zeroing any probability. The fast way to do both things is: probability = ( probability SHIFT RIGHT 2 ) OR 1. And the slow one is to care if it's 0, then change it to 1.
In the compressor and the decompressor (in case of an adaptative model),
the code for doing the rescaling would be called from the huffman routines.
From the function that makes the codes. If it detects a length which is
too high, calls the rescaling routine, and after that it calls again the
main routine for making the huffman codes.