Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped,
Table of contents
Introduction
This article is about static huffman. The aim of huffman is to provide
shorter codes to bytes with higher frequency. It's not a standalone codec,
but it can be used, and it is used, with other schemes like Lz variants
(arj, zip) or even in image compression (jpeg). So it's very useful. Huffman
is, as far as I know, not patented, so you can freely use it. It's
not the best entropy coder, but it's easy, fast, good and free of patents.
If you want to learn this you only need programming experience, nothing
else, I hope I've written enough explanations and examples so you can easily
learn this.
Static Huffman
This algorithm is quite old, David Huffman presented it in 1953, one
year before Claude Shannon and Fano presented their method. Huffman's method
is efficient, while Shannon-Fano isn't always efficient, probably this
is because huffman sorts the probabilities after combining them, and Shannon-Fano
doesnn't. The algorithm uses the symbols probabilities and then it makes
the codes, which are of variable length. Note that one code can only have
one symbol. (One code can't represent 'abc' only 'a', this example
is called block-variables codes while huffman is block-block codes) Huffman
uses binary trees, so after explaining huffman I'll explain binary trees,
if you have already used them, skip this section.
If you want to make variable codes you have many solutions, some of
them will not work at all E-) The codes must be unique decodable,
so the decoder will have no problems. For example if we have the following
codes:
| Byte | Code |
|
|
|
|
|
|
|
|
|
When the decoder gets a 0 it doesn't know if it's a 'b' or the start
of the code for 'a'. (probably it will choose 'b') Codes that will make
no error could be:
| Byte | Code |
|
|
|
|
|
|
|
|
|
|
|
|
This follows that rule: no code can be prefix of another code, (In our
first case the 0b code was prefix of the 01b code) this is called the prefix
property. But the codes presented above waste too many bits, they aren't
optimal; that's why we use huffman codes. The huffman codes would be something
like that:
| Byte | Code |
|
|
|
|
|
|
|
|
|
They have the prefix propierty and also are optimal. Note that those
codes are also instantaneous (they can be unique decoded), so we only need
the bits of a code itself for recognizing it. (once we have 00b we now
that this code corresponds to 'c')
Binary trees
A binary tree looks like that:
1
/ \
3 2
/ \
3
3
The '1' is the root of the tree. '2' is a node. And the '3' are the
leaves. If we want to scan a binary tree (and we want ;-) we must have
a difference between a node and a leaf, so the structure of every element
of the tree will have an integer, telling what it is, a node (or the root)
or a leaf. Also you can see that every element (except the leaves)
has two pointers, to other elements; so the data structure of an element
is as follows:
| Kind | Description |
| INT value (word) | identify the element |
| POINTER left | pointers to next elements |
| POINTER right |
The integer has the value of the element itself. In our case (huffman
codes), if it's between 0-255 then it's a byte (a leaf) if it's 257 then
it's a node. The pointers will be 0, if they aren't pointing to other elements;
if they aren't 0, then the pointer will point to the address (in memory)
of the next element. Lets do the tree again. This binary tree is invented,
just an example of how a binary tree should be, this is not the binary
trees that huffman does. (but almost the same) Imagine those offsets for
every element:
| Element | Offset |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
And we have three bytes: 21,34,56 So here it is our tree:
|
(257,01h,04h) <- 1 (root)
/ \ (node) 2-> (257,02h,03h) (56,00h,00h) <-5 (leaf) / \ (leaf) 3-> (21,00h,00h) (34,00h,00h) <-4 (leaf) |
Note that the nodes (and the root) have a value of 257, what makes them
no 'valid' bytes (not leafs). Also note that the pointers of the leaves
are 00h because they don't point to anywhere. Note too that for scanning
the binary tree we only need the pointer to the first element, the root.
Of course there are other binary trees, which need to scan from leave to
root, in that case we would store a pointer to the parents too. But huffman
doesn't need it. Our huffman trees will have this structure:
| Kind | Description |
| INT value | if it's a byte or a node |
| UNSIGNED LONG probability | the probability of the byte |
| POINTER left | pointer to next element |
| POINTER right | pointer to next element |
The probability is used for making the tree. (combining elements) When
I implemented it, my structure was:
| Kind | Description |
| INT value | if it's a byte or a node |
| INT dummy | just to fill space |
| UNSIGNED LONG probability | the probability of the byte |
| POINTER left | pointer to next element |
| POINTER right | pointer to next element |
So it was 16 byte long and made sorting easier. (There was need of doing
muls, so I emulated them with shifts)
How to scan
a binary tree
Because we don't know the length (the number of nodes and leaves) of
a binary tree, the routine to use must be recursive. The length of a binary
tree is also called height. (the height of our example of a binary tree
is 2) It will start at the root. Then check if we are in a leaf (in that
case do what needed, we'll see what), else then check the pointer right,
if it's NOT 0, then call the same function but using the pointer to the
next element as argument (instead of the pointer to the root, which we
use at the start). And then the same for the left pointer. Ok? (that's
the recursion)
How it works
I'll present you a simple string as an example: 'acbacaa' First we
must have the probabilities of every symbol. For doing that we keep a list
of unsigned longs (double word) with 256 entries (yes, one for every byte)
and scan the whole file (or whatever (it may be the result of a compressor))
and increment the entry of the byte. (Not very difficult ;-)
| Byte | Probability |
|
|
|
|
|
|
|
|
|
Then we sort them, the biggest first. (probability)
| Byte | Probability |
|
|
|
|
|
|
|
|
|
Now we start to build the binary tree. What we will do is the following:
Get the last two elements and combine them in another binary tree, first
we put both elements and the fictive new node pointing to them, then discard
both of them, in the probabilities list, and put a new element, with the
probabilities of both of them. (in the list of probabilities) (in the binary
tree '&c' means a pointer to the node containing 'c')
|
|
|
|
Some things to note: During this process we have two pointers, one to
the list of probabilities and the other to the binary tree. The pointer
to the list, points to the LAST element. The pointer to the binary tree
points to the start of the binary tree, and when the tree is being build
to the next position. What we do is the following: first we create the
new fictive node (in the pointer to the binary tree E-), put the value
to 257 (not a byte) the probabilities get it by adding the probabilities
from the last two elements of the list, and then the pointers to the new
elements just to the next position in the binary tree. And then following
this fictive node put the two combined elements. Now increment the pointer
of the list (so you skip the LAST element) and then put here the new fictive
node, just by adding the last probability and the current one, and putting
the value 257. (in fact I first copied the elements of the list, and then
made the new fictive node) Once this is over, sort the elements of the
probabilities list again. And repeat this till you only have two elements.
When there's only two elements in the probabilities list you shouldn't
do the same, this is an special case, (it's the end) what shall you do?
read it in the example. Phew, what a long explanation, now an example:
(Note one thing: the data structure of the list of probabilities is itself
the (start of the) binary tree.)
|
|
||||||||||||||||||||||||||||||||
As you can see the list of probabilities hold the byte itself, its probability and two pointers, this will make creating the binary tree easier. Let's assume the value of the pointer to the (start of the) binary tree is 00050000h. And now let's start with the real example, step by step:
1- Copy both LAST elements
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||
3- Discard both elements and do a new one (fictive element)
|
|
|||||||||||||||||||||||||||||||||||||||||
4- Sort the list
|
|
|||||||||||||||||||||||||||||||||||||||||
5- and then continue
First check if you have tow elements, if you have more, repeat the same, starting at 1, otherwise, when we only have TWO elements left, we can't work as we used to do, we have to treat it in a different way:
6- Copy FIRST element
|
|
||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||
| Symbol | Prob. | Rp. | Lp. | Offset |
|
|
|
50020h | 50030h | 50040h |
|
|
|
|
|
50030h |
|
|
|
50010h | 50000h | 50020h |
|
|
|
|
|
50010h |
|
|
|
|
|
50000h |
As you see making the root is as easy as making a fictive node (in fact is the same) Now your binary tree is finished, you should save the pointer to the ROOT, so you can scan the binary tree. Do this in any variable.
And about the way you make your probabilities list: It can be not sorted.
However this will make huffman codes not optimal, and this is what we want
when using huffman codes at all. E-) And when combining elements: Of course
you can avoid sorting them. I never implemented it (without sorting) but
I think it will make the same codes as Shannon-Fano does.
Scanning
the huffman binary tree
Once we are here, we have our binary tree, now we'll do the codes.
As you see the bytes with more probability are at the top and those bytes
with LOWEST probability are at the bottom of the binary tree. So let's
take profit of that, we'll scan the binary tree, every time we go to the
right we assign a 0 to the code, and when we go to the left we assign a
1. The length of the code is equal to the height of the byte in the tree.
(so based in that we know that 'a' will have a code with length of 1 bit.
and 'c' and 'b' two bits.) Once we arrive to a byte, we assign that code
to the byte. Note that we must keep track also of the length of the code.
Pseudo code:
Do_codes:
Example of
doing codes
Now as an example let's do the codes from our example.
|
|
||||||||||||||||||||||||||||||||||||
Code = 0b Length = -1
Note that lenght starts with -1 so we can easily enter the recursive
routine.
| Byte | Code | Length |
|
|
|
|
|
|
|
|
|
|
|
|
In the output list (256 entries) you save the code and the length.
Bit stream
Once you have this list you reescan the whole file and output
for every byte its code with its length. Example for our string 'acbacaa':
Saving the codes
You should note that the bits of the codes MUST be send from
left to right. Because they were created from left to right.
For example the code 011b must be sent as: 0b 1b 1b If you don't do
it as above, then imagine our list:
| Byte | Code | Length |
|
|
|
|
|
|
|
|
|
|
|
|
If with code for 'b' we output it from right to left, then we output the 0b first instead of the 1b. So a decoder will get the 0b and just output 'a' !! Another solution will be invert all the codes, this will make coding and decoding faster, let's take as an example:
|
|
||||||||||||||||||||||||||||||||||||
Output of a
huffman coder
Now here it comes how to save the file. Of course you only don't
have to output the bit stream, the decoder also needs to know what represent
every code and how long they are. So we need to save also the codes. I'll
present you two ways of
doing it, of course there are another ways which can be even better.
Save the binary
tree
This is the solution that you'll find everywhere. Save the whole
binary tree. Then the decompressor (or decoder, name it as you want ;-)
reads it. Once it has that, starts to decompress. Get one bit, scan the
binary tree. If we found a leaf, stop scanning and output the byte at this
leaf. If that was a node, then get another bit and continue scanning till
you found a leaf. I read a lot of times that this scheme is slow. I never
implemented it, so I can't tell you if it's slow or not.
Save the codes
and lengths
When I implemented huffman coder I needed (of course ;-) a decoder,
and then I invented this scheme. Probably someone has already invented
and implemented it, however I never saw it. Feel free to use it. But keep
in mind that is not the fastest one. My idea is the following: We divide
all the codes in lengths. For example the length 5 (5 bits of code) has
3 different codes. I keep track of how many codes a length has. Then I
get a bit. I search in the length 1 for the current code. If I found it,
output the byte. If I didn't found it, then get another bit and search
in the next list of lengths. If you implement this you have to care about
somethings: what's the minimum length that a code has? (this is different
for every input file) And then start getting those minimum bits before
starting. Also you must care about something else: lists that don't have
any code at all.
How I did everything? First, save it: I sorted the list by the bytes.
Then a loop. Read byte [0], it has a code? output 1 if yes, and then the
code and the length. If no simply output a 0. And this for the 255 bytes.
For outputting the length and the code I recommend you that first output
5 bits with the length, and then exactly the bits needed for this code.
(the decoder will have no problems, it already knows the length) The decompressor
reads its and makes a nice decompression list. First the minimum numbers
of bits needed. Then how many codes with the minimum length we have. If
it's different than 0 put the codes, and bytes that correspond to the code.
Then next length, put the number... Once this list is finished, starts
to decompress. Get one bit. read how many codes are. If 0, skip (read another
bit), and next list. I think you don't need an example about this, however
if you feel it's needed, let me know and I'll (probably) do a new version.
Improvements
As long as I know static huffman may only be improved (in the
ratio) with one trick: canonical huffman.
I'll not fully explain there, just a phrase. If you want to learn it, visit
this
page. Canonical huffman: We do everything in the same way, but we don't
care about the codes, just the lengths, then based on the length of the
codes we make the codes themselves. So when saving the tree (or in another
way) we only output the lengths, the decoder just have to read the lengths
and do the codes again. I highly reccomend to visit this page, because
you'll learn canonical huffman, and a fast decoder.
Of course you could use the static huffman scheme for order-1 instead of order-0 model. (just the current byte)
There's another improvement, this is adaptive huffman. The way it works
is a little bit more difficult than the static scheme. If you need an adaptative
entropy scheme you should check any (unpatented) arithmetic coder. It's
better than huffman because huffman assumes that the minimum length for
a code is 1 bit while arithmetic coding don't.
Closing words
A lot of effort has been put in this article, as you can see from the
explanations and examples, so feedback will be welcome. New versions of
this article will be released if you think that something isn't clearer
enough. also grammar corrections are welcome. Once, you have implemented
a simple order-0 model statical huffman, you could use it for the output
of a dictionary based scheme, or something like that. Also, thanks to Michael
Schindler, for his help with canonical huffman. Btw: you can learn
canonical
huffman and also how to avoid a bug of huffman.
If you want a faster decoding, visit this one http://www.compressconsult.com/huffman.
Also in the compression pointers, there was a link to a ps file about fast
decoding of huffman codes, http://www.internz.com/compression-pointers.html.
If you find any mistake in this article, email me.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.