"Static huffman"
by
Arturo San Emeterio Campos





Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped,
 

Table of contents

  • Introduction
  • Static huffman
  • Binary trees
  • How to scan a binary tree
  • How it works
  • Scanning the huffman binary tree
  • Example of doing codes
  • Bit stream
  • Saving the codes
  • Output of a huffman coder
  • Save the binary tree
  • Save the codes and lengths
  • Improvements
  • Closing words
  • Contacting the author

  •  

     
     
     

    Introduction
    This article is about static huffman. The aim of huffman is to provide shorter codes to bytes with higher frequency. It's not a standalone codec, but it can be used, and it is used, with other schemes like Lz variants (arj, zip) or even in image compression (jpeg). So it's very useful. Huffman is, as far as I know, not patented, so you can freely use it. It's not the best entropy coder, but it's easy, fast, good and free of patents. If you want to learn this you only need programming experience, nothing else, I hope I've written enough explanations and examples so you can easily learn this.
     

    Static Huffman
    This algorithm is quite old, David Huffman presented it in 1953, one year before Claude Shannon and Fano presented their method. Huffman's method is efficient, while Shannon-Fano isn't always efficient, probably this is because huffman sorts the probabilities after combining them, and Shannon-Fano doesnn't. The algorithm uses the symbols probabilities and then it makes the codes, which are of variable length. Note that one code can only have one symbol. (One code can't represent 'abc' only 'a', this example is called block-variables codes while huffman is block-block codes) Huffman uses binary trees, so after explaining huffman I'll explain binary trees, if you have already used them, skip this section.

    If you want to make variable codes you have many solutions, some of them will not work at all E-) The codes must be unique decodable, so the decoder will have no problems. For example if we have the following codes:
     
    Byte  Code
    'a'
    01b
    'b'
    0b
    'c'
    11b

    When the decoder gets a 0 it doesn't know if it's a 'b' or the start of the code for 'a'. (probably it will choose 'b') Codes that will make no error could be:
     
    Byte  Code 
    'a'
    1b
    'b' 
    01b
    'c'
    001b
    'd'
    0001b 

    This follows that rule: no code can be prefix of another code, (In our first case the 0b code was prefix of the 01b code) this is called the prefix property. But the codes presented above waste too many bits, they aren't optimal; that's why we use huffman codes. The huffman codes would be something like that:
     
    Byte  Code 
    'a'
    1b
    'b'
    00b
    'c'
    01b

    They have the prefix propierty and also are optimal. Note that those codes are also instantaneous (they can be unique decoded), so we only need the bits of a code itself for recognizing it. (once we have 00b we now that this code corresponds to 'c')
     

    Binary trees
    A binary tree looks like that:

              1
             /  \
           3    2
                /  \
              3    3

    The '1' is the root of the tree. '2' is a node. And the '3' are the leaves. If we want to scan a binary tree (and we want ;-) we must have a difference between a node and a leaf, so the structure of every element of the tree will have an integer, telling what it is, a node (or the root) or a leaf.  Also you can see that every element (except the leaves) has two pointers, to other elements; so the data structure of an element is as follows:
     
    Kind  Description 
    INT value (word) identify the element
    POINTER left  pointers to next elements 
    POINTER right 

    The integer has the value of the element itself. In our case (huffman codes), if it's between 0-255 then it's a byte (a leaf) if it's 257 then it's a node. The pointers will be 0, if they aren't pointing to other elements; if they aren't 0, then the pointer will point to the address (in memory) of the next element. Lets do the tree again. This binary tree is invented, just an example of how a binary tree should be, this is not the binary trees that huffman does. (but almost the same) Imagine those offsets for every element:
     
    Element  Offset 
    1
    00h
    2
    01h
    3
    02h
    4
    03h
    5
    04h

    And we have three bytes: 21,34,56 So here it is our tree:
     

                             (257,01h,04h) <- 1 (root) 
                                      /        \ 
    (node) 2-> (257,02h,03h)   (56,00h,00h) <-5 (leaf) 
                                /         \ 
    (leaf) 3-> (21,00h,00h)   (34,00h,00h) <-4 (leaf)

    Note that the nodes (and the root) have a value of 257, what makes them no 'valid' bytes (not leafs). Also note that the pointers of the leaves are 00h because they don't point to anywhere. Note too that for scanning the binary tree we only need the pointer to the first element, the root. Of course there are other binary trees, which need to scan from leave to root, in that case we would store a pointer to the parents too. But huffman doesn't need it. Our huffman trees will have this structure:
     
    Kind  Description 
    INT value if it's a byte or a node
    UNSIGNED LONG probability  the probability of the byte 
    POINTER left  pointer to next element 
    POINTER right pointer to next element 

    The probability is used for making the tree. (combining elements) When I implemented it, my structure was:
     
    Kind  Description 
    INT value if it's a byte or a node
    INT dummy just to fill space
    UNSIGNED LONG probability  the probability of the byte 
    POINTER left  pointer to next element 
    POINTER right pointer to next element 

    So it was 16 byte long and made sorting easier. (There was need of doing muls, so I emulated them with shifts)
     

    How to scan a binary tree
    Because we don't know the length (the number of nodes and leaves) of a binary tree, the routine to use must be recursive. The length of a binary tree is also called height. (the height of our example of a binary tree is 2) It will start at the root. Then check if we are in a leaf (in that case do what needed, we'll see what), else then check the pointer right, if it's NOT 0, then call the same function but using the pointer to the next element as argument (instead of the pointer to the root, which we use at the start). And then the same for the left pointer. Ok? (that's the recursion)
     

    How it works
    I'll present you a simple string as an example: 'acbacaa' First we must have the probabilities of every symbol. For doing that we keep a list of unsigned longs (double word) with 256 entries (yes, one for every byte) and scan the whole file (or whatever (it may be the result of a compressor)) and increment the entry of the byte. (Not very difficult ;-)
     
    Byte  Probability 
    'a'
    4
    'b'
    1
    'c'
    2

    Then we sort them, the biggest first. (probability)
     
    Byte  Probability 
    'a'
    4
    'c'
    2
    'b'
    1

    Now we start to build the binary tree. What we will do is the following: Get the last two elements and combine them in another binary tree, first we put both elements and the fictive new node pointing to them, then discard both of them, in the probabilities list, and put a new element, with the probabilities of both of them. (in the list of probabilities) (in the binary tree '&c' means a pointer to the node containing 'c')
     
    Byte  Probability  Join
    'a'
    4

    \ F. node1 
    / (prob= 3)
    'c'
    2
    'b'
    1
     
    Binary tree
    (257(1),3,&c,&b) 
                     /     \ 
         ('c',2,0,0)  ('b',1,0,0) 
     
    Then after doing this we SORT the list again and join the last two elements (we should do this till we only have one element so we can't combine more):
     
    Byte  Probability  Join
    'a'
    4
    \ F. node2 
    / (prob= 7)
    fn1
    3
     
    Binary tree
    (257(2),7,&257(1),&a) 
                         /           \ 
    (257(1),3,&c,&b)   ('a',4,0,0) 
                     /     \ 
         ('c',2,0,0)  ('b',1,0,0) 
     

    Some things to note: During this process we have two pointers, one to the list of probabilities and the other to the binary tree. The pointer to the list, points to the LAST element. The pointer to the binary tree points to the start of the binary tree, and when the tree is being build to the next position. What we do is the following: first we create the new fictive node (in the pointer to the binary tree E-), put the value to 257 (not a byte) the probabilities get it by adding the probabilities from the last two elements of the list, and then the pointers to the new elements just to the next position in the binary tree. And then following this fictive node put the two combined elements. Now increment the pointer of the list (so you skip the LAST element) and then put here the new fictive node, just by adding the last probability and the current one, and putting the value 257. (in fact I first copied the elements of the list, and then made the new fictive node) Once this is over, sort the elements of the probabilities list again. And repeat this till you only have two elements. When there's only two elements in the probabilities list you shouldn't do the same, this is an special case, (it's the end) what shall you do? read it in the example. Phew, what a long explanation, now an example: (Note one thing: the data structure of the list of probabilities is itself the (start of the) binary tree.)
     
    Pointer to list 
    Symbol   Prob.  Rp.   Lp. 
    'a'
    4
    0
    0
    'c'
    2
    0
    0
    'b'
    1
    0
    0
     
    Pointer to tree
    Symbol   Prob.  Rp.   Lp. 
    empty
     

    As you can see the list of probabilities hold the byte itself, its probability and two pointers, this will make creating the binary tree easier. Let's assume the value of the pointer to the (start of the) binary tree is 00050000h. And now let's start with the real example, step by step:

    1- Copy both LAST elements
    Pointer to list 
    Symbol   Prob.  Rp.   Lp. 
    'a'
    4
    0
    0
    'c'
    2
    0
    0
    'b'
    1
    0
    0
     
    Pointer to tree
    Symbol   Prob.  Rp.   Lp.  Offset
    'c'
    2
    0
    0
    50010h 
    'b'
    1
    0
    0
    50000h
     
    2- Create the new fictive node
    Pointer to list 
    Symbol   Prob.  Rp.   Lp. 
    'a'
    4
    0
    0
    'c'
    2
    0
    0
    'b'
    1
    0
    0
     
    Pointer to tree
    Symbol   Prob.  Rp.   Lp.  Offset
    257
    3
    50010h  50000h  50020h
    'c'
    2
    0
    0
    50010h 
    'b'
    1
    0
    0
    50000h
     
    Note that putting the correct offsets in the fictive node is only matter of subtracting to the current pointer (to the fictive node) the lenght of every element. (in our case 10h, 16 bytes)

    3- Discard both elements and do a new one (fictive element)
    Pointer to list 
    Symbol   Prob.  Rp.   Lp. 
    'a'
    4
    0
    0
    257
    3
    0
    0
     
    Pointer to tree
    Symbol   Prob.  Rp.   Lp.  Offset
    257
    3
    50010h  50000h  50020h
    'c'
    2
    0
    0
    50010h 
    'b'
    1
    0
    0
    50000h
     
    Note that we don't initialize the pointers (no need) and also we have to do add to his probability the previous ('b': 1h) and instead of his value put a 257.

    4- Sort the list
    Pointer to list 
    Symbol   Prob.  Rp.   Lp. 
    'a'
    4
    0
    0
    257
    3
    0
    0
     
    Pointer to tree
    Symbol   Prob.  Rp.   Lp.  Offset
    257
    3
    50010h  50000h  50020h
    'c'
    2
    0
    0
    50010h 
    'b'
    1
    0
    0
    50000h
     
    In that case sorting the list of probabilities didn't changed it, but you'll find that most of the time it will do.

    5- and then continue

    First check if you have tow elements, if you have more, repeat the same, starting at 1, otherwise, when we only have TWO elements left, we can't work as we used to do, we have to treat it in a different way:

    6- Copy FIRST element
    Pointer to list 
    Symbol   Prob.  Rp.   Lp. 
    'a'
    4
    0
    0
    257
    3
    0
    0
     
    Pointer to tree
    Symbol   Prob.  Rp.   Lp.  Offset
    'a'
    4
    0
    0
    50030h
    257
    3
    50010h  50000h  50020h
    'c'
    2
    0
    0
    50010h 
    'b'
    1
    0
    0
    50000h
     
    7- As usual make fictive pointer (this is the ROOT)
    Pointer to tree
    Symbol   Prob.  Rp.   Lp.  Offset
    257
    7
    50020h 50030h 50040h
    'a'
    4
    0
    0
    50030h
    257
    3
    50010h  50000h  50020h
    'c'
    2
    0
    0
    50010h 
    'b'
    1
    0
    0
    50000h

    As you see making the root is as easy as making a fictive node (in fact is the same) Now your binary tree is finished, you should save the pointer to the ROOT, so you can scan the binary tree. Do this in any variable.

    And about the way you make your probabilities list: It can be not sorted. However this will make huffman codes not optimal, and this is what we want when using huffman codes at all. E-) And when combining elements: Of course you can avoid sorting them. I never implemented it (without sorting) but I think it will make the same codes as Shannon-Fano does.
     

    Scanning the huffman binary tree
     Once we are here, we have our binary tree, now we'll do the codes. As you see the bytes with more probability are at the top and those bytes with LOWEST probability are at the bottom of the binary tree. So let's take profit of that, we'll scan the binary tree, every time we go to the right we assign a 0 to the code, and when we go to the left we assign a 1. The length of the code is equal to the height of the byte in the tree. (so based in that we know that 'a' will have a code with length of 1 bit. and 'c' and 'b' two bits.) Once we arrive to a byte, we assign that code to the byte. Note that we must keep track also of the length of the code. Pseudo code:

    Do_codes:

    Ok? Of course you must care about pushing 'code' after calling again, and nothing else.
     

    Example of doing codes
     Now as an example let's do the codes from our example.
     
     (257(2),7,&257(1),&a) 
                          /          \ 
    (257,3,&c,&b)      ('a',4,0,0) 
                 /       \ 
    ('c',2,0,0)     ('b',1,0,0) 
     
    Pointer to tree
    Symbol   Prob.  Rp.   Lp.  Offset
    257
    7
    50020h 50030h 50040h
    'a'
    4
    0
    0
    50030h
    257
    3
    50010h  50000h  50020h
    'c'
    2
    0
    0
    50010h 
    'b'
    1
    0
    0
    50000h
     

    Code = 0b Length = -1
    Note that lenght starts with -1 so we can easily enter the recursive routine.

    The final codes are:
     
    Byte  Code  Length 
    'a'
    0b
    1
    'b'
    10b
    2
    'c'
    11b
    2

    In the output list (256 entries) you save the code and the length.
     

    Bit stream
     Once you have this list you reescan the whole file and output for every byte its code with its length. Example for our string 'acbacaa':

    And this makes this bit stream: 0111001100b. As you see should be not error decoding this stream. Of course you call the putbits with the length of the code, and as a value the code. (but remember that your putbits must put the codes from left to right, or else when decoding codes bit per bit, you'll have trouble, because you'll read first the last bits)
     

    Saving the codes
     You should note that the bits of the codes MUST be send from left to right. Because they were created from left to right.
    For example the code 011b must be sent as: 0b 1b 1b If you don't do it as above, then imagine our list:
     
    Byte  Code  Length 
    'a'
    0b
    1
    'b'
    10b
    2
    'c'
    11b
    2

    If with code for 'b' we output it from right to left, then we output the 0b first instead of the 1b. So a decoder will get the 0b and just output 'a' !! Another solution will be invert all the codes, this will make coding and decoding faster, let's take as an example:
    Original
    Byte  Code  Length 
    'a'
    0b
    1
    'b'
    10b
    2
    'c'
    110b
    3
    'd'
    111b
    3
     
    Inverted
    Byte  Code  Length 
    'a'
    0b
    1
    'b'
    01b
    2
    'c'
    011b
    3
    'd'
    111b
    3
     
    Now the codes are from right to left, so we can save them with only a putbits, no need of doing slow routines for putting codes. Of course fast ways of inverting the codes are welcome. (I have not implemented it yet (you know, no time ;-)) Note that if you need to invert the codes or not, comes from your implementation of the putbits, you should know how it saves the bits.
     

    Output of a huffman coder
     Now here it comes how to save the file. Of course you only don't have to output the bit stream, the decoder also needs to know what represent every code and how long they are. So we need to save also the codes. I'll present you two ways of
    doing it, of course there are another ways which can be even better.
     

    Save the binary tree
     This is the solution that you'll find everywhere. Save the whole binary tree. Then the decompressor (or decoder, name it as you want ;-) reads it. Once it has that, starts to decompress. Get one bit, scan the binary tree. If we found a leaf, stop scanning and output the byte at this leaf. If that was a node, then get another bit and continue scanning till you found a leaf. I read a lot of times that this scheme is slow. I never implemented it, so I can't tell you if it's slow or not.
     

    Save the codes and lengths
    When I implemented huffman coder I needed (of course ;-) a decoder, and then I invented this scheme. Probably someone has already invented and implemented it, however I never saw it. Feel free to use it. But keep in mind that is not the fastest one. My idea is the following: We divide all the codes in lengths. For example the length 5 (5 bits of code) has 3 different codes. I keep track of how many codes a length has. Then I get a bit. I search in the length 1 for the current code. If I found it, output the byte. If I didn't found it, then get another bit and search in the next list of lengths. If you implement this you have to care about somethings: what's the minimum length that a code has? (this is different for every input file) And then start getting those minimum bits before starting. Also you must care about something else: lists that don't have any code at all.

    How I did everything? First, save it: I sorted the list by the bytes. Then a loop. Read byte [0], it has a code? output 1 if yes, and then the code and the length. If no simply output a 0. And this for the 255 bytes. For outputting the length and the code I recommend you that first output 5 bits with the length, and then exactly the bits needed for this code. (the decoder will have no problems, it already knows the length) The decompressor reads its and makes a nice decompression list. First the minimum numbers of bits needed. Then how many codes with the minimum length we have. If it's different than 0 put the codes, and bytes that correspond to the code. Then next length, put the number... Once this list is finished, starts to decompress. Get one bit. read how many codes are. If 0, skip (read another bit), and next list. I think you don't need an example about this, however if you feel it's needed, let me know and I'll (probably) do a new version.
     

    Improvements
     As long as I know static huffman may only be improved (in the ratio) with one trick: canonical huffman. I'll not fully explain there, just a phrase. If you want to learn it, visit this page. Canonical huffman: We do everything in the same way, but we don't care about the codes, just the lengths, then based on the length of the codes we make the codes themselves. So when saving the tree (or in another way) we only output the lengths, the decoder just have to read the lengths and do the codes again. I highly reccomend to visit this page, because you'll learn canonical huffman, and a fast decoder.

    Of  course you could use the static huffman scheme for order-1 instead of order-0 model. (just the current byte)

    There's another improvement, this is adaptive huffman. The way it works is a little bit more difficult than the static scheme. If you need an adaptative entropy scheme you should check any (unpatented) arithmetic coder. It's better than huffman because huffman assumes that the minimum length for a code is 1 bit while arithmetic coding don't.
     

    Closing words
    A lot of effort has been put in this article, as you can see from the explanations and examples, so feedback will be welcome. New versions of this article will be released if you think that something isn't clearer enough. also grammar corrections are welcome. Once, you have implemented a simple order-0 model statical huffman, you could use it for the output of a dictionary based scheme, or something like that. Also, thanks to Michael Schindler, for his help with canonical huffman.  Btw: you can learn canonical huffman and also how to avoid a bug of huffman. If you want a faster decoding, visit this one http://www.compressconsult.com/huffman. Also in the compression pointers, there was a link to a ps file about fast decoding of huffman codes, http://www.internz.com/compression-pointers.html.  If you find any mistake in this article, email me.
     

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
     
     

    Arturo San Emeterio Campos, Barcelona 24-Jul-1999


    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.