Implementing huffman code construction and its data structures
by
Arturo San Emeterio Campos








Disclaimer
Copyright (c) Arturo San Emeterio Campos 2000. All rights reserved. Permission is granted to make verbatim copies of this
document for private use only.

Note
"Compression programming" is on a early stage. However the text itself is complete and will suffer only minor modifications.
 

Table of contents
Introduction
Data structures
Making the huffman binary tree
Sorting the list in the main loop
Making the code table
Outputting the codes
 
 

Introduction

Before implementing this, you have to implement a simple order-0 model. For learning intentions as well as for practical matters, I recommend using qsm. Once you have this, you can start. We'll start with the data structures, because the source code is based on the data structures.
 

Data structures

The input list and the binary tree both use the same data structure. The basic unit is the node of a binary tree. For our matter, we need to store the probability associated with the node. In the year 2000 the best decision seems using 4 bytes to store it (it is big enough, it's what a computer uses and it's well aligned). We also have to store the symbol. However the node could not be a symbol but a fictive node. We could use two bytes. The first one would say if the node is a symbol, and the second one would be the symbol. But again we can freely use 4 bytes for it. We are not going to allocate more than 514 of such nodes so memory is not a problem. We can mark a fictive node with a value higher than any input symbol. Because we are using bytes as the input symbol, a 256 will represent a fictive node. Of course we also need two pointers to the children. Summing up we have the following data structure:

Binary tree node
- symbol, 0-255 is a symbol, 256 is a fictive node (4 bytes)
- probability (4 bytes)
- pointer to left child (4 bytes)
- pointer to right child (4 bytes)

When the binary tree is finished we need to store the codes in a table, because we can't read them directly from the tree. For every symbol we need an entry in the code table, it's not too big either, and we can access the entry of a symbol by "code_table[symbol]". for every symbol we have to store the code length and the code itself.

Code table entry
-code, these are the bits to output (4 bytes)
-code length, the number of bits of this code (4 bytes)

If we have an alphabet with n symbols, we would need a list of n entries, and an output binary tree of 2*n nodes. And talking about restrictions, we have assumed that the maximum code length is 32 bits.

Once we know which data structures are we going to use, we'll use the code for the functions that use them.
 

Making the huffman binary tree

Now we have a table with the probabilities of the current order (in your case, probably order-0). The first thing to do is to copy all the symbols from this table to the list. They are now nodes, but before they are valid ones, we need to zero the pointers. The list is an array with 256 entries (assuming we are working with bytes). Then we have to sort them, the symbol with higher probability first. Let's say you have an order-0 table which is accessed like order0_prob[symbol]. And that the list is named huffman_list.

for( counter=0 ; counter!=256 ; counter++)
 {

 //Put the value of the byte, they are in order
 huffman_list[counter].value = counter;

 //get probability of current byte
 huffman_list[counter].probability = order0_prob[counter];

 //zero children pointers
 huffman_list[counter].left_child = 0; //this is a pointer
 huffman_list[counter].right_child = 0;
 }

//Now sort list based on "huffman_list[counter].probability"
 

To later initialize pointers we need to know how many symbols have a non zero probability, and thus that we have to make a code for them. That's an easy task because they are sorted: the first entry in the list which has a 0 probability is the number of symbols.

for( counter=0 ; counter!=256 ; counter++)
 if(huffman_list[counter].probability==0) //a zero count, no more elements
  break; //get out of the loop

number_of_symbols=counter;
 

We have already allocated memory for huffman_binary_tree[512]. Now we have to initialize the pointers that we'll use in the main loop. The task is copying the last two symbols from the list to the binary tree. Then remove them. After that put a new element on the list with a probability that is the sum of the removed symbols. Ok, to start with the pointer to the list will point at the end of the list, because we need to get the last two nodes (or symbols). However the pointer to the binary tree will point to the first position, because it will be there where we start to put new nodes.

pointer_to_list = start of "huffman_list" + "number_of_symbols" - 1; //point to the end
pointer_to_binary_tree = start of "binary_tree";
 

The main loop is finished when there's only one element left in the list. So the end condition is if the pointer to the list is equal to the start of the list (that is, to its memory address). In the main loop we have to copy the last two elements of the list to the binary tree. That's just a matter of copying all the contents from the pointer "pointer_to_list" to "pointer_to_binary_tree", and for the second one, copy from "pointer_to_list-1" to "pointer_to_binary_tree+1". But before we deal with updating the pointers, let's care about the new node. What we'll do is 'remove' the last element by updating the pointer to the list, that is decrementing its value. After that we'll add to the node in the current position the probability of the next one. That means we are adding the probabilities of the children in the parent (which now uses the entry of one children of its children). Next we mark it as a fictive node by making his value equal to 256. After that we init the pointer to the children: because we haven't updated "pointer_to_binary_tree" yet, the right children is in "pointer_to_binary_tree" and the left one in "pointer_to_binary_tree+1".

while ( pointer_to_list != start of "huffman_list" )
 {
 //copy the last two nodes to the binary tree
 copy values "pointer_to_list" to "pointer_to_binary_tree"
 copy values "pointer_to_list-1" to "pointer_to_binary_tree+1"

 //remove the last node
 decrement "pointer_to_list"

 //make fictive node
 pointer_to_list->value = 256; //make it a fictive node
 //add to it the probability of the other child
 pointer_to_list->probability += (pointer_to_list+1)->probability;
 //initialize pointers
 pointer_to_list->right_child = "pointer_to_binary_tree";
 pointer_to_list->left_child = "pointer_to_binary_tree+1";

 //Now sort the list
 }
 

Sorting the list in the main loop

In the main loop we also have to sort the list. One option is calling qsort (remember to pass it the right length). And another one is taking profit of the fact that the whole array but the last node is sorted. We thus have to care only about putting the last node in the right position. It could happen that the list looked like:
 
Symbol Prob
a 6
b 3
c 2
f1 4

 And had to be changed to:
 
Symbol Prob
a 6
f1 4
b 3
c 2

We see that the only thing we have to do is to swap the last node with the previous one until the previous one has a probability higher or equal to the current one. But notice that we could do more than one swap, like in the example. In the worst situation the last node would have to be moved up to the first position. So in the loop we also have to check that it has not reached yet the first position. We don't have to sort if there's only one element left. Note that we don't really need this first "if" because the second condition of the while would break too, and nothing would be modified.

//Sort routine, must be done before looping again
if ( pointer_to_list != start of "huffman_list" )//don't sort if there's only one left
 {
 //temporal pointer that we'll use instead
 pointer_to_list_temp = pointer_to_list;

 //swap node as long as the previous node has a lower probability or that we
 //don't reach the first position. (both conditions must be true, thus we use &&)
 while ( pointer_to_list_temp != start of "huffman_list" &&
 pointer_to_list_temp->probability > ((pointer_to_list_temp-1)->probability))
  {
  swap nodes, "pointer_to_list_temp" with "pointer_to_list_temp-1"
  decrement "pointer_to_list_temp"
  }
 }

When the main loop is exited, the whole binary tree is done, and the root is in the first position of the list (while the rest of the binary tree is in its array). In the next section I'll tell you how to make the code table from this binary tree.
 

Making the code table

We saw that the code table is made with a recursive function, now we'll see how to implement it. I'll assume you know what recursion is and how it's done with your compiler.

Our function will be called with a pointer to the root. Now take a look at the example of the binary tree and its code table. The code length is equal to the depth of the leaf in the binary tree (the depth of the root is 0). Then our function will be called with the value of the depth, which initially is 0. Also we have to care about the code, at first is 0, and then depending on whatever we go to the left or the right, we put a 0 or a 1 in the right position. To do this we shift the bits of the code to the left, and put in the rightmost position the new bit. When it's called it has to check if current node is a symbol, in that case it outputs the code and code length of it. Otherwise it updates the code and depth (code length) and does recursion. The function would be like the following:

make_codes(current_node, depth, code)
{
//check if it's a symbol
if( current_node->value != 256)
 {
 //it's a symbol, put code and code length
 //"current_node->value" is the byte
 huffman_code_table[current_node->value].code=code;
 huffman_code_table[current_node->value].codelength=depth;
 return;  //end recursion
 }
else
 {
 //it's a fictive node, update code and do recursion
 //First the left node. We have to shift the code to the left.
 //And also put a 0 (but we don't need that step).
 shift "code" one position to the left

 //we are going to call the function for the children, so
 //depth will be higher
 increment depth

 //recursion for left node
 make_codes(current_node->left_node,depth,code);

 //Now we do recursion with the right node, we have to put
 //a 1 in the rightmost position of the code with an 'or'.
 code = code OR 1; //In c: "code|=1;"

 //recursion for right node
 make_codes(current_node->right_node,depth,code);
 }
}

We have to call it with the following parameters: make_codes(start of "huffman_list",0,0). Now we'll discuss about outputting the codes, which is almost the last part of your huffman coder.
 

Outputting the codes

Now we have codes for every symbol to output. And due to the fact that they have the prefix property and are instantaneously decodable the decompressor will be able to decode them, even if they are of variable length. But you have to be careful about how do you output them. Again let's recall our example.

We have decided to make the codes left to right. In the leftmost used bit we have the first bit from the binary tree. Let's assume we store the bit code in a byte (in practice we use 4 bytes). The code of 'i' is stored in the byte in that way (where 'u' means unused bit):

u,u,u,u,u,0,0,1

Due to the length of the code, we know which bits should we use. We have two options for outputting them. First, we can make a custom function which outputs single bits from the code (in our case, 0, then 0, and finally 1). Or, we can use a bitio module which can write a bit variable code and then read it left to right. Like it's the case of the bitio that I explain. In the second case outputting a code is done just with "put_bits(code_length,code);".

To make it sure that your huffman coder works correctly, read Restricting huffman codes to a maximum length.
 


First version of this article: 21-July-2000 Last update of this article: 17-Sept-2000
This is part of Compression programming by Arturo Campos (email address: arturo@arturocampos.com).
http://www.arturocampos.com Visit again soon for new and updated compression articles and software.