Implementing lzp
by
Arturo San Emeterio Campos
Disclaimer
Copyright (c) Arturo San Emeterio Campos 2000. All rights reserved.
Permission is granted to make verbatim copies of this
document for private use only.
Note
"Compression programming" is on a early stage. This text is not finished.
Table of contents
Lz algorithm and its chart
flow
How does
the lzp algorithm chooses the phrases
Output of
the lzp algorithm and its chart flow
Using a
hash table to access context information
Lzp algorithm
with hash tables and its pseudo code
Using pointers to access
the text
Using hashing to access
order-3
Matching two phrases
Coding of literals,
matches and flags
Some final remarks
Closing words
Lz algorithm and its chart flow
Lz77 variations store the input seen till the date, or part of it, in a window. Like for example "The second key feature of speech communication stems from the telephone. It is the universality of the". Then they substitute the next bytes with references to the window. Let's say that this text follows the other one "telephone receiver itself that is important here". We then would look at the past and match "telephone". To let this know to the decompressor we output a code with the position of the current match in the window and also the number of bytes matched. With this information the decompressor can successfully restore the original file.
But what if we don't find any match in the input? In that case we have to output the current byte "t", which is called a literal. Then we can repeat the process and try to match the rest of the data "elephone receiver itself...". But if we can output two different codes another question arises, how does the decompressor know when it has to decode a match or a literal? Before outputting any, we code a flag. A flag has two states, thus it takes one bit, 0 or 1. A 0 means that a literals follow in the bit stream, and a 1 means that the codes of a match follow.
If we put this together, we have the following chart flow:
Find a match.
- No match
- Output a 0 flag
- Output a literal
- Next data
- Match
- Output a 0 flag
- Output codes that describe match
- Next data
- Repeat as long as there's data to compress
Before sending the encoded data, the compressor sends (or stores) the length of the input file. That way the decompressor can know when to stop decoding. Then the decompressor has to do the opposite process:
- Get a flag
- 0 flag
- Get a literal and put it on the output
- Next data
- 1 flag
- Get the description of the match
- Copy match to current position
- Next data
- Repeat as long as the whole file isn't decompressed
Note that when the decompressor copies the match, all the past text
is available to it.
How does the lzp algorithm chooses the phrases
The last context is a good predictor. That's the base of lzp. Instead of looking for matches in all the window, it only checks one position. The last position where the same context occurs. Let's have a look at the example text.
"The second key feature of speech communication stems from the telephone.
It is the universality of the " we are here "telephone receiver itself
that is important here".
Let's say our lzp model uses order-3. The order-3 context of "telephone receiver itself..." is "he " (space character included). We can find two positions that have the same context:
Context Content
"he " "second key feature..."
"he " "telephone. It is..."
But we only choose the last one. Now we match this phrase with the one
at the current match, and found that 9 bytes match "telephone".
Output of the lzp algorithm and its chart flow
At first the decompressor has the same information as the compressor: the past text, including the current context. Thus it can make the same decisions as the compressor. It then selects the same phrase. Due to that while describing the match, we don't need to say its position, only its length.
Then lzp has to output flags, literals and a description of the match which is only its We only select one position in the past text, so we don't need to output its position because the decompressor will choose the same one. Let's see the basic chart flow of lzp.
Select position in the past text
Match against current text
- No match
- Output a 0 flag
- Output a literal
- Next data
- Match
- Output a 0 flag
- Output match length
- Next data
- Repeat as long as there's data to compress
Using a hash table to access context information
In lzp we have to access the information of a context. For that matter we use a hash table. A hash table is an array which is indexed with a value that represents part of the entry. In our case we use the context to index it. In the hash entry itself we store the rest of the information, in our case a pointer to the past text which has this context. Let's see an example.
Context content
"The se" "cond"
We have to put the previous phrase in the hash table. We are going to
use order-2, "se". Every character is a byte, and we have two of them which
represents 2^16=65536 different values, a word. So if we want to index
the hash table with a word, it must have a size of at least 65536 entries.
Now we access the hash entry of the context "se", which is hash_table["se"].
There we store a pointer to "cond", hash_table["se"].pointer= pointer to
"cond".
Lzp algorithm with hash tables and its pseudo code
Now finding the last position with the same context it's just a matter of accessing the hash table with the current context as hash key. The hash key is the value formed with the context to access the hash table. But it could happen that there wasn't any entry yet. In that case we have to output a literal directly. We don't need to output a flag, obviously a match can't occur if there isn't any previous position suitable.
But we also have to update the hash table. If a match occurs we don't update the table. Because if the position worked once for matching it probably will do again. If there wasn't any entry we have to put in the hash entry the current position, or otherwise our table will be empty and we'll do nothing else but outputting literals!
The last case is when there was an entry but its content wasn't equal to the current data. In that case what should we do? keep the last position or put the new one? Remember the main idea of lzp? The last context is a good predictor. That means we have to favor the newest one. Also we could say that the old one proved it wouldn't make matches.
By the way, in the init part we have to ensure that every hash entry is marked as unused. To do so we init the value of the offset to 0. Now let's add the hash table to our chart flow, and make some pseudo code:
Init hash table, read input
Loop.
Make hash key with current context
See if hash_table[hash_value] is unused
- No offset yet
- Put pointer to current position in hash_table[hash_key]
- Output a literal
- Next data (that is, loop again if there's data left)
- There was a valid offset
- Match against current text
- No match
- Put pointer to current position in hash_table[hash_key]
- Output a 0 flag
- Output a literal
- Next data
- Match
- Output a 0 flag
- Output match length
- Next data
Now it's your job to make the code for the decompressor.
Using pointers to access the text
In the compression program we store all the text in a buffer. In the compression loop we have to use a pointer to the current position in the text, to this buffer, we'll call it "current_position". To know when we are done we compare this pointer against a pointer to the last position of the buffer (and thus of the text to compress). Now we know that in "current_position" we have the first byte of the current phrase to match against. And that in "current_position+1" we have the second one. It's obvious then that the context is backwards, obviously order-1 is in "current_position-1", and the second byte (or character) of the context is in "current_position-2".
It would be logical to think that the start of the loop "current_position" points to the first byte. But keep on mind that we have to use the context. So the first bytes that are the context (two in case of order-2), have to outputted so the decompressor also has access to the context. Then in case of order-2 "current_position" points to the third byte.
In the previous pseudo code we have seen just "next data", but how is this done? In the case of outputting a literal we just have to increment "current_position". And for the case of a match we have to add it the match length. Example: you are in "telephone receiver itself..." ("current_position" points to its start "t") and you matched 4 bytes, in that case you add it to the pointer and have "phone receiver itself...".
To make the hash value we use the following hash key: hash_value= "current_position-1" + ("current_position-2" << 8). "<< 8" means shifted eight times to the left. That way we put the two bytes in a single word.
Let's put this new concretions to the code:
Init hash table, read input
Output bytes corresponding to the context
Init current_position and last_position
While(current_position != last_position) //while they are different
-Make hash value with current context
-hash_value= "current_position-1" + ("current_position-2" <<
8)
-if(hash_table[hash_value].pointer==0) //0 means unused
- Unused entry
- hash_table[hash_value].pointer=current_position
- Output a literal
- current_position++ //next position
- There was a valid offset
- How much bytes do hash_table[hash_value].pointer and current_position
share?
- 0
- hash_table[hash_value].pointer=current_position
- Output a 0 flag
- Output a literal
- current_position++
- 1 or more
- Output a 0 flag
- Output match length
- current_position = current_position + match_length
Using hashing to access order-3
Now we are going to improve the compression of the previous implementation by only reducing a little bit the speed. The trick is in the hashing. In the code shown we used a hash key which mapped every bit from the context to the hash value. Let's say "1" represents the first byte of the context, "2" the second, and "h" the hash value. The most significant bit is at the left most part. The hash key was "hash_value= "current_position-1" + ("current_position-2" << 8)" and we can represent it that way:
1111 1111
2222 2222
hhhh hhhh hhhh hhhh
Now we want to hash order-3 but we want to keep the same size for the hash table. In that case the hash value must be of the same length. Thus some bits of the hash values must be shared by at least two input bits. We do it in the following way:
1111 1111
222 2222 2
3333 3
hhhh hhhh hhhh hhhh
The code is: "hash_value= "current_position-1" + ("current_position-2" << 7) + ("current_position-3" << 11)".
Because now we some of the hash value are the shared, different contexts will share the same entry. You may think that this is a problem, of course it doesn't give it as good results as using all the three bytes, but is much better than order-2. Let's say we were using order-2 but that the hash value only had 15 bits and we mapped it that way:
1111
1111
222 2222
hhh hhhh hhhh hhhh
In that case there are 65536 different contexts and 32768 entries (2^15).
Thus a hash entry is shared by 65536/32768=2 different contexts. But in
fact such contexts aren't that different! To start with their order-1 is
the same. Their only difference is the highest bit. In our hash key for
order-3 it happens the same. Hash entries are shared by contexts which
are not very different.
Once we have a pointer to the past text we have to see how many bytes match with the current position. Obviously we have to compare byte per byte, if they are equal keep comparing and increment the match length:
match_length=0
while( *(last_position+match_length) == *(current_position+match_length))
{
match_length++ //increment match length
}
In practice we may want to restrict the match length, so we can code it in a fixed number of bits. A good value is 15 or 255. To restrict it, we break the loop if the match length is already the maximum. Note that "&&" means tat both conditions must be true to keep executing the loop.
match_length=0
while( *(last_position+match_length) == *(current_position+match_length)
&& match_length != 15
)
{
match_length++ //increment match length
}
But still we have to care about something else, what about if while we are matching we read further the end of the buffer? To solve we have to stop if the pointer to the current position plus the match length is equal to the last position of the buffer.
match_length=0
while( *(last_position+match_length) == *(current_position+match_length)
&& match_length != 15
&& (current_position+match_length)!=last_position
)
{
match_length++ //increment match length
}
Coding of literals, matches and flags
What we have seen till now is the lzp model. Now we have to decide how to code, how to output, the literals, flags and match lengths. You have different options depending on speed and compression.
If you want the highest speed, output them directly with bitio routines. Output literals in 8 bits. Flags in 1 bit. And match lengths in, for example, 4 bits (maximum length restricted to 15 or 16). But that offers too little compression, around 6 bpb.
Where we can gain more compression is in a good coding of literals, around 3.5 bpb. You can use huffman or arithmetic coding. Use the first if you prefer speed. The difference in compression is little though, for order-4 only 0.06 bpb. For highest compression use ppmc.
The match lengths can be compressed too. But the flags don't offer much
compression at all, because their probability is almost the same.
In our pseudo code we have assumed that we allocated a buffer with a size equal to the file size. In practice you may have to compress files (4 megas) and you may not want to allocate so much memory. In that case you have two options, first using an sliding window, or compressing the file by blocks of a maximum size.
I'll explain the second which is a lot easier, though achieves less
compression. What you do is compressing every block of the file as if it
was a different file. For every block you read the file again (allocate
memory only once!) init the hash table and compress it. The only thing
that you may leave the same are probability tables for huffman or arithmetic
coding, if you use any. Also the decompressor must know the size of every
block and the end of the file. To do so before outputting the compressed
block, output its uncompressed length. When you have compressed the whole
file output a block length of 0. That will tell the decompressor to stop.
The example text is an extract of book2 from the Calgary Corpus. If
you want to access higher orders, use linked lists.
And for better compression use any code like huffman
or arithmetic coding for the output of the literals.