"Lzp order-3 with linked lists"
by
Arturo San Emeterio Campos




Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped. It contains the images also.
 

Table of contents

  • Introduction
  • Linked list

  •     - Add a new element
        - Search an element
        - Update an element
  • Lzp order-3
  • Doing it faster
  • Closing words
  • Contacting the author

  •  

     
     
     

    Introduction
    A linked list will be used with an order-3 model not for good results, but rather to show how can it be done, so you can apply it to higher orders,and this will lead to good results. Of course knowledge about lzp is needed.
     

    Linked lists
    If you know how a linked list is used, then better skip this section, if you don't, read it. A linked list is a group of elements which may not be continue in memory but they can still be read sequentially. Every element of the list has a pointer to the next element in the list, the last one will have as a pointer 0. (Null) The structure for an element of the linked list may look as the following:

  • 1 byte - value
  • 4 bytes - pointer to next element

  • Let's say we have the values a, b and c, they'll be something like:
    Value  Offset 
    a offset of b
    b offset of c
    c 0
    A "graphical" representation: 
    a ® b ®
     

    This kind of linked lists is a singly one, because every element contains only one pointer, there's another kind of linked lists, called doubly, where every element contains two pointers, to the next and previous elements. With linked lists we can perform some basic operations common in any container (a data structure used for storing and reading data), for lzp we'll only need three of them:

    1. Add a new element
    2. Search an element
    3. Update an element
    Let's see them:
     

    Add a new element.
    What we actually do is to have a big buffer and then put there new elements, of course we have to keep track what is the next free position where we can put a new entry. (without overwritten another) The first job to do is to allocate enough memory for the big buffer, probably 2 megabytes is enough. (this value should be tuned according your needs) And then save in a variable the position of the first free entry. (this pointer will be modified to reflect the bytes which are being used)

    The first operation, to add a new offset, must care about two situations: when we haven't write any element yet, and when there's one element or more on it. For lzp we'll store in our hash table a pointer to the first element, in that case we only have to check our entry in the hash table, if it's initialised, then we know that there's at least one element in the linked lists, otherwise we have to put a new one.

    In this case we get the pointer of the big buffer (in the variable) and put it in the hash table (making it point to the element which we'll write now) then we put our element in the big buffer, in our case we'll write the value (let's say "a") and then the offset to the next element, because this is the last element in the linked lists we'll put a 0 there. Also don't forget to update the variable holding the pointer to the big buffer: you have to add it the length of an element. (for this example 5, 1 byte for the value and 4 for the pointer) When used linked lists we don't need our data structures to be aligned, because we don't use random access.

    In the case that our entry in the hash table was already initialised we have to go thru the whole linked lists and at the end of it put our new element. What we have to do is to take the pointer to the first element, and then check if it's offset is 0 (= it's the last element) if it's then we have found the last element, otherwise get its pointer and check the next element, till you found the last element. Once you've found the last element you have to put the next element: you get the pointer from the variable and put it in the pointer of the last element, and then in the position of the big buffer you put your new element, first the value and then for the pointer a value of 0. (coz now this is the last element)

    Want some pseudo code? Here it is.

    Explanation of the variables used: This is all you need for adding new elements. With lzp we don't need to delete elements from the linked lists, we have to update them, but this is done while searching for one.
     

    Search an element
    This is quite easy, seeing the structure of a linked list everyone can imagine how to search an element. You compare your value with the value of the first element, if it's note equal, then you get the pointer of the next element, and continue comparing till you find the same value or there's no more elements. (you get an offset and it's 0) Now we must have a look at the values stored along with the pointers in the entries of the linked lists. In our case we'll have a hash table with order-2 using direct hashing. Every entry will have a pointer to the first element in the linked lists. In every linked lists we'll have all the contexts which have the same order-2 context, so we don't need to put the order-2 context on the element, we only have to put the order-3, which takes one byte. Let's say we have to insert "abc", "bc" is inserted in the hash table, and then in the linked lists we put "a". In this linked lists we'll have all the context which start with "...bc". Then the values are:

    1. Order-3, one byte.
    2. Pointer, the offset of the match. (just as we used to)
    We have to scan all the linked lists till we find the same order-3, in this case we can use the pointer at this entry, otherwise, if we reach the end of the linked lists without having found the same order-3, then we have to insert it and also act as if there was no entry. (probably outputting a literal without a flag)

    Let's have a look at the pseudo code:

    The variables:


    Update an element
    As you know lzp needs (or at least it improves the compression) to update the pointer. You can do so in two ways, depending on whatever you conditionally update, or you always update.

    If you always update the read pointer, then you only have to update in the same routine that reads it.

    However if you conditionally update the pointers (as showed in my article "Lzp research") you can scan again when needed to update, however this will be twice slower. A better solution is once you find the offset save in a variable the offset to this element in the linked lists, so you don't need to go thru the whole linked lists again, you just have to get the offset and update the element.
     

    Why linked lists?
    Let's see how and why a linked list may help us: Hash tables provide random access (you can access any of its elements without having to read the previous or the next ones, the complexity is O(1)) but it needs big amounts of memory: you allocate memory for all the possible elements. (even if some of them will never appear) In the other hand, linked list only reserve memory for the symbols (in our case contexts and offsets) which are currently using, its problem is that it cannot be random accessed, if you want to read the third element you have to read the first and the second before you get to it. Knowing how lzp works we could work with order-3 context with everything in linked lists, but in this case if we want to read an element in the position N, we have to read N elements, which makes it very slow. (having in mind that we could have up to 2^24 elements. One solution is to use a combination of them, first a hash table for lower orders (they don't take a lot of memory, in our case order-2) which provides fast random access, and for higher orders (in this case order-3) linked lists, which take less memory.
     

    Lzp order-3
    Once you know why let's see how we'll implement it. We'll have a hash table for order-2, with 2^16 entries, of course direct hashing. (direct addressing) And then for order-3 we'll use linked lists. The hash table will hold an offset to the first element in a linked list which contain all the elements which had the same order-2 context. (note that this list doesn't need to hold the order-2 context, because direct hashing doesn't make hash collisions) Then the linked list will hold the order-3 and the pointers to the input. Let's see an example:
     
    The hash table contain a pointer in every entry to its linked lists, and the linked lists hold the pointer and the order-3 context. This is a hash table with linked list after having inserted on it the following contexts: "d-00h-01h" "b-00h-001h" "a-FFh-FFh"
    Then usual lzp is used:


    Doing it faster
    As soon as you implement you'll notice that (obviously) it isn't as fast as another version without linked lists can be (of course you'll also notice an improvement in the compression) Unfortunately as long as I know there is only a way to speed it up. As you now in the linked lists elements at the first positions are read first (so accessed faster), and what's the algorithm which assign the first positions to most used elements? exactly, move to front. It will allow most used elements to be at the first positions, so the total average will be lower, and speed higher.

    How to implement it? easy, as soon as an element is read it's moved to the front. Note that we don't put new elements at the start of the linked lists. In the case of linked lists we don't need to move the offsets which are used for pointing to the next element, we'll leave them alone, we'll only move the data itself, in our case the order-3 context and the pointer to the input data.
     

    Closing words
    The use of linked lists is one of the keys to access to higher orders. If we use higher orders our hashing with chaining (using linked lists) will start to seem a context trie, so it's probably that a suffix tree could improve the overall speed. Some of improvements of lzp are described at my article "Lzp research" If you like lzp you should read this article. If you want to get good compression it's needed to use an entropy coder, fortunately for you, I provide three articles about different schemes, Static huffman, arithmetic coding and the range coder. If you find any mistake in this article, contact me.
     

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
      


    Arturo San Emeterio Campos, Barcelona 23-Jul-1999


    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.