Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped. It
contains the images also.
Table of contents
Introduction
A linked list will be used with an order-3 model not for good results,
but rather to show how can it be done, so you can apply it to higher orders,and
this will lead to good results. Of course knowledge about lzp
is needed.
Linked lists
If you know how a linked list is used, then better skip this section,
if you don't, read it. A linked list is a group of elements which may not
be continue in memory but they can still be read sequentially. Every element
of the list has a pointer to the next element in the list, the last one
will have as a pointer 0. (Null) The structure for an element of the linked
list may look as the following:
|
A "graphical" representation:
|
This kind of linked lists is a singly one, because every element contains only one pointer, there's another kind of linked lists, called doubly, where every element contains two pointers, to the next and previous elements. With linked lists we can perform some basic operations common in any container (a data structure used for storing and reading data), for lzp we'll only need three of them:
Let's see them:Add a new element.
What we actually do is to have a big buffer and then put there new
elements, of course we have to keep track what is the next free position
where we can put a new entry. (without overwritten another) The first job
to do is to allocate enough memory for the big buffer, probably 2 megabytes
is enough. (this value should be tuned according your needs) And then save
in a variable the position of the first free entry. (this pointer will
be modified to reflect the bytes which are being used)
The first operation, to add a new offset, must care about two situations: when we haven't write any element yet, and when there's one element or more on it. For lzp we'll store in our hash table a pointer to the first element, in that case we only have to check our entry in the hash table, if it's initialised, then we know that there's at least one element in the linked lists, otherwise we have to put a new one.
In this case we get the pointer of the big buffer (in the variable) and put it in the hash table (making it point to the element which we'll write now) then we put our element in the big buffer, in our case we'll write the value (let's say "a") and then the offset to the next element, because this is the last element in the linked lists we'll put a 0 there. Also don't forget to update the variable holding the pointer to the big buffer: you have to add it the length of an element. (for this example 5, 1 byte for the value and 4 for the pointer) When used linked lists we don't need our data structures to be aligned, because we don't use random access.
In the case that our entry in the hash table was already initialised we have to go thru the whole linked lists and at the end of it put our new element. What we have to do is to take the pointer to the first element, and then check if it's offset is 0 (= it's the last element) if it's then we have found the last element, otherwise get its pointer and check the next element, till you found the last element. Once you've found the last element you have to put the next element: you get the pointer from the variable and put it in the pointer of the last element, and then in the position of the big buffer you put your new element, first the value and then for the pointer a value of 0. (coz now this is the last element)
Want some pseudo code? Here it is.
Search an element
This is quite easy, seeing the structure of a linked list everyone
can imagine how to search an element. You compare your value with the value
of the first element, if it's note equal, then you get the pointer of the
next element, and continue comparing till you find the same value or there's
no more elements. (you get an offset and it's 0) Now we must have a look
at the values stored along with the pointers in the entries of the linked
lists. In our case we'll have a hash table with order-2 using direct hashing.
Every entry will have a pointer to the first element in the linked lists.
In every linked lists we'll have all the contexts which have the same order-2
context, so we don't need to put the order-2 context on the element, we
only have to put the order-3, which takes one byte. Let's say we have to
insert "abc", "bc" is inserted in the hash table, and then in the linked
lists we put "a". In this linked lists we'll have all the context which
start with "...bc". Then the values are:
Let's have a look at the pseudo code:
Update an element
As you know lzp needs (or at least it improves the compression) to
update the pointer. You can do so in two ways, depending on whatever you
conditionally update, or you always update.
If you always update the read pointer, then you only have to update in the same routine that reads it.
However if you conditionally update the pointers (as showed in my article
"Lzp research") you can scan again when
needed to update, however this will be twice slower. A better solution
is once you find the offset save in a variable the offset to this element
in the linked lists, so you don't need to go thru the whole linked lists
again, you just have to get the offset and update the element.
Why linked lists?
Let's see how and why a linked list may help us: Hash tables provide
random access (you can access any of its elements without having to read
the previous or the next ones, the complexity is O(1)) but it needs big
amounts of memory: you allocate memory for all the possible elements. (even
if some of them will never appear) In the other hand, linked list only
reserve memory for the symbols (in our case contexts and offsets) which
are currently using, its problem is that it cannot be random accessed,
if you want to read the third element you have to read the first and the
second before you get to it. Knowing how lzp works we could work with order-3
context with everything in linked lists, but in this case if we want to
read an element in the position N, we have to read N elements, which makes
it very slow. (having in mind that we could have up to 2^24 elements. One
solution is to use a combination of them, first a hash table for lower
orders (they don't take a lot of memory, in our case order-2) which provides
fast random access, and for higher orders (in this case order-3) linked
lists, which take less memory.
Lzp order-3
Once you know why let's see how we'll implement it. We'll have a hash
table for order-2, with 2^16 entries, of course direct
hashing. (direct addressing) And then for order-3 we'll use linked
lists. The hash table will hold an offset to the first element in a linked
list which contain all the elements which had the same order-2 context.
(note that this list doesn't need to hold the order-2 context, because
direct hashing doesn't make hash collisions) Then the linked list will
hold the order-3 and the pointers to the input. Let's see an example:
| The hash table contain a pointer in every entry to its linked lists, and the linked lists hold the pointer and the order-3 context. This is a hash table with linked list after having inserted on it the following contexts: "d-00h-01h" "b-00h-001h" "a-FFh-FFh" |
|
Doing it faster
As soon as you implement you'll notice that (obviously) it isn't as
fast as another version without linked lists can be (of course you'll also
notice an improvement in the compression) Unfortunately as long as I know
there is only a way to speed it up. As you now in the linked lists elements
at the first positions are read first (so accessed faster), and what's
the algorithm which assign the first positions to most used elements? exactly,
move
to front. It will allow most used elements to be at the first positions,
so the total average will be lower, and speed higher.
How to implement it? easy, as soon as an element is read it's moved
to the front. Note that we don't put new elements at the start of the linked
lists. In the case of linked lists we don't need to move the offsets which
are used for pointing to the next element, we'll leave them alone, we'll
only move the data itself, in our case the order-3 context and the pointer
to the input data.
Closing words
The use of linked lists is one of the keys to access to higher orders.
If we use higher orders our hashing with chaining (using linked lists)
will start to seem a context trie, so it's probably that a suffix tree
could improve the overall speed. Some of improvements of lzp are described
at my article "Lzp research" If you like lzp you should read this article.
If you want to get good compression it's needed to use an entropy coder,
fortunately for you, I provide three articles about different schemes,
Static
huffman, arithmetic coding and the
range
coder. If you find any mistake in this article, contact me.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.