Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped, with
the figures also.
Table of contents
Introduction
The best lz variation till the date is lzp, which you can tune to obtain
more compression or more speed. This article introduces lzp and the basic
things to know about it, you are encouraged to visit my h-page
where you'll find other articles about lzp about advanced topics of it.
Lzp
This scheme was invented by Charles Bloom, this is scheme is like an
hybrid of lzrw and ppmcb, and some articles of DCC'93. Lzrw was its first
step, when Ross Williams, proved that an lz like tree, could be constructed
with a Markov model. It was the first step, the first step to a very intelligent
hybrid of lz and markov, then, Charles Bloom did the other step in the
summer of 1994 or so, while he was working for Micom. Lzp is a dictionary
based scheme, as his two first letters say "lz" (Lempel-ziv) but what's
the meaning of the "p"? It stands for "prediction". It relies in the fact
that any lz77 coder needs too much bits for the offset, so it's compression
isn't as good as the entropy, (the minimum length which any compressed
file may have, based on the probabilities of the file itself) and then
it tries (and succeed) avoiding to pass to the decoder the offset, lzp
does that trying to predict it based on the last bytes seen. (context)
How to predict the
offset
Now the big question is how to predict this offset. Let's have a look
at those words: "There Where" Both of them have the same text, 'here' The
way we'll predict the offset is just with finite markov. (this is not the
only way of doing it, but it's the most used and fastest) We'll keep a
list of pointers based on the context, so two equal context, will have
the same entry in this table, and we then can predict the offset. If we
were using an order-2 model, our offset could be: 'he' which in our case:
"There Where" the first time it's seen, it will keep an offset to the context
so the context 'he' will have a pointer to "re Where" And the second time
we found the context 'he' again, we compare the text at the current position
"re" with the text at the predicted offset "re" so we see that we have
a match of three bytes. Want more examples? let's say we have an order-1
model (it makes the prediction based just on the last byte) so our table
should have 256 entries. Let's compress "abcdabc" In our case (using a
order-1 model) we have to pass to the decompressor the first byte, so it
can predict the offset.
| Context | Offset |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
length (2) |
And a decompressor will do the following:
| Context | Offset |
|
Output |
|
|
|
|
"a" |
|
|
|
|
"ab" |
|
|
|
|
"abc" |
|
|
|
|
"abcd" |
|
|
|
|
"abcda" |
|
|
|
length (2) | "abcdabc" |
Of course both the compressor and the decompressor must work with the
same model. (do the same based on the incoming text)
Hashing
This is the key for understanding lzp. Hashing is making a value based
on the context of the current position. The way you do that is called the
hash key. The hash index is a value we use for accessing a table filled
with pointers to the input string. We save in a table the offset for this
context, and once we found this context again we probably have the same
data, so a match. Once we have the hash index we check if there was any
offset, if it weren't we put the offset to the current position in case
it was valid we check if the bytes from this offset match the bytes in
our position, and then we output the number of bytes that match has. Example,
let's say our table has 255 entries, our hash key will be based on the
last byte. Our hash key is: hash_key= (last byte]) AND 255 Note that we
don't allow any value above 255. We are encoding "ababa". First we put
'a' as a literal (we need at least one byte for the hash key)
| Current byte | Predicted? |
|
|
|
|
|
|
|
|
|
|
|
|
|
And decompressing:
| Current byte | Input | Predicted? |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
length |
|
|
Lzp output
In that example you also have to output a flag saying if there's a
byte or a match. But in fact you don't always need to output a flag, you
only need it in the case you find the current offset valid (not 0) so you
have to check if there's a match. But if the offset wasn't valid (it was
0) then you don't have need to output a flag, saying if there's a match
or a literal coz, you'll always output a literal, remember you only have
to output flags when you've done a prediction. Let me show this:
| Current byte | Predicted? |
|
Flag? |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This improves compression. As you can see the output of lzp is of two
kinds:
| Prediction | No prediction |
|
|
|
|
|
Of course you can spend less bits in the length or do variable codes,
or entropy code the literals and lengths. If you don't use any entropy
coder, you should tune the best length (in bits) for the length, also you
should change the range of the length, its usual range is from 0-255, but
in fact we'll never write a 0 length (but remember that 1 is the minimum),
so we can change the range from 1-256. You should be careful when incrementing
the 0FFh, coz if you are using bytes, you may have problems, a 0.
Hash keys
The way we get the entry of the table is very basic an almost is only
for examples. (you can use it in an lzp compressor, but the ratio will
be poor) Let's say we have the string "abacab" The second time we get the
offset from the context 'a' this isn't a match at all ('ba'!='ca'), so
we must output 'c' and then update the table with a pointer to 'c' and
the third time we do the hash key of 'a' we neither have a match, ('c'!='b').
This case is called a hash collision. What we need is to use any other
order than order-1 model. Order-2 will be fine, and it'll take 65,535 entries,
which if using dwords for pointers, makes a memory usage of 262,144 bytes.
Which is not that big. If we want to use an order-3 model, we'll have 16,777,216
entries, and we'll need 67,108,864. But we even continue, and use
order-4, we'll have 4,294,967,296 entries, and a memory usage of 17,179,869,184
bytes!! With actual computers we can afford even 67 megas, (using order-3
model) but not all the users have 17 gigas of memory. ;-) So we need to
restrict the size of our table (the number of entries) while using more
than order-1. How could we do something like that? Continue reading. In
hashing we need precision, more precision your hash key has, better prediction
so better compression. The last hash key was:
| hhhh | hhhh | <= Hash key |
|
|
|
<= last byte |
Note that I'm putting in a column a nibble, the first column, at the
left, will contain the msb, of a byte, word or dword, of course the last
column, at the left will hold the lsb (less significant byte). We could
also use a two byte hash key like that:
| hhhh | hhhh | hhhh | hhhh |
|
|
|
1111 | 1111 |
Where 'h' is the hash key, 1 the first byte and 2 the second byte. I.e.:
Our context (last bytes) are 45h 0Dh so our hash key is: 450Dh Now we are
using a 2^16 (64k) entries so our table is (2^16)*4 bytes long. (262.144
bytes, 262k) Note that there's no difference between 450Dh or 0D45h, coz
they access at the same entry at all. (as long as both the encoder and
decoder do that in the same way) Note that when using hash keys with xor
this do makes difference. Those kind of hash keys are what I call
direct hash, coz for doing them you only need to do a move, no need of
xors, they are faster, and if you have enough memory they are good. Of
course the hash keys that you'll see now, achieve more compression while
using less memory. (but they are slower to implement) Of course there are
some hash keys which are more precise, they also care about the value of
the third byte, while having a table of 2^16 entries. How? using xors.
This is the hash key that Charles Bloom used in lzp3:
| hhhh | hhhh | hhhh | hhhh |
|
|
|
1111 | 1111 |
|
222
|
2222 | 2 | |
| 3333 | 3 |
The way this is coded is the following:
| hhhh | hhhh | hhhh | hhhh | hhhh |
|
|
2222 |
|
1111 | 1111 |
Will give more compression, but it will need 4.194.304 bytes! And something
like that:
| hhhh | hhhh | hhhh | hhhh | hhhh | hhhh |
|
|
3333 | 2222 |
|
1111 | 1111 |
Will also make a better result, but it will need 67.108.864 bytes of
Ram memory! But you should note that in this moment (1999) you can buy
a Pc with 128 megas of Ram memory, and this is enough for this hash key,
however not everybody has 128 megas or even 64 megas. I still have 16 megas!
;-( The kind of hashing we are using also is called finite context Markov
prediction coz we use a finite context (the bytes we use for the hash key)(it's
finite coz we have a maximum number of bytes to use) to make the prediction.
Also you should associate Markov with probabilities, and when we are doing
prediction what we are doing in fact is using the probabilities of a file
for predicting whatever. (in our case an offset to the context of the current
byte)
Charles Bloom's
hash keys
And this is the way which Charles Bloom defines hash keys:
x, this defines the length in bits of our context, if we have an order-1 model which reads the whole last bit, this will be 12. However we could have an order-2 model (or more) which uses 2 bytes, but not uses the whole 16 bits.
H, this is the hash index, it tells the position in the table for this context, where we save the offset. (and the context confirmation if needed)
h, this is the 'and' so we restrict the size of our hash index. Let's say our table has 4096 entries, then h will be 12, and 1<<12=4096, and 4096-1= 4095, 0FFFh, which will have the purpose of erasing (putting to 0) all the bits but the 12 at the right.
p, this is one of the most important (the other which also is very important is the size of the hash table) of the parameters, which should be carefully tunned to produce the better results.
Context confirmation
This is used when a bit of the hash key is the result of the xor of
two or more bits. I.e.:
|
|
For higher order like order-4, context confirmation, around 0,2 bpb
of improvement.
Different hash keys
And now, let's see some different hash keys. One implementation which
was faster than the famous version of Lz, Stack Lzs
algorithm, while achieving 1,4 bpb less than it. (more ratio) This
one uses
a 2k window and a table with 1024 entries, so it uses around 5k.
X=order-2 model x=
Under construction.
After a match
Lz77 outputs pairs of offset/length, and after them it always output
a literal. Lzp does the same. Why? If you have found a match just after
it the probabilities of getting another match are rather low. Why? coz
the context of the next match is the current match, and in case that the
new match is really a match it should have matched in the previous phrase.
So if you output a literal after a match, you get more speed, and little
bit of more compression.
Phrases
We shouldn't see lzp just as try to predict matches. For example let's
take lz77, it keeps in a sliding window phrases to the input file, but
let's focus more on this, lzw (or lz78), it just keeps phrases, stand alone
phrases. We should try to do the same, get phrases. Well, in fact we get
possible phrases every time we add a new offset to the hash table. Charles
Bloom proposed the following, just update the hash table at phrase boundaries,
at the start of the phrase. So a match like the following: "there"
should only store an offset to the start, ("t") the phrase boundary, but
it shouldn't do it at the rest "here". This idea improves both compression
and speed. I've read some papers about compression (I don't remember what
paper exactly) and read about the suffix property, (I think I'm not wrong)
most lz algorithms have the suffix property: all the suffix of a phrase
are also phrases in the dictionary. I.e.: In a lz77 compressor we have
the phrase "there" then in our dictionary we also have its suffixes, all
the phrases inside of it: "here" "ere" "re" "e" Do you see? Well, this
and also the fact, that compression on little files isn't as good as it
could be I had the following idea: I try to update only at phrase boundaries,
but also try to quickly fill up our hash table, how I did it? Easy, I update
inside a phrase, only if the entry corresponding to the suffix wasn't initialized
yet. Example: We have the phrase "there", we are under an order-1 model
(for example purposes is fine) First we update at the phrase boundaries,
with an offset to "t". Then we conditionally update for the suffixes of
the phrase:
Sliding Window
Lzp, as I've already mentioned is a dictionary based scheme, so like
lz77, it keeps an sliding window, the size of this window can vary, it's
just your decision. An sliding windows of 65535 bytes is as good as an
infinite window can be, and the difference between an sliding windows of
65535 bytes and another of 16384, is rather low. You can reduce the size
of the window without really hurting compression, while having more speed.
I always use infinite windows (well, they aren't real infinite, they just
take as many memory as the user have) this makes everything easier. Here
there's a graphical example of how the size of the windows affects compression:
![]() |
The y is bpb, and the x size of the
window, in the way 2^n.
Numbers from lzp_old by Charles Bloom. (for paper1) |
If you have any experience with finite windows, let me know.
When to update
the table
The hash table which holds the offsets should be updated at phrase
boundaries, we have to update it for all the contexts but when we are inside
a match, thus a phrase. This is in fact very easy to do, at your inner
loop, you always update the table, and once you have a match, coz you've
already updated the hash table for its start, then you skip the matched
bytes, and don't do any update for them. In that way we adapt the "dictionary"
to the incoming bytes, and so it compress better. Let's say we have an
string like: "which where when", the first time we do the offset for the
context "wh" (order-2 model) it points to "ich" and the second time it
points to "ere" but if now we don't update the offset we lose a match in
the third one, "e". Do you want some numbers? here they are, the first
table is using order-3 with hashing, a 2^16 entries table, and also entropy
coding, updating always but in phrase boundaries and when we have a match,
the second does the same than the first, but also updates when it finds
new matches.
| File | Original | Length1 | bpb1 | length2 | bpb2 |
| bib | 117541 | 43575 | 2,9657 | 43514 | 2,9616 |
| book1 | 785393 | 368476 | 3,7532 | 367666 | 3,7450 |
| book2 | 626490 | 250254 | 3,1956 | 247255 | 3,1573 |
| geo | 102400 | 72937 | 5,6982 | 72953 | 5,6994 |
| news | 387168 | 161574 | 3,3385 | 160900 | 3,3246 |
| obj1 | 21504 | 12028 | 4,4747 | 11969 | 4,4527 |
| obj2 | 246814 | 106347 | 3,4470 | 102824 | 3,3328 |
| paper1 | 54411 | 22567 | 3,3180 | 22441 | 3,2994 |
| paper2 | 83930 | 35272 | 3,3620 | 35144 | 3,3498 |
| pic | 513216 | 65077 | 1,0144 | 57951 | 0,9033 |
| progc | 41098 | 16321 | 3,1769 | 15936 | 3,1020 |
| progl | 73890 | 20431 | 2,2120 | 19557 | 2,1174 |
| progp | 51345 | 14483 | 2,2565 | 13636 | 2,1246 |
| trans | 94487 | 24041 | 2,0354 | 22554 | 1,9095 |
| Average | 3199687 | 1213383 | 3,1605 | 1194300 | 3,0028 |
The files are from the Calgary/Canterbury
text compression corpus. Bpb (bits per byte) is calculated in that
way: bpb = (compressed_length*8) / uncompressed_length.
More than one offset
This was my approach of compression improvements to lzp, I didn't know
anything else, so I had to invent something to get it better, I haven't
yet Charles Bloom's articles, so I tried more and more things, till I did
that. The idea is not to keep only a single offset, instead you keep 2^n
offsets. Then once you check for matches you have to check the whole offsets
and then get the one who gives you better compression. Of course you don't
put two offsets with the same context! I.e.: Under an order-1 model and
after compressing this phrase: "there " we'll have two offsets for the
"e" context, one pointing to "re " and another pointing to " ". Once we
found another e, we see how long the match is at the first offset, then
the second, till the last offset, and write the number of the best offset
in N bits. How do we update this table? let's say we have 2^n, n=1 => 1
different offsets, then the first offset we get for a given context we
assign it the position 0 when we saw another one, we assign to it the position
1, and the next one will have the position 0, and again and again. Another
way of doing it would be something like a simple mtf scheme for updating
the offset, so you can keep the most used offsets. Also we could use a
qsort, based on the number of times that we have used them. However such
solutions have problems, if the probabilities of the text change along
the file, then we may have real compression problems, coz old offsets will
have high probabilities, so new offsets will be quickly discarded. I've
not implemented yet such tricks, so if you do, please tell me. This way
of improving compression have a negative impact on speed, a serious problem
which makes it not a very good idea if you care about speed. Also my test
showed me that using more than 2^4, didn't payed for the 4 bits that we
had to write, also the speed was poor. But hey! I've succeed improving
compression. E-) Another idea of mine which my be useful.
Pseudo code
Once you've understood how hashing works this is very easy, all of
those examples use an infinite window. (some experience on Lz compression
would be good)
Improvements
Some improvements are pretty obvious (using an entropy
coder), but there others which you should care about. Also you can
try what's the maximum length, how many bits will you use for the length,
this isn't very important when using an entropy coder, but without it...
You can also do varible length
codes. This decision will make ratio better in one files and worser
in others. Also you can find some other improvements. Speed improvements:
of course a good speed improvement will be doing a fast
putbits or to do byte based output. (read the article lz77, tags)
Any other possible speed improvements are when comparison are done, just
see the next section. Courtesy of Dario Phong. E-D
Comparisons
A fast string comparison routine should be developed. First some test
string should be defined, this should cover as most cases as possible:
| Name | String |
| s_1 | "abcdef00" |
| s_11 | "abcdef11" |
| s_2 | "abcdefgh00" |
| s_22 | "abcdefgh01" |
| s_3 | "ae" |
| s_33 | "ad" |
| s_4 | "000000000000000000000000000000000000001" |
| s_44 | "000000000000000000000000000000000000002" |
String s_1 will be compared with string s_11, string s_2 will be compared with s_22 ... All the routines assume: DS:EDI-> first string DS:ESI-> second string, also they assume that EDI shouldn't be changed. The first comparison was a byte based one:
dec
esi
@@copy_inner:
inc
esi
mov
al,[edi+ecx]
inc
ecx
cmp
al,[esi]
je short @@copy_inner
dec
ecx
That easy, with the following timings, which were taken in a Pentium
plain 133:
| String | Timing |
| s_1 | 210-80 |
| s_2 | 120-89 |
| s_3 | 119-49 |
| s_4 | 400-253 |
The first timing is the first time it's executed, so the processor has to load the code, instructions of more than one byte cannot be paired, etc. The second, is when the average of 5 repetitions. It's measured in clocks. The last implementation was a word based comparison:
sub
esi,2
@@lzp_more_equal:
add
esi,2
mov
ax,[edi+ecx]
add
ecx,2
cmp
[esi],ax
je short @@lzp_more_equal
mov
al,[edi+ecx-2] ;compare the last byte
sub
ecx,2
cmp
al,[esi]
jne short @@nooop
inc
ecx
@@nooop:
And the timings:
| String | Timing |
| s_1 | 202-83 |
| s_2 | 147-81 |
| s_3 | 97-43 |
| s_4 | 338-203 |
As you can see the speed improved unless in short strings, that's the
main cause to make me not to do a dword version, the operations out of
the 'unrolled' loop, will not pay for the improvements. Anyway if someone
improves the presented routines, or do a dword version, email-me.
Closing words
First of all I should thank to Charles
Bloom for all his help, indirectly, articles, and directly, answering
me a lot of questions; Thanks Charles! As you can see while reading this
article, I've put a lot of effort in this article, so some feedback will
be welcome, congratulations are welcome, but rather I prefer ideas of how
to improve the article. Actually (in the moment which I'm writing this
article) I'm doing lzp+huffman.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.