"Lzp"
by
Arturo San Emeterio Campos





Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped, with the figures also.
 

Table of contents

  • Introduction
  • Lzp
  • How to predict the offset
  • Hashing
  • Lzp output
  • Hash keys
  • Charles Bloom's hash keys
  • Context confirmation
  • Different hash keys
  • After a match
  • Phrases
  • Sliding Window
  • When to update the table
  • More than one offset
  • Pseudo code
  • Improvements
  • Comparisons
  • Closing words
  • Contacting the author

  •  

     
     
     

    Introduction
    The best lz variation till the date is lzp, which you can tune to obtain more compression or more speed. This article introduces lzp and the basic things to know about it, you are encouraged to visit my h-page where you'll find other articles about lzp about advanced topics of it.
     

    Lzp
    This scheme was invented by Charles Bloom, this is scheme is like an hybrid of lzrw and ppmcb, and some articles of DCC'93. Lzrw was its first step, when Ross Williams, proved that an lz like tree, could be constructed with a Markov model. It was the first step, the first step to a very intelligent hybrid of lz and markov, then, Charles Bloom did the other step in the summer of 1994 or so, while he was working for Micom. Lzp is a dictionary based scheme, as his two first letters say "lz" (Lempel-ziv) but what's the meaning of the "p"? It stands for "prediction". It relies in the fact that any lz77 coder needs too much bits for the offset, so it's compression isn't as good as the entropy, (the minimum length which any compressed file may have, based on the probabilities of the file itself) and then it tries (and succeed) avoiding to pass to the decoder the offset, lzp does that trying to predict it based on the last bytes seen. (context)
     

    How to predict the offset
    Now the big question is how to predict this offset. Let's have a look at those words: "There Where" Both of them have the same text, 'here' The way we'll predict the offset is just with finite markov. (this is not the only way of doing it, but it's the most used and fastest) We'll keep a list of pointers based on the context, so two equal context, will have the same entry in this table, and we then can predict the offset. If we were using an order-2 model, our offset could be: 'he' which in our case: "There Where" the first time it's seen, it will keep an offset to the context so the context 'he' will have a pointer to "re Where" And the second time we found the context 'he' again, we compare the text at the current position "re" with the text at the predicted offset "re" so we see that we have a match of three bytes. Want more examples? let's say we have an order-1 model (it makes the prediction based just on the last byte) so our table should have 256 entries. Let's compress "abcdabc" In our case (using a order-1 model) we have to pass to the decompressor the first byte, so it can predict the offset.
     
    Context  Offset 
    Output 
    ...
    ...
    'a'
    'a'
    "bcdabc" 
    'b'
    'b'
    "cdabc" 
    'c'
    'c'
    "dabc"
    'd'
    'd'
    "abc"
    'a'
    'a'
    match
    length (2) 

    And a decompressor will do the following:
     
    Context  Offset 
    Get 
    Output 
    ...
    ...
    'a'
    "a"
    'a'
    "..." 
    'b'
    "ab"
    'b'
    "..." 
    'c'
    "abc"
    'c'
    "..."
    'd'
    "abcd"
    'd'
    "..."
    'a'
    "abcda" 
    'a'
    match
    length (2)  "abcdabc" 

    Of course both the compressor and the decompressor must work with the same model. (do the same based on the incoming text)
     

    Hashing
    This is the key for understanding lzp. Hashing is making a value based on the context of the current position. The way you do that is called the hash key. The hash index is a value we use for accessing a table filled with pointers to the input string. We save in a table the offset for this context, and once we found this context again we probably have the same data, so a match. Once we have the hash index we check if there was any offset, if it weren't we put the offset to the current position in case it was valid we check if the bytes from this offset match the bytes in our position, and then we output the number of bytes that match has. Example, let's say our table has 255 entries, our hash key will be based on the last byte. Our hash key is: hash_key= (last byte]) AND 255 Note that we don't allow any value above 255. We are encoding "ababa". First we put 'a' as a literal (we need at least one byte for the hash key)
     
    Current byte  Predicted? 
    Offset 
    'b'
    no
    table[a]=offset 'baba' 
    'a'
    no
    table[b]=offset 'aba'
    'b'
    yes 
    'baba', 2 matches

    And decompressing:
     
    Current byte  Input  Predicted? 
    Offset 
    ...
    'a'
    ...
    ...
    'b'
    'b'
    no
    table[a]=offset 'baba' 
    'a'
    'a'
    no
    table[b]=offset 'aba'
    'b'
    length 
    yes 
    'baba', 2 matches

    Lzp output
    In that example you also have to output a flag saying if there's a byte or a match. But in fact you don't always need to output a flag, you only need it in the case you find the current offset valid (not 0) so you have to check if there's a match. But if the offset wasn't valid (it was 0) then you don't have need to output a flag, saying if there's a match or a literal coz, you'll always output a literal, remember you only have to output flags when you've done a prediction. Let me show this:
     
    Current byte  Predicted? 
    Offset 
    Flag? 
    'b'
    no
    table[a]=offset 'baba' 
    no
    'a'
    no
    table[b]=offset 'aba'
    no
    'b'
    yes 
    'baba', 2 matches
    yes

    This improves compression. As you can see the output of lzp is of two kinds:
     
    Prediction  No prediction 
    Literal flag (1) literal (8) 
    literal (8)
    Match flag (1) length (8) 

    Of course you can spend less bits in the length or do variable codes, or entropy code the literals and lengths. If you don't use any entropy coder, you should tune the best length (in bits) for the length, also you should change the range of the length, its usual range is from 0-255, but in fact we'll never write a 0 length (but remember that 1 is the minimum), so we can change the range from 1-256. You should be careful when incrementing the 0FFh, coz if you are using bytes, you may have problems, a 0.
     

    Hash keys
    The way we get the entry of the table is very basic an almost is only for examples. (you can use it in an lzp compressor, but the ratio will be poor) Let's say we have the string "abacab" The second time we get the offset from the context 'a' this isn't a match at all ('ba'!='ca'), so we must output 'c' and then update the table with a pointer to 'c' and the third time we do the hash key of 'a' we neither have a match, ('c'!='b'). This case is called a hash collision. What we need is to use any other order than order-1 model. Order-2 will be fine, and it'll take 65,535 entries, which if using dwords for pointers, makes a memory usage of 262,144 bytes. Which is not that big. If we want to use an order-3 model, we'll have 16,777,216 entries, and we'll need  67,108,864. But we even continue, and use order-4, we'll have 4,294,967,296 entries, and a memory usage of 17,179,869,184 bytes!! With actual computers we can afford even 67 megas, (using order-3 model) but not all the users have 17 gigas of memory. ;-) So we need to restrict the size of our table (the number of entries) while using more than order-1. How could we do something like that? Continue reading. In hashing we need precision, more precision your hash key has, better prediction so better compression. The last hash key was:
     
    hhhh  hhhh  <= Hash key 
    1111 
    1111 
    <= last byte

    Note that I'm putting in a column a nibble, the first column, at the left, will contain the msb, of a byte, word or dword, of course the last column, at the left will hold the lsb (less significant byte). We could also use a two byte hash key like that:
     
    hhhh  hhhh  hhhh  hhhh 
    2222 
    2222 
    1111  1111 

    Where 'h' is the hash key, 1 the first byte and 2 the second byte. I.e.: Our context (last bytes) are 45h 0Dh so our hash key is: 450Dh Now we are using a 2^16 (64k) entries so our table is (2^16)*4 bytes long. (262.144 bytes, 262k) Note that there's no difference between 450Dh or 0D45h, coz they access at the same entry at all. (as long as both the encoder and decoder do that in the same way) Note that when using hash keys with xor this do makes difference. Those kind of hash keys are what I call direct hash, coz for doing them you only need to do a move, no need of xors, they are faster, and if you have enough memory they are good. Of course the hash keys that you'll see now, achieve more compression while using less memory. (but they are slower to implement) Of course there are some hash keys which are more precise, they also care about the value of the third byte, while having a table of 2^16 entries. How? using xors. This is the hash key that Charles Bloom used in lzp3:
     
    hhhh  hhhh  hhhh  hhhh 
     
     
    1111  1111 
     222
    2222   
    3333 3    

    The way this is coded is the following:

    1.  hash_index = byte_1
    2.  hash_index XOR (byte_2 shifted to the left 7 times)
    3.  hash_index XOR (byte_3 shifted to the left 11 times)
    4.  hash_index AND 1111111111111111b
    In the last step we 'and' the value, so the range of the hash key is between 0 -> 2^16. (of course the part of the byte 3 that is used in the hash key is the LSB) Of course more memory usage makes it better, for example a hash key like that:
     
    hhhh  hhhh hhhh  hhhh  hhhh 
    3333 
    2222 
    2222 
    1111  1111 

    Will give more compression, but it will need 4.194.304 bytes! And something like that:
     
    hhhh  hhhh  hhhh hhhh  hhhh  hhhh 
    3333 
    3333  2222 
    2222 
    1111  1111 

    Will also make a better result, but it will need 67.108.864 bytes of Ram memory! But you should note that in this moment (1999) you can buy a Pc with 128 megas of Ram memory, and this is enough for this hash key, however not everybody has 128 megas or even 64 megas. I still have 16 megas! ;-( The kind of hashing we are using also is called finite context Markov prediction coz we use a finite context (the bytes we use for the hash key)(it's finite coz we have a maximum number of bytes to use) to make the prediction. Also you should associate Markov with probabilities, and when we are doing prediction what we are doing in fact is using the probabilities of a file for predicting whatever. (in our case an offset to the context of the current byte)
     

    Charles Bloom's hash keys
    And this is the way which Charles Bloom defines hash keys:

    Where: And let's see how they work: In our example it only worked with two bytes (order-2 model) but of course we can work with more than 2: This could be one example. And let's analyse how they interact: the better value for p is:  p = x - h  So, let's say we have an order-2 model, which uses the whole 16 bits, so x is 16, and the hash table has 65535 entries (2^16) so h is 16, and the  p= x - h  =>  p = 16 - 16 =>  p = 0  So this means, that we'll not shift out any bit, all the bits will take part of the hash index, and it will be a good hash key. (the best while using an order-2 model) Note that it doesn't make sense to have the value h above of x. If we are using an order-1 model our value of x is 8, and if our value of h is 9 or more, then it makes no sense at all, coz only 8 bits will change the hash index. Also note that: Charles Bloom proved empirically that a value of p which is a multiple of 4, is not a good idea. Having an order-2 model a good value of h is 12 (4096 entries) and a value of p=3, was better than p=4 in 0.1 bpb.
     

    Context confirmation
    This is used when a bit of the hash key is the result of the xor of two or more bits. I.e.:

    In this case the bit number 7, is a xor of the msb of the last byte and the lsb of the penultimate byte, so it can introduce 'errors', let me explain, not real errors, but two different context may have the same hash entry, and this can (and will) hurt compression. So we need a way of knowing if the current context is the same which saved the offset. The way we do that is the following: along with the offset we save the context. Once we do the next hash index, we first check if there's an offset, if it's then we must check if the context is the same. If it's, then it's a valid offset, otherwise, proceed as if there were no offset at all. Context confirmation improves bpb, but doubles memory usage and decreases speed.
    File Original  Confirmation  No confirmation 
    bib 117541 43514 43169
    book1 785393 367666  366898
    book2 626490 247255 246950
    geo 102400 72953 73948
    news 387168 160900 160942
    obj1 21504 11969 12053
    obj2 246814 102824 102561
    paper1 54411 22441 22313
    paper2  83930 35144 35074
    pic 513216 57951 59973
    progc 41098 15936 15904
    progl 73890 19557 19533
    progp 51345 13636 13543
    trans 94487 22554 22500
    Average  3199687  1194300  1195361
     
    The table used had 2^16 entries, and the hash key used was the hash key presented above. The only difference between the two version is that one uses context confirmation and the other don't.As one  can see the difference in compression is rather low (1k!) and the version which didn't used context confirmation, used half the memory, and compressed the files in less time. Note that it used order-3 model. (also entropy coding of literals and lengths) Also note that some files are even compressed better without context confirmation. (but this is only valid for low order models)
     

    For higher order like order-4, context confirmation, around 0,2 bpb of improvement.
     

    Different hash keys
    And now, let's see some different hash keys. One implementation which was faster than the famous version of Lz, Stack Lzs
    algorithm, while achieving 1,4 bpb less than it. (more ratio) This one uses
    a 2k window and a table with 1024 entries, so it uses around 5k.
    X=order-2 model  x=
    Under construction.
     

    After a match
    Lz77 outputs pairs of offset/length, and after them it always output a literal. Lzp does the same. Why? If you have found a match just after it the probabilities of getting another match are rather low. Why? coz the context of the next match is the current match, and in case that the new match is really a match it should have matched in the previous phrase. So if you output a literal after a match, you get more speed, and little bit of more compression.
     

    Phrases
    We shouldn't see lzp just as try to predict matches. For example let's take lz77, it keeps in a sliding window phrases to the input file, but let's focus more on this, lzw (or lz78), it just keeps phrases, stand alone phrases. We should try to do the same, get phrases. Well, in fact we get possible phrases every time we add a new offset to the hash table. Charles Bloom proposed the following, just update the hash table at phrase boundaries, at the start of the phrase. So a match like the following:  "there" should only store an offset to the start, ("t") the phrase boundary, but it shouldn't do it at the rest "here". This idea improves both compression and speed. I've read some papers about compression (I don't remember what paper exactly) and read about the suffix property, (I think I'm not wrong) most lz algorithms have the suffix property: all the suffix of a phrase are also phrases in the dictionary. I.e.: In a lz77 compressor we have the phrase "there" then in our dictionary we also have its suffixes, all the phrases inside of it: "here" "ere" "re" "e" Do you see? Well, this and also the fact, that compression on little files isn't as good as it could be I had the following idea: I try to update only at phrase boundaries, but also try to quickly fill up our hash table, how I did it? Easy, I update inside a phrase, only if the entry corresponding to the suffix wasn't initialized yet. Example: We have the phrase "there", we are under an order-1 model (for example purposes is fine) First we update at the phrase boundaries, with an offset to "t". Then we conditionally update for the suffixes of the phrase:

    1.  Context: "t". Offset: "here". Uninitialized, update table.
    2.  Context: "h". Offset: "ere". Uninitialized, update table.
    3.  Context: "e". Offset: "re". Uninitialized, update table.
    4.  Context: "r". Offset: "h". Uninitialized, update table.
    5.  Context: "e". Offset: "...". Initialized, not update.
    Conditional updating, invented by Dario Phong, it helps compression in all the files, but specially in little files, however the speed gets hurt. You choose if you have to use it or not.
     

    Sliding Window
    Lzp, as I've already mentioned is a dictionary based scheme, so like lz77, it keeps an sliding window, the size of this window can vary, it's just your decision. An sliding windows of 65535 bytes is as good as an infinite window can be, and the difference between an sliding windows of 65535 bytes and another of 16384, is rather low. You can reduce the size of the window without really hurting compression, while having more speed. I always use infinite windows (well, they aren't real infinite, they just take as many memory as the user have) this makes everything easier. Here there's a graphical example of how the size of the windows affects compression:
     

    The y is bpb, and the x size of the window, in the way 2^n.
    Numbers from lzp_old by Charles Bloom.
    (for paper1)

    If you have any experience with finite windows, let me know.

    When to update the table
    The hash table which holds the offsets should be updated at phrase boundaries, we have to update it for all the contexts but when we are inside a match, thus a phrase. This is in fact very easy to do, at your inner loop, you always update the table, and once you have a match, coz you've already updated the hash table for its start, then you skip the matched bytes, and don't do any update for them. In that way we adapt the "dictionary" to the incoming bytes, and so it compress better. Let's say we have an string like: "which where when", the first time we do the offset for the context "wh" (order-2 model) it points to "ich" and the second time it points to "ere" but if now we don't update the offset we lose a match in the third one, "e". Do you want some numbers? here they are, the first table is using order-3 with hashing, a 2^16 entries table, and also entropy coding, updating always but in phrase boundaries and when we have a match, the second does the same than the first, but also updates when it finds new matches.
     

    File Original  Length1  bpb1  length2  bpb2 
    bib 117541 43575 2,9657  43514 2,9616 
    book1 785393 368476 3,7532 367666  3,7450
    book2 626490 250254 3,1956 247255 3,1573
    geo 102400 72937 5,6982 72953 5,6994
    news 387168 161574 3,3385 160900 3,3246
    obj1 21504 12028 4,4747 11969 4,4527
    obj2 246814 106347 3,4470 102824 3,3328
    paper1 54411 22567 3,3180 22441 3,2994
    paper2  83930 35272 3,3620 35144 3,3498
    pic 513216 65077 1,0144 57951 0,9033
    progc 41098 16321 3,1769 15936 3,1020
    progl 73890 20431 2,2120 19557 2,1174
    progp 51345 14483 2,2565 13636 2,1246
    trans 94487 24041 2,0354 22554 1,9095
    Average  3199687  1213383  3,1605 1194300  3,0028

    The files are from the Calgary/Canterbury text compression corpus. Bpb (bits per byte) is calculated in that way: bpb = (compressed_length*8) / uncompressed_length.
     

    More than one offset
    This was my approach of compression improvements to lzp, I didn't know anything else, so I had to invent something to get it better, I haven't yet Charles Bloom's articles, so I tried more and more things, till I did that. The idea is not to keep only a single offset, instead you keep 2^n offsets. Then once you check for matches you have to check the whole offsets and then get the one who gives you better compression. Of course you don't put two offsets with the same context! I.e.: Under an order-1 model and after compressing this phrase: "there " we'll have two offsets for the "e" context, one pointing to "re " and another pointing to " ". Once we found another e, we see how long the match is at the first offset, then the second, till the last offset, and write the number of the best offset in N bits. How do we update this table? let's say we have 2^n, n=1 => 1 different offsets, then the first offset we get for a given context we assign it the position 0 when we saw another one, we assign to it the position 1, and the next one will have the position 0, and again and again. Another way of doing it would be something like a simple mtf scheme for updating the offset, so you can keep the most used offsets. Also we could use a qsort, based on the number of times that we have used them. However such solutions have problems, if the probabilities of the text change along the file, then we may have real compression problems, coz old offsets will have high probabilities, so new offsets will be quickly discarded. I've not implemented yet such tricks, so if you do, please tell me. This way of improving compression have a negative impact on speed, a serious problem which makes it not a very good idea if you care about speed. Also my test showed me that using more than 2^4, didn't payed for the 4 bits that we had to write, also the speed was poor. But hey! I've succeed improving compression. E-) Another idea of mine which my be useful.
     

    Pseudo code
    Once you've understood how hashing works this is very easy, all of those examples use an infinite window. (some experience on Lz compression would be good)

    This is a simple lzp compressor, you can also encode the lengths and literal bytes with an entropy coder, but this is up to you. And the decompressor: The flag is one bit, i.e.: if it's 0 we have a literal byte; if it's 1 we have a match. We output the bytes used in the hash key coz we need them for starting doing it! Of course the way you do your hash key is up to you, and this will improve bpb (Bits Per Bytes, how many bits you need for representing a byte, the formula is: (output_length*8)/input_length) This was just a basic example, I'll do a better one, this one we'll use the hash key: H =  ( X ^ ( X >> P) ^ ( X >> P2) )  &  ( (1<<h) - 1) So it will need context confirmation, also it will not write flag if not needed: No_offset: The decoder is quite similar: No_offset: Some things should be noted there, "Init dictionary" means clear it, put all the entries to 0. Also when compressing a match for advancing the pointer you should add to the offset to the current byte the length of the match.
     

    Improvements
    Some improvements are pretty obvious (using an entropy coder), but there others which you should care about. Also you can try what's the maximum length, how many bits will you use for the length, this isn't very important when using an entropy coder, but without it... You can also do varible length codes. This decision will make ratio better in one files and worser in others. Also you can find some other improvements. Speed improvements: of course a good speed improvement will be doing a fast putbits or to do byte based output. (read the article lz77, tags) Any other possible speed improvements are when comparison are done, just see the next section. Courtesy of Dario Phong. E-D
     

    Comparisons
    A fast string comparison routine should be developed. First some test string should be defined, this should cover as most cases as possible:
     
    Name  String 
    s_1 "abcdef00"
    s_11 "abcdef11" 
    s_2 "abcdefgh00"
    s_22 "abcdefgh01" 
    s_3 "ae"
    s_33 "ad"
    s_4 "000000000000000000000000000000000000001" 
    s_44 "000000000000000000000000000000000000002" 

    String s_1 will be compared with string s_11, string s_2 will be compared with s_22 ... All the routines assume: DS:EDI-> first string DS:ESI-> second string, also they assume that EDI shouldn't be changed. The first comparison was a byte based one:

          dec         esi
    @@copy_inner:
          inc         esi
          mov         al,[edi+ecx]
          inc         ecx
          cmp         al,[esi]
          je short    @@copy_inner
          dec         ecx

    That easy, with the following timings, which were taken in a Pentium plain 133:
     
    String  Timing 
    s_1 210-80 
    s_2  120-89
    s_3 119-49
    s_4 400-253 

    The first timing is the first time it's executed, so the processor has to load the code, instructions of more than one byte cannot be paired, etc. The second, is when the average of 5 repetitions. It's measured in clocks. The last implementation was a word based comparison:

          sub         esi,2
    @@lzp_more_equal:
          add         esi,2
          mov         ax,[edi+ecx]
          add         ecx,2
          cmp         [esi],ax
          je short    @@lzp_more_equal

          mov         al,[edi+ecx-2] ;compare the last byte
          sub         ecx,2
          cmp         al,[esi]
          jne short   @@nooop
          inc         ecx
    @@nooop:
     

    And the timings:
     
    String  Timing 
    s_1 202-83 
    s_2  147-81
    s_3 97-43 
    s_4 338-203 

    As you can see the speed improved unless in short strings, that's the main cause to make me not to do a dword version, the operations out of the 'unrolled' loop, will not pay for the improvements. Anyway if someone improves the presented routines, or do a dword version, email-me.
     

    Closing words
    First of all I should thank to Charles Bloom for all his help, indirectly, articles, and directly, answering me a lot of questions; Thanks Charles! As you can see while reading this article, I've put a lot of effort in this article, so some feedback will be welcome, congratulations are welcome, but rather I prefer ideas of how to improve the article. Actually (in the moment which I'm writing this article) I'm doing lzp+huffman.
     

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
     
     

    Arturo San Emeterio Campos, Barcelona 07-Jul-1999


    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.