"Lzp research"
by
Arturo San Emeterio Campos




Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped, with the figures also.
 

Table of contents

  • Introduction
  • Lz
  • Offset probability
  • Standard Lzp
  • Orders
  • Result of orders
  • Deterministic contexts and optimal contexts
  • Implementing deterministic contexts and optimal contexts
  • Results for updating schemes
  • Lzp with optimal contexts scheme 1 pseudo code
  • Match lengths
  • More than one offset
  • No match literals
  • Encoders
  • References
  • Closing words
  • Contacting the author

  •  

     
     

    Introduction
    In this article it's presented all my research till now in lzp. Some standard models are presented and briefly analyzed. Also improvements for any kind of lzp model are presented: deterministic contexts and optimal contexts. The idea of using more than one offset is seen too, though it's not very explored. Enough results are given, so you can choose the more appropriate model for your compressor. This article is mainly focused on models, encoders are briefly discussed too.

    This article assumes knowledge of lzp, different implementations of it, and ppm. (at least the theory behind it)

    If the tables from this article aren't properly displayed, I could send you or upload to my page the tables in plain text version. Anyway I encourage to view this article in Netscape, instead of Internet Explorer. 800x600 is recommended.
     

    Lz
    We can classify the compression of text in two big worlds, Markov models and Lz dictionary compressors, both of them do the same: assign less bits to more common events; thus some researchers explained Lz schemes thru Markov eyes. Both schemes are different: one provides high compression, the other one high speed. The question raised in the brains of some researchers.

    Time passed and Ross Williams did his approach with lzrw, the idea in theory worked, also lzrw became one of the fastest Lz compressor, though it did not had impressive compression ratios. But it was the first step to the hybrid between contexts coders and Lz.

    Charles Bloom did the next step: Lzp. Which seems to be the best approach of Markov and Lz, high speed and high compression, which can be tuned in order to fit your needs. His paper presented lzp and the most important implementations of it. However one can miss an explanation of what's happening behind it or more results. Most of the implementations had in mind speed, however some of them had a very low memory requirements, and one was better than some ppm variants. (but it uses ppmc as a coder for literals, thus dramatically reducing the speed of the whole compressor)

    Peter Fenwick did some research on it, but he focused it on symbol ranking.

    And then here it comes my work, with the hope that it can further develop lzp.
     

    Offset probability
    Can we see parsing as a markov model? yes, we can, but this is not the point of this article, (see articles from Ross Williams) let's analyze  them. Let's say we are under an lz77 like model, and we have parsed "baa", this will produce the following phrases with the following probabilities:
    Note that lz77 has the suffix property: The suffixes of a phrase in a given dictionary are phrases of the dictionary too. The entropy (under a simple order-0 model) of every phrase is 2.32 bits, and effectively, we can't represent it with less than 2.3 bits, well, in fact we'll need 2 bits for the offset and two bits more for the length of the phrase (which is more than we should assign, one can clearly see that this model isn't the best one), even we could assign a value to every phrase, and thus coding them with only 3 bits.
    Phrase Probability
    "a" 1/5
    "b"  1/5
    "aa"  1/5
    "ba" 1/5
    "baa" 1/5
    We can see that in a Lz like compressor where all the events have a flat probability (the same probability for all the symbols) the only way to improve the probability of a given symbols (and thus needing less bits to encode it) is having less events to encode. I.e.: If we have a dictionary with 4096 phrases defined on it, we'll need at least ent ( 1/4096 ) = 12 bits but if we reduce the number of phrases down to 1024 phrases, we'll need ent ( 1/1024 ) = 10 bits, and with even less phrases, let's say 200, we'll need ent ( 1/200 ) = 7.643 bits. (ent() means the entropy of a symbol)

    A standard version of Lzp solves this problem assigning a probability of 1/2 to every offset from a given context (the other 1/2 is reserved to tell the decompressor that this was a failed match), so it needs ent ( 1/2 ) = 1 bit. In some cases lzp may even assign a probability of 1/1 to a given symbol, but then in the case of no match we  have to encode a 0 match length and a literal. (which may result in degradation of the total compression)
     

    Standard Lzp
    We should define a standard implementation of lzp. The pseudo code of it will look like the following:


    Orders
    If we want to do good models, we have to know how many of them can we use, in this I'll discuss and analyze different orders (the number of symbols which are used as context). We can use different orders, every of them will give different results, but first we have to define a set of standard orders. The symbol it's assumed to be a byte.

    Most of the implementations of lzp use only one context, which is accessed with a hash table. It's fast and also easy to implement. The author has never implemented lzp with context trees or suffixes trees, for various reasons, they are not easy to implement, they are not as fast as hashing, and mainly because lzp should not be treated as a context coder, no blending can be performed, and going to lower orders when no match is done seems no good idea, mainly because no scheme to assign the probability to the escape code has been developed yet for lzp. (also the use of arithmetic coding makes it not suitable for applications where high speed is needed)
     

    Result of orders
    Now that we have defined different contexts, let's see how they perform.
     
    Predictions % Bad Good % Matched %
    O1 544434 87 463592 80842 15 162800 26
    O2 432742 69 321783 110959 26 301605 48
    O2h 446314 71 340311 106003 24 284535 45
    O2hc 401479 64 298006 103473 26 281646 45
    O3 360188 57 242612 117576 33 359255 57
    O3h 373524 60 253083 120441 32 361928 58
    O3hc 331039 53 220359 110680 33 332299 53
    O4 314976 50 198489 116487 37 380558 61
    O5 275372 44 168923 106449 39 362334 58
    O6 235102 38 141649 93453 40 328266 52
    O7 166140 27 97888 68252 41 251772 40
    O8 161348 26 95427 65921 41 244967 39
    1- O1
    2 - O2
    3 - O2h
    4 - O2hc
    5 - O3
    6 - O3h
    7 - O3hc
    8 - O4
    9 - O5
    10 - O6
    11 - O7
    12 - O8

    Numbers for graphics.

    A little explanation of this table: Results come from book2. A bad prediction is one which produce no match (0 match length) a good prediction is one which is not bad. (at the right the % of the good predictions, I mean (good/(good+bad))*100) The column "predictions" holds the number of times that under a given contexts we had an offset (and then we have to check for match). At his right there's the %, in respect with the file size of book2. (626,490 bytes) In the column "matched" you have the total bytes that could be matched. (the sum of all match lengths) And at its right the % of bytes matched on all the file. ((matched/file_size)*100)

    Let's see some graphics:
     
    % of predictions
    Here you can see the % of predictions ((predictions/file_length)*100) 
    As you can see the higher the order, the less predictions, this is due to the zero frequency problem (when we haven't seen any context of the same kind yet). Have in mind that all those tables don't take in account only different orders, but also different ways of accessing them. (I.e.: O2, O2h, and O2hc, all of them use the same order, but the way an offset is stored changes)
    % Good predictions
    This is the % of good predictions ((good_predictions/total_predictions)*100) 
    Obviously also, the higher the order, the more reliable the predictions are. Higher orders than 8 were not explored, it's a know fact that orders above 8 don't make any real improvement in text compression. (and the resources for accessing them will be higher and it will be far slower too)
    % of matched bytes
    That graphic represents the % of matched bytes (as defined above) ((matched_bytes/file_length)*100)

    Conclusions: Low orders produce a lot of wrong predictions, and higher orders are very effected by the zero frequency problem. The best trade off (at least for book2) between memory/speed/compression seems to be o4, however order-5 seems good too.

    With all this information you should be able to choose the order which best suits your needs.
     

    Deterministic contexts and optimal contexts
    Some time ago (I mean around June, 1999), in private communication with Charles Bloom, he explained me what deterministic contexts are, they opened a new set of models, which where not done yet. So I started to work on it. Here I present all my approaches.

    But first let's define what is a deterministic context: A deterministic context is a context which has not produced any bad prediction. (a bad prediction is defined above)

    My first scheme to handle deterministic contexts was only based on this information, however I developed some improved versions of it, which don't use deterministic contexts, so a new term should be introduced. (and I base this in the fact that for lzp they are better than deterministic contexts)

    An optimal contexts is a contexts which has produced more good predictions than bad predictions.

    Optimal contexts will only be implemented in lzp, at least in this article, however it's probable that they'll work for ppm too.

    Before starting to explain the different models, you must know this: a given contexts (let's say "th") can have different offsets which provide different matches. (the context "th" can predict "e ", "ere ", "us ", "an ", "ink "...) the goal of the schemes which I'll present now, is to get the best of them.
     

    Implementing deterministic contexts and optimal contexts
    Once we know that there are some offsets which work better than others, how can we build our model so it captures it?
    Let's think about some facts of lzp: We can't change when a offset is inserted for first time (mainly because we don't have any information about it, and also because we need to fill the table) Then we only have two more cases dealing with offset in a given context: when it's updated and deleted.

    1. Updating. For implementing deterministic contexts and optimal contexts, we'll never update offsets. (in a standard versions of lzp as defined before, the offset of a given context is always updated)
    2. Deleting. We delete an offset when it has produced a wrong match, this leads the model to adapt to the incoming data, however we could define this as the greedy approach. So here it is where we can capture the best offsets.
    First let's implement deterministic contexts, and then implementing optimal contexts will be easier. What we have to do is to save along with the offset in a given context a flag, which may have two states:
    1. No deterministic context. (you may choose what value represent them)
    2. Deterministic contexts.
    (note that if we care about memory requirements, we could pack them, a byte for every 8 contexts) And that's the way we use it: This scheme produces already produce better results than an standard lzp model, check the table which is at the end of this section. Also note that using this scheme is fast, the only problem is (perhaps) the memory needed. (one byte more for every context if we don't pack it)

    For optimal contexts there's a main scheme, scheme 1, and then variations of the same, which proved to not be better. (but are presented to show you different ideas)

    Optimal contexts scheme 1:

    This is the main scheme for optimal contexts, note that an offset isn't updated till it proves that it is not good anymore. While it produces enough matches, we keep it. And now let's have a look at the different versions:

    Optimal contexts scheme 2:

    Optimal contexts scheme 3: Optimal contexts scheme 4: The scheme 1 for optimal contexts produce good results, and given the information which theirs state bytes give, (good_predictions and bad_predictions) one could create a scheme for assigning the right probability to the escape code (meaning no match) however this will need an arithmetic coder, which in most of the cases, one can't use, also the time needed in computing so, will make it slower, however here I see a new frontier which remains unexplored. (at least till I start to work on it) But this idea has not been explored yet.

    Also there's another question: Are optimal contexts optimal? While this may seem a paradox, this is a serious question. Let's say we have a text like "the ... there ... the...there ... the..." (where "..." means that there is other text) Here we can see that the scheme for optimal contexts does his job, to not discard an optimal context, "th"->"e " (where "th" is the context and "e " the match that it can produce) but while it's an optimal context, it doesn't produce more bytes matches than "th"->"ere " can produce (twice the bytes matched) so the offset "th"->"ere " is optimal while "th"->"e " is not (even if it's an optimal context).

    However now the question arises: How to keep track of the bytes matched from offsets in a given context which aren't currently used? This would need to store more than one pointer, and check them. Probably this scheme will not produce a lot better results, and I'm sure that it will be more slower. Another solution could be to save the number of matched bytes in a given context, and also assign to the literals a given length, and then do the same as in optimal contexts, however this remains unexplored. Anyway it doesn't seem possible to do this without needing more speed and memory than optimal contexts do.
     

    Results for updating schemes
    In all this examples I only care about the model and the probabilities which it produces, while it's not a ppm model which directly has the probability of a symbol, one can at least know how a given coder will perform given the probability distribution.
     

     
    0
    1
    2
    3
    4
    5
    BIB 59 27 56 57 31 59 56 33 61 57 32 60 56 33 61 56 33 61
    BOOK1 83 24 36 82 27 40 82 30 43 82 29 42 82 30 43 82 30 43
    BOOK2 69 26 48 68 29 51 68 32 53 68 31 52 68 32 53 69 32 53
    GEO 79 23 25 79 23 26 79 23 26 79 23 26 79 23 25 79 23 25
    NEWS 63 24 51 62 27 53 62 29 55 62 28 54 63 29 54 64 29 53
    OBJ2 43 35 67 46 39 67 47 38 66 47 38 66 48 38 65 49 37 64
    PAPER1 64 27 50 63 31 54 62 33 56 62 32 55 63 33 55 63 32 54
    PAPER2 73 26 45 71 30 48 71 33 51 71 32 50 71 33 51 71 32 50
    PIC 15 27 88 18 38 88 20 48 89 17 31 88 20 48 89 21 47 89
    PROGC 52 31 60 52 35 62 53 37 62 53 36 62 54 37 61 55 36 60
    PROGL 39 33 72 40 38 74 42 40 73 42 39 73 43 40 73 44 40 72
    PROGP 33 38 77 35 43 77 37 45 77 36 44 77 38 44 76 39 44 75
    TRANS 32 33 77 32 37 78 33 39 78 32 37 78 34 39 77 35 39 77
    TOTAL 58 26 55 58 30 57 59 32 59 58 31 58 59 32 58 60 32 58
    There are 6 different schemes:
    0- Default lzp.
    1- Deterministic contexts scheme.
    2- Optimal contexts scheme 1.
    3- Optimal contexts scheme 2.
    4- Optimal contexts scheme 3.
    5- Optimal contexts scheme 4.
    And three different columns:
    1- Predictions (%)
    2- Good predictions (%)
    3- Matched bytes (%)

     

    Full results of different contexts are showed now for better understanding.
     
    Default
    Predictions Good Bad Matched
    BIB 69810 19079 50731 65276
    BOOK1 654998 155561 499437 284127
    BOOK2 432742 110959 321783 301605
    GEO 81047 18664 62383 26107
    NEWS 242789 58229 184560 198295
    OBJ2 106140 37588 68552 166225
    PAPER1 34780 9358 25422 27430
    PAPER2 60990 15894 45096 37491
    PIC 77268 20517 56751 453606
    PROGC 21247 6617 14630 24719
    PROGL 29137 9690 19447 53241
    PROGP 16930 6350 10580 39308
    TRANS 29992 9981 20011 72516
    TOTAL 1857870 478487 1379383 1749946
    AVERAGE 142913 36807 106106 134611
    % 26 74
    Detc.
    Predictions Good Bad Matched
    BIB 67334 20626 46708 69299
    BOOK1 646152 176859 469293 314271
    BOOK2 426438 125267 301171 322217
    GEO 81050 18798 62252 26238
    NEWS 241101 65245 175856 206999
    OBJ2 112993 43713 69280 165497
    PAPER1 34067 10474 23593 29259
    PAPER2 59691 17732 41959 40628
    PIC 90205 34183 56022 454194
    PROGC 21531 7549 13982 25367
    PROGL 29783 11454 18329 54359
    PROGP 18152 7806 10346 39542
    TRANS 30371 11368 19003 73773
    TOTAL 1858868 551074 1307794 1821643
    AVERAGE 142990 42390 100600 140126
    % 30 70
    Opc. s1
    Predictions Good Bad Matched
    BIB 65913 21690 44223 71784
    BOOK1 640496 193362 447134 336430
    BOOK2 424905 136046 288859 334529
    GEO 81031 18865 62166 26324
    NEWS 240289 70395 169894 212961
    OBJ2 116465 44579 71886 162891
    PAPER1 33561 11136 22425 30427
    PAPER2 59516 19414 40102 42485
    PIC 103630 50059 53571 456823
    PROGC 21621 7947 13674 25675
    PROGL 30982 12522 18460 54228
    PROGP 19123 8647 10476 39412
    TRANS 30834 11907 18927 73600
    TOTAL 1868366 606569 1261797 1867569
    AVERAGE 143720 46659 97061 143659
    % 32 68
    Opc. s2
    Predictions Good Bad Matched
    BIB 66721 21294 45427 70580
    BOOK1 642739 187364 455375 328189
    BOOK2 427430 132949 294481 328907
    GEO 81026 18721 62305 26185
    NEWS 240220 67385 172835 210020
    OBJ2 115824 44294 71530 163247
    PAPER1 33781 10766 23015 29837
    PAPER2 59612 18816 40796 41791
    PIC 85345 26363 58982 451375
    PROGC 21723 7748 13975 25374
    PROGL 30949 11963 18986 53702
    PROGP 18611 8098 10513 39375
    TRANS 30336 11306 19030 73496
    TOTAL 1854317 567067 1287250 1842078
    AVERAGE 142640 43621 99019 141698
    % 31 69
    Opc. s3
    Predictions Good Bad Matched
    BIB 65946 21760 44186 71821
    BOOK1 640557 193507 447050 336514
    BOOK2 428085 137806 290279 333109
    GEO 81153 18677 62476 26014
    NEWS 244923 70889 174034 208821
    OBJ2 118735 44662 74073 160704
    PAPER1 34318 11214 23104 29748
    PAPER2 59469 19376 40093 42494
    PIC 103565 49267 54298 455927
    PROGC 22354 8179 14175 25174
    PROGL 31430 12694 18736 53952
    PROGP 19282 8433 10849 39039
    TRANS 31937 12394 19543 72983
    TOTAL 1881754 608858 1272896 1856300
    AVERAGE 144750 46835 97915 142792
    % 32 68
    Opc. s4
    Predictions Good Bad Matched
    BIB 65927 21665 44262 71745
    BOOK1 642670 193821 448849 334715
    BOOK2 429632 136889 292743 330645
    GEO 81150 18306 62844 25646
    NEWS 248970 71411 177559 205296
    OBJ2 121540 45194 76346 158431
    PAPER1 34309 10973 23336 29516
    PAPER2 59887 19460 40427 42160
    PIC 105427 50013 55414 454913
    PROGC 22751 8233 14518 24831
    PROGL 32177 12831 19346 53342
    PROGP 20019 8853 11166 38722
    TRANS 33195 13099 20096 72683
    TOTAL 1897654 610748 1286906 1842645
    AVERAGE 145973 46981 98993 141742
    % 32 68
    And let's see the bpb (bits per byte, (compressed_length/uncompressed_length)*8) for all of them:
     
     
    Default Encoded bpb
    Literals 324883 198541 4,889
    Match l. 110959 33934 2,447
    Flags 54091
    Total 286566 3,659
     
    Detc 1 Encoded bpb
    Literals 304216 187926 4,942
    Match l. 125196 36985 2,363
    Flags 53288
    Total 278199 3,552
     
    Opt c.S1 Encoded bpb
    Literals 291959 181573 4,975
    Match l. 136046 38815 2,282
    Flags 53112
    Total 273500 3,492
    Opt c.S2 Encoded bpb
    Literals 297581 184679 4,965
    Match l. 132949 38148 2,295
    Flags 53428
    Total 276255 3,528
    Opt c.S3 Encoded bpb
    Literals 293379 182693 4,982
    Match l. 137806 38816 2,253
    Flags 53509
    Total 275018 3,512
    Opt c.S4 Encoded bpb
    Literals 295843 184330 4,985
    Match l. 136889 38609 2,256
    Flags 53703
    Total 276642 3,533
     
    Models Bpb Improvement
    1- Default 3,659 ...
    2- Deterministic contexts 3,552  0,107
    3- Optimal contexts s1 3,492  0,167
    4- Optimal contexts s2 3,528  0,131
    5- Optimal contexts s3 3,512  0,147
    6- Optimal contexts s4 3,533  0,126
    Bpb of the different models
    Bpb of the different models
    Bpb of encoded literals
    Bpb of encoded literals
    Bpb of encoded match lengths
    Bpb of encoded match lengths
    Optimal contexts scheme 1 is the best model for lzp, while using o2 (Note that all those results are for o2). Other orders should be explored to see how it impacts them, however the idea of this models is to choose the best offsets among some of them in a given contexts, but in higher orders, (due to the zero frequency problem) we have less offsets, so probably these models will have less impact. All those results are for book2.
     

    Lzp with optimal contexts scheme 1 pseudo code
    This is the pseudo coder for it.

    Match lengths
    This idea was proposed by Charles Bloom. It's based on the fact that in a given context the same match tends to occur often, so the same match length is coded. We try to capture it with differential coding, you know, coding the current value as the difference with the previous one. Charles encoded a no match flag if the match was below than the saved one. However I think that one can do it better (in both time and compression) using the overflow of the bytes itself. And differential code a byte value with another byte, and let an encoder do the rest of the job.

    Note that for implementing this, we have to save in every context one byte (or whatever we use for match lengths) which is the last match length.

    Here there are two graphics, the first is the match lengths outputted (at its done in an standard lzp model) and the second is the match lengths which were first differential coded. (note that both of them can be coded with the same routines, so this doesn't add complexity, and it just takes a little bit of time)
     

    Match lengths outputted
    Differential coded match lengths

    Note that the scale is logarithmic adjusted, to make it fit in the pic.We can learn more seeing the output:
    (click here to go to the end of this table and read conclusions) All the results come from book2.
     
     
    Match lenghts  Differential coded 
    0: 0
    1: 52514
    2: 24379
    3: 13609
    4: 8450
    5: 5803
    6: 4227
    7: 3075
    8: 2148
    9: 1328
    10: 952
    11: 762
    12: 639
    13: 478
    14: 325
    15: 227
    16: 232
    17: 162
    18: 118
    19: 115
    20: 110
    21: 89
    22: 83
    23: 56
    24: 44
    25: 39
    26: 49
    27: 28
    28: 41
    29: 19
    30: 19
    31: 14
    32: 11
    33: 15
    34: 13
    35: 12
    36: 13
    37: 10
    38: 10
    39: 12
    40: 9
    41: 14
    42: 7
    43: 5
    44: 2
    45: 7
    46: 9
    47: 6
    48: 6
    49: 7
    50: 8
    51: 5
    52: 5
    53: 6
    54: 2
    55: 2
    56: 5
    57: 7
    58: 2
    59: 4
    60: 5
    61: 4
    62: 6
    63: 2
    64: 3
    65: 0
    66: 2
    67: 1
    68: 4
    69: 2
    70: 1
    71: 2
    72: 4
    73: 0
    74: 3
    75: 6
    76: 1
    77: 0
    78: 1
    79: 2
    80: 3
    81: 1
    82: 2
    83: 1
    84: 0
    85: 0
    86: 2
    87: 2
    88: 4
    89: 1
    90: 1
    91: 3
    92: 2
    93: 0
    94: 8
    95: 3
    96: 0
    97: 0
    98: 1
    99: 1
    100: 0
    101: 1
    102: 1
    103: 0
    104: 1
    105: 1
    106: 0
    107: 0
    108: 0
    109: 0
    110: 2
    111: 0
    112: 0
    113: 0
    114: 0
    115: 0
    116: 0
    117: 0
    118: 1
    119: 0
    120: 0
    121: 0
    122: 0
    123: 0
    124: 0
    125: 0
    126: 0
    127: 0
    128: 0
    129: 0
    130: 0
    131: 0
    132: 0
    133: 1
    134: 1
    135: 0
    136: 0
    137: 0
    138: 0
    139: 0
    140: 0
    141: 0
    142: 0
    143: 0
    144: 0
    145: 0
    146: 0
    147: 0
    148: 0
    149: 0
    150: 0
    151: 0
    152: 0
    153: 0
    154: 0
    155: 0
    156: 0
    157: 0
    158: 0
    159: 0
    160: 0
    161: 0
    162: 0
    163: 0
    164: 0
    165: 0
    166: 0
    167: 0
    168: 0
    169: 0
    170: 0
    171: 0
    172: 0
    173: 0
    174: 0
    175: 0
    176: 0
    177: 0
    178: 0
    179: 0
    180: 0
    181: 0
    182: 0
    183: 0
    184: 0
    185: 0
    186: 0
    187: 0
    188: 0
    189: 0
    190: 0
    191: 0
    192: 0
    193: 0
    194: 0
    195: 0
    196: 0
    197: 0
    198: 0
    199: 0
    200: 0
    201: 0
    202: 0
    203: 0
    204: 0
    205: 0
    206: 0
    207: 0
    208: 0
    209: 0
    210: 0
    211: 0
    212: 0
    213: 0
    214: 0
    215: 0
    216: 0
    217: 0
    218: 0
    219: 0
    220: 0
    221: 0
    222: 0
    223: 0
    224: 0
    225: 0
    226: 0
    227: 0
    228: 0
    229: 0
    230: 0
    231: 0
    232: 0
    233: 0
    234: 0
    235: 0
    236: 0
    237: 0
    238: 0
    239: 0
    240: 0
    241: 0
    242: 0
    243: 0
    244: 0
    245: 0
    246: 0
    247: 0
    248: 0
    249: 0
    250: 0
    251: 0
    252: 0
    253: 0
    254: 0
    255: 0
    0: 92562
    1: 10169
    2: 3922
    3: 2087
    4: 1430
    5: 882
    6: 580
    7: 355
    8: 219
    9: 172
    10: 131
    11: 88
    12: 63
    13: 52
    14: 39
    15: 39
    16: 30
    17: 19
    18: 26
    19: 16
    20: 15
    21: 11
    22: 17
    23: 12
    24: 5
    25: 18
    26: 8
    27: 9
    28: 3
    29: 2
    30: 4
    31: 8
    32: 5
    33: 7
    34: 4
    35: 4
    36: 4
    37: 2
    38: 7
    39: 5
    40: 2
    41: 3
    42: 1
    43: 1
    44: 1
    45: 1
    46: 1
    47: 3
    48: 0
    49: 1
    50: 0
    51: 1
    52: 1
    53: 1
    54: 1
    55: 0
    56: 1
    57: 1
    58: 2
    59: 4
    60: 1
    61: 1
    62: 1
    63: 0
    64: 0
    65: 1
    66: 1
    67: 1
    68: 1
    69: 1
    70: 0
    71: 0
    72: 1
    73: 0
    74: 0
    75: 0
    76: 2
    77: 0
    78: 2
    79: 0
    80: 0
    81: 0
    82: 0
    83: 1
    84: 0
    85: 0
    86: 2
    87: 1
    88: 3
    89: 1
    90: 1
    91: 0
    92: 1
    93: 1
    94: 0
    95: 0
    96: 0
    97: 0
    98: 0
    99: 1
    100: 0
    101: 0
    102: 0
    103: 0
    104: 0
    105: 0
    106: 1
    107: 0
    108: 0
    109: 0
    110: 0
    111: 0
    112: 0
    113: 0
    114: 0
    115: 0
    116: 0
    117: 0
    118: 0
    119: 0
    120: 0
    121: 0
    122: 0
    123: 0
    124: 0
    125: 0
    126: 0
    127: 0
    128: 0
    129: 0
    130: 0
    131: 0
    132: 0
    133: 0
    134: 0
    135: 0
    136: 0
    137: 0
    138: 0
    139: 0
    140: 0
    141: 0
    142: 0
    143: 0
    144: 0
    145: 0
    146: 0
    147: 0
    148: 0
    149: 0
    150: 1
    151: 0
    152: 0
    153: 0
    154: 0
    155: 0
    156: 0
    157: 0
    158: 0
    159: 0
    160: 0
    161: 0
    162: 0
    163: 0
    164: 0
    165: 1
    166: 0
    167: 0
    168: 3
    169: 1
    170: 0
    171: 0
    172: 0
    173: 0
    174: 0
    175: 1
    176: 1
    177: 0
    178: 0
    179: 1
    180: 1
    181: 0
    182: 0
    183: 0
    184: 0
    185: 1
    186: 0
    187: 0
    188: 0
    189: 2
    190: 0
    191: 0
    192: 1
    193: 0
    194: 0
    195: 1
    196: 0
    197: 2
    198: 3
    199: 0
    200: 0
    201: 2
    202: 0
    203: 0
    204: 0
    205: 0
    206: 0
    207: 0
    208: 0
    209: 1
    210: 0
    211: 1
    212: 0
    213: 2
    214: 1
    215: 2
    216: 1
    217: 1
    218: 8
    219: 1
    220: 0
    221: 2
    222: 4
    223: 7
    224: 1
    225: 3
    226: 2
    227: 2
    228: 2
    229: 2
    230: 4
    231: 15
    232: 5
    233: 7
    234: 7
    235: 3
    236: 12
    237: 6
    238: 13
    239: 7
    240: 14
    241: 16
    242: 25
    243: 32
    244: 35
    245: 45
    246: 86
    247: 133
    248: 153
    249: 263
    250: 459
    251: 713
    252: 1219
    253: 1754
    254: 3301
    255: 9029

    One can easily see that in most of the cases the same match length occurs (a 0 differential coded byte) or little changes are done, this does not only helps to have a more skewed probabilities (where some symbols have higher probabilities than others) but also gives a very high probability to one symbol, the 0. In the default lzp, the symbol with highest probability is 1 (meaning a match length of value 1) with a probability of 52514 and we have 120441 match lengths, so its entropy is: ent ( 52514/120441 ) = 1,197 bits while when we differential code the match lengths, the symbol with highest probability, 0, has an entropy of: en ( 92562/120441) = 0,379 bits. One can clearly see why it's worth to use differential coding in match lengths, and it only adds one byte more fore every context.

    I haven't checked to discard the offset if the match was below than the previous, so I don't know whatever it works better or worser. But probably it will hurt compression. (I don't think it's a good idea to discard matches)

    Charles also proposed me another idea, to output the number without caring about the sign (I mean all the values are positive, so -1 becomes 1), and then a bit with the sign, this idea has not been tested, however I think that it will not get better results than the one presented by me, because positive numbers have higher probabilities, thus mixing them and output a bit more, will not be better. In case you want to implement this, I recommend first outputting the number itself, and in case that it's 0 (in most of the cases) don't output the flag after it.
     

    More than one offset
    It's clear than lzp in its quest for choosing only one offset, is losing the chance of more matches, so I thought, that it will be worth to analyze how lzp performs with more than one offset.

    I'll present the results for lzp o2, which keeps track of 2 , 4 or 8 offsets, not the last offsets, but rather the last best offsets, a simple mtf scheme, can easily do that. This is a greedy approach too, it reads the offset 0 first and goes till the last one, whenever it gets one which gives a match, it uses it, without testing if this is the best one. (I mean the offset which gives the longest match)
     
    Using one offsets (default mode)
    times used % Entropy
    Offset 0 110959 100 0
    Good 110959 25
    Bad 321783 75
    Predictions  432742 69
    Matched 301605 48
     
    Using two offsets
     
    % Entropy
    Offset 0 93155 61 0,7235
    Offset 1 60656 39 1,3424
    Total 153811 18603
     
    Good 153811 40
    Bad 231997 60
    Predictions 385808 62
    Matched 394491 63
    Using four offsets
     
    % Entropy
    Offset 0 80306 40 1,3058
    Offset 1 52505 26 1,9188
    Offset 2 37536 19 2,403
    Offset 3 28184 14 2,8164
    Total 198531 46899
     
    Good 198531 58
    Bad 146518 42
    Predictions 345049 55
    Matched 479970 77
    Using height offsets
     
    % Entropy
    Offset 0 74500 31 1,6909
    Offset 1 48304 20 2,316
    Offset 2 34151 14 2,8163
    Offset 3 25688 11 3,2271
    Offset 4 19633 8 3,6149
    Offset 5 15512 6 3,9548
    Offset 6 12461 5 4,2708
    Offset 7 10288 4 4,5472
    Total 240537 81155
    Good 240537 75
    Bad 80304 25
    Predictions 320841 51
    Matched 546184 87

    Some plots:
     
    % of good predictions
     % of good predictions
    % of matched bytes
     % of matched bytes
    Bytes for coding the offset
     Bytes for coding the offset
    It's not difficult to see that using more offsets, leads to more bytes matched, (more good predictions too) but of course we need more bytes to choose among the different offsets. Of course we've matched more (almost most of the) literals.
    (plots don't take in account using only one offset)

    And how much every model will compress book2 if we assume literals to be coded to 4bpb and match lengths to 2bpb:
     
    1 offset
    Literals: 162441
    Matchs: 55479
    Offset: 0
    Flag 1: 13869
    Flag 0: 40222
    Total: 272011
    Bpb: 3,473
    2 offset
    Literals: 115998
    Matchs: 38452
    Offset: 18603
    Flag 1: 19226
    Flag 0: 28999
    Total: 221278
    Bpb: 2,825
    4 offsets
    Literals: 73259
    Matchs: 49632
    Offset: 46899
    Flag 1: 30067
    Flag 0: 10038
    Total: 209895
    Bpb: 2,680
    8 offsets
    Literals: 40152
    Matchs: 60134
    Offset: 81155
    Flag 1: 30067
    Flag 0: 10038
    Total: 221546
    Bpb: 2,829

    Using more offsets is worth the job, only using one more can improve the bpb in 0.648! Also note that we are only using o2, and also that our assumption of coding match lengths to 2bpb is a little bit pessimistic (I've proved that more than half of them can be coded down to 0,3bpb) Unfortunately I don't see a way of choosing a given offset, and thus assigning it a higher probability, though I'm doing some research in giving probability to the flags.
     

    No match literals
    This is another improvement, this is not meant to improve the compression (though it does it, but only a little bit) but instead to make the encoder faster by reducing the number of operations to code a literal.

    The idea is based in this: if under a given context the same phrases tend to occur, then it's very probable that the same literal occurs too.

    Because this idea is not very good, it will not be further developed, and results will not be shown.

    Note that this scheme of saving the last seen unmatched literal in a context is like saving an offset to the last seen unmatched position and to not allow the match length to exceed one.
     

    Encoders
    You can use the encoder that you want, it's up to you. If you want to get more compression at cost of speed, you should think about coding the flags with an arithmetic coder, because usually the probability of a flag isn't 50% (which the entropy codes to 1 bit, just like we do in standard lzp) but rather  30% and 70% or so, it depends on the model you use. And if you want speed you better use Huffman. If you really want to get high compression, no matter what happens, then you better use any ppm variant for coding of the literals, or plain order-0 arithmetic coding. If you want a good tradeoff between compression and speed, a good choice is the range coder.

    References
    "LZP: a new data compression algorithm" Charles Bloom (lzp.ps)
    "Using Prediction to Improve LZ77 Coders" Charles Bloom (lzp_old.ps)
    "Lzp" Dario Phong (dp_lzp.html)
    "Lzp order-3 with linked lists" Dario Phong (dp_lzp_o3_ll.html)

    The article of Charles Bloom are at his home page: http://www.cbloom.com
    And mines are at Wisdom Vault: http://www.arturocampos.com also there you'll found articles about other topics of data compression, such as encoders or other algorithms.
     

    Closing words
    This article isn't fully finished, as soon as I have more time, I'll improve some sections or add new ones, if needed. I.e.: it will be good to see how all orders are affected with the different updating schemes, using more than one offset, differential coding of match lengths outputting the sign flag, and more things. However I'll be busy with other projects, so if you want me to write the next version of this article you better email me.

    Conclusion of my research till the date: I've developed new schemes for the models and investigated other things, and now I can safely say: lzp is very close to its own own limits, and for fast implementation nothing else should be added to it.
    It will never reach the compression ratios of ppm, for the way it adapts to the data. Also it will never achieve the ratios of bwt, as it can't access orders that high. Anyway it seems like one of the best algorithms for getting good compression at high speeds, though it must be finely tuned.

    Thanks to Charles Bloom for all the help and ideas that he provided me about lzp. If you want to discuss about lzp, email me.
     

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
     
     

    Arturo San Emeterio Campos, Barcelona 04-Aug-1999


    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.