Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped,
with the figures also.
Table of contents
Introduction
In this article it's presented all my research till now in lzp. Some
standard models are presented and briefly analyzed. Also improvements for
any kind of lzp model are presented: deterministic contexts and optimal
contexts. The idea of using more than one offset is seen too, though it's
not very explored. Enough results are given, so you can choose the more
appropriate model for your compressor. This article is mainly focused on
models, encoders are briefly discussed too.
This article assumes knowledge of lzp, different implementations of it, and ppm. (at least the theory behind it)
If the tables from this article aren't properly displayed, I could send
you or upload to my page the tables in plain text version. Anyway I encourage
to view this article in Netscape, instead of Internet Explorer. 800x600
is recommended.
Lz
We can classify the compression of text in two big worlds, Markov models
and Lz dictionary compressors, both of them do the same: assign less bits
to more common events; thus some researchers explained Lz schemes thru
Markov eyes. Both schemes are different: one provides high compression,
the other one high speed. The question raised in the brains of some researchers.
Time passed and Ross Williams did his approach with lzrw, the idea in theory worked, also lzrw became one of the fastest Lz compressor, though it did not had impressive compression ratios. But it was the first step to the hybrid between contexts coders and Lz.
Charles Bloom did the next step: Lzp. Which seems to be the best approach of Markov and Lz, high speed and high compression, which can be tuned in order to fit your needs. His paper presented lzp and the most important implementations of it. However one can miss an explanation of what's happening behind it or more results. Most of the implementations had in mind speed, however some of them had a very low memory requirements, and one was better than some ppm variants. (but it uses ppmc as a coder for literals, thus dramatically reducing the speed of the whole compressor)
Peter Fenwick did some research on it, but he focused it on symbol ranking.
And then here it comes my work, with the hope that it can further develop
lzp.
Offset probability
Can we see parsing as a markov model? yes, we can, but this is not
the point of this article, (see articles from Ross Williams) let's analyze
them. Let's say we are under an lz77 like model, and we have parsed "baa",
this will produce the following phrases with the following probabilities:
| Note that lz77 has the suffix property: The suffixes of a phrase in a given dictionary are phrases of the dictionary too. The entropy (under a simple order-0 model) of every phrase is 2.32 bits, and effectively, we can't represent it with less than 2.3 bits, well, in fact we'll need 2 bits for the offset and two bits more for the length of the phrase (which is more than we should assign, one can clearly see that this model isn't the best one), even we could assign a value to every phrase, and thus coding them with only 3 bits. |
|
A standard version of Lzp solves this problem assigning a probability
of 1/2 to every offset from a given context (the other 1/2 is reserved
to tell the decompressor that this was a failed match), so it needs ent
( 1/2 ) = 1 bit. In some cases lzp may even assign a probability of 1/1
to a given symbol, but then in the case of no match we have to encode
a 0 match length and a literal. (which may result in degradation of the
total compression)
Standard Lzp
We should define a standard implementation of lzp. The pseudo code
of it will look like the following:
Orders
If we want to do good models, we have to know how many of them can
we use, in this I'll discuss and analyze different orders (the number of
symbols which are used as context). We can use different orders, every
of them will give different results, but first we have to define a set
of standard orders. The symbol it's assumed to be a byte.
Result of orders
Now that we have defined different contexts, let's see how they perform.
|
1- O1
2 - O2 3 - O2h 4 - O2hc 5 - O3 6 - O3h 7 - O3hc 8 - O4 9 - O5 10 - O6 11 - O7 12 - O8 Numbers for graphics. |
A little explanation of this table: Results come from book2. A bad prediction is one which produce no match (0 match length) a good prediction is one which is not bad. (at the right the % of the good predictions, I mean (good/(good+bad))*100) The column "predictions" holds the number of times that under a given contexts we had an offset (and then we have to check for match). At his right there's the %, in respect with the file size of book2. (626,490 bytes) In the column "matched" you have the total bytes that could be matched. (the sum of all match lengths) And at its right the % of bytes matched on all the file. ((matched/file_size)*100)
Let's see some graphics:
|
Here you can see the % of predictions ((predictions/file_length)*100)
As you can see the higher the order, the less predictions, this is due to the zero frequency problem (when we haven't seen any context of the same kind yet). Have in mind that all those tables don't take in account only different orders, but also different ways of accessing them. (I.e.: O2, O2h, and O2hc, all of them use the same order, but the way an offset is stored changes) |
|
This is the % of good predictions ((good_predictions/total_predictions)*100)
Obviously also, the higher the order, the more reliable the predictions are. Higher orders than 8 were not explored, it's a know fact that orders above 8 don't make any real improvement in text compression. (and the resources for accessing them will be higher and it will be far slower too) |
|
That graphic represents the % of matched bytes (as defined above) ((matched_bytes/file_length)*100) |
Conclusions: Low orders produce a lot of wrong predictions, and higher orders are very effected by the zero frequency problem. The best trade off (at least for book2) between memory/speed/compression seems to be o4, however order-5 seems good too.
With all this information you should be able to choose the order which
best suits your needs.
Deterministic
contexts and optimal contexts
Some time ago (I mean around June, 1999), in private communication
with Charles Bloom, he explained me what deterministic contexts are, they
opened a new set of models, which where not done yet. So I started to work
on it. Here I present all my approaches.
But first let's define what is a deterministic context: A deterministic context is a context which has not produced any bad prediction. (a bad prediction is defined above)
My first scheme to handle deterministic contexts was only based on this information, however I developed some improved versions of it, which don't use deterministic contexts, so a new term should be introduced. (and I base this in the fact that for lzp they are better than deterministic contexts)
An optimal contexts is a contexts which has produced more good predictions than bad predictions.
Optimal contexts will only be implemented in lzp, at least in this article, however it's probable that they'll work for ppm too.
Before starting to explain the different models, you must know this:
a given contexts (let's say "th") can have different offsets which provide
different matches. (the context "th" can predict "e ", "ere ", "us ", "an
", "ink "...) the goal of the schemes which I'll present now, is to get
the best of them.
Implementing
deterministic contexts and optimal contexts
Once we know that there are some offsets which work better than others,
how can we build our model so it captures it?
Let's think about some facts of lzp: We can't change when a offset
is inserted for first time (mainly because we don't have any information
about it, and also because we need to fill the table) Then we only have
two more cases dealing with offset in a given context: when it's updated
and deleted.
For optimal contexts there's a main scheme, scheme 1, and then variations of the same, which proved to not be better. (but are presented to show you different ideas)
Optimal contexts scheme 1:
Optimal contexts scheme 2:
Also there's another question: Are optimal contexts optimal? While this may seem a paradox, this is a serious question. Let's say we have a text like "the ... there ... the...there ... the..." (where "..." means that there is other text) Here we can see that the scheme for optimal contexts does his job, to not discard an optimal context, "th"->"e " (where "th" is the context and "e " the match that it can produce) but while it's an optimal context, it doesn't produce more bytes matches than "th"->"ere " can produce (twice the bytes matched) so the offset "th"->"ere " is optimal while "th"->"e " is not (even if it's an optimal context).
However now the question arises: How to keep track of the bytes matched
from offsets in a given context which aren't currently used? This would
need to store more than one pointer, and check them. Probably this scheme
will not produce a lot better results, and I'm sure that it will be more
slower. Another solution could be to save the number of matched bytes in
a given context, and also assign to the literals a given length, and then
do the same as in optimal contexts, however this remains unexplored. Anyway
it doesn't seem possible to do this without needing more speed and memory
than optimal contexts do.
Results
for updating schemes
In all this examples I only care about the model and the probabilities
which it produces, while it's not a ppm model which directly has the probability
of a symbol, one can at least know how a given coder will perform given
the probability distribution.
|
|
|
|
|
|
|
|||||||||||||||||||
| BIB | 59 | 27 | 56 | 57 | 31 | 59 | 56 | 33 | 61 | 57 | 32 | 60 | 56 | 33 | 61 | 56 | 33 | 61 | ||||||
| BOOK1 | 83 | 24 | 36 | 82 | 27 | 40 | 82 | 30 | 43 | 82 | 29 | 42 | 82 | 30 | 43 | 82 | 30 | 43 | ||||||
| BOOK2 | 69 | 26 | 48 | 68 | 29 | 51 | 68 | 32 | 53 | 68 | 31 | 52 | 68 | 32 | 53 | 69 | 32 | 53 | ||||||
| GEO | 79 | 23 | 25 | 79 | 23 | 26 | 79 | 23 | 26 | 79 | 23 | 26 | 79 | 23 | 25 | 79 | 23 | 25 | ||||||
| NEWS | 63 | 24 | 51 | 62 | 27 | 53 | 62 | 29 | 55 | 62 | 28 | 54 | 63 | 29 | 54 | 64 | 29 | 53 | ||||||
| OBJ2 | 43 | 35 | 67 | 46 | 39 | 67 | 47 | 38 | 66 | 47 | 38 | 66 | 48 | 38 | 65 | 49 | 37 | 64 | ||||||
| PAPER1 | 64 | 27 | 50 | 63 | 31 | 54 | 62 | 33 | 56 | 62 | 32 | 55 | 63 | 33 | 55 | 63 | 32 | 54 | ||||||
| PAPER2 | 73 | 26 | 45 | 71 | 30 | 48 | 71 | 33 | 51 | 71 | 32 | 50 | 71 | 33 | 51 | 71 | 32 | 50 | ||||||
| PIC | 15 | 27 | 88 | 18 | 38 | 88 | 20 | 48 | 89 | 17 | 31 | 88 | 20 | 48 | 89 | 21 | 47 | 89 | ||||||
| PROGC | 52 | 31 | 60 | 52 | 35 | 62 | 53 | 37 | 62 | 53 | 36 | 62 | 54 | 37 | 61 | 55 | 36 | 60 | ||||||
| PROGL | 39 | 33 | 72 | 40 | 38 | 74 | 42 | 40 | 73 | 42 | 39 | 73 | 43 | 40 | 73 | 44 | 40 | 72 | ||||||
| PROGP | 33 | 38 | 77 | 35 | 43 | 77 | 37 | 45 | 77 | 36 | 44 | 77 | 38 | 44 | 76 | 39 | 44 | 75 | ||||||
| TRANS | 32 | 33 | 77 | 32 | 37 | 78 | 33 | 39 | 78 | 32 | 37 | 78 | 34 | 39 | 77 | 35 | 39 | 77 | ||||||
| TOTAL | 58 | 26 | 55 | 58 | 30 | 57 | 59 | 32 | 59 | 58 | 31 | 58 | 59 | 32 | 58 | 60 | 32 | 58 |
| There are 6 different schemes:
0- Default lzp. 1- Deterministic contexts scheme. 2- Optimal contexts scheme 1. 3- Optimal contexts scheme 2. 4- Optimal contexts scheme 3. 5- Optimal contexts scheme 4. |
And three different columns:
1- Predictions (%) 2- Good predictions (%) 3- Matched bytes (%)
|
| Default | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 69810 | 19079 | 50731 | 65276 |
| BOOK1 | 654998 | 155561 | 499437 | 284127 |
| BOOK2 | 432742 | 110959 | 321783 | 301605 |
| GEO | 81047 | 18664 | 62383 | 26107 |
| NEWS | 242789 | 58229 | 184560 | 198295 |
| OBJ2 | 106140 | 37588 | 68552 | 166225 |
| PAPER1 | 34780 | 9358 | 25422 | 27430 |
| PAPER2 | 60990 | 15894 | 45096 | 37491 |
| PIC | 77268 | 20517 | 56751 | 453606 |
| PROGC | 21247 | 6617 | 14630 | 24719 |
| PROGL | 29137 | 9690 | 19447 | 53241 |
| PROGP | 16930 | 6350 | 10580 | 39308 |
| TRANS | 29992 | 9981 | 20011 | 72516 |
| TOTAL | 1857870 | 478487 | 1379383 | 1749946 |
| AVERAGE | 142913 | 36807 | 106106 | 134611 |
| % | 26 | 74 | ||
| Detc. | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 67334 | 20626 | 46708 | 69299 |
| BOOK1 | 646152 | 176859 | 469293 | 314271 |
| BOOK2 | 426438 | 125267 | 301171 | 322217 |
| GEO | 81050 | 18798 | 62252 | 26238 |
| NEWS | 241101 | 65245 | 175856 | 206999 |
| OBJ2 | 112993 | 43713 | 69280 | 165497 |
| PAPER1 | 34067 | 10474 | 23593 | 29259 |
| PAPER2 | 59691 | 17732 | 41959 | 40628 |
| PIC | 90205 | 34183 | 56022 | 454194 |
| PROGC | 21531 | 7549 | 13982 | 25367 |
| PROGL | 29783 | 11454 | 18329 | 54359 |
| PROGP | 18152 | 7806 | 10346 | 39542 |
| TRANS | 30371 | 11368 | 19003 | 73773 |
| TOTAL | 1858868 | 551074 | 1307794 | 1821643 |
| AVERAGE | 142990 | 42390 | 100600 | 140126 |
| % | 30 | 70 | ||
| Opc. s1 | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 65913 | 21690 | 44223 | 71784 |
| BOOK1 | 640496 | 193362 | 447134 | 336430 |
| BOOK2 | 424905 | 136046 | 288859 | 334529 |
| GEO | 81031 | 18865 | 62166 | 26324 |
| NEWS | 240289 | 70395 | 169894 | 212961 |
| OBJ2 | 116465 | 44579 | 71886 | 162891 |
| PAPER1 | 33561 | 11136 | 22425 | 30427 |
| PAPER2 | 59516 | 19414 | 40102 | 42485 |
| PIC | 103630 | 50059 | 53571 | 456823 |
| PROGC | 21621 | 7947 | 13674 | 25675 |
| PROGL | 30982 | 12522 | 18460 | 54228 |
| PROGP | 19123 | 8647 | 10476 | 39412 |
| TRANS | 30834 | 11907 | 18927 | 73600 |
| TOTAL | 1868366 | 606569 | 1261797 | 1867569 |
| AVERAGE | 143720 | 46659 | 97061 | 143659 |
| % | 32 | 68 | ||
| Opc. s2 | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 66721 | 21294 | 45427 | 70580 |
| BOOK1 | 642739 | 187364 | 455375 | 328189 |
| BOOK2 | 427430 | 132949 | 294481 | 328907 |
| GEO | 81026 | 18721 | 62305 | 26185 |
| NEWS | 240220 | 67385 | 172835 | 210020 |
| OBJ2 | 115824 | 44294 | 71530 | 163247 |
| PAPER1 | 33781 | 10766 | 23015 | 29837 |
| PAPER2 | 59612 | 18816 | 40796 | 41791 |
| PIC | 85345 | 26363 | 58982 | 451375 |
| PROGC | 21723 | 7748 | 13975 | 25374 |
| PROGL | 30949 | 11963 | 18986 | 53702 |
| PROGP | 18611 | 8098 | 10513 | 39375 |
| TRANS | 30336 | 11306 | 19030 | 73496 |
| TOTAL | 1854317 | 567067 | 1287250 | 1842078 |
| AVERAGE | 142640 | 43621 | 99019 | 141698 |
| % | 31 | 69 | ||
| Opc. s3 | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 65946 | 21760 | 44186 | 71821 |
| BOOK1 | 640557 | 193507 | 447050 | 336514 |
| BOOK2 | 428085 | 137806 | 290279 | 333109 |
| GEO | 81153 | 18677 | 62476 | 26014 |
| NEWS | 244923 | 70889 | 174034 | 208821 |
| OBJ2 | 118735 | 44662 | 74073 | 160704 |
| PAPER1 | 34318 | 11214 | 23104 | 29748 |
| PAPER2 | 59469 | 19376 | 40093 | 42494 |
| PIC | 103565 | 49267 | 54298 | 455927 |
| PROGC | 22354 | 8179 | 14175 | 25174 |
| PROGL | 31430 | 12694 | 18736 | 53952 |
| PROGP | 19282 | 8433 | 10849 | 39039 |
| TRANS | 31937 | 12394 | 19543 | 72983 |
| TOTAL | 1881754 | 608858 | 1272896 | 1856300 |
| AVERAGE | 144750 | 46835 | 97915 | 142792 |
| % | 32 | 68 | ||
| Opc. s4 | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 65927 | 21665 | 44262 | 71745 |
| BOOK1 | 642670 | 193821 | 448849 | 334715 |
| BOOK2 | 429632 | 136889 | 292743 | 330645 |
| GEO | 81150 | 18306 | 62844 | 25646 |
| NEWS | 248970 | 71411 | 177559 | 205296 |
| OBJ2 | 121540 | 45194 | 76346 | 158431 |
| PAPER1 | 34309 | 10973 | 23336 | 29516 |
| PAPER2 | 59887 | 19460 | 40427 | 42160 |
| PIC | 105427 | 50013 | 55414 | 454913 |
| PROGC | 22751 | 8233 | 14518 | 24831 |
| PROGL | 32177 | 12831 | 19346 | 53342 |
| PROGP | 20019 | 8853 | 11166 | 38722 |
| TRANS | 33195 | 13099 | 20096 | 72683 |
| TOTAL | 1897654 | 610748 | 1286906 | 1842645 |
| AVERAGE | 145973 | 46981 | 98993 | 141742 |
| % | 32 | 68 |
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
|
|
||||||||||||||||||||||
|
|
Lzp
with optimal contexts scheme 1 pseudo code
This is the pseudo coder for it.
Note that for implementing this, we have to save in every context one byte (or whatever we use for match lengths) which is the last match length.
Here there are two graphics, the first is the match lengths outputted
(at its done in an standard lzp model) and the second is the match lengths
which were first differential coded. (note that both of them can be coded
with the same routines, so this doesn't add complexity, and it just takes
a little bit of time)
![]() |
![]() |
Note that the scale is logarithmic adjusted, to make it fit in the pic.We
can learn more seeing the output:
(click here to go to the end of this
table and read conclusions) All the results come from book2.
| Match lenghts | Differential coded |
| 0: 0
1: 52514 2: 24379 3: 13609 4: 8450 5: 5803 6: 4227 7: 3075 8: 2148 9: 1328 10: 952 11: 762 12: 639 13: 478 14: 325 15: 227 16: 232 17: 162 18: 118 19: 115 20: 110 21: 89 22: 83 23: 56 24: 44 25: 39 26: 49 27: 28 28: 41 29: 19 30: 19 31: 14 32: 11 33: 15 34: 13 35: 12 36: 13 37: 10 38: 10 39: 12 40: 9 41: 14 42: 7 43: 5 44: 2 45: 7 46: 9 47: 6 48: 6 49: 7 50: 8 51: 5 52: 5 53: 6 54: 2 55: 2 56: 5 57: 7 58: 2 59: 4 60: 5 61: 4 62: 6 63: 2 64: 3 65: 0 66: 2 67: 1 68: 4 69: 2 70: 1 71: 2 72: 4 73: 0 74: 3 75: 6 76: 1 77: 0 78: 1 79: 2 80: 3 81: 1 82: 2 83: 1 84: 0 85: 0 86: 2 87: 2 88: 4 89: 1 90: 1 91: 3 92: 2 93: 0 94: 8 95: 3 96: 0 97: 0 98: 1 99: 1 100: 0 101: 1 102: 1 103: 0 104: 1 105: 1 106: 0 107: 0 108: 0 109: 0 110: 2 111: 0 112: 0 113: 0 114: 0 115: 0 116: 0 117: 0 118: 1 119: 0 120: 0 121: 0 122: 0 123: 0 124: 0 125: 0 126: 0 127: 0 128: 0 129: 0 130: 0 131: 0 132: 0 133: 1 134: 1 135: 0 136: 0 137: 0 138: 0 139: 0 140: 0 141: 0 142: 0 143: 0 144: 0 145: 0 146: 0 147: 0 148: 0 149: 0 150: 0 151: 0 152: 0 153: 0 154: 0 155: 0 156: 0 157: 0 158: 0 159: 0 160: 0 161: 0 162: 0 163: 0 164: 0 165: 0 166: 0 167: 0 168: 0 169: 0 170: 0 171: 0 172: 0 173: 0 174: 0 175: 0 176: 0 177: 0 178: 0 179: 0 180: 0 181: 0 182: 0 183: 0 184: 0 185: 0 186: 0 187: 0 188: 0 189: 0 190: 0 191: 0 192: 0 193: 0 194: 0 195: 0 196: 0 197: 0 198: 0 199: 0 200: 0 201: 0 202: 0 203: 0 204: 0 205: 0 206: 0 207: 0 208: 0 209: 0 210: 0 211: 0 212: 0 213: 0 214: 0 215: 0 216: 0 217: 0 218: 0 219: 0 220: 0 221: 0 222: 0 223: 0 224: 0 225: 0 226: 0 227: 0 228: 0 229: 0 230: 0 231: 0 232: 0 233: 0 234: 0 235: 0 236: 0 237: 0 238: 0 239: 0 240: 0 241: 0 242: 0 243: 0 244: 0 245: 0 246: 0 247: 0 248: 0 249: 0 250: 0 251: 0 252: 0 253: 0 254: 0 255: 0 |
0: 92562
1: 10169 2: 3922 3: 2087 4: 1430 5: 882 6: 580 7: 355 8: 219 9: 172 10: 131 11: 88 12: 63 13: 52 14: 39 15: 39 16: 30 17: 19 18: 26 19: 16 20: 15 21: 11 22: 17 23: 12 24: 5 25: 18 26: 8 27: 9 28: 3 29: 2 30: 4 31: 8 32: 5 33: 7 34: 4 35: 4 36: 4 37: 2 38: 7 39: 5 40: 2 41: 3 42: 1 43: 1 44: 1 45: 1 46: 1 47: 3 48: 0 49: 1 50: 0 51: 1 52: 1 53: 1 54: 1 55: 0 56: 1 57: 1 58: 2 59: 4 60: 1 61: 1 62: 1 63: 0 64: 0 65: 1 66: 1 67: 1 68: 1 69: 1 70: 0 71: 0 72: 1 73: 0 74: 0 75: 0 76: 2 77: 0 78: 2 79: 0 80: 0 81: 0 82: 0 83: 1 84: 0 85: 0 86: 2 87: 1 88: 3 89: 1 90: 1 91: 0 92: 1 93: 1 94: 0 95: 0 96: 0 97: 0 98: 0 99: 1 100: 0 101: 0 102: 0 103: 0 104: 0 105: 0 106: 1 107: 0 108: 0 109: 0 110: 0 111: 0 112: 0 113: 0 114: 0 115: 0 116: 0 117: 0 118: 0 119: 0 120: 0 121: 0 122: 0 123: 0 124: 0 125: 0 126: 0 127: 0 128: 0 129: 0 130: 0 131: 0 132: 0 133: 0 134: 0 135: 0 136: 0 137: 0 138: 0 139: 0 140: 0 141: 0 142: 0 143: 0 144: 0 145: 0 146: 0 147: 0 148: 0 149: 0 150: 1 151: 0 152: 0 153: 0 154: 0 155: 0 156: 0 157: 0 158: 0 159: 0 160: 0 161: 0 162: 0 163: 0 164: 0 165: 1 166: 0 167: 0 168: 3 169: 1 170: 0 171: 0 172: 0 173: 0 174: 0 175: 1 176: 1 177: 0 178: 0 179: 1 180: 1 181: 0 182: 0 183: 0 184: 0 185: 1 186: 0 187: 0 188: 0 189: 2 190: 0 191: 0 192: 1 193: 0 194: 0 195: 1 196: 0 197: 2 198: 3 199: 0 200: 0 201: 2 202: 0 203: 0 204: 0 205: 0 206: 0 207: 0 208: 0 209: 1 210: 0 211: 1 212: 0 213: 2 214: 1 215: 2 216: 1 217: 1 218: 8 219: 1 220: 0 221: 2 222: 4 223: 7 224: 1 225: 3 226: 2 227: 2 228: 2 229: 2 230: 4 231: 15 232: 5 233: 7 234: 7 235: 3 236: 12 237: 6 238: 13 239: 7 240: 14 241: 16 242: 25 243: 32 244: 35 245: 45 246: 86 247: 133 248: 153 249: 263 250: 459 251: 713 252: 1219 253: 1754 254: 3301 255: 9029 |
One can easily see that in most of the cases the same match length occurs (a 0 differential coded byte) or little changes are done, this does not only helps to have a more skewed probabilities (where some symbols have higher probabilities than others) but also gives a very high probability to one symbol, the 0. In the default lzp, the symbol with highest probability is 1 (meaning a match length of value 1) with a probability of 52514 and we have 120441 match lengths, so its entropy is: ent ( 52514/120441 ) = 1,197 bits while when we differential code the match lengths, the symbol with highest probability, 0, has an entropy of: en ( 92562/120441) = 0,379 bits. One can clearly see why it's worth to use differential coding in match lengths, and it only adds one byte more fore every context.
I haven't checked to discard the offset if the match was below than the previous, so I don't know whatever it works better or worser. But probably it will hurt compression. (I don't think it's a good idea to discard matches)
Charles also proposed me another idea, to output the number without
caring about the sign (I mean all the values are positive, so -1 becomes
1), and then a bit with the sign, this idea has not been tested, however
I think that it will not get better results than the one presented by me,
because positive numbers have higher probabilities, thus mixing them and
output a bit more, will not be better. In case you want to implement this,
I recommend first outputting the number itself, and in case that it's 0
(in most of the cases) don't output the flag after it.
More than one offset
It's clear than lzp in its quest for choosing only one offset, is losing
the chance of more matches, so I thought, that it will be worth to analyze
how lzp performs with more than one offset.
I'll present the results for lzp o2, which keeps track of 2 , 4 or 8
offsets, not the last offsets, but rather the last best offsets, a simple
mtf scheme, can easily do that. This is a greedy approach too, it reads
the offset 0 first and goes till the last one, whenever it gets one which
gives a match, it uses it, without testing if this is the best one. (I
mean the offset which gives the longest match)
| Using one offsets (default mode) | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Using two offsets | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Using four offsets | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Using height offsets | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
Some plots:
|
|
||
|
It's not difficult to see that using more offsets, leads to more bytes
matched, (more good predictions too) but of course we need more bytes to
choose among the different offsets. Of course we've matched more (almost
most of the) literals.
(plots don't take in account using only one offset) |
And how much every model will compress book2 if we assume literals to
be coded to 4bpb and match lengths to 2bpb:
|
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Using more offsets is worth the job, only using one more can improve
the bpb in 0.648! Also note that we are only using o2, and also that our
assumption of coding match lengths to 2bpb is a little bit pessimistic
(I've proved that more than half of them can be coded down to 0,3bpb) Unfortunately
I don't see a way of choosing a given offset, and thus assigning it a higher
probability, though I'm doing some research in giving probability to the
flags.
No match literals
This is another improvement, this is not meant to improve the compression
(though it does it, but only a little bit) but instead to make the encoder
faster by reducing the number of operations to code a literal.
The idea is based in this: if under a given context the same phrases tend to occur, then it's very probable that the same literal occurs too.
Because this idea is not very good, it will not be further developed, and results will not be shown.
Note that this scheme of saving the last seen unmatched literal in a
context is like saving an offset to the last seen unmatched position and
to not allow the match length to exceed one.
Encoders
You can use the encoder that you want, it's up to you. If you want
to get more compression at cost of speed, you should think about coding
the flags with an arithmetic coder, because usually the probability of
a flag isn't 50% (which the entropy codes to 1 bit, just like we do in
standard lzp) but rather 30% and 70% or so, it depends on the model
you use. And if you want speed you better use Huffman.
If you really want to get high compression, no matter what happens, then
you better use any ppm variant for coding of the literals, or plain order-0
arithmetic
coding. If you want a good tradeoff between compression and speed,
a good choice is the range coder.
References
"LZP: a new data compression algorithm" Charles Bloom (lzp.ps)
"Using Prediction to Improve LZ77 Coders" Charles Bloom (lzp_old.ps)
"Lzp" Dario Phong (dp_lzp.html)
"Lzp order-3 with linked lists" Dario Phong (dp_lzp_o3_ll.html)
The article of Charles Bloom are at his home page: http://www.cbloom.com
And mines are at Wisdom Vault: http://www.arturocampos.com
also there you'll found articles about other topics of data compression,
such as encoders or other algorithms.
Closing words
This article isn't fully finished, as soon as I have more time, I'll
improve some sections or add new ones, if needed. I.e.: it will be good
to see how all orders are affected with the different updating schemes,
using more than one offset, differential coding of match lengths outputting
the sign flag, and more things. However I'll be busy with other projects,
so if you want me to write the next version of this article you better
email me.
Conclusion of my research till the date: I've developed
new schemes for the models and investigated other things, and now I can
safely say: lzp is very close to its own own limits, and for fast implementation
nothing else should be added to it.
It will never reach the compression ratios of ppm, for the way it adapts
to the data. Also it will never achieve the ratios of bwt, as it can't
access orders that high. Anyway it seems like one of the best algorithms
for getting good compression at high speeds, though it must be finely tuned.
Thanks to Charles Bloom for all the help and ideas that he provided
me about lzp. If you want to discuss about lzp, email me.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.