Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped,
with the figures also.
Table of contents
Introduction
In this article it's presented all my research till now in lzp. Some
standard models are presented and briefly analyzed. Also improvements for
any kind of lzp model are presented: deterministic contexts and optimal
contexts. The idea of using more than one offset is seen too, though it's
not very explored. Enough results are given, so you can choose the more
appropriate model for your compressor. This article is mainly focused on
models, encoders are briefly discussed too.
This article assumes knowledge of lzp, different implementations of it, and ppm. (at least the theory behind it)
If the tables from this article aren't properly displayed, I could send
you or upload to my page the tables in plain text version. Anyway I encourage
to view this article in Netscape, instead of Internet Explorer. 800x600
is recommended.
Lz
We can classify the compression of text in two big worlds, Markov models
and Lz dictionary compressors, both of them do the same: assign less bits
to more common events; thus some researchers explained Lz schemes thru
Markov eyes. Both schemes are different: one provides high compression,
the other one high speed. The question raised in the brains of some researchers.
Time passed and Ross Williams did his approach with lzrw, the idea in theory worked, also lzrw became one of the fastest Lz compressor, though it did not had impressive compression ratios. But it was the first step to the hybrid between contexts coders and Lz.
Charles Bloom did the next step: Lzp. Which seems to be the best approach of Markov and Lz, high speed and high compression, which can be tuned in order to fit your needs. His paper presented lzp and the most important implementations of it. However one can miss an explanation of what's happening behind it or more results. Most of the implementations had in mind speed, however some of them had a very low memory requirements, and one was better than some ppm variants. (but it uses ppmc as a coder for literals, thus dramatically reducing the speed of the whole compressor)
Peter Fenwick did some research on it, but he focused it on symbol ranking.
And then here it comes my work, with the hope that it can further develop
lzp.
Offset probability
Can we see parsing as a markov model? yes, we can, but this is not
the point of this article, (see articles from Ross Williams) let's analyze
them. Let's say we are under an lz77 like model, and we have parsed "baa",
this will produce the following phrases with the following probabilities:
| Note that lz77 has the suffix property: The suffixes of a phrase in a given dictionary are phrases of the dictionary too. The entropy (under a simple order-0 model) of every phrase is 2.32 bits, and effectively, we can't represent it with less than 2.3 bits, well, in fact we'll need 2 bits for the offset and two bits more for the length of the phrase (which is more than we should assign, one can clearly see that this model isn't the best one), even we could assign a value to every phrase, and thus coding them with only 3 bits. |
|
A standard version of Lzp solves this problem assigning a probability
of 1/2 to every offset from a given context (the other 1/2 is reserved
to tell the decompressor that this was a failed match), so it needs ent
( 1/2 ) = 1 bit. In some cases lzp may even assign a probability of 1/1
to a given symbol, but then in the case of no match we have to encode
a 0 match length and a literal. (which may result in degradation of the
total compression)
Standard Lzp
We should define a standard implementation of lzp. The pseudo code
of it will look like the following:
Orders
If we want to do good models, we have to know how many of them can
we use, in this I'll discuss and analyze different orders (the number of
symbols which are used as context). We can use different orders, every
of them will give different results, but first we have to define a set
of standard orders. The symbol it's assumed to be a byte.
Result of orders
Now that we have defined different contexts, let's see how they perform.
|
1- O1
2 - O2 3 - O2h 4 - O2hc 5 - O3 6 - O3h 7 - O3hc 8 - O4 9 - O5 10 - O6 11 - O7 12 - O8 Numbers for graphics. |
A little explanation of this table: Results come from book2. A bad prediction is one which produce no match (0 match length) a good prediction is one which is not bad. (at the right the % of the good predictions, I mean (good/(good+bad))*100) The column "predictions" holds the number of times that under a given contexts we had an offset (and then we have to check for match). At his right there's the %, in respect with the file size of book2. (626,490 bytes) In the column "matched" you have the total bytes that could be matched. (the sum of all match lengths) And at its right the % of bytes matched on all the file. ((matched/file_size)*100)
Let's see some graphics:
|
Here you can see the % of predictions ((predictions/file_length)*100)
As you can see the higher the order, the less predictions, this is due to the zero frequency problem (when we haven't seen any context of the same kind yet). Have in mind that all those tables don't take in account only different orders, but also different ways of accessing them. (I.e.: O2, O2h, and O2hc, all of them use the same order, but the way an offset is stored changes) |
|
This is the % of good predictions ((good_predictions/total_predictions)*100)
Obviously also, the higher the order, the more reliable the predictions are. Higher orders than 8 were not explored, it's a know fact that orders above 8 don't make any real improvement in text compression. (and the resources for accessing them will be higher and it will be far slower too) |
|
That graphic represents the % of matched bytes (as defined above) ((matched_bytes/file_length)*100) |
Conclusions: Low orders produce a lot of wrong predictions, and higher orders are very effected by the zero frequency problem. The best trade off (at least for book2) between memory/speed/compression seems to be o4, however order-5 seems good too.
With all this information you should be able to choose the order which
best suits your needs.
Deterministic
contexts and optimal contexts
Some time ago (I mean around June, 1999), in private communication
with Charles Bloom, he explained me what deterministic contexts are, they
opened a new set of models, which where not done yet. So I started to work
on it. Here I present all my approaches.
But first let's define what is a deterministic context: A deterministic context is a context which has not produced any bad prediction. (a bad prediction is defined above)
My first scheme to handle deterministic contexts was only based on this information, however I developed some improved versions of it, which don't use deterministic contexts, so a new term should be introduced. (and I base this in the fact that for lzp they are better than deterministic contexts)
An optimal contexts is a contexts which has produced more good predictions than bad predictions.
Optimal contexts will only be implemented in lzp, at least in this article, however it's probable that they'll work for ppm too.
Before starting to explain the different models, you must know this:
a given contexts (let's say "th") can have different offsets which provide
different matches. (the context "th" can predict "e ", "ere ", "us ", "an
", "ink "...) the goal of the schemes which I'll present now, is to get
the best of them.
Implementing
deterministic contexts and optimal contexts
Once we know that there are some offsets which work better than others,
how can we build our model so it captures it?
Let's think about some facts of lzp: We can't change when a offset
is inserted for first time (mainly because we don't have any information
about it, and also because we need to fill the table) Then we only have
two more cases dealing with offset in a given context: when it's updated
and deleted.
For optimal contexts there's a main scheme, scheme 1, and then variations of the same, which proved to not be better. (but are presented to show you different ideas)
Optimal contexts scheme 1:
Optimal contexts scheme 2:
Also there's another question: Are optimal contexts optimal? While this may seem a paradox, this is a serious question. Let's say we have a text like "the ... there ... the...there ... the..." (where "..." means that there is other text) Here we can see that the scheme for optimal contexts does his job, to not discard an optimal context, "th"->"e " (where "th" is the context and "e " the match that it can produce) but while it's an optimal context, it doesn't produce more bytes matches than "th"->"ere " can produce (twice the bytes matched) so the offset "th"->"ere " is optimal while "th"->"e " is not (even if it's an optimal context).
However now the question arises: How to keep track of the bytes matched
from offsets in a given context which aren't currently used? This would
need to store more than one pointer, and check them. Probably this scheme
will not produce a lot better results, and I'm sure that it will be more
slower. Another solution could be to save the number of matched bytes in
a given context, and also assign to the literals a given length, and then
do the same as in optimal contexts, however this remains unexplored. Anyway
it doesn't seem possible to do this without needing more speed and memory
than optimal contexts do.
Results
for updating schemes
In all this examples I only care about the model and the probabilities
which it produces, while it's not a ppm model which directly has the probability
of a symbol, one can at least know how a given coder will perform given
the probability distribution.
|
|
|
|
|
|
|
|||||||||||||||||||
| BIB | 59 | 27 | 56 | 57 | 31 | 59 | 56 | 33 | 61 | 57 | 32 | 60 | 56 | 33 | 61 | 56 | 33 | 61 | ||||||
| BOOK1 | 83 | 24 | 36 | 82 | 27 | 40 | 82 | 30 | 43 | 82 | 29 | 42 | 82 | 30 | 43 | 82 | 30 | 43 | ||||||
| BOOK2 | 69 | 26 | 48 | 68 | 29 | 51 | 68 | 32 | 53 | 68 | 31 | 52 | 68 | 32 | 53 | 69 | 32 | 53 | ||||||
| GEO | 79 | 23 | 25 | 79 | 23 | 26 | 79 | 23 | 26 | 79 | 23 | 26 | 79 | 23 | 25 | 79 | 23 | 25 | ||||||
| NEWS | 63 | 24 | 51 | 62 | 27 | 53 | 62 | 29 | 55 | 62 | 28 | 54 | 63 | 29 | 54 | 64 | 29 | 53 | ||||||
| OBJ2 | 43 | 35 | 67 | 46 | 39 | 67 | 47 | 38 | 66 | 47 | 38 | 66 | 48 | 38 | 65 | 49 | 37 | 64 | ||||||
| PAPER1 | 64 | 27 | 50 | 63 | 31 | 54 | 62 | 33 | 56 | 62 | 32 | 55 | 63 | 33 | 55 | 63 | 32 | 54 | ||||||
| PAPER2 | 73 | 26 | 45 | 71 | 30 | 48 | 71 | 33 | 51 | 71 | 32 | 50 | 71 | 33 | 51 | 71 | 32 | 50 | ||||||
| PIC | 15 | 27 | 88 | 18 | 38 | 88 | 20 | 48 | 89 | 17 | 31 | 88 | 20 | 48 | 89 | 21 | 47 | 89 | ||||||
| PROGC | 52 | 31 | 60 | 52 | 35 | 62 | 53 | 37 | 62 | 53 | 36 | 62 | 54 | 37 | 61 | 55 | 36 | 60 | ||||||
| PROGL | 39 | 33 | 72 | 40 | 38 | 74 | 42 | 40 | 73 | 42 | 39 | 73 | 43 | 40 | 73 | 44 | 40 | 72 | ||||||
| PROGP | 33 | 38 | 77 | 35 | 43 | 77 | 37 | 45 | 77 | 36 | 44 | 77 | 38 | 44 | 76 | 39 | 44 | 75 | ||||||
| TRANS | 32 | 33 | 77 | 32 | 37 | 78 | 33 | 39 | 78 | 32 | 37 | 78 | 34 | 39 | 77 | 35 | 39 | 77 | ||||||
| TOTAL | 58 | 26 | 55 | 58 | 30 | 57 | 59 | 32 | 59 | 58 | 31 | 58 | 59 | 32 | 58 | 60 | 32 | 58 |
| There are 6 different schemes:
0- Default lzp. 1- Deterministic contexts scheme. 2- Optimal contexts scheme 1. 3- Optimal contexts scheme 2. 4- Optimal contexts scheme 3. 5- Optimal contexts scheme 4. |
And three different columns:
1- Predictions (%) 2- Good predictions (%) 3- Matched bytes (%)
|
| Default | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 69810 | 19079 | 50731 | 65276 |
| BOOK1 | 654998 | 155561 | 499437 | 284127 |
| BOOK2 | 432742 | 110959 | 321783 | 301605 |
| GEO | 81047 | 18664 | 62383 | 26107 |
| NEWS | 242789 | 58229 | 184560 | 198295 |
| OBJ2 | 106140 | 37588 | 68552 | 166225 |
| PAPER1 | 34780 | 9358 | 25422 | 27430 |
| PAPER2 | 60990 | 15894 | 45096 | 37491 |
| PIC | 77268 | 20517 | 56751 | 453606 |
| PROGC | 21247 | 6617 | 14630 | 24719 |
| PROGL | 29137 | 9690 | 19447 | 53241 |
| PROGP | 16930 | 6350 | 10580 | 39308 |
| TRANS | 29992 | 9981 | 20011 | 72516 |
| TOTAL | 1857870 | 478487 | 1379383 | 1749946 |
| AVERAGE | 142913 | 36807 | 106106 | 134611 |
| % | 26 | 74 | ||
| Detc. | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 67334 | 20626 | 46708 | 69299 |
| BOOK1 | 646152 | 176859 | 469293 | 314271 |
| BOOK2 | 426438 | 125267 | 301171 | 322217 |
| GEO | 81050 | 18798 | 62252 | 26238 |
| NEWS | 241101 | 65245 | 175856 | 206999 |
| OBJ2 | 112993 | 43713 | 69280 | 165497 |
| PAPER1 | 34067 | 10474 | 23593 | 29259 |
| PAPER2 | 59691 | 17732 | 41959 | 40628 |
| PIC | 90205 | 34183 | 56022 | 454194 |
| PROGC | 21531 | 7549 | 13982 | 25367 |
| PROGL | 29783 | 11454 | 18329 | 54359 |
| PROGP | 18152 | 7806 | 10346 | 39542 |
| TRANS | 30371 | 11368 | 19003 | 73773 |
| TOTAL | 1858868 | 551074 | 1307794 | 1821643 |
| AVERAGE | 142990 | 42390 | 100600 | 140126 |
| % | 30 | 70 | ||
| Opc. s1 | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 65913 | 21690 | 44223 | 71784 |
| BOOK1 | 640496 | 193362 | 447134 | 336430 |
| BOOK2 | 424905 | 136046 | 288859 | 334529 |
| GEO | 81031 | 18865 | 62166 | 26324 |
| NEWS | 240289 | 70395 | 169894 | 212961 |
| OBJ2 | 116465 | 44579 | 71886 | 162891 |
| PAPER1 | 33561 | 11136 | 22425 | 30427 |
| PAPER2 | 59516 | 19414 | 40102 | 42485 |
| PIC | 103630 | 50059 | 53571 | 456823 |
| PROGC | 21621 | 7947 | 13674 | 25675 |
| PROGL | 30982 | 12522 | 18460 | 54228 |
| PROGP | 19123 | 8647 | 10476 | 39412 |
| TRANS | 30834 | 11907 | 18927 | 73600 |
| TOTAL | 1868366 | 606569 | 1261797 | 1867569 |
| AVERAGE | 143720 | 46659 | 97061 | 143659 |
| % | 32 | 68 | ||
| Opc. s2 | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 66721 | 21294 | 45427 | 70580 |
| BOOK1 | 642739 | 187364 | 455375 | 328189 |
| BOOK2 | 427430 | 132949 | 294481 | 328907 |
| GEO | 81026 | 18721 | 62305 | 26185 |
| NEWS | 240220 | 67385 | 172835 | 210020 |
| OBJ2 | 115824 | 44294 | 71530 | 163247 |
| PAPER1 | 33781 | 10766 | 23015 | 29837 |
| PAPER2 | 59612 | 18816 | 40796 | 41791 |
| PIC | 85345 | 26363 | 58982 | 451375 |
| PROGC | 21723 | 7748 | 13975 | 25374 |
| PROGL | 30949 | 11963 | 18986 | 53702 |
| PROGP | 18611 | 8098 | 10513 | 39375 |
| TRANS | 30336 | 11306 | 19030 | 73496 |
| TOTAL | 1854317 | 567067 | 1287250 | 1842078 |
| AVERAGE | 142640 | 43621 | 99019 | 141698 |
| % | 31 | 69 | ||
| Opc. s3 | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 65946 | 21760 | 44186 | 71821 |
| BOOK1 | 640557 | 193507 | 447050 | 336514 |
| BOOK2 | 428085 | 137806 | 290279 | 333109 |
| GEO | 81153 | 18677 | 62476 | 26014 |
| NEWS | 244923 | 70889 | 174034 | 208821 |
| OBJ2 | 118735 | 44662 | 74073 | 160704 |
| PAPER1 | 34318 | 11214 | 23104 | 29748 |
| PAPER2 | 59469 | 19376 | 40093 | 42494 |
| PIC | 103565 | 49267 | 54298 | 455927 |
| PROGC | 22354 | 8179 | 14175 | 25174 |
| PROGL | 31430 | 12694 | 18736 | 53952 |
| PROGP | 19282 | 8433 | 10849 | 39039 |
| TRANS | 31937 | 12394 | 19543 | 72983 |
| TOTAL | 1881754 | 608858 | 1272896 | 1856300 |
| AVERAGE | 144750 | 46835 | 97915 | 142792 |
| % | 32 | 68 | ||
| Opc. s4 | ||||
| Predictions | Good | Bad | Matched | |
| BIB | 65927 | 21665 | 44262 | 71745 |
| BOOK1 | 642670 | 193821 | 448849 | 334715 |
| BOOK2 | 429632 | 136889 | 292743 | 330645 |
| GEO | 81150 | 18306 | 62844 | 25646 |
| NEWS | 248970 | 71411 | 177559 | 205296 |
| OBJ2 | 121540 | 45194 | 76346 | 158431 |
| PAPER1 | 34309 | 10973 | 23336 | 29516 |
| PAPER2 | 59887 | 19460 | 40427 | 42160 |
| PIC | 105427 | 50013 | 55414 | 454913 |
| PROGC | 22751 | 8233 | 14518 | 24831 |
| PROGL | 32177 | 12831 | 19346 | 53342 |
| PROGP | 20019 | 8853 | 11166 | 38722 |
| TRANS | 33195 | 13099 | 20096 | 72683 |
| TOTAL | 1897654 | 610748 | 1286906 | 1842645 |
| AVERAGE | 145973 | 46981 | 98993 | 141742 |
| % | 32 | 68 |
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
|
|
|
||||||||||||||||||||||
|
|
Lzp
with optimal contexts scheme 1 pseudo code
This is the pseudo coder for it.
Note that for implementing this, we have to save in every context one byte (or whatever we use for match lengths) which is the last match length.
Here there are two graphics, the first is the match lengths outputted
(at its done in an standard lzp model) and the second is the match lengths
which were first differential coded. (note that both of them can be coded
with the same routines, so this doesn't add complexity, and it just takes
a little bit of time)
![]() |
![]() |
Note that the scale is logarithmic adjusted, to make it fit in the pic.We
can learn more seeing the output:
(click here to go to the end of this
table and read conclusions) All the results come from book2.
| Match lenghts | Differential coded |
| 0: 0
1: 52514 2: 24379 3: 13609 4: 8450 5: 5803 6: 4227 7: 3075 8: 2148 9: 1328 10: 952 11: 762 12: 639 13: 478 14: 325 15: 227 16: 232 17: 162 18: 118 19: 115 20: 110 21: 89 22: 83 23: 56 24: 44 25: 39 26: 49 27: 28 28: 41 29: 19 30: 19 31: 14 32: 11 33: 15 34: 13 35: 12 36: 13 37: 10 38: 10 39: 12 40: 9 41: 14 42: 7 43: 5 44: 2 45: 7 46: 9 47: 6 48: 6 49: 7 50: 8 51: 5 52: 5 53: 6 54: 2 55: 2 56: 5 57: 7 58: 2 59: 4 60: 5 61: 4 62: 6 63: 2 64: 3 65: 0 66: 2 67: 1 68: 4 69: 2 70: 1 71: 2 72: 4 73: 0 74: 3 75: 6 76: 1 77: 0 78: 1 79: 2 80: 3 81: 1 82: 2 83: 1 84: 0 85: 0 86: 2 87: 2 88: 4 89: 1 90: 1 91: 3 92: 2 93: 0 94: 8 95: 3 96: 0 97: 0 98: 1 99: 1 100: 0 101: 1 102: 1 103: 0 104: 1 105: 1 106: 0 107: 0 108: 0 109: 0 110: 2 111: 0 112: 0 113: 0 114: 0 115: 0 116: 0 117: 0 118: 1 119: 0 120: 0 121: 0 122: 0 123: 0 124: 0 125: 0 126: 0 127: 0 128: 0 129: 0 130: 0 131: 0 132: 0 133: 1 134: 1 135: 0 136: 0 137: 0 138: 0 139: 0 140: 0 141: 0 142: 0 143: 0 144: 0 145: 0 146: 0 147: 0 148: 0 149: 0 150: 0 151: 0 152: 0 153: 0 154: 0 155: 0 156: 0 157: 0 158: 0 159: 0 160: 0 161: 0 162: 0 163: 0 164: 0 165: 0 166: 0 167: 0 168: 0 169: 0 170: 0 171: 0 172: 0 173: 0 174: 0 175: 0 176: 0 177: 0 178: 0 179: 0 180: 0 181: 0 182: 0 183: 0 184: 0 185: 0 186: 0 187: 0 188: 0 189: 0 190: 0 191: 0 192: 0 193: 0 194: 0 195: 0 196: 0 197: 0 198: 0 199: 0 200: 0 201: 0 202: 0 203: 0 204: 0 205: 0 206: 0 207: 0 208: 0 209: 0 210: 0 211: 0 212: 0 213: 0 214: 0 215: 0 216: 0 217: 0 218: 0 219: 0 220: 0 221: 0 222: 0 223: 0 224: 0 225: 0 226: 0 227: 0 228: 0 229: 0 230: 0 231: 0 232: 0 233: 0 234: 0 235: 0 236: 0 237: 0 238: 0 239: 0 240: 0 241: 0 242: 0 243: 0 244: 0 245: 0 246: 0 247: 0 248: 0 249: 0 250: 0 251: 0 252: 0 253: 0 254: 0 255: 0 |
0: 92562
1: 10169 2: 3922 3: 2087 4: 1430 5: 882 6: 580 7: 355 8: 219 9: 172 10: 131 11: 88 12: 63 13: 52 14: 39 15: 39 16: 30 17: 19 18: 26 19: 16 20: 15 21: 11 22: 17 23: 12 24: 5 25: 18 26: 8 27: 9 28: 3 29: 2 30: 4 31: 8 32: 5 33: 7 34: 4 35: 4 36: 4 37: 2 38: 7 39: 5 40: 2 41: 3 42: 1 43: 1 44: 1 45: 1 46: 1 47: 3 48: 0 49: 1 50: 0 51: 1 52: 1 53: 1 54: 1 55: 0 56: 1 57: 1 58: 2 59: 4 60: 1 61: 1 62: 1 63: 0 64: 0 65: 1 66: 1 67: 1 68: 1 69: 1 70: 0 71: 0 72: 1 73: 0 74: 0 75: 0 76: 2 77: 0 78: 2 79: 0 80: 0 81: 0 82: 0 83: 1 84: 0 85: 0 86: 2 87: 1 88: 3 89: 1 90: 1 91: 0 92: 1 93: 1 94: 0 95: 0 96: 0 97: 0 98: 0 99: 1 100: 0 101: 0 102: 0 103: 0 104: 0 105: 0 106: 1 107: 0 108: 0 109: 0 110: 0 111: 0 112: 0 113: 0 114: 0 115: 0 116: 0 117: 0 118: 0 119: 0 120: 0 121: 0 122: 0 123: 0 124: 0 125: 0 126: 0 127: 0 128: 0 129: 0 130: 0 131: 0 132: 0 133: 0 134: 0 135: 0 136: 0 137: 0 138: 0 139: 0 140: 0 141: 0 142: 0 143: 0 144: 0 145: 0 146: 0 147: 0 148: 0 149: 0 150: 1 151: 0 152: 0 153: 0 154: 0 155: 0 156: 0 157: 0 158: 0 159: 0 160: 0 161: 0 162: 0 163: 0 164: 0 165: 1 166: 0 167: 0 168: 3 169: 1 170: 0 171: 0 172: 0 173: 0 174: 0 175: 1 176: 1 177: 0 178: 0 179: 1 180: 1 181: 0 182: 0 183: 0 184: 0 185: 1 186: 0 187: 0 188: 0 189: 2 190: 0 191: 0 192: 1 193: 0 194: 0 195: 1 196: 0 197: 2 198: 3 199: 0 200: 0 201: 2 202: 0 203: 0 204: 0 205: 0 206: 0 207: 0 208: 0 209: 1 210: 0 211: 1 212: 0 213: 2 214: 1 215: 2 216: 1 217: 1 218: 8 219: 1 220: 0 221: 2 222: 4 223: 7 224: 1 225: 3 226: 2 227: 2 228: 2 229: 2 230: 4 231: 15 232: 5 233: 7 234: 7 235: 3 236: 12 237: 6 238: 13 239: 7 240: 14 241: 16 242: 25 243: 32 244: 35 245: 45 246: 86 247: 133 248: 153 249: 263 250: 459 251: 713 252: 1219 253: 1754 254: 3301 255: 9029 |
One can easily see that in most of the cases the same match length occurs (a 0 differential coded byte) or little changes are done, this does not only helps to have a more skewed probabilities (where some symbols have higher probabilities than others) but also gives a very high probability to one symbol, the 0. In the default lzp, the symbol with highest probability is 1 (meaning a match length of value 1) with a probability of 52514 and we have 120441 match lengths, so its entropy is: ent ( 52514/120441 ) = 1,197 bits while when we differential code the match lengths, the symbol with highest probability, 0, has an entropy of: en ( 92562/120441) = 0,379 bits. One can clearly see why it's worth to use differential coding in match lengths, and it only adds one byte more fore every context.
I haven't checked to discard the offset if the match was below than the previous, so I don't know whatever it works better or worser. But probably it will hurt compression. (I don't think it's a good idea to discard matches)
Charles also proposed me another idea, to output the number without
caring about the sign (I mean all the values are positive, so -1 becomes
1), and then a bit with the sign, this idea has not been tested, however
I think that it will not get better results than the one presented by me,
because positive numbers have higher probabilities, thus mixing them and
output a bit more, will not be better. In case you want to implement this,
I recommend first outputting the number itself, and in case that it's 0
(in most of the cases) don't output the flag after it.
More than one offset
It's clear than lzp in its quest for choosing only one offset, is losing
the chance of more matches, so I thought, that it will be worth to analyze
how lzp performs with more than one offset.
I'll present the results for lzp o2, which keeps track of 2 , 4 or 8
offsets, not the last offsets, but rather the last best offsets, a simple
mtf scheme, can easily do that. This is a greedy approach too, it reads
the offset 0 first and goes till the last one, whenever it gets one which
gives a match, it uses it, without testing if this is the best one. (I
mean the offset which gives the longest match)
| Using one offsets (default mode) | ||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||
| Using two offsets | ||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||
| Using four offsets | ||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||
| Using height offsets | ||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||