Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped.
Table of contents
Introduction
Have you ever implemented a codec from the lz family? If you have done,
I'm sure that you have done greedy parsing (getting the longer phrase at
the current position, wihtout caring about if this the total next phrases
or previous ones) why? may be because you don't know what optimal parsing
is, or you think that it's too slow. But now there's a new scheme called
flexible parsing which improves all the codecs from the lz family. The
speed? well, it isn't greedy parsing, but it's fast enough to be used in
any real-time application. Also you don't have to change the decompressor,
only the compressor. Perhaps you already have tried to read the original
article but you didn't understand anything, it also happened to me. OE-)
But I was lucky, Malcolm Taylor explained it to me. Now I teach it to you.
If something is wrong in this article, don't blame me, instead tell me
what is wrong and I'll fix it.
Improvement
Before explaining it I'll show some results, and you'll be able to
decide whatever is worth to implement it or not. Those are the results
for some of the files from Calgary
Corpus set.
| File | Greedy parsing | Flexible parsing |
| bib
book1 book2 news obj1 obj2 paper1 paper2 progc progl progp trans Total |
73,141
661,229 439,765 258,813 13,371 120,667 36,730 62,291 22,980 33,110 206,68 39,579 1,883,497 |
71,990
658,595 435,254 256,417 13,287 116,946 36,309 61,724 22,550 31,851 19,810 38,119 1,841,922 |
Around 2,2% of improvement. May be it may not seem a lot, but if you are trying to get the best compression possible, then you should learn it. Those test were done with lzp. As long as I know with other models, like lzw or lz77 the improvement may be higher, till 10% or so.
Flexible parsing
The idea is quite simple, you try to get the always the optimal match,
for doing so we only have to care about getting the longest match in the
current match and also in the next match, this will ensure an optimal parsing.
How can we achieve so? what we do is to assign a value for very byte in
a phrase, (match) the value depends both in the length that it will provide
in the current phrase and also in the next match. Let's say we have a match
like that:
Implementation
We could save in an array both the length of the current match and
the next one, but this will not be optimal, there's a better solution in
both memory usage and speed. The way I did it was the following: first
the only modification which flexible parsing is in the length of the current
match. I kept two variables, one with the original length of the match,
and other with the value of the original length, and then for every byte
in the phrase I computed the value, the addition of both the length which
it will do in the current match and also in the next, if the value was
above than the value in the variable, I updated both the final length of
the current match and also the bigger value. Remember that we only enter
the routine of flexible parsing when we already have a match, first let's
define the variables:
Worst case
The speed of flexible parsing is ok for most matches, but it has a
worst case, which is even worse than bwt's
worst case, in our case the weakness is the same: long runs of the
same byte or a repeated pattern. (The first case is represented in the
file pic from the Calgary Corpus set, the second one is more unusual) Btw:
in the case of Pic my compressor which had a maximum length of 256 bytes,
took more than 12 minutes to compress pic, well in fact I think it was
more, but I was tired of waiting, so I stopped compressing it. The way
to overpass this problem is to use rle at the same time than the lz compressor
or before it. The file pic was compressed down to 79,070 bytes with first
an rle pass with byte based output, and then
flexible parsing. (with lzp)
Closing words
Thanks to Malcolm Taylor for his explanation of flexible parsing. If
your results (in improvement or whatever) with flexible parsing were different
than mines, then let me know. If you find any mistake in this article,
email me.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.