"Flexible parsing"
by
Arturo San Emeterio Campos







Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped.
 

Table of contents

  • Introduction
  • Improvement
  • Flexible parsing
  • Implementation
  • Worst case
  • Closing words
  • Contacting the author

  •  

     
     
     
     

    Introduction
    Have you ever implemented a codec from the lz family? If you have done, I'm sure that you have done greedy parsing (getting the longer phrase at the current position, wihtout caring about if this the total next phrases or previous ones) why? may be because you don't know what optimal parsing is, or you think that it's too slow. But now there's a new scheme called flexible parsing which improves all the codecs from the lz family. The speed? well, it isn't greedy parsing, but it's fast enough to be used in any real-time application. Also you don't have to change the decompressor, only the compressor. Perhaps you already have tried to read the original article but you didn't understand anything, it also happened to me. OE-) But I was lucky, Malcolm Taylor explained it to me. Now I teach it to you. If something is wrong in this article, don't blame me, instead tell me what is wrong and I'll fix it.

    Improvement
    Before explaining it I'll show some results, and you'll be able to decide whatever is worth to implement it or not. Those are the results for some of the files from Calgary Corpus set.
     
    File Greedy parsing  Flexible parsing 
    bib 
    book1 
    book2 
    news 
    obj1 
    obj2 
    paper1 
    paper2 
    progc 
    progl 
    progp 
    trans 
    Total
     73,141 
     661,229 
     439,765 
     258,813 
     13,371 
     120,667 
     36,730 
     62,291 
     22,980 
     33,110 
     206,68 
     39,579 
     1,883,497
     71,990 
     658,595 
     435,254 
     256,417 
     13,287 
     116,946 
     36,309 
     61,724 
     22,550 
     31,851
     19,810 
     38,119 
     1,841,922

    Around 2,2% of improvement. May be it may not seem a lot, but if you are trying to get the best compression possible, then you should learn it. Those test were done with lzp. As long as I know with other models, like lzw or lz77 the improvement may be higher, till 10% or so.

    Flexible parsing
    The idea is quite simple, you try to get the always the optimal match, for doing so we only have to care about getting the longest match in the current match and also in the next match, this will ensure an optimal parsing. How can we achieve so? what we do is to assign a value for very byte in a phrase, (match) the value depends both in the length that it will provide in the current phrase and also in the next match. Let's say we have a match like that:

  •  "abcd"

  • The value of "a" is equal to 1 (the current length of the match) + the length of the next match. (in this case the bytes to match will be "bcd..." and the length of this match corresponds to the phrase which we may find in the dictionary) We do this for all the bytes and then decide what is the better one, it will be the one with the higher value. Let's say we have this: In this case we'll choose our phrase to be 4 byte long, because it will provide the better match, for both the current phrase and the next. If we had this: We'll choose that our current match (which initially was 4 bytes long) now it should be 3 bytes long. (and thus making the next match to be longer)

    Implementation
    We could save in an array both the length of the current match and the next one, but this will not be optimal, there's a better solution in both memory usage and speed. The way I did it was the following: first the only modification which flexible parsing is in the length of the current match. I kept two variables, one with the original length of the match, and other with the value of the original length, and then for every byte in the phrase I computed the value, the addition of both the length which it will do in the current match and also in the next, if the value was above than the value in the variable, I updated both the final length of the current match and also the bigger value. Remember that we only enter the routine of flexible parsing when we already have a match, first let's define the variables:

    And now the pseudo code: Once this loop is over we have on "final_length" the length for the current match, you only have to code the current match with this length, and you can continue parsing the input. Note that "counter" also keeps track of the position inside of the phrase.
     

    Worst case
    The speed of flexible parsing is ok for most matches, but it has a worst case, which is even worse than bwt's worst case, in our case the weakness is the same: long runs of the same byte or a repeated pattern. (The first case is represented in the file pic from the Calgary Corpus set, the second one is more unusual) Btw: in the case of Pic my compressor which had a maximum length of 256 bytes, took more than 12 minutes to compress pic, well in fact I think it was more, but I was tired of waiting, so I stopped compressing it. The way to overpass this problem is to use rle at the same time than the lz compressor or before it. The file pic was compressed down to 79,070 bytes with first an rle pass with byte based output, and then flexible parsing. (with lzp)
     

    Closing words
    Thanks to Malcolm Taylor for his explanation of flexible parsing. If your results (in improvement or whatever) with flexible parsing were different than mines, then let me know. If you find any mistake in this article, email me.
     

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
     
     

    Arturo San Emeterio Campos, Barcelona 24-Jul-1999


    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.