"Optimizing the putbits"
by
Arturo San Emeterio Campos




Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped.
 

Table of contents

  • Introduction
  • The 'proc'
  • First optimization
  • Second optimization
  • The same in pmode
  • Get bits
  • Limiting the buffer
  • Fastest putbits
  • Test of fastest putbits
  • Getbits for fastest putbits
  • Putting one bit
  • Closing words
  • Contacting the author

  •  

     
     
     
     

    Introduction
    This article is aimed to a coder that has already implemented a compressor, and know how ASM works, and how to optimize for pentium. If you don't ever did some of those things then better do them, and when it's done read this. Everybody that has even learned assembler has also 'studied' the putpixel. A very optimized routine. Why? That's for three reasons: first is the way for understanding how the vram works, second it's used always in video routines, and third, everybody showed his putpixel. Now think about the putbits. It's almost the most used routine from compressors. A lot of clocks are spend in your putbits. Do you really think you can optimize you compressor/parser/encoder/... enough? I hope so, but a fast putbits is always needed, so here we talk about it. We will use real mode, and we will optimize for a pentium plain (586), so specific optimizations for others procesors will not be putted there, unless someone really feels that's necessary. And I don't think so E-). This article isn't source-code only, so better you understand how everything works.
     

    The 'proc'
    The routine is simple: everytime it's called it needs a byte of source, the bits to put into the output buffer, and the number of bits to write. Then the routine put the bits, if the byte is full (= we writed 8 bits) we update the output buffer. For putting the bit in the right position we shift it and then 'or' the output byte. Here it is the putbits that I use, at least till now:  ;-) The ascii version is here.


    ; Putbits
    ; Input: cx-> number of bits to write  bl->bits to be written
    ; Output: nothing
    ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com


    put_bits:

     test  cx,cx  ;I used that. Put it ONLY if you know
     jz short @@ptb_fin ;that maybe you will try to write 0 bits
                 ;but better avoid writing 0 bits ;-D
     push ax es di   ;Save only the registers that you know
                 ;that are currently used.
     mov  es,buf_com ;ds it's faster than any other segment
                 ;register. But it's needed for some data
    @@ptb_more_bits_:                       ;in my 'main' loop OE-)
     push  cx  ;We need cx for the shl/cl
     xor  ax,ax           ;put it to 0
     mov  bh,ptb_byte ;get the byte to finally write
     mov   al,bl           ;get the input byte (when the bits are)
     shr  al,1  ;put the first bit in the carry flag
     adc  ah,0  ;Then it to ah. I know there's salc
                 ;but I think it's an inst. undocumented
                 ;from Intel's pentium, so...
     mov  bl,al  ;save the shifted byte
     mov  cl,ptb_totalbits ;how many bits we writed
     shl  ah,cl  ;Put the bit in his position
     or  bh,ah  ;and then update the output byte
     mov  ptb_byte,bh ;save the output byte
     ;now let's see if we must update the output buffer
     inc  ptb_totalbits ;we writed one bit
     cmp  ptb_totalbits,8 ;did we writed the full byte?
     jb short        @@ptb_no_byte   ;90h
     mov  ptb_totalbits,0 ;start again to write the bits
     mov  di,ptr_com ;the pointer to the input buffer
     mov  es:[di],bh ;the output byte
     inc  di  ;next position
     mov  ptr_com,di ;save it for next time
     inc  bytes_com ;the number of written bits
     mov  ptb_byte,0 ;erase the output byte
    @@ptb_no_byte:
     pop  cx  ;the number of bits to write
     loop short @@ptb_more_bits_ ;are we done?
     pop di es ax  ;restore the used registers
    @@ptb_fin:
     ret
     

    Problems? here? a lot. There's no pairing, never. the shl/cl takes 4 clocks always push&pop isn't very funny, of course everything can be worser, the data maybe misaligned, or in the same dword-boundary, ... Let's count the clocks: 6+(16*number of bits)+6(if we update out. buffer) This is aprox. the clocks for that, of course assuming all the data in the level 1 cache (that's like assuming bill gates is documenting every secret of windows ;-) For writing one bit and updating the buffer this will take 28 clocks. I tried it with a program that reads the counters, and it took 65 clocks! Writing 8 bits, so writing the byte to the output buffer too: 252 clocks.
     

    First optimization
    We will try to do more pairing, and loading, and saving the data only when it starts and when ends:


    ; Putbits
    ; Input: dx-> number of bits to write  al->bits to be written
    ; Output: nothing
    ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com


    put_bits:

     mov  bh,ptb_byte
     mov  cl,ptb_totalbits
     mov  es,buf_com ;yes, still using es }E-)
     mov  di,ptr_com
     push  cx di           ;this pairs

    @@ptb_more_bits_:   ;the 'main' loop

     xor  ah,ah           ;1 u
     shr  al,1  ;2 u
     adc  ah,0  ;3 u
     shl  ah,cl  ;4-7 u - the bits writed
     or  bh,ah  ;8 u
     inc  cl  ;1 v

     cmp  cl,8  ;9 u
     jb short @@ptb_no_byte ;9 v

     xor  cx,cx  ;1 u - the count of writed bits
     mov  es:[di],bh ;2 u
     inc  di  ;2 v - next byte
     mov  ptr_com,di ;3 u - save only if incremented
     inc  bytes_com ;3 v
     xor  bh,bh  ;4 u - output byte

    @@ptb_no_byte:

     dec dx              ;10 u - the counter
     jnz short       @@ptb_more_bits_ ;10 v

     mov  ptb_byte,bh
     mov  ptb_totalbits,cl
     pop di cx           ;this pairs too
     ret

    This version has been debbuged at least twice or more. It works. And it's better than the first. 7+(10*number of bits)+4(if we update out. buffer) It's ok, don't you think? we improved the pairing, and sure that writing for example 5 bits it's far better from the first. The real timing for this is 51 clocks per one bit. For 8 bits 145! We have reduced from 252 to 145. Quite good. E-) The bit organization is the following:
     
    eigth bit  ... ... ... ... third bit  second bit  first bit 

    Second optimization

    One of the fucking things in this loop is the 4 clocks shl/cl. How can we avoid it. Easy, everytime we put a new bit we shift the byte to the left. This implies two things: we only spend 1 clock with the shift, BUT the order of the bits in the byte is inverted, so:
     
    first bit  second bit  third bit  ... ... ... ... eigth bit 

    If you choose this putbits, then you MUST change your getbits, but not a lot, your first getbits was shifting to the left, this will be to the right.



    ; Putbits
    ; Input: cl-> number of bits to write  al->bits to be written
    ; Output: nothing
    ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com

    put_bits:

            push            di
            mov             es,buf_com      ;some initalizations
            mov  di,ptr_com
            mov             ah,ptb_byte
            mov             ch,ptb_totalbits

    @@ptb_more_bits_:   ;the 'main' loop

            shr             al,1            ;u 1 - put bit in the carry flag
            adc            ah,0            ;u 2 - put it in output byte
            inc             ch              ;v 2 - the count of writed bits

            cmp             ch,8            ;3 u
            jb short        @@ptb_no_byte   ;3 v

            xor                ch,ch           ;1 u - the count of writed bits
            mov              es:[di],ah      ;2 u
            inc                di              ;2 v
            mov             ptr_com,di      ;3 u
            inc  bytes_com ;3 v
            xor             ah,ah           ;4 u - clear output byte

    @@ptb_no_byte:

            shl             ah,1             ;u 4 - put the bit in his position
            dec             cl               ;5 u - the counter
            jnz short       @@ptb_more_bits_ ;5 v

            mov             ptb_byte,ah
            mov             ptb_totalbits,ch
            pop             di
            ret

    There are somethings to say: Why the 'shl ah,1' is after the comparation, and not just after the 'inc ch'. Easy, if when you have writed 8 bits you shift again to the left, then the bit #7 will go out. In that way they are only shifted if the byte isn't full. Once the updating of the buffer 'shl ah,1' works, but don't worry the contents of ah are 0 so there's no problem with it. As you see the body of the loop is smaller and all the inst. are simples and only ine clock. I will not care more for 'theory' timing, so here it's the real timing: One bit: 41  Eight bits: 110 It improves the previous version, and it's more than 50% better than first. Hey, I'm proud of it. OE-) I must say that the first was very poor coded, so it wasn't a very difficult job. E-)
     

    The same in pmode
    Well, coz now I'm learning pmode I decided to do the same version but in pmode, this is not really very different, anyway here it goes:


    ; Putbits. Pmode.
    ; Input: cl-> number of bits to write  al->bits to be written
    ; Output: nothing
    ; data: ptb_byte, ptb_totalbits, output buffer, bytes_com


    put_bits:

       mov             edi,ptr_com     ;I like pmode E-)
       mov             ah,ptb_byte
       mov             ch,ptb_totalbits
       push            edi

    @@ptb_more_bits_:                         ;the 'main' loop

       shr              al,1            ;u 1 - put bit in the carry flag
       adc             ah,0            ;u 2 - put it in output byte
       inc              ch              ;v 2 - the count of writed bits

       cmp             ch,8            ;3 u
       jb short        @@ptb_no_byte   ;3 v

       xor             ch,ch           ;1 u - the count of writed bits
       mov            [edi],ah        ;2 v
       inc              edi             ;3 u
       inc              bytes_com       ;3 v - I reduced 2 clocks putting
       mov            ptr_com,edi     ;3 u - <- this there
       xor             ah,ah           ;4 u - clear output byte

    @@ptb_no_byte:

       shl               ah,1            ;u 4 - put the bit in his position
       dec             cl              ;5 u - the counter
       jnz short      @@ptb_more_bits_ ;5 v

       mov             ptb_byte,ah
       mov             ptb_totalbits,ch
       pop             edi
       ret
     

    This is, at least till someone publishes on better, the best put_bits out there. Putting one bit: 39 clocks. Putting 8: 105 clocks.

    Btw: if you think that the code is too bad formatted, then I also think so. E-) Just download the zip, with everything formatted ok.
     

    Get bits

    This document isn't a class of "How to properly cut'n'paste" this is a "Learn, and implement", or at least I wanted this in that way, but optimization requires examples, so the source code is there. But there's still the get bits, once you've learned all this stuff write your own. If you get good results, show us your implementation. Did you really thought you will get no homework. X-D Remember that if you use the last version for the getbits you must use shr, and if you use the others the shl. Or something like that. OE-)
     

    Limiting the buffer

    Today I was planning the memory needed for my new compressor (really a lot E-). And coz the memory is limited, you have to choose well how will you distribute it. Let's think. A lot of memory for the input file. More memory for the data structures needed by the compressor... So now we have almost no free mem, and we still need memory for the output buffer. And I arrived to this (simple) solution: The output buffer will be 5k longs or something like that, the put_bits will have a counter with all the bytes writed in this buffer (this is not bytes_com) and when it's 5k (so the buffer is full) we write it to the output file, and start again writing in the 5k buffer. May be 5k is not enough and spends a lot of time writing.
     

    Fastest putbits

    Almost three months have passed since the first version of this article, the time quickly goes by... but this is not the point of this article. E-) Lastly I've been implementing a lot of things, and I've also achieved a better putbits. It started when I was talking with Michael Schindler he telled me about the putbits: "you still loop over the bits? you should write them at once" And then I started to think about, did some pseudo code, implemented it, and it worked perfectly and faster. Here I'll not present code, you should be able to do it yourself, however if you achieve a good implementation, email-me, and let the ppl see it.
    The idea is the following, instead of writing bit by bit, we'll have to manage to do an or and in this way put all the bits. First a bit ordering should be defined:
     
    first bit  second bit  third bit  ... ... ... ... seventh bit 

    Where 0 is the first bit that a getbits must read. If we have a code like 01b it will be inserted that way: 01b, and when properly shifted:
     
     0             

    Once you call it, you have three different cases:

    1. Needed_byte == bits_to_put
    2. Needed_byte > bits_to_put
    3. Needed_byte < bits_to_put
    Rigth? then let's see them:
     

    1. In the first case very everything is perfect, you just or the output with the input, and output the byte. Let's say we have 1101b, and we have to put 0011b, first you shift to the left (input_length times):  11010000b  And or it:  11010011b This byte is ready for output.
     

    2. The second case isn't a real problem. We just have to do the same than the first case, but we don't output the byte.We had 11b, and we have to put 101b. Shift to the left: 11000b And or: 11101b

    3. The third case IS the problem, coz we just can't put the low part of it, and then write in another, we have to put the HIGH part. (remember, from rigth to left) So if we have the code 111111b, and we have to write 0100b. It can't be: 11111100b
    Coz the decoder will read first the 00 bits of the code, it must be: 11111101b And then in the next byte 000b
     

    A pseudo-code for this putbits will look like that:

    Init:

    Then the putbits:

    putbits:

    If we were short of space instead of writing at the end, we could call the putbits again. But it's not really a good idea, coz it will be slower. It would be better to compute (bits_to_put-empty_bits) just once, for speed reasons. Of course you also need an end_putbits, which should flush the buffer of bits and bytes, simply write the remaining bits. Note that any implementation may be fucked up, if the bits wich you don't have to write are 1. If we have to write the value 11b with 2 bits, and instead of 11b we pass it 1111b, it will fuck up, coz we do the ors with previous ands, but this is not a real problem, if we don't pass wrong bits.
     

    Test of fastest putbits
    And here it is a test, so you can see that it works:

    ------ We have to take the last case ---- I just showed the last case, but the other two are easier than this. If you do a good implementation, I'd be glad to see it.
     

    Getbits for fastest putbits
    The way the getbits should get the value is the following: (4 bits value)
     
    0 1 2 3 4 5 6 7
     òòòò
                 òòòò
    0 1 2 3 4 5 6 7

    The body of the loop of the pseudo code will look like that:

    This loops over, the bits, however a decoder is fast, so we don't really need to make a fast getbits. Of course if you do a fast getbits, send it to me, but remember to do it without looping over the bits.
     

    Putting one bit

    Sometimes we only need to write or read one bit, probably when using flags, in this case is a good idea to do an special getbit an putbit, the improvement is that they don't have to care about if there are more bits to write. So there's no loop at all. Even if you are using the fastest putbits it may be a good idea, mainly coz in that case you only have two cases, or the number of its to write (1) is equal, or below the number of bits which you have to write, think about why.
     

    Closing words
    Of course this doc isn't finished, it still need YOUR help. If you improve the put_bits, or do a good get_bits, send them to me, and a new version of that doc will be released, you will be properly credited, let's do a good piece of optimization. Of course other ideas around that will be accepted, making the put_bits getting the source bits from a word, etc. If you read something that doesn't make sense or something, lete me know. If you have any idea, email me: arturo@arturocampos.com Nothing else to say. Well, something else, if I don't get any feedback I will not support more this doc. E-(
     

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
     
     

    Arturo San Emeterio Campos, Barcelona 24-Jul-1999


    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.