Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped.
Table of contents
Introduction
This article is aimed to a coder that has already implemented a compressor,
and know how ASM works, and how to optimize for pentium. If you don't ever
did some of those things then better do them, and when it's done read this.
Everybody that has even learned assembler has also 'studied' the putpixel.
A very optimized routine. Why? That's for three reasons: first is the way
for understanding how the vram works, second it's used always in video
routines, and third, everybody showed his putpixel. Now think about the
putbits. It's almost the most used routine from compressors. A lot of clocks
are spend in your putbits. Do you really think you can optimize you compressor/parser/encoder/...
enough? I hope so, but a fast putbits is always needed, so here we talk
about it. We will use real mode, and we will optimize for a pentium plain
(586), so specific optimizations for others procesors will not be putted
there, unless someone really feels that's necessary. And I don't think
so E-). This article isn't source-code only, so better you understand how
everything works.
The 'proc'
The routine is simple: everytime it's called it needs a byte of source,
the bits to put into the output buffer, and the number of bits to write.
Then the routine put the bits, if the byte is full (= we writed 8 bits)
we update the output buffer. For putting the bit in the right position
we shift it and then 'or' the output byte. Here it is the putbits that
I use, at least till now: ;-) The ascii version is here.
test cx,cx ;I used that. Put it ONLY if you know
jz short @@ptb_fin ;that maybe you will try to write 0 bits
;but better avoid writing 0 bits ;-D
push ax es di ;Save only the registers that you know
;that are currently used.
mov es,buf_com ;ds it's faster than any other segment
;register. But it's needed for some data
@@ptb_more_bits_:
;in my 'main' loop OE-)
push cx ;We need cx for the shl/cl
xor ax,ax
;put it to 0
mov bh,ptb_byte ;get the byte to finally write
mov al,bl
;get the input byte (when the bits are)
shr al,1 ;put the first bit in the carry flag
adc ah,0 ;Then it to ah. I know there's salc
;but I think it's an inst. undocumented
;from Intel's pentium, so...
mov bl,al ;save the shifted byte
mov cl,ptb_totalbits ;how many bits we writed
shl ah,cl ;Put the bit in his position
or bh,ah ;and then update the output byte
mov ptb_byte,bh ;save the output byte
;now let's see if we must update the output buffer
inc ptb_totalbits ;we writed one bit
cmp ptb_totalbits,8 ;did we writed the full byte?
jb short @@ptb_no_byte
;90h
mov ptb_totalbits,0 ;start again to write the bits
mov di,ptr_com ;the pointer to the input buffer
mov es:[di],bh ;the output byte
inc di ;next position
mov ptr_com,di ;save it for next time
inc bytes_com ;the number of written bits
mov ptb_byte,0 ;erase the output byte
@@ptb_no_byte:
pop cx ;the number of bits to write
loop short @@ptb_more_bits_ ;are we done?
pop di es ax ;restore the used registers
@@ptb_fin:
ret
Problems? here? a lot. There's no pairing, never. the shl/cl takes 4
clocks always push&pop isn't very funny, of course everything can be
worser, the data maybe misaligned, or in the same dword-boundary, ... Let's
count the clocks: 6+(16*number of bits)+6(if we update out. buffer) This
is aprox. the clocks for that, of course assuming all the data in the level
1 cache (that's like assuming bill gates is documenting every secret of
windows ;-) For writing one bit and updating the buffer this will take
28 clocks. I tried it with a program that reads the counters, and it took
65 clocks! Writing 8 bits, so writing the byte to the output buffer too:
252 clocks.
First optimization
We will try to do more pairing, and loading, and saving the data only
when it starts and when ends:
mov bh,ptb_byte
mov cl,ptb_totalbits
mov es,buf_com ;yes, still using es }E-)
mov di,ptr_com
push cx di
;this pairs
@@ptb_more_bits_: ;the 'main' loop
xor ah,ah
;1 u
shr al,1 ;2 u
adc ah,0 ;3 u
shl ah,cl ;4-7 u - the bits writed
or bh,ah ;8 u
inc cl ;1 v
cmp cl,8 ;9 u
jb short @@ptb_no_byte ;9 v
xor cx,cx ;1 u - the count of writed bits
mov es:[di],bh ;2 u
inc di ;2 v - next byte
mov ptr_com,di ;3 u - save only if incremented
inc bytes_com ;3 v
xor bh,bh ;4 u - output byte
@@ptb_no_byte:
dec dx
;10 u - the counter
jnz short @@ptb_more_bits_
;10 v
mov ptb_byte,bh
mov ptb_totalbits,cl
pop di cx
;this pairs too
ret
This version has been debbuged at least twice or more. It works. And
it's better than the first. 7+(10*number of bits)+4(if we update out. buffer)
It's ok, don't you think? we improved the pairing, and sure that writing
for example 5 bits it's far better from the first. The real timing for
this is 51 clocks per one bit. For 8 bits 145! We have reduced from 252
to 145. Quite good. E-) The bit organization is the following:
| eigth bit | ... | ... | ... | ... | third bit | second bit | first bit |
One of the fucking things in this loop is the 4 clocks shl/cl. How can
we avoid it. Easy, everytime we put a new bit we shift the byte to the
left. This implies two things: we only spend 1 clock with the shift, BUT
the order of the bits in the byte is inverted, so:
| first bit | second bit | third bit | ... | ... | ... | ... | eigth bit |
If you choose this putbits, then you MUST change your getbits, but not a lot, your first getbits was shifting to the left, this will be to the right.
put_bits:
push
di
mov
es,buf_com ;some initalizations
mov di,ptr_com
mov
ah,ptb_byte
mov
ch,ptb_totalbits
@@ptb_more_bits_: ;the 'main' loop
shr
al,1
;u 1 - put bit in the carry flag
adc
ah,0
;u 2 - put it in output byte
inc
ch
;v 2 - the count of writed bits
cmp
ch,8
;3 u
jb short
@@ptb_no_byte ;3 v
xor
ch,ch ;1 u
- the count of writed bits
mov
es:[di],ah ;2 u
inc
di
;2 v
mov
ptr_com,di ;3 u
inc bytes_com ;3 v
xor
ah,ah ;4 u
- clear output byte
@@ptb_no_byte:
shl
ah,1
;u 4 - put the bit in his position
dec
cl
;5 u - the counter
jnz short
@@ptb_more_bits_ ;5 v
mov
ptb_byte,ah
mov
ptb_totalbits,ch
pop
di
ret
There are somethings to say: Why the 'shl ah,1' is after the comparation,
and not just after the 'inc ch'. Easy, if when you have writed 8 bits you
shift again to the left, then the bit #7 will go out. In that way they
are only shifted if the byte isn't full. Once the updating of the buffer
'shl ah,1' works, but don't worry the contents of ah are 0 so there's no
problem with it. As you see the body of the loop is smaller and all the
inst. are simples and only ine clock. I will not care more for 'theory'
timing, so here it's the real timing: One bit: 41 Eight bits: 110
It improves the previous version, and it's more than 50% better than first.
Hey, I'm proud of it. OE-) I must say that the first was very poor coded,
so it wasn't a very difficult job. E-)
The same in pmode
Well, coz now I'm learning pmode I decided to do the same version but
in pmode, this is not really very different, anyway here it goes:
mov
edi,ptr_com ;I like pmode E-)
mov
ah,ptb_byte
mov
ch,ptb_totalbits
push
edi
@@ptb_more_bits_: ;the 'main' loop
shr
al,1
;u 1 - put bit in the carry flag
adc
ah,0
;u 2 - put it in output byte
inc
ch
;v 2 - the count of writed bits
cmp
ch,8
;3 u
jb short @@ptb_no_byte
;3 v
xor
ch,ch ;1 u
- the count of writed bits
mov
[edi],ah ;2 v
inc
edi
;3 u
inc
bytes_com ;3 v - I reduced 2 clocks
putting
mov
ptr_com,edi ;3 u - <- this there
xor
ah,ah ;4 u
- clear output byte
@@ptb_no_byte:
shl
ah,1
;u 4 - put the bit in his position
dec
cl
;5 u - the counter
jnz short @@ptb_more_bits_
;5 v
mov
ptb_byte,ah
mov
ptb_totalbits,ch
pop
edi
ret
This is, at least till someone publishes on better, the best put_bits out there. Putting one bit: 39 clocks. Putting 8: 105 clocks.
Btw: if you think that the code is too bad formatted, then I also think
so. E-) Just download the zip, with everything
formatted ok.
This document isn't a class of "How to properly cut'n'paste" this is
a "Learn, and implement", or at least I wanted this in that way, but optimization
requires examples, so the source code is there. But there's still the get
bits, once you've learned all this stuff write your own. If you get good
results, show us your implementation. Did you really thought you will get
no homework. X-D Remember that if you use the last version for the getbits
you must use shr, and if you use the others the shl. Or something like
that. OE-)
Today I was planning the memory needed for my new compressor (really
a lot E-). And coz the memory is limited, you have to choose well how will
you distribute it. Let's think. A lot of memory for the input file. More
memory for the data structures needed by the compressor... So now we have
almost no free mem, and we still need memory for the output buffer. And
I arrived to this (simple) solution: The output buffer will be 5k longs
or something like that, the put_bits will have a counter with all the bytes
writed in this buffer (this is not bytes_com) and when it's 5k (so the
buffer is full) we write it to the output file, and start again writing
in the 5k buffer. May be 5k is not enough and spends a lot of time writing.
Almost three months have passed since the first version of this article,
the time quickly goes by... but this is not the point of this article.
E-) Lastly I've been implementing a lot of things, and I've also achieved
a better putbits. It started when I was talking with Michael
Schindler he telled me about the putbits: "you still loop over the
bits? you should write them at once" And then I started to think about,
did some pseudo code, implemented it, and it worked perfectly and faster.
Here I'll not present code, you should be able to do it yourself, however
if you achieve a good implementation, email-me,
and let the ppl see it.
The idea is the following, instead of writing bit by bit, we'll have
to manage to do an or and in this way put all the bits. First a bit ordering
should be defined:
| first bit | second bit | third bit | ... | ... | ... | ... | seventh bit |
Where 0 is the first bit that a getbits must read. If we have a code
like 01b it will be inserted that way: 01b, and when properly shifted:
| 0 | 1 |
Once you call it, you have three different cases:
1. In the first case very everything is perfect, you just or the output
with the input, and output the byte. Let's say we have 1101b, and we have
to put 0011b, first you shift to the left (input_length times): 11010000b
And or it: 11010011b This byte is ready for output.
2. The second case isn't a real problem. We just have to do the same than the first case, but we don't output the byte.We had 11b, and we have to put 101b. Shift to the left: 11000b And or: 11101b
3. The third case IS the problem, coz we just can't put the low part
of it, and then write in another, we have to put the HIGH part. (remember,
from rigth to left) So if we have the code 111111b, and we have to write
0100b. It can't be: 11111100b
Coz the decoder will read first the 00 bits of the code, it must be:
11111101b And then in the next byte 000b
A pseudo-code for this putbits will look like that:
Init:
putbits:
Test of fastest
putbits
And here it is a test, so you can see that it works:
Getbits for
fastest putbits
The way the getbits should get the value is the following: (4 bits
value)
|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
The body of the loop of the pseudo code will look like that:
Sometimes we only need to write or read one bit, probably when using
flags, in this case is a good idea to do an special getbit an putbit, the
improvement is that they don't have to care about if there are more bits
to write. So there's no loop at all. Even if you are using the fastest
putbits it may be a good idea, mainly coz in that case you only have two
cases, or the number of its to write (1) is equal, or below the number
of bits which you have to write, think about why.
Closing words
Of course this doc isn't finished, it still need YOUR help. If you
improve the put_bits, or do a good get_bits, send them to me, and a new
version of that doc will be released, you will be properly credited, let's
do a good piece of optimization. Of course other ideas around that will
be accepted, making the put_bits getting the source bits from a word, etc.
If you read something that doesn't make sense or something, lete me know.
If you have any idea, email me: arturo@arturocampos.com
Nothing else to say. Well, something else, if I don't get any feedback
I will not support more this doc. E-(
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.