Ravi and James, thank you very much.

I hardly understood anything of your last answers, but now I know what
to look for.

Thank you very much again.

I hope I can find and/or develop what I need. Of course I'll let you
know.

Bye and thank you again!


 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com

"Ravi Srikantiah" <ravi.srikantiah@gmail.com> wrote in message
news:d0lva4$9lb$1@alageremail2.agere.com...

> How much advantage you'd derive implementing packed arithmetic using
> bitwise operations is questionable. If you were to pack two bytes on a
> 16-bit processor for example, an ADD would require an XOR, SHIFT, and an
> AND operation for each bit - that would be 8*3 cycles for 2 8-bit adds,
> not considering looping and initialization overheads. Even if you could
> use a processor that has parallel units and pipeline operations, it
> would still be pretty difficult to achieve 2 8-bit adds in 2 cycles (or
> in 1-cycle as a processor that supports packed arithmetic in hardware
> would achieve).

You are thinking at orthogonal trajectories to the optimal
method.  Consider two 8-bit numbers, [a7 a6 a5 a4 a3 a2 a1 a0]
and [b7 b6 b5 b4 b3 b2 b1 b0].  The low bit of the result is
a0 XOR b0, while the carry up to the next bit is c1 = a0 AND b0.
The next bit of the result is a1 XOR b1 XOR c1 while the next
carry would be c2 = (a1 AND b1) OR (c1 AND (a1 XOR b1)).
Similarly for the bits 2:6, bit 7 only needs a7 XOR b7 XOR c7,
so for a total of 2+6*5+2 = 34 operations we could do 64 8-bit
adds with 64-bit registers.  Of course we could pack 7 8-bit
quantities into a 64-bit register with space between each and
do 7 8-bit adds in 2 operations (1 ADD and 1 AND) even if the
architecture didn't offer an 8-way parallel 8-bit addition
operation.  It's left as an exercise to show that you can
do a POPCNT on a large array in less than 6 operations per
word (not counting loads and stores) so this can be efficient
if the word size is large, like 64 bits.  And the original
question was about boolean variables in which case it's
certainly possible to knock off the word size per operation.

-- 
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

>>Well... not unless you have a SIMD processor that is wired to break up
>>its native word into that many little words. Typically only large word
>>sizes are broken up - pentium breaks up a 128-bit word into 4 32-bit
>>words. I'm not aware of any processor that breaks up a word into bits
>>and operates on them independently.
> 
> 
> And I'm not aware of a processor that doesn't have at least
> bitwise logical AND, OR, XOR.  I would routinely process 64
> bits in parallel on my Alpha using these operations, and mmx
> allows you to do so on x86.  The problem for FFTs is that
> overflow happens pretty quickly with 1-bit words.  FFTs mod
> 2 might not be so bad, though.
> 

How much advantage you'd derive implementing packed arithmetic using 
bitwise operations is questionable. If you were to pack two bytes on a 
16-bit processor for example, an ADD would require an XOR, SHIFT, and an 
AND operation for each bit - that would be 8*3 cycles for 2 8-bit adds, 
not considering looping and initialization overheads. Even if you could 
use a processor that has parallel units and pipeline operations, it 
would still be pretty difficult to achieve 2 8-bit adds in 2 cycles (or 
in 1-cycle as a processor that supports packed arithmetic in hardware 
would achieve).

> FFTs mod
> 2 might not be so bad, though.
> 
Unfortunately not for my target. I need all the information.

> I'm not aware of any processor that breaks up a word into bits
> and operates on them independently.

I'm not talking about breaking up, but packing together! This way one
word contains my signal in say 16 (to avoid overflow on a 32 bit
processor) different points. And then the first sums are computed.
Then 8 each word, then 4, so that the speed gain is very high in the
first set of sums, and decreases progressively.

Unfortunately I could hardly write a simple FFT with double, I could
never write anything of the kind I said above and I was wondering
whether anyone already had done it!

Thank you again! The advices you've already given me have been
extremely useful.

               By!


 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com

"Ravi Srikantiah" <ravi.srikantiah@gmail.com> wrote in message
news:d0hob7$ras$1@alageremail2.agere.com...

> Well... not unless you have a SIMD processor that is wired to break up
> its native word into that many little words. Typically only large word
> sizes are broken up - pentium breaks up a 128-bit word into 4 32-bit
> words. I'm not aware of any processor that breaks up a word into bits
> and operates on them independently.

And I'm not aware of a processor that doesn't have at least
bitwise logical AND, OR, XOR.  I would routinely process 64
bits in parallel on my Alpha using these operations, and mmx
allows you to do so on x86.  The problem for FFTs is that
overflow happens pretty quickly with 1-bit words.  FFTs mod
2 might not be so bad, though.

-- 
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Thank you very much Mark!
I've looked for integer fft and single bit fft in fftw documentation,
but not single float! This is a great idea, the speed should become
almost twice what it is now. I'll try it soon!

Thank you again!


 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com


ZapoTex wrote:

> Hello everyone!
> 
> I have a simple question: I have a large number of large images, but
> with a "color" depth of just 1 bit: my signal can just be "on" or
> "off". I'm using fftw, that uses double precision floating point
> numbers! Isn't it a waste of cpu time and memory?
> 
> What I need is an FFT subroutine optimised for one bit numbers. Of
> course the result must be float or double.
> 
> I've always been a "black box" programmer (I don't know what a
> processor does, I just care about making my programs work), then I
> don't know whether what I say makes sense. Would I really gain speed?
> Can a processor compute 32 or at least 16 operations at a time if the
> operands are boolean variables?

Well... not unless you have a SIMD processor that is wired to break up 
its native word into that many little words. Typically only large word 
sizes are broken up - pentium breaks up a 128-bit word into 4 32-bit 
words. I'm not aware of any processor that breaks up a word into bits 
and operates on them independently.

> 
> Thank you very much. And I hope my question is useful for someone else
> too!
> 
> Thank you again.
> 
> ZapoTex
> 
> 
>  Posted Via Usenet.com Premium Usenet Newsgroup Services
> ----------------------------------------------------------
>     ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ----------------------------------------------------------        
>                 http://www.usenet.com

The default mode of FFTW is double, but you can make it use floats 
instead with 'configure --enable-float'  and some syntactical changes 
(see docs).  That should buy you a little speed.



ZapoTex wrote:
> Hello everyone!
> 
> I have a simple question: I have a large number of large images, but
> with a "color" depth of just 1 bit: my signal can just be "on" or
> "off". I'm using fftw, that uses double precision floating point
> numbers! Isn't it a waste of cpu time and memory?
> 
> What I need is an FFT subroutine optimised for one bit numbers. Of
> course the result must be float or double.
> 
> I've always been a "black box" programmer (I don't know what a
> processor does, I just care about making my programs work), then I
> don't know whether what I say makes sense. Would I really gain speed?

If you are using a processor that has simd extensions in FFTW, the 
answer is "probably not".

FFTW is pretty darned fast on Intel, AMD, PPC cpus.


> Can a processor compute 32 or at least 16 operations at a time if the
> operands are boolean variables?
> 
> Thank you very much. And I hope my question is useful for someone else
> too!
> 
> Thank you again.
> 
> ZapoTex
> 
> 
>  Posted Via Usenet.com Premium Usenet Newsgroup Services
> ----------------------------------------------------------
>     ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ----------------------------------------------------------        
>                 http://www.usenet.com

Hello everyone!

I have a simple question: I have a large number of large images, but
with a "color" depth of just 1 bit: my signal can just be "on" or
"off". I'm using fftw, that uses double precision floating point
numbers! Isn't it a waste of cpu time and memory?

What I need is an FFT subroutine optimised for one bit numbers. Of
course the result must be float or double.

I've always been a "black box" programmer (I don't know what a
processor does, I just care about making my programs work), then I
don't know whether what I say makes sense. Would I really gain speed?
Can a processor compute 32 or at least 16 operations at a time if the
operands are boolean variables?

Thank you very much. And I hope my question is useful for someone else
too!

Thank you again.

ZapoTex


 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com