Forums

Single Bit Fast Fourier

Started by ZapoTex March 6, 2005
Hello everyone!

I have a simple question: I have a large number of large images, but
with a "color" depth of just 1 bit: my signal can just be "on" or
"off". I'm using fftw, that uses double precision floating point
numbers! Isn't it a waste of cpu time and memory?

What I need is an FFT subroutine optimised for one bit numbers. Of
course the result must be float or double.

I've always been a "black box" programmer (I don't know what a
processor does, I just care about making my programs work), then I
don't know whether what I say makes sense. Would I really gain speed?
Can a processor compute 32 or at least 16 operations at a time if the
operands are boolean variables?

Thank you very much. And I hope my question is useful for someone else
too!

Thank you again.

ZapoTex


 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com
The default mode of FFTW is double, but you can make it use floats 
instead with 'configure --enable-float'  and some syntactical changes 
(see docs).  That should buy you a little speed.



ZapoTex wrote:
> Hello everyone! > > I have a simple question: I have a large number of large images, but > with a "color" depth of just 1 bit: my signal can just be "on" or > "off". I'm using fftw, that uses double precision floating point > numbers! Isn't it a waste of cpu time and memory? > > What I need is an FFT subroutine optimised for one bit numbers. Of > course the result must be float or double. > > I've always been a "black box" programmer (I don't know what a > processor does, I just care about making my programs work), then I > don't know whether what I say makes sense. Would I really gain speed?
If you are using a processor that has simd extensions in FFTW, the answer is "probably not". FFTW is pretty darned fast on Intel, AMD, PPC cpus.
> Can a processor compute 32 or at least 16 operations at a time if the > operands are boolean variables? > > Thank you very much. And I hope my question is useful for someone else > too! > > Thank you again. > > ZapoTex > > > Posted Via Usenet.com Premium Usenet Newsgroup Services > ---------------------------------------------------------- > ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY ** > ---------------------------------------------------------- > http://www.usenet.com

ZapoTex wrote:

> Hello everyone! > > I have a simple question: I have a large number of large images, but > with a "color" depth of just 1 bit: my signal can just be "on" or > "off". I'm using fftw, that uses double precision floating point > numbers! Isn't it a waste of cpu time and memory? > > What I need is an FFT subroutine optimised for one bit numbers. Of > course the result must be float or double. > > I've always been a "black box" programmer (I don't know what a > processor does, I just care about making my programs work), then I > don't know whether what I say makes sense. Would I really gain speed? > Can a processor compute 32 or at least 16 operations at a time if the > operands are boolean variables?
Well... not unless you have a SIMD processor that is wired to break up its native word into that many little words. Typically only large word sizes are broken up - pentium breaks up a 128-bit word into 4 32-bit words. I'm not aware of any processor that breaks up a word into bits and operates on them independently.
> > Thank you very much. And I hope my question is useful for someone else > too! > > Thank you again. > > ZapoTex > > > Posted Via Usenet.com Premium Usenet Newsgroup Services > ---------------------------------------------------------- > ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY ** > ---------------------------------------------------------- > http://www.usenet.com
Thank you very much Mark!
I've looked for integer fft and single bit fft in fftw documentation,
but not single float! This is a great idea, the speed should become
almost twice what it is now. I'll try it soon!

Thank you again!


 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com
"Ravi Srikantiah" <ravi.srikantiah@gmail.com> wrote in message
news:d0hob7$ras$1@alageremail2.agere.com...

> Well... not unless you have a SIMD processor that is wired to break up > its native word into that many little words. Typically only large word > sizes are broken up - pentium breaks up a 128-bit word into 4 32-bit > words. I'm not aware of any processor that breaks up a word into bits > and operates on them independently.
And I'm not aware of a processor that doesn't have at least bitwise logical AND, OR, XOR. I would routinely process 64 bits in parallel on my Alpha using these operations, and mmx allows you to do so on x86. The problem for FFTs is that overflow happens pretty quickly with 1-bit words. FFTs mod 2 might not be so bad, though. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
> FFTs mod > 2 might not be so bad, though. >
Unfortunately not for my target. I need all the information.
> I'm not aware of any processor that breaks up a word into bits > and operates on them independently.
I'm not talking about breaking up, but packing together! This way one word contains my signal in say 16 (to avoid overflow on a 32 bit processor) different points. And then the first sums are computed. Then 8 each word, then 4, so that the speed gain is very high in the first set of sums, and decreases progressively. Unfortunately I could hardly write a simple FFT with double, I could never write anything of the kind I said above and I was wondering whether anyone already had done it! Thank you again! The advices you've already given me have been extremely useful. By! Posted Via Usenet.com Premium Usenet Newsgroup Services ---------------------------------------------------------- ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY ** ---------------------------------------------------------- http://www.usenet.com
>>Well... not unless you have a SIMD processor that is wired to break up >>its native word into that many little words. Typically only large word >>sizes are broken up - pentium breaks up a 128-bit word into 4 32-bit >>words. I'm not aware of any processor that breaks up a word into bits >>and operates on them independently. > > > And I'm not aware of a processor that doesn't have at least > bitwise logical AND, OR, XOR. I would routinely process 64 > bits in parallel on my Alpha using these operations, and mmx > allows you to do so on x86. The problem for FFTs is that > overflow happens pretty quickly with 1-bit words. FFTs mod > 2 might not be so bad, though. >
How much advantage you'd derive implementing packed arithmetic using bitwise operations is questionable. If you were to pack two bytes on a 16-bit processor for example, an ADD would require an XOR, SHIFT, and an AND operation for each bit - that would be 8*3 cycles for 2 8-bit adds, not considering looping and initialization overheads. Even if you could use a processor that has parallel units and pipeline operations, it would still be pretty difficult to achieve 2 8-bit adds in 2 cycles (or in 1-cycle as a processor that supports packed arithmetic in hardware would achieve).
"Ravi Srikantiah" <ravi.srikantiah@gmail.com> wrote in message
news:d0lva4$9lb$1@alageremail2.agere.com...

> How much advantage you'd derive implementing packed arithmetic using > bitwise operations is questionable. If you were to pack two bytes on a > 16-bit processor for example, an ADD would require an XOR, SHIFT, and an > AND operation for each bit - that would be 8*3 cycles for 2 8-bit adds, > not considering looping and initialization overheads. Even if you could > use a processor that has parallel units and pipeline operations, it > would still be pretty difficult to achieve 2 8-bit adds in 2 cycles (or > in 1-cycle as a processor that supports packed arithmetic in hardware > would achieve).
You are thinking at orthogonal trajectories to the optimal method. Consider two 8-bit numbers, [a7 a6 a5 a4 a3 a2 a1 a0] and [b7 b6 b5 b4 b3 b2 b1 b0]. The low bit of the result is a0 XOR b0, while the carry up to the next bit is c1 = a0 AND b0. The next bit of the result is a1 XOR b1 XOR c1 while the next carry would be c2 = (a1 AND b1) OR (c1 AND (a1 XOR b1)). Similarly for the bits 2:6, bit 7 only needs a7 XOR b7 XOR c7, so for a total of 2+6*5+2 = 34 operations we could do 64 8-bit adds with 64-bit registers. Of course we could pack 7 8-bit quantities into a 64-bit register with space between each and do 7 8-bit adds in 2 operations (1 ADD and 1 AND) even if the architecture didn't offer an 8-way parallel 8-bit addition operation. It's left as an exercise to show that you can do a POPCNT on a large array in less than 6 operations per word (not counting loads and stores) so this can be efficient if the word size is large, like 64 bits. And the original question was about boolean variables in which case it's certainly possible to knock off the word size per operation. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
Ravi and James, thank you very much.

I hardly understood anything of your last answers, but now I know what
to look for.

Thank you very much again.

I hope I can find and/or develop what I need. Of course I'll let you
know.

Bye and thank you again!


 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com