DSPRelated.com
Forums

speed up FFT process

Started by Dinakar July 8, 2004
Hi,
   I'm developing a image processing based application where I need to
compute 2D(512X512) FFT of several images. But this takes more
time(around 10 seconds to compute the FFTs for 8 such images) in a P4
computer. Is there any hardware based solution to speed up the
process.

Thanks, Dinakar P
That sounds awfully slow for 8x a 2d 512pt. FFT - are you sure you
can't optimize your code better? At any rate, that would be the first
thing to consider before adding more hardware.

--smb

dinakarp@tenet.res.in (Dinakar) wrote in message news:<795d1889.0407072058.57085886@posting.google.com>...
> Hi, > I'm developing a image processing based application where I need to > compute 2D(512X512) FFT of several images. But this takes more > time(around 10 seconds to compute the FFTs for 8 such images) in a P4 > computer. Is there any hardware based solution to speed up the > process. > > Thanks, Dinakar P
Dinakar wrote:
> Hi, > I'm developing a image processing based application where I need to > compute 2D(512X512) FFT of several images. But this takes more > time(around 10 seconds to compute the FFTs for 8 such images) in a P4 > computer. Is there any hardware based solution to speed up the > process. > > Thanks, Dinakar P
It would appear you have a very slow FFT routine. I wondered how long KISSFFT would take, so I tested the time to transform 8 512x512 images on my athlon 2100+. Each 512x512 image requires an FFT for each column and each row. For 8 images (8192 512pt ffts), KISSFFT takes only 0.3 seconds. KISSFFT can do multi-dimensional, mixed-radix FFTs (check under the tools/ dir in the distribution) http://www.sourceforge.net/projects/kissfft/ If this is not fast enough, use FFTW. Try it with --enable-float, --enable-sse, and wisdom. If you spend the time to get it optimized for your system, you can expect at least a 2x speed improvement over KISSFFT. http://www.fftw.org/ -- Mark
spam@dspdimension.com (Stephan M. Bernsee) wrote in message news:<42c34ef.0407080139.5cd0838b@posting.google.com>...
> That sounds awfully slow for 8x a 2d 512pt. FFT - are you sure you > can't optimize your code better? At any rate, that would be the first > thing to consider before adding more hardware. > > --smb
Nah, I don't think this is unreasonable. I'm using a 1.4 GHz thingy, and a simple test with matlab yields
>> a=randn(512,512);
tstart=cputime; for n=1:8 A=fft2(a); end tstop=cputime-tstart; disp(['Eight 512x512 2D FFTs took ',num2str(tstop),' seconds']) Eight 512x512 2D FFTs took 2.547 seconds
>>
Matlab apparently uses a fast fft library, so a factor 4 between a fast library and a home-coded(?) non-optimized(?) routine doesn't sound all that unreasonable... Of course, I don't know what hardware Dinakar used, so I can't correct for clock frequency. Rune
> > dinakarp@tenet.res.in (Dinakar) wrote in message news:<795d1889.0407072058.57085886@posting.google.com>... > > Hi, > > I'm developing a image processing based application where I need to > > compute 2D(512X512) FFT of several images. But this takes more > > time(around 10 seconds to compute the FFTs for 8 such images) in a P4 > > computer. Is there any hardware based solution to speed up the > > process. > > > > Thanks, Dinakar P
dinakarp@tenet.res.in (Dinakar) wrote...
> I'm developing a image processing based application where I need to > compute 2D(512X512) FFT of several images. But this takes more > time(around 10 seconds to compute the FFTs for 8 such images) in a P4 > computer. Is there any hardware based solution to speed up the > process.
You're doing something wrong; the transform should be much faster on a P4. Our FFTW library (www.fftw.org), for example, can compute 8 transforms of 512x512 real data on a 2.8GHz P4 in around 0.05s (see http://www.fftw.org/speed/p4-2.8GHz-new/). You should get a better FFT implementation. Cordially, Steven G. Johnson
Hi Mark,

thanks for the link to the KISSFFT library, it looks like a very
useful tool. I never actually got the FFTW to compile on MacOS X and I
also find it too complicated to set up (certainly very powerful, but
way too complicated for use in a "daily context") so this is certainly
a link I will keep...

Did you try it on MacOS X wrt. performance?

--smb
Mark Borgerding <mark@borgerding.net> wrote in message news:<2PaHc.3263$wX6.1064@fe1.columbus.rr.com>...
> Dinakar wrote: > > Hi, > > I'm developing a image processing based application where I need to > > compute 2D(512X512) FFT of several images. But this takes more > > time(around 10 seconds to compute the FFTs for 8 such images) in a P4 > > computer. Is there any hardware based solution to speed up the > > process. > > > > Thanks, Dinakar P > > It would appear you have a very slow FFT routine. > > > > I wondered how long KISSFFT would take, so I tested the time to > transform 8 512x512 images on my athlon 2100+. > > Each 512x512 image requires an FFT for each column and each row. > For 8 images (8192 512pt ffts), KISSFFT takes only 0.3 seconds.
Eh... where did 8192 come from? Wouldn't this be 8*(512)^2 = 2097152 (1D) FFTs? If so, assuming 512*log_2(512)=4608 instructions pr FFT, this would amount to a total of roughly 10e9 instructions for the whole job. On a 1.5 GHz machine, assuming one instruction/clock cycle, this means some 6 seconds for the job. Add some pipelining and a run-time on the order of 1-3 seconds seems feasible. So, if you guys find run-times two orders of magnitude lower than that for the job, my argument above if flawed. Could somebody help me see what the flaw is? And while we are at it, I think the current CPUs have some sort of DSP-ish computational unit added to it, much the same way the 80x87 maths FP unit was attached in the old days. Do somebody here know how to trick the Borland C++ 6 compiler (Personal version) to take advantage of that extra unit? Rune
"Rune Allnor" <allnor@tele.ntnu.no> wrote in message
news:f56893ae.0407082248.6529fc25@posting.google.com...

> Eh... where did 8192 come from? Wouldn't this be 8*(512)^2 = 2097152 > (1D) FFTs? If so, assuming 512*log_2(512)=4608 instructions pr FFT, > this would amount to a total of roughly 10e9 instructions for the > whole job. On a 1.5 GHz machine, assuming one instruction/clock cycle, > this means some 6 seconds for the job. Add some pipelining and a > run-time on the order of 1-3 seconds seems feasible.
No, it's 8*512*2: you transform all rows, then transform all columns. Since a full complex DFT on 512 inputs requires 15048 flops, this is 15048*8192 = 123273216 flops in 0.3 seconds, we get 123273216/(0.3*1.0e9) = 0.41 Gflops for a processor capable of 3.0 Gflops with its x87 unit or 6.0 Glops with 3DNow! single precision. Of course you can't attain these kinds of flop rates in FFT code because the Athlon requires balance between adds and multiplies to do so disregarding the fact that real FFT code for datasets this large will spend most of its time waiting on memory unless the code is very efficiently organized.
> And while we are at it, I think the current CPUs have some sort of > DSP-ish computational unit added to it, much the same way the 80x87 > maths FP unit was attached in the old days. Do somebody here know > how to trick the Borland C++ 6 compiler (Personal version) to take > advantage of that extra unit?
Quite a trick considering that the compiler has to be assured in some way that data will be aligned 0 mod 64 or even mod 128 to be able to use SIMD parallelism. What languages let you tell the compiler this? -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
"Rune Allnor" schrieb
> > And while we are at it, I think the > current CPUs
You're talking about Intel 80x86 processors here, right?
> have some sort > of DSP-ish computational unit added to it, much the same way the > 80x87 maths FP unit was attached in the old days. Do somebody > here know how to trick the Borland C++ 6 compiler (Personal > version) to take advantage of that extra unit? >
IIRC, the Pentium (or its AMD cousin, Athlon) have the co-processor of old right built in. If you tell your compiler to generate code for pentium or higher, the compiler does take advantage of this unit. AFAIK. Martin
Rune Allnor wrote:
> Mark Borgerding <mark@borgerding.net> wrote in message news:<2PaHc.3263$wX6.1064@fe1.columbus.rr.com>... > >>Dinakar wrote: >> >>>Hi, >>> I'm developing a image processing based application where I need to >>>compute 2D(512X512) FFT of several images. But this takes more >>>time(around 10 seconds to compute the FFTs for 8 such images) in a P4 >>>computer. Is there any hardware based solution to speed up the >>>process. >>> >>>Thanks, Dinakar P >> >>It would appear you have a very slow FFT routine. >> >> >> >>I wondered how long KISSFFT would take, so I tested the time to >>transform 8 512x512 images on my athlon 2100+. >> >>Each 512x512 image requires an FFT for each column and each row. >>For 8 images (8192 512pt ffts), KISSFFT takes only 0.3 seconds. > > > Eh... where did 8192 come from?
A two-dimensional FFT is performed as 1D FFTs columnwise then rowwise FFTs on the intermediate results (or vice versa). So an MxN matrix requires N FFTs of size M plus M FFTs of size N. A 512x512 matrix takes 1024 1d FFTs of size 512. KISSFFT can perform FFTs on N-dimensional matrices, but I didn't have a test program handy to time it, so I estimated using 1d FFTs. The times should track pretty close with my estimate from the 1d. -- Mark