DSPRelated.com
Forums

Large FFT --- NOT radix-2

Started by Richard Owlett January 1, 2004
Many helpful people wrote:
> Richard Owlett <rowlett@atlascomm.net> wrote in message news:<vv9fams78p7438@corp.supernews.com>... > >>Any pointers to doing non-radix 2 FFT's. >>Specifically I wish to do 44100 pt FFT's on 16 bit data. >>Simplicity of coding OUTWEIGHS speed! >>[ Scilab is slow but adequate ] > > >[Snip many educational replies]
My problem was how I was getting the data on which the FFT would be done. NOT, how fast Scilab could do an FFT. Mind if I don't admit how bad my code was ;?
Fred Marshall <fmarshallx@remove_the_x.acm.org> wrote:
> > "Richard Owlett" <rowlett@atlascomm.net> wrote in message > news:vv9fams78p7438@corp.supernews.com... >> Any pointers to doing non-radix 2 FFT's. >> Specifically I wish to do 44100 pt FFT's on 16 bit data. >> Simplicity of coding OUTWEIGHS speed! >> [ Scilab is slow but adequate ] >> > > Doesn't SCILAB just do them? I tried it and it seems to. It took 0.421 > seconds on a dual processor Pentium 200MHz.
P200: octave:4> x=randn(1,44100);tic,w=fft(x);toc ans = 0.18392 A1600: octave:5> x=randn(1,44100);tic,w=fft(x);toc ans = 0.022966
Alex Gibson wrote:
> "Richard Owlett" <rowlett@atlascomm.net> wrote in message > news:vvar7epmfqmuee@corp.supernews.com... > >>Richard Owlett wrote: >> >>>Fred Marshall wrote: >>> >>> >>>>"Richard Owlett" <rowlett@atlascomm.net> wrote in message >>>>news:vv9fams78p7438@corp.supernews.com... >>>> >>>> >>>>>Any pointers to doing non-radix 2 FFT's. >>>>>Specifically I wish to do 44100 pt FFT's on 16 bit data. >>>>>Simplicity of coding OUTWEIGHS speed! >>>>>[ Scilab is slow but adequate ] >>>>> >>>> >>>> >>>>Doesn't SCILAB just do them? I tried it and it seems to. It took > > 0.421 > >>>>seconds on a dual processor Pentium 200MHz. >>>> >>>>Fred >>>> >>>> >>> >>>I'm going to have a second look at the framework I was using. My Pentium >>>4 1.6 GHz seems to do it no faster than your machine. Well, I did do >>>~1800 of them in a batch ;} >>> >>>BTW, it's already been pointed out to me in another forum that I could >>>make my life simpler by padding to 64k ;/ >>> >>> >> >>Just did a timing check using Scilab's timer(). My FFT's usually take >>.14 seconds with a few as fast as .125 seconds. Not the speed >>improvement I would expect. What's your OS? I'm using Winxp Pro SP1. >> >>Doing the timing pointed out another problem with my setup. It takes >>.94 seconds to move the 44k samples from the input data array to the >>array on which the FFT is performed. >> > > > Well early p4's (most p4's) aren't very good at floating point > unless your code is recompiled for sse2. > > Athlon's do a lot better with x87 floating point. > > (athlon64 + 1 or 2 GB ram are looking like a real nice upgrade > for a lot of things not just gaming) > > A p3 at 1.3GHz will beat a p4 1.6GHz > on somethings about equal on others. > > A p4 should beat a dual pentium 200 on most things > (should even beat dual pentium pro 200) > > How much memory do you have installed ? > With xp of any sort 256MB is the absolute minimum for any > serious sort of work. > Especially as xp itself usually grabs around 128MB.
1.25 Gig Knew I'd be wanting lots of memory when i bought the machine :}
> > Also what else is running in the background ? > Did you kill off fast.exe which is the background loader for internet > explorer. >
It's not shown on Task Manager, I use Mozilla in any case. Timing didn't change when I disabled firewall and anti-virus software.
> Been a while since I last used / ran scilab. > > Does scilab use atlas ?
I don't know.
> If so is it configured for your machine ? > http://math-atlas.sourceforge.net/ > I known matlab uses it for some things. > > Or have your recompiled scilab to tune it for > the machine your on ? > Shouldn't be necessary for most machines.
No. I don't have the tools.
> > Alex Gibson > >
Well, 44100 is 3^2 * 7^2 * 10^10.  You could use 6 FFTs
matching those sizes  (Winograd, Singleton and Rader have
3,7, 5 point FFTs), however the code is likely to be less
simple than you want.  If you really don't care about speed,
you could just do a 44100 point DFT.  The code would be dirt
simple, although it may also be dog slow.

Richard Owlett wrote:

> Any pointers to doing non-radix 2 FFT's. > Specifically I wish to do 44100 pt FFT's on 16 bit data. > Simplicity of coding OUTWEIGHS speed! > [ Scilab is slow but adequate ]
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
He could do a pair of prime factor 210 point FFTs cascaded using the mixed
radix algorithm though.  210=2*3*5*7, so for that you could use the PFA.
PFA is nice because it eliminates the phase rotations between the smaller
FFT kernels.  The 2,3,5 and 7 point FFTs could be any of winograd,
singleton or Rader.   A decent reference is the Smith and Smith "Handbook
of real-time fast fourier transforms", IEEE press.

"Steven G. Johnson" wrote:

> r.lyons@REMOVE.ieee.org (Rick Lyons) wrote... > > Richard Owlett <rowlett@atlascomm.net> wrote: > > >Any pointers to doing non-radix 2 FFT's. > > >Specifically I wish to do 44100 pt FFT's on 16 bit data. > > >Simplicity of coding OUTWEIGHS speed! > > > > I'm ignorant of these schemes, > > but ya' might do an Internet search > > on "Prime Factor" and "Split-radix" > > FFTs. > > The split-radix algorithm is essentially a combination of radices 2 > and 4, and only works for power-of-two sizes. The Prime-Factor > Algorithm (PFA) only decomposes a DFT into relatively prime factors > (so e.g. it cannot further decompose the factor of 7^2 in 44100) and > is fairly complicated to implement. > > What the original poster wants is a "mixed-radix" FFT, which is simply > the generalized version of the standard Cooley-Tukey algorithm to > arbitrary factorizations. Many implementations can be found online > (see www.fftw.org/links.html) > > Of course, for spectral estimation you should also be doing windowing, > etcetera, in order to reduce edge artifacts. Ditto if you are doing > zero-padding. See e.g. Oppenheim and Schafer, "Digital Signal > Processing". > > Cordially, > Steven G. Johnson
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
"Steven G. Johnson" <stevenj@alum.mit.edu> wrote in message
news:27cfb406.0401021313.4196aa02@posting.google.com...

> The split-radix algorithm is essentially a combination of radices 2 > and 4, and only works for power-of-two sizes.
The following question is only loosely related to this comment: Last year you wrote:
> Actually, the minimal operation count (for powers of two) is achieved by > something called "split-radix" that is a mixture of radix 2 and 4. At > each step, it breaks your size-n DFT into three pieces, one of size n/2 > and two of size n/4.
But it seems to me that split-radix requires 912 additions and 248 multiplications for n = 64, and 1160 isn't the minimal number of operations. Is there some paper you are thinking of that discusses a modification of the standard split-radix algorithm that always attains the minimal number of arithmetic operations (real adds + real multiplies?) -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
James Van Buskirk wrote:
> But it seems to me that split-radix requires 912 additions and 248 > multiplications for n = 64, and 1160 isn't the minimal number of > operations.
What do you think is the lowest known number of operations for an n=64 complex DFT, using what algorithm?
"Steven G. Johnson" <stevenj@alum.mit.edu> wrote in message
news:bunj9a$hts$1@news.fas.harvard.edu...

> James Van Buskirk wrote: > > But it seems to me that split-radix requires 912 additions and 248 > > multiplications for n = 64, and 1160 isn't the minimal number of > > operations.
> What do you think is the lowest known number of operations for an n=64 > complex DFT, using what algorithm?
The best I know of is 912 additions and 240 multiplications; just figured out how to achieve this result yesterday. Algorithm is the same as used for all my computational kernals (heavy use of real-half-complex DFTs,) just became aware of another possible trick. The lowest known by anyone is a different question. The more you look at DFTs, the more you become uncertain that what you or anyone else does is optimal, even if you get to write the criteria of optimality yourself. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
James Van Buskirk wrote:
>>James Van Buskirk wrote: >>>But it seems to me that split-radix requires 912 additions and 248 >>>multiplications for n = 64, and 1160 isn't the minimal number of >>>operations. > > The best I know of is 912 additions and 240 multiplications; just > figured out how to achieve this result yesterday.
Please do let me know when you publish your results, as I try to keep abreast of the latest FFT developments. The arithmetic complexity of FFT algorithms was extensively studied in the 1980's, when it was still considered to be a good proxy for speed (sadly no longer the case). As of 1991 [1], the lowest-known operation count for an n=64 complex DFT was 1160 flops, a record that had been set in 1968 by Yavne [2] and was later matched using a simpler algorithm by the split-radix technique. For N=2^m, this record is: flops = 4 N log2(N) - 6N + 8 (This is asymptotically 20% fewer flops than for vanilla radix 2. Note that you can trade off some multiplies for additions by using the 3 adds + 3 multiplies complex-multiply trick.) Actually, it has supposedly been rigorously proved [1] that split-radix achieves the minimal number of complex additions. There is also a rigorous lower bound [1] on the number of non-trivial complex (or real) multiplications, which split-radix can achieve up to N=16 (and it is close for larger N). For larger N, there are known algorithms that achieve the minimal number of multiplies, but they do not have fewer total flops than split-radix, and have more flops for N > 64. So, the achievable total flop count has yet not been rigorously proven. > Algorithm is
> the same as used for all my computational kernals (heavy use of > real-half-complex DFTs,) just became aware of another possible trick.
(There is the known technique of taking the real-input FFTs of the real and imaginary parts, and then combining them with (2N-4) additions. However, if you use the lowest-known operation counts for real-input FFTs [3], this gives exactly the same known flops for the complex DFT. e.g. 1160flops for n=64 from two 518flops n=64 real-input FFTs.) Cordially, Steven G. Johnson [1] P. Duhamel and M. Vetterli, "Fast Fourier transforms: A tutorial review and a state of the art," Signal Processing 19, 259-299 (1990). [2] R. Yavne, "An economical method for calculating the discrete Fourier transform," Proc. AFIPS 33, 115-125 (1968). [3] H. V. Sorensen, D. L. Jones, M. T. Heideman, and C. S. Burrus, "Real-valued fast Fourier transform algorithms," IEEE Trans. Acoust. Speech Sig. Processing ASSP-35, 849&#4294967295;863 (1987).
"Steven G. Johnson" <stevenj@alum.mit.edu> wrote in message
news:4010424d$0$564$b45e6eb0@senator-bedfellow.mit.edu...

> Please do let me know when you publish your results, as I try to keep > abreast of the latest FFT developments.
Well, I would just post the code the n = 64, but it's a little long -- 35937 bytes. Is this objectionably long for this group? The trick that made the difference for n = 64 would also improve the number of multiplies for some of my smaller modules. I'm not sure at this point whether I want to rewrite them all (many take a day each) or try to create a code generator that can do what I do in the short time available to me just now to ponder this problem again.
> Actually, it has supposedly been rigorously proved [1] that split-radix > achieves the minimal number of complex additions.
My code is structured for minimum-add so I am not challenging the above assertion.
> [1] P. Duhamel and M. Vetterli, "Fast Fourier transforms: A tutorial > review and a state of the art," Signal Processing 19, 259-299 (1990). > [2] R. Yavne, "An economical method for calculating the discrete Fourier > transform," Proc. AFIPS 33, 115-125 (1968). > [3] H. V. Sorensen, D. L. Jones, M. T. Heideman, and C. S. Burrus, > "Real-valued fast Fourier transform algorithms," IEEE Trans. Acoust. > Speech Sig. Processing ASSP-35, 849&#4294967295;863 (1987).
Thanks for the comments and the references. Lets see... I've read [1], and it is highly recommended. [2] just happened to be sitting on my desk as I read this, but I'm not sure if I've seen [3]. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end