Reply by Steven G. Johnson October 12, 20032003-10-12
James Van Buskirk wrote:
> The increase in resource usage for radix-2 FFTs is not always > small. The number of cache or memory reads and writes goes > up significantly compared to, for example, radix-8 FFTs. On > a processor where these are the limiting resource radix-2 FFTs > are a bad idea, e.g. Alpha 21164.
I would go further and say that this is true for essentially every modern general-purpose CPU (not including things like DSP chips, as another poster pointed out). I don't recall seeing *any* pure radix-2 implementation that wasn't typically several times slower than the fastest available FFT code for such a CPU (see benchmarks at www.fftw.org/benchfft). Although if you really care about cache (as you should, especially for transforms of a few thousand points or more), there are a lot of additional tricks besides just using higher radices. [For one thing, we (fftw.org) have argued that the traditional 1965 non-recursive style of implementation is a bad idea.] (And for small sizes where cache isn't as important, there are other equally compelling reasons not to do radix-2, having to do with the CPU pipeline etc. It's not just a matter of finding a radix that "fits" in the registers. Modern x86 processors have a lot more registers internally than are exposed by the instruction set, and moreover it is possible to some extent to simulate a larger register set with a smaller one thanks to the way most CPU pipelines work.) Things have changed a lot since the 1980's, when the authors of Numerical Recipes could write that higher radices etc. "can be faster than simpler FFTs by some significant, but not overwhelming, factor, e.g., 20 or 30 percent" (this is only true if speed == # of floating-point operations). Unfortunately, they and many other authors have not updated their books to reflect this. Cordially, Steven G. Johnson
Reply by Greg Berchin October 11, 20032003-10-11
On Sat, 11 Oct 2003 07:09:14 GMT, "James Van Buskirk"
<not_valid@comcast.net> wrote:

>>I don't understand the comment about 16384-point FFTs. Does using >>a radix-8 FFT to carry out most of the work for such an FFT mean >>that one is honor-bound never to break up the 14 stages into 4 >>stages of radix-8 and one stage of radix-4, or something?
No, not at all. That is one way to perform a 16384 FFT using radix-8. My point was addressed to the situation in which only ONE radix is implemented. If you have nothing but radix-8 butterflies in your set of primitives, then 16384 is unreachable. Greg Berchin
Reply by James Van Buskirk October 11, 20032003-10-11
"Greg Berchin" <Bank-to-Turn@mchsi.com> wrote in message
news:ot8eovgv6tan2rs6t0i90k1f8skjn55jft@4ax.com...

> Well, the original question had to do with radix-4 vs. radix-2. > Nonetheless, though a radix-8 FFT may provide better resource > utilization in a particular implementation, if the user needs a > 16384-point FFT then they are out of luck.
I don't understand the comment about 16384-point FFTs. Does using a radix-8 FFT to carry out most of the work for such an FFT mean that one is honor-bound never to break up the 14 stages into 4 stages of radix-8 and one stage of radix-4, or something? In my power of 2 full complex FFT code, the program tries to do as much work as possible via the radix-8 workhorse but obviously to handle all integer powers of 2 it also has radix-4 and radix-2 stages available to it that it utilizes as necessary when the command generator component decides. Works OK for what it was designed for and is more efficient than the CXML FFT for all powers of 2 through 4194304 except for 8192 IIRC on my 21164. -- write(*,*) transfer((/17.392111325966148d0,3.6351694777236872d228, & 6.0134700169991705d-154/),(/'x'/)); end
Reply by Andrew Reilly October 10, 20032003-10-10
On Fri, 10 Oct 2003 21:33:08 +0000, Greg Berchin wrote:

> On Fri, 10 Oct 2003 19:05:42 GMT, "James Van Buskirk" > <not_valid@comcast.net> wrote: > >>>The increase in resource usage for radix-2 FFTs is not always >>>small. The number of cache or memory reads and writes goes >>>up significantly compared to, for example, radix-8 FFTs. > > Well, the original question had to do with radix-4 vs. radix-2. > Nonetheless, though a radix-8 FFT may provide better resource > utilization in a particular implementation, if the user needs a > 16384-point FFT then they are out of luck. > >>>On a processor where these are the limiting resource radix-2 >>>FFTs are a bad idea, e.g. Alpha 21164. > > As I said, "The answer to your question depends upon the > specific circumstances."
I agree entirely. If you consider the other extreme: a classical DSP with zero-wait-state, multiple access on-chip memory and a MAC unit, the "fancy" algorithms that avoid multiplies don't help at all. Radix 2 can be the best approach. -- Andrew
Reply by Ray Andraka October 10, 20032003-10-10
If for hardware, one possibility is to use a Winograd FFT, which is
basically a different matrix decomposition designed to minimize the
multiplications (winograd FFTs break the DFT into 3 successive matrix
multiplies, the first and last containing only factors of 1 and i so that
no multiplies are needed.  The middle matrix is a diagonal matrix of
constants).  In the case of the winograd FFT, there are decompositions for
several radices up to 16.  The Winograd algorithm is the basis of my FFT
library for FPGAs, which happens to be the fastest possible for single
sample at a time in Xilinx FPGAs (speed is limited by the minimum pulse
width on the SRL16 primitives...over 300 MS/sec in virtexII-P-7 for the 16
point kernel with a 73 clock latency).

"Steven G. Johnson" wrote:

> Steven G. Johnson wrote: > > Nope, quite the opposite. First of all, realize that the radix's > > "primitive" kernel (or "butterfly") is just a DFT of the size of the > > radix. So, for a radix-4 FFT, you need a (usually hard-coded) size-4 > > DFT. > > (In case it wasn't clear, for a radix-r FFT you need a size-r DFT. > Usually, you compute this size-r DFT in turn via some FFT algorithm > (e.g. radix-2), but you hard-code it and optimize out things like > trivial multiplications.)
-- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759
Reply by Greg Berchin October 10, 20032003-10-10
On Fri, 10 Oct 2003 19:05:42 GMT, "James Van Buskirk"
<not_valid@comcast.net> wrote:

>>The increase in resource usage for radix-2 FFTs is not always >>small. The number of cache or memory reads and writes goes >>up significantly compared to, for example, radix-8 FFTs.
Well, the original question had to do with radix-4 vs. radix-2. Nonetheless, though a radix-8 FFT may provide better resource utilization in a particular implementation, if the user needs a 16384-point FFT then they are out of luck.
>>On a processor where these are the limiting resource radix-2 >>FFTs are a bad idea, e.g. Alpha 21164.
As I said, "The answer to your question depends upon the specific circumstances." Greg Berchin
Reply by James Van Buskirk October 10, 20032003-10-10
"Greg Berchin" <Bank-to-Turn@mchsi.com> wrote in message
news:uqa8ov00mbp3vprq2vcu6ujssisrq41qe6@4ax.com...

> As always, the answer to your question depends upon the specific > circumstances. If one is only implementing a single FFT in a > unique situation, then it makes sense to use as many radix-4 > stages as possible. However, if one is implementing in > general-purpose hardware, for example, and several different sizes > must be accommodated, then using a radix-2 provides greater > flexibility for just a small increase in resource usage.
The increase in resource usage for radix-2 FFTs is not always small. The number of cache or memory reads and writes goes up significantly compared to, for example, radix-8 FFTs. On a processor where these are the limiting resource radix-2 FFTs are a bad idea, e.g. Alpha 21164. -- write(*,*) transfer((/17.392111325966148d0,3.6351694777236872d228, & 6.0134700169991705d-154/),(/'x'/)); end
Reply by Steven G. Johnson October 9, 20032003-10-09
Bob wrote:
> Can a radix-4 fft be used to construct any type of radix-2 fft. e.g. > can I manipulate a 64 pt radix-4 fft to give me a 128 pt radix-2 fft ? > If so, is the radix-2 fft of any use in dsp as it usually results in > larger and slower ffts than the radix-4 version.
It's not clear what you mean by "be used to construct", and I suspect you're confusing the algorithm ("radix-r") with the sizes that it can handle by itself (r^n). A radix-r FFT algorithm is best understood as a single step, which *can* be applied recursively but need not be. The beauty of this class of FFT algorithms (Cooley-Tukey) is that it expresses a DFT of size n as r smaller DFTs of size n/r (plus some combining computations involving DFTs of size r). Because these smaller transforms are also DFTs, you can apply any FFT algorithm you want in order to compute them, either the same radix or different radices. Steven
Reply by Steven G. Johnson October 9, 20032003-10-09
Steven G. Johnson wrote:
> Nope, quite the opposite. First of all, realize that the radix's > "primitive" kernel (or "butterfly") is just a DFT of the size of the > radix. So, for a radix-4 FFT, you need a (usually hard-coded) size-4 > DFT.
(In case it wasn't clear, for a radix-r FFT you need a size-r DFT. Usually, you compute this size-r DFT in turn via some FFT algorithm (e.g. radix-2), but you hard-code it and optimize out things like trivial multiplications.)
Reply by Steven G. Johnson October 9, 20032003-10-09
Rune Allnor wrote:
> Some quite interesting questions... usually one requires that a "radix > N" FFT is based on N being a prime number.
Nope, the radix can be anything you want (as long as it is a factor of the size to be transformed, of course), at least in the Cooley-Tukey FFT algorithm. In general, Cooley-Tukey breaks a DFT of size n = n1 * n2 into n1 DFTs of size n2, followed by multiplication by roots of unity ("twiddle factors") followed by n2 DFTs of size n1. A "radix-r" FFT is just the special case of n1 = r (for decimation in time) or n2 = r (for decimation in frequency), where you use the same factor r recursively. > Could you define the
> "primitive" ("butterfly") of the radix 4 FFT? While I can't see how to > construct such > a "primitive", I am quite sure you will find that there are fewer > computations involved in computing the 4 point DFT by the radix 2 FFT, > than by using a radix 4 "primitive", whatever this may be.
Nope, quite the opposite. First of all, realize that the radix's "primitive" kernel (or "butterfly") is just a DFT of the size of the radix. So, for a radix-4 FFT, you need a (usually hard-coded) size-4 DFT. Because this size-4 DFT includes some multiplications by i, -1, etcetera that can be optimized out, a radix-4 FFT algorithm generally requires fewer operations than a radix-2 algorithm. Actually, the minimal operation count (for powers of two) is achieved by something called "split-radix" that is a mixture of radix 2 and 4. At each step, it breaks your size-n DFT into three pieces, one of size n/2 and two of size n/4. However, FFT speed these days is mostly determined by other factors besides strict flop counts. e.g. the FFTW library (www.fftw.org) often uses radices as large as 32 or 64. Steven