Reply by Steven G. Johnson November 30, 20072007-11-30
On Nov 29, 11:37 pm, robert bristow-johnson
<r...@audioimagination.com> wrote:
> well, i said "for illustration". in particular i was referring to > Fig. 9-15 (p. 597) in my 1989 version of O&S. certainly if N=8 you > can do it with straight-line code. that's beside the point. i was > trying to infer, from the illustration, what the rhyme or reason there > is in the particular pointer manipulation from that N=8 illustration. > it still looks gnarly. for the case of a general N (say, a power of > 2) and much bigger than 2^3, what is the tight and efficient algorithm > for determining the order of indexing for the node that you are > writing to, and once you've determined that, what are the nodes that > you're reading from? except for the first pass (which is just like > the first pass of an in-place DIT), i cannot see what that rhyme or > reason is. i'm imagine there may be one, but it doesn't look simple > or elegant.
No, there are certainly simple patterns to the memory access for in- place algorithms without bit-reversal. They are not that complicated to implement (as my 15-line example in another post shows). You absolutely do not have to use a look-up-table of indices! (One of the many reasons why I feel that butterfly diagrams are next to useless in truly understanding FFT algorithms.) (Of course, if you compare a highly optimized in-order FFT like FFTW or MKL, it will be way faster than a textbook radix-2 algorithm even if you skip the bitreversal stage. Highly optimized algorithms are always going to be way more complicated than textbook radix-2, as long as CPUs remain so complex.)
> "In the flow graph of Fig. 9-15, the data are accessed > nonsequentially, the computation is not in place, and a > scheme for indexing the data is considerably more > complicated than in either of the two previous cases [two > in-place DIT structures]. Consequently, this structure > has no apparent advantages."
One reason to be cautious here is that O&S is comparing apples and oranges: FFTs with bitreversed input or output, and FFTs with normal- order output. Of course, if you can get away with permuted output it will be simpler. That's not the comparison I'm making: I'm comparing only FFTs with normal-order output, that differ in whether you have a separate bitreversal pass or whether you combine the permutations with the computation.
> i was reverberating that sentiment from the little bit that i know of > the radix-(2^n) FFT. that information could very well be stale. it's > also in the Yellow O&S (1975 - p. 300, Fig. 6.13, and they say the > same thing), so it's more than 3 decades old. > > sometimes i think that O&S is timeless, other times it's just dated.
Comments on performance advantages from FFT algorithms that avoid a separate bitreversal pass date back to the 60's, so in some sense it's not a matter of being out-of-date. On the other hand, it seems to be especially in the last 20 years that floating-point performance of CPUs has totally outstripped memory performance, and that makes non- arithmetic issues so much more relevant now. The main problem is that textbooks are trying to teach students the basic ideas of FFT algorithms rather than performance hacks, and therefore they almost always limit their discussion to only two algorithms for power-of-two N: radix-2 decimation in time and radix-2 decimation in frequency. If you limit yourself to these two cases, you can't really have a useful discussion of performance optimization on modern CPUs where performance is limited not by arithmetic counts but rather by memory access and register/pipeline utilization. Steven
Reply by robert bristow-johnson November 30, 20072007-11-30
On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:

> On Nov 27, 1:46 pm, robert bristow-johnson
> > Steven, what's different (in terms of code) is the pointer > > arithmetic. doing all sorts of goofy pointer arithmetic might be more > > costly, in the aggregate throughout all of the FFT passes, than a > > single bit-reversing pass at the beginning or end.
> As someone who has actually implemented many of the algorithms without > bit reversal, I can say that (a) the pointer arithmetic is not really > that complicated and (b) the pointer arithmetic per se is not the > dominant factor in the time anyway (compared to the memory access > itself), and (c) and it really does seem to be faster on modern > general-purpose CPUs than the methods involving a separate array pass.
On Nov 28, 9:16 pm, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Nov 28, 7:11 pm, robert bristow-johnson <r...@audioimagination.com> > wrote: > > > okay, Steven, using O&S for illustration, the pointers changes looks > > pretty knarly, even for N=8, but even if it's not, it's not in-place, > > is it? and if it's not in-place, is there not a cache hit? assuming > > you have the (double) memory and that cost is not a problem, you now > > have the top and bottom (i'm gonna assume radix-2 for the moment) of > > the butterfly you're reading (from the previous pass) and the top and > > bottom of the butterfly you are writing (which would be the same > > locations if you were doing this in-place). if it is not in-place, > > then you are ping-ponging and i would expect the number of cache > > misses to double. > > First of all, I have no idea what you are talking about with "gnarly > pointer changes" for N=8. An N=8 FFT is 60 floating-point operations > that can be implemented in straight-line code: read into local > variables (= registers), do calculations, write results.
well, i said "for illustration". in particular i was referring to Fig. 9-15 (p. 597) in my 1989 version of O&S. certainly if N=8 you can do it with straight-line code. that's beside the point. i was trying to infer, from the illustration, what the rhyme or reason there is in the particular pointer manipulation from that N=8 illustration. it still looks gnarly. for the case of a general N (say, a power of 2) and much bigger than 2^3, what is the tight and efficient algorithm for determining the order of indexing for the node that you are writing to, and once you've determined that, what are the nodes that you're reading from? except for the first pass (which is just like the first pass of an in-place DIT), i cannot see what that rhyme or reason is. i'm imagine there may be one, but it doesn't look simple or elegant. so how could this be coded compactly and efficiently for a general N = 2^p? i cannot see how it is done, while i do not deny it can be done. i just don't see it. surely, if you have to look up those node indices from a table (and a different table for every N?), that may be quick, but it's not compact. this is what i mean by "goofy pointer arithmetic". and, it's clear that it is not in-place after the first pass. there *must* be another array space to write to. so, there clearly is a scheme you know about (i never thought otherwise) that is something i hadn't seen. but using the one reference to a normal-in, normal-out structure that i have seen, i'll put it the way O&S did (p. 598): "In the flow graph of Fig. 9-15, the data are accessed nonsequentially, the computation is not in place, and a scheme for indexing the data is considerably more complicated than in either of the two previous cases [two in-place DIT structures]. Consequently, this structure has no apparent advantages." i was reverberating that sentiment from the little bit that i know of the radix-(2^n) FFT. that information could very well be stale. it's also in the Yellow O&S (1975 - p. 300, Fig. 6.13, and they say the same thing), so it's more than 3 decades old. sometimes i think that O&S is timeless, other times it's just dated. r b-j
Reply by Steven G. Johnson November 28, 20072007-11-28
On Nov 28, 7:11 pm, robert bristow-johnson <r...@audioimagination.com>
wrote:
> okay, Steven, using O&S for illustration, the pointers changes looks > pretty knarly, even for N=8, but even if it's not, it's not in-place, > is it? and if it's not in-place, is there not a cache hit? assuming > you have the (double) memory and that cost is not a problem, you now > have the top and bottom (i'm gonna assume radix-2 for the moment) of > the butterfly you're reading (from the previous pass) and the top and > bottom of the butterfly you are writing (which would be the same > locations if you were doing this in-place). if it is not in-place, > then you are ping-ponging and i would expect the number of cache > misses to double.
First of all, I have no idea what you are talking about with "gnarly pointer changes" for N=8. An N=8 FFT is 60 floating-point operations that can be implemented in straight-line code: read into local variables (= registers), do calculations, write results. Second, there are in-place algorithms that do not require a separate bitreversal pass, some of which I pointed out in another message on this thread. Basically, they interleave the transpositions (bitreversal = product of transpositions) with the computational stages. Third, an out-of-place algorithm does not require that you "ping-pong" between two arrays (as in the Stockham algorithm). Actually, only one stage of the computation needs to work out-of-place (as in the recursive example code I posted) to write things in the right place; all other stages of the computation can work in-place. And if you look at the FFTW benchmarks, you'll see that properly written out-of-place code can indeed be very competitive. Fourth, even if you have to do additional data motion or copying stages, sometimes this can be advantageous because it makes the data access during computation more consecutive. e.g. the Stockham ("ping- pong") algorithm was originally designed for vector machines where consecutive access was a huge win, and even now it has locality benefits (although we currently find that other algorithms work better for us). Or e.g. in the radix-sqrt(n) approach (dubbed the "four-step" and "six-step" algorithms by Bailey, although variants date back to Gentleman and Sande) you do explicit transpositions---essentially, a part of the bit-reversal permutation---interleaved with the computation, in order to improve locality, and this is a win for very large N or for very slow memory (e.g. distributed-memory systems). Fifth, one is crazy to do radix-2 these days. You need to do the work in larger chunks to take full advantage of the CPU. I've never seen a radix-2 FFT that was not several times slower than the fastest code. Ultimately, however, machines are complicated enough these days that theorizing about performance is not very persuasive; you have to actually try the different algorithms and sometimes the results are unexpected. In the case of FFTs, people have tried quite a few algorithms and there are many examples of people finding performance benefits from algorithms that avoid separate bit-reversal passes. Regards, Steven G. Johnson
Reply by Ron N. November 28, 20072007-11-28
On Nov 28, 4:11 pm, robert bristow-johnson <r...@audioimagination.com>
wrote:
> On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...@alum.mit.edu> wrote: > > > On Nov 27, 1:46 pm, robert bristow-johnson > > > > Steven, what's different (in terms of code) is the pointer > > > arithmetic. doing all sorts of goofy pointer arithmetic might be more > > > costly, in the aggregate throughout all of the FFT passes, than a > > > single bit-reversing pass at the beginning or end. > > > As someone who has actually implemented many of the algorithms without > > bit reversal, I can say that (a) the pointer arithmetic is not really > > that complicated and (b) the pointer arithmetic per se is not the > > dominant factor in the time anyway (compared to the memory access > > itself), and (c) and it really does seem to be faster on modern > > general-purpose CPUs than the methods involving a separate array pass. > > okay, Steven, using O&S for illustration, the pointers changes looks > pretty knarly, even for N=8, but even if it's not, it's not in-place, > is it? and if it's not in-place, is there not a cache hit? assuming > you have the (double) memory and that cost is not a problem, you now > have the top and bottom (i'm gonna assume radix-2 for the moment) of > the butterfly you're reading (from the previous pass) and the top and > bottom of the butterfly you are writing (which would be the same > locations if you were doing this in-place). if it is not in-place, > then you are ping-ponging and i would expect the number of cache > misses to double.
There are alternatives to either in-place storage or ping-ponging of all butterfly results. Gnarly software tagging/caching/swapping indeed; but even gnarly stuff could look close to free on systems where a single dcache miss might support the execution of hundred(s) of address computation instructions underneath it. IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M
Reply by robert bristow-johnson November 28, 20072007-11-28
On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Nov 27, 1:46 pm, robert bristow-johnson > > > Steven, what's different (in terms of code) is the pointer > > arithmetic. doing all sorts of goofy pointer arithmetic might be more > > costly, in the aggregate throughout all of the FFT passes, than a > > single bit-reversing pass at the beginning or end. > > As someone who has actually implemented many of the algorithms without > bit reversal, I can say that (a) the pointer arithmetic is not really > that complicated and (b) the pointer arithmetic per se is not the > dominant factor in the time anyway (compared to the memory access > itself), and (c) and it really does seem to be faster on modern > general-purpose CPUs than the methods involving a separate array pass.
okay, Steven, using O&S for illustration, the pointers changes looks pretty knarly, even for N=8, but even if it's not, it's not in-place, is it? and if it's not in-place, is there not a cache hit? assuming you have the (double) memory and that cost is not a problem, you now have the top and bottom (i'm gonna assume radix-2 for the moment) of the butterfly you're reading (from the previous pass) and the top and bottom of the butterfly you are writing (which would be the same locations if you were doing this in-place). if it is not in-place, then you are ping-ponging and i would expect the number of cache misses to double.
> > in addition, as i tried to point out earlier, if you're doing > > something like fast-convolution using the FFT, then if your forward > > FFT leaves its results in bit-reversing order and your inverse FFT > > takes its input in bit reversed order (and leaves the results in > > normal order), then no bit reversing will be necessary. (the transfer > > function, besides being pre-computed, will also be pre bit-reversed.) > > It's well known that you don't need normal-order output for > convolutions. I was concerned with the case where you do want normal > order (which most users seem to want, even if they could do without > out it conceivably). However, in the case of convolutions, I'm > convinced that the optimal algorithm will not be to do two completely > separate FFTs in any case,
the FFT and iFFT kernals are pretty small. so you have two short snippets of code rather than one. i can't imagine that being a cache- miss problem. what else could be the problem?
> regardless of the output/input order; I > have some ideas on this that I want to explore for FFTW 4 (which will > have built-in convolution routines).
well, let us know. even though there are some things you're saying here that either i just don't understand (in a context where i think i *should* be understanding) or a little contentious, everyone here, including me, recognize you as the FFT guru. but the maths and other pertanent facts don't really give a fig who is the guru or what the guru says. that's the only reason i take some of these issues on. bestest, r b-j P.S. BTW, i think we now work within 15 kilometers of each other. i wouldn't mind hooking up (maybe at some kind of function, but the Boston section AES meetings seem to be pretty lame as of late). anyway, if you know of some signal proc or similar seminar or something (or some FFTW thing you're doing), please lemme know. i'd love to see it and shake your hand.
Reply by Steven G. Johnson November 28, 20072007-11-28
On Nov 27, 1:46 pm, robert bristow-johnson
> Steven, what's different (in terms of code) is the pointer > arithmetic. doing all sorts of goofy pointer arithmetic might be more > costly, in the aggregate throughout all of the FFT passes, than a > single bit-reversing pass at the beginning or end.
As someone who has actually implemented many of the algorithms without bit reversal, I can say that (a) the pointer arithmetic is not really that complicated and (b) the pointer arithmetic per se is not the dominant factor in the time anyway (compared to the memory access itself), and (c) and it really does seem to be faster on modern general-purpose CPUs than the methods involving a separate array pass.
> in addition, as i tried to point out earlier, if you're doing > something like fast-convolution using the FFT, then if your forward > FFT leaves its results in bit-reversing order and your inverse FFT > takes its input in bit reversed order (and leaves the results in > normal order), then no bit reversing will be necessary. (the transfer > function, besides being pre-computed, will also be pre bit-reversed.)
It's well known that you don't need normal-order output for convolutions. I was concerned with the case where you do want normal order (which most users seem to want, even if they could do without out it conceivably). However, in the case of convolutions, I'm convinced that the optimal algorithm will not be to do two completely separate FFTs in any case, regardless of the output/input order; I have some ideas on this that I want to explore for FFTW 4 (which will have built-in convolution routines). Regards, Steven G. Johnson
Reply by Richard Owlett November 28, 20072007-11-28
Steven G. Johnson wrote:
> On Nov 24, 9:45 am, Richard Owlett <rowl...@atlascomm.net> wrote: > >>In "Chapter 12 - The Fast Fourier Transform / FFT Programs" of _The >>Scientist and Engineer's Guide to Digital Signal Processing_ >>Smith says "There are also FFT routines that completely eliminate the >>bit reversal sorting." >> >>Could someone point me to a description of one, preferably with BASIC or >>FORTRAN sample code. > > > A number of references are given in the Wikipedia article on the > Cooley-Tukey FFT algorithm. > > http://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm
One of its references is to a paper titled "The FFT &#4294967295; an algorithm the whole family can use" (http://www.cs.dartmouth.edu/~rockmore/cse-fft.pdf). I've taken a quick look and I may be in target audience.
Reply by Ron N. November 28, 20072007-11-28
On Nov 27, 9:24 am, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Nov 24, 9:45 am, Richard Owlett <rowl...@atlascomm.net> wrote: > > > In "Chapter 12 - The Fast Fourier Transform / FFT Programs" of _The > > Scientist and Engineer's Guide to Digital Signal Processing_ > > Smith says "There are also FFT routines that completely eliminate the > > bit reversal sorting." > > > Could someone point me to a description of one, preferably with BASIC or > > FORTRAN sample code. > > A number of references are given in the Wikipedia article on the > Cooley-Tukey FFT algorithm. > > http://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm > > The Cooley-Tukey FFT (the most common algorithm), along with other FFT > algorithms for that matter, intrinsically requires some sort of re- > ordering, in the sense that its operations map consecutive inputs to > non-consecutive outputs and vice-versa. > > The most well-known approach is to do the re-ordering all at once, > with a single bit-reversal pass (or, more generally, a mixed-base > digit reversal). Another approach that still works in-place is to > arrange the sequence of radices so that one can do a sequence of small > transpositions interleaved with the transform (alternating DIT/DIF > stages work well here). The simplest approaches work out-of-place. > e.g. the Stockham algorithm goes back and forth between two arrays > with each pass, transposing one digit at a time. > > The easiest approach, in my mind, is simply to do the algorithm > recursively out-of-place; if you directly transcribe the Cooley-Tukey > algorithm from the mathematical description, working recursively, all > of the outputs are "automatically" put in the right place. (This was > the approach we used in the original FFTW; the current version of FFTW > implements a variety of algorithms as described in our Proc. IEEE > paper.) > > For example, here is a very simple 15-line radix-2 FFT, working > recursively and out-of-place, that does not require any separate bit- > reversal pass. (To make it reasonably fast, one would want to use a > larger base case than the size-1 base-case used here. It also uses a > relatively inaccurate trigonometric recurrence. But of course, if you > were actually interested in such practical issues you should be using > a pre-existing FFT library.) > > #include <complex.h> > #define TWOPI 6.2831853071795864769252867665590057683943388 > /* call with rec_fft(+1 or -1, N, input, output, 1, 1), where input > and output are arrays of length N ... > requires C99 complex-number support */ > void rec_fft(int sign, int N, complex double *x, complex double *y, > int is, int os) > { > if (N == 1) *y = *x; > else { > int N2 = N/2; > rec_fft(sign, N2, x, y, is*2, os); > rec_fft(sign, N2, x+is, y+os*N2, is*2, os); > complex double omega = 1.0, omega1 = cexp(sign * I * TWOPI / > N); > for (int i = 0; i < N2; ++i) { > complex double a = y[os*i], b = omega * y[os*(i + N2)]; > y[os*i] = a + b; > y[os*(i + N2)] = a - b; > omega *= omega1; > } > } > > } > > Regards, > Steven G. Johnson
Richard wanted the FFT routines in BASIC, so I converted the above routine into Basic for Chipmunk Basic, changing pointer arithmetic into array access, converting complex data types back to standard float variables, and removing the trig recursion. A Basic implementation that allow recursive subroutines with locally scoped variables is required. I found that the line in the above C code:
> if (N == 1) *y = *x;
does pretty much the same thing as does the final sort of a generic in-place decimation in frequency FFT algorithm, except distributing the sort with the butterflys, and calculating the bit-reversed indexes recursively. rem *** recursive out-of-place radix-2 FFT in Basic *** rem rem *** call as: rec_fft(flag, n, xr,xi,0, yr,yi,0, 1,1) rem flag = 1 for FFT, -1 for IFFT rem n is FFT length rem xr,xi are input arrays (must be dimensioned to length >= n) rem yr,yi are result arrays (must be dimensiond to length >= n) rem (kx,ky,ks,os) are sub-vector parameter state variables rem (n2,i,c,s,k1,k2,ar,ai,br,bi) are local variables rem sub rec_fft(flag,n,xr,xi,kx,yr,yi,ky,ks,os, n2,i,c,s,k1,k2,ar,ai,br,bi) if (n = 1) rem ** this does a bit-reversed-index copy and scale ** rem print ky,kx to see post (DiF) sorting s = 1 : if (flag = -1) then s = 1/ks yr(ky) = xr(kx) * s yi(ky) = xi(kx) * s else n2 = n/2 rec_fft(flag,n2, xr,xi,kx , yr,yi,ky, ks*2,os) rec_fft(flag,n2, xr,xi,kx+ks, yr,yi,ky+os*n2, ks*2,os) for i = 0 to n2-1 c = cos(i * 2*pi/n) s = sin(i * flag * 2*pi/n) k1 = ky+os*(i) k2 = ky+os*(i+n2) ar = yr(k1) ai = yi(k1) br = c * yr(k2) - s * yi(k2) bi = c * yi(k2) + s * yr(k2) yr(k1) = ar + br yi(k1) = ai + bi yr(k2) = ar - br yi(k2) = ai - bi next i endif end sub : rem end rec_fft If the above code formats wrong (old-fashioned Basic is line-break sensitive), a copy is here: http://www.nicholson.com/dsp.fft1.html IMHO. YMMV. -- rhn A.T nicholson d.0.t C-o-M
Reply by glen herrmannsfeldt November 27, 20072007-11-27
Steven G. Johnson wrote:
(snip regarding bit-reversal in FFTs)

> The most well-known approach is to do the re-ordering all at once, > with a single bit-reversal pass (or, more generally, a mixed-base > digit reversal). Another approach that still works in-place is to > arrange the sequence of radices so that one can do a sequence of small > transpositions interleaved with the transform (alternating DIT/DIF > stages work well here). The simplest approaches work out-of-place. > e.g. the Stockham algorithm goes back and forth between two arrays > with each pass, transposing one digit at a time.
I still remember a Fortran FFT subroutine from long ago that did bit reversal using nested DO loops. Depending on N (the number of points) it would (illegally) GOTO the appropriate nested DO and (maybe) GOTO out at the appropriate point. Not very portable, but it must have worked on at least one machine. -- glen
Reply by SteveSmith November 27, 20072007-11-27
Hi Richard,
I haven&rsquo;t seen these alternate FFTs in years, but I can look around if
you have a need.  However, they are just as cryptic and difficult to
understand as the conventional ones.    Also, I doubt that you could
modify one to use square waves.  It sounds like you should be using the
&ldquo;DFT by Correlation&rdquo; method.  See page 157 in Chapter 8 of my on-line
book (www.DSPguide.com). This can directly be modified to use square
waves.  Good luck and let me know if I can help. 
Regards,
Steve

Steve.Smith1@SpectrumSDI.com