On Nov 29, 11:37 pm, robert bristow-johnson
<r...@audioimagination.com> wrote:
> well, i said "for illustration".  in particular i was referring to
> Fig. 9-15 (p. 597) in my 1989 version of O&S.  certainly if N=8 you
> can do it with straight-line code.  that's beside the point.  i was
> trying to infer, from the illustration, what the rhyme or reason there
> is in the particular pointer manipulation from that N=8 illustration.
> it still looks gnarly.  for the case of a general N (say, a power of
> 2) and much bigger than 2^3, what is the tight and efficient algorithm
> for determining the order of indexing for the node that you are
> writing to, and once you've determined that, what are the nodes that
> you're reading from?  except for the first pass (which is just like
> the first pass of an in-place DIT), i cannot see what that rhyme or
> reason is.  i'm imagine there may be one, but it doesn't look simple
> or elegant.

No, there are certainly simple patterns to the memory access for in-
place algorithms without bit-reversal.  They are not that complicated
to implement (as my 15-line example in another post shows). You
absolutely do not have to use a look-up-table of indices!

(One of the many reasons why I feel that butterfly diagrams are next
to useless in truly understanding FFT algorithms.)

(Of course, if you compare a highly optimized in-order FFT like FFTW
or MKL, it will be way faster than a textbook radix-2 algorithm even
if you skip the bitreversal stage.  Highly optimized algorithms are
always going to be way more complicated than textbook radix-2, as long
as CPUs remain so complex.)

>   "In the flow graph of Fig. 9-15, the data are accessed
>   nonsequentially, the computation is not in place, and a
>   scheme for indexing the data is considerably more
>   complicated than in either of the two previous cases [two
>   in-place DIT structures].  Consequently, this structure
>   has no apparent advantages."

One reason to be cautious here is that O&S is comparing apples and
oranges: FFTs with bitreversed input or output, and FFTs with normal-
order output.  Of course, if you can get away with permuted output it
will be simpler.

That's not the comparison I'm making: I'm comparing only FFTs with
normal-order output, that differ in whether you have a separate
bitreversal pass or whether you combine the permutations with the
computation.

> i was reverberating that sentiment from the little bit that i know of
> the radix-(2^n) FFT.  that information could very well be stale.  it's
> also in the Yellow O&S (1975 - p. 300, Fig. 6.13, and they say the
> same thing), so it's more than 3 decades old.
>
> sometimes i think that O&S is timeless, other times it's just dated.

Comments on performance advantages from FFT algorithms that avoid a
separate bitreversal pass date back to the 60's, so in some sense it's
not a matter of being out-of-date.  On the other hand, it seems to be
especially in the last 20 years that floating-point performance of
CPUs has totally outstripped memory performance, and that makes non-
arithmetic issues so much more relevant now.

The main problem is that textbooks are trying to teach students the
basic ideas of FFT algorithms rather than performance hacks, and
therefore they almost always limit their discussion to only two
algorithms for power-of-two N: radix-2 decimation in time and radix-2
decimation in frequency.  If you limit yourself to these two cases,
you can't really have a useful discussion of performance optimization
on modern CPUs where performance is limited not by arithmetic counts
but rather by memory access and register/pipeline utilization.

Steven

On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:

> On Nov 27, 1:46 pm, robert bristow-johnson

> > Steven, what's different (in terms of code) is the pointer
> > arithmetic.  doing all sorts of goofy pointer arithmetic might be more
> > costly, in the aggregate throughout all of the FFT passes, than a
> > single bit-reversing pass at the beginning or end.

> As someone who has actually implemented many of the algorithms without
> bit reversal, I can say that (a) the pointer arithmetic is not really
> that complicated and (b) the pointer arithmetic per se is not the
> dominant factor in the time anyway (compared to the memory access
> itself), and (c) and it really does seem to be faster on modern
> general-purpose CPUs than the methods involving a separate array pass.

On Nov 28, 9:16 pm, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Nov 28, 7:11 pm, robert bristow-johnson <r...@audioimagination.com>
> wrote:
>
> > okay, Steven, using O&S for illustration, the pointers changes looks
> > pretty knarly, even for N=8, but even if it's not, it's not in-place,
> > is it?  and if it's not in-place, is there not a cache hit?  assuming
> > you have the (double) memory and that cost is not a problem, you now
> > have the top and bottom (i'm gonna assume radix-2 for the moment) of
> > the butterfly you're reading (from the previous pass) and the top and
> > bottom of the butterfly you are writing (which would be the same
> > locations if you were doing this in-place).  if it is not in-place,
> > then you are ping-ponging and i would expect the number of cache
> > misses to double.
>
> First of all, I have no idea what you are talking about with "gnarly
> pointer changes" for N=8.  An N=8 FFT is 60 floating-point operations
> that can be implemented in straight-line code: read into local
> variables (= registers), do calculations, write results.

well, i said "for illustration".  in particular i was referring to
Fig. 9-15 (p. 597) in my 1989 version of O&S.  certainly if N=8 you
can do it with straight-line code.  that's beside the point.  i was
trying to infer, from the illustration, what the rhyme or reason there
is in the particular pointer manipulation from that N=8 illustration.
it still looks gnarly.  for the case of a general N (say, a power of
2) and much bigger than 2^3, what is the tight and efficient algorithm
for determining the order of indexing for the node that you are
writing to, and once you've determined that, what are the nodes that
you're reading from?  except for the first pass (which is just like
the first pass of an in-place DIT), i cannot see what that rhyme or
reason is.  i'm imagine there may be one, but it doesn't look simple
or elegant.  so how could this be coded compactly and efficiently for
a general N = 2^p?  i cannot see how it is done, while i do not deny
it can be done.  i just don't see it.  surely, if you have to look up
those node indices from a table (and a different table for every N?),
that may be quick, but it's not compact.  this is what i mean by
"goofy pointer arithmetic".  and, it's clear that it is not in-place
after the first pass.  there *must* be another array space to write
to.

so, there clearly is a scheme you know about (i never thought
otherwise) that is something i hadn't seen.  but using the one
reference to a normal-in, normal-out structure that i have seen, i'll
put it the way O&S did (p. 598):

  "In the flow graph of Fig. 9-15, the data are accessed
  nonsequentially, the computation is not in place, and a
  scheme for indexing the data is considerably more
  complicated than in either of the two previous cases [two
  in-place DIT structures].  Consequently, this structure
  has no apparent advantages."

i was reverberating that sentiment from the little bit that i know of
the radix-(2^n) FFT.  that information could very well be stale.  it's
also in the Yellow O&S (1975 - p. 300, Fig. 6.13, and they say the
same thing), so it's more than 3 decades old.

sometimes i think that O&S is timeless, other times it's just dated.

r b-j

On Nov 28, 7:11 pm, robert bristow-johnson <r...@audioimagination.com>
wrote:
> okay, Steven, using O&S for illustration, the pointers changes looks
> pretty knarly, even for N=8, but even if it's not, it's not in-place,
> is it?  and if it's not in-place, is there not a cache hit?  assuming
> you have the (double) memory and that cost is not a problem, you now
> have the top and bottom (i'm gonna assume radix-2 for the moment) of
> the butterfly you're reading (from the previous pass) and the top and
> bottom of the butterfly you are writing (which would be the same
> locations if you were doing this in-place).  if it is not in-place,
> then you are ping-ponging and i would expect the number of cache
> misses to double.

First of all, I have no idea what you are talking about with "gnarly
pointer changes" for N=8.  An N=8 FFT is 60 floating-point operations
that can be implemented in straight-line code: read into local
variables (= registers), do calculations, write results.

Second, there are in-place algorithms that do not require a separate
bitreversal pass, some of which I pointed out in another message on
this thread.  Basically, they interleave the transpositions
(bitreversal = product of transpositions) with the computational
stages.

Third, an out-of-place algorithm does not require that you "ping-pong"
between two arrays (as in the Stockham algorithm).  Actually, only one
stage of the computation needs to work out-of-place (as in the
recursive example code I posted) to write things in the right place;
all other stages of the computation can work in-place. And if you look
at the FFTW benchmarks, you'll see that properly written out-of-place
code can indeed be very competitive.

Fourth, even if you have to do additional data motion or copying
stages, sometimes this can be advantageous because it makes the data
access during computation more consecutive.  e.g. the Stockham ("ping-
pong") algorithm was originally designed for vector machines where
consecutive access was a huge win, and even now it has locality
benefits (although we currently find that other algorithms work better
for us). Or e.g. in the radix-sqrt(n) approach (dubbed the "four-step"
and "six-step" algorithms by Bailey, although variants date back to
Gentleman and Sande) you do explicit transpositions---essentially, a
part of the bit-reversal permutation---interleaved with the
computation, in order to improve locality, and this is a win for very
large N or for very slow memory (e.g. distributed-memory systems).

Fifth, one is crazy to do radix-2 these days. You need to do the work
in larger chunks to take full advantage of the CPU. I've never seen a
radix-2 FFT that was not several times slower than the fastest code.

Ultimately, however, machines are complicated enough these days that
theorizing about performance is not very persuasive; you have to
actually try the different algorithms and sometimes the results are
unexpected. In the case of FFTs, people have tried quite a few
algorithms and there are many examples of people finding performance
benefits from algorithms that avoid separate bit-reversal passes.

Regards,
Steven G. Johnson

On Nov 28, 4:11 pm, robert bristow-johnson <r...@audioimagination.com>
wrote:
> On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
>
> > On Nov 27, 1:46 pm, robert bristow-johnson
>
> > > Steven, what's different (in terms of code) is the pointer
> > > arithmetic.  doing all sorts of goofy pointer arithmetic might be more
> > > costly, in the aggregate throughout all of the FFT passes, than a
> > > single bit-reversing pass at the beginning or end.
>
> > As someone who has actually implemented many of the algorithms without
> > bit reversal, I can say that (a) the pointer arithmetic is not really
> > that complicated and (b) the pointer arithmetic per se is not the
> > dominant factor in the time anyway (compared to the memory access
> > itself), and (c) and it really does seem to be faster on modern
> > general-purpose CPUs than the methods involving a separate array pass.
>
> okay, Steven, using O&S for illustration, the pointers changes looks
> pretty knarly, even for N=8, but even if it's not, it's not in-place,
> is it?  and if it's not in-place, is there not a cache hit?  assuming
> you have the (double) memory and that cost is not a problem, you now
> have the top and bottom (i'm gonna assume radix-2 for the moment) of
> the butterfly you're reading (from the previous pass) and the top and
> bottom of the butterfly you are writing (which would be the same
> locations if you were doing this in-place).  if it is not in-place,
> then you are ping-ponging and i would expect the number of cache
> misses to double.

There are alternatives to either in-place storage or
ping-ponging of all butterfly results. Gnarly software
tagging/caching/swapping indeed; but even gnarly stuff
could look close to free on systems where a single
dcache miss might support the execution of hundred(s)
of address computation instructions underneath it.


IMHO. YMMV.
--
rhn A.T nicholson d.0.t C-o-M

On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Nov 27, 1:46 pm, robert bristow-johnson
>
> > Steven, what's different (in terms of code) is the pointer
> > arithmetic.  doing all sorts of goofy pointer arithmetic might be more
> > costly, in the aggregate throughout all of the FFT passes, than a
> > single bit-reversing pass at the beginning or end.
>
> As someone who has actually implemented many of the algorithms without
> bit reversal, I can say that (a) the pointer arithmetic is not really
> that complicated and (b) the pointer arithmetic per se is not the
> dominant factor in the time anyway (compared to the memory access
> itself), and (c) and it really does seem to be faster on modern
> general-purpose CPUs than the methods involving a separate array pass.

okay, Steven, using O&S for illustration, the pointers changes looks
pretty knarly, even for N=8, but even if it's not, it's not in-place,
is it?  and if it's not in-place, is there not a cache hit?  assuming
you have the (double) memory and that cost is not a problem, you now
have the top and bottom (i'm gonna assume radix-2 for the moment) of
the butterfly you're reading (from the previous pass) and the top and
bottom of the butterfly you are writing (which would be the same
locations if you were doing this in-place).  if it is not in-place,
then you are ping-ponging and i would expect the number of cache
misses to double.

> > in addition, as i tried to point out earlier, if you're doing
> > something like fast-convolution using the FFT, then if your forward
> > FFT leaves its results in bit-reversing order and your inverse FFT
> > takes its input in bit reversed order (and leaves the results in
> > normal order), then no bit reversing will be necessary.  (the transfer
> > function, besides being pre-computed, will also be pre bit-reversed.)
>
> It's well known that you don't need normal-order output for
> convolutions.  I was concerned with the case where you do want normal
> order (which most users seem to want, even if they could do without
> out it conceivably).  However, in the case of convolutions, I'm
> convinced that the optimal algorithm will not be to do two completely
> separate FFTs in any case,

the FFT and iFFT kernals are pretty small.  so you have two short
snippets of code rather than one.  i can't imagine that being a cache-
miss problem.  what else could be the problem?

> regardless of the output/input order; I
> have some ideas on this that I want to explore for FFTW 4 (which will
> have built-in convolution routines).

well, let us know.  even though there are some things you're saying
here that either i just don't understand (in a context where i think i
*should* be understanding) or a little contentious, everyone here,
including me, recognize you as the FFT guru.  but the maths and other
pertanent facts don't really give a fig who is the guru or what the
guru says.  that's the only reason i take some of these issues on.

bestest,

r b-j

P.S.  BTW, i think we now work within 15 kilometers of each other.  i
wouldn't mind hooking up (maybe at some kind of function, but the
Boston section AES meetings seem to be pretty lame as of late).
anyway, if you know of some signal proc or similar seminar or
something (or some FFTW thing you're doing), please lemme know.  i'd
love to see it and shake your hand.

On Nov 27, 1:46 pm, robert bristow-johnson
> Steven, what's different (in terms of code) is the pointer
> arithmetic.  doing all sorts of goofy pointer arithmetic might be more
> costly, in the aggregate throughout all of the FFT passes, than a
> single bit-reversing pass at the beginning or end.

As someone who has actually implemented many of the algorithms without
bit reversal, I can say that (a) the pointer arithmetic is not really
that complicated and (b) the pointer arithmetic per se is not the
dominant factor in the time anyway (compared to the memory access
itself), and (c) and it really does seem to be faster on modern
general-purpose CPUs than the methods involving a separate array pass.

> in addition, as i tried to point out earlier, if you're doing
> something like fast-convolution using the FFT, then if your forward
> FFT leaves its results in bit-reversing order and your inverse FFT
> takes its input in bit reversed order (and leaves the results in
> normal order), then no bit reversing will be necessary.  (the transfer
> function, besides being pre-computed, will also be pre bit-reversed.)

It's well known that you don't need normal-order output for
convolutions.  I was concerned with the case where you do want normal
order (which most users seem to want, even if they could do without
out it conceivably).  However, in the case of convolutions, I'm
convinced that the optimal algorithm will not be to do two completely
separate FFTs in any case, regardless of the output/input order; I
have some ideas on this that I want to explore for FFTW 4 (which will
have built-in convolution routines).

Regards,
Steven G. Johnson

Steven G. Johnson wrote:
> On Nov 24, 9:45 am, Richard Owlett <rowl...@atlascomm.net> wrote:
> 
>>In "Chapter 12 - The Fast Fourier Transform / FFT Programs" of _The
>>Scientist and Engineer's Guide to Digital Signal Processing_
>>Smith says "There are also FFT routines that completely eliminate the
>>bit reversal sorting."
>>
>>Could someone point me to a description of one, preferably with BASIC or
>>FORTRAN sample code.
> 
> 
> A number of references are given in the Wikipedia article on the
> Cooley-Tukey FFT algorithm.
> 
>        http://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm


One of its references is to a paper titled
"The FFT &#4294967295; an algorithm the whole family can use" 
(http://www.cs.dartmouth.edu/~rockmore/cse-fft.pdf).

I've taken a quick look and I may be in target audience.

On Nov 27, 9:24 am, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Nov 24, 9:45 am, Richard Owlett <rowl...@atlascomm.net> wrote:
>
> > In "Chapter 12 - The Fast Fourier Transform / FFT Programs" of _The
> > Scientist and Engineer's Guide to Digital Signal Processing_
> > Smith says "There are also FFT routines that completely eliminate the
> > bit reversal sorting."
>
> > Could someone point me to a description of one, preferably with BASIC or
> > FORTRAN sample code.
>
> A number of references are given in the Wikipedia article on the
> Cooley-Tukey FFT algorithm.
>
>        http://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm
>
> The Cooley-Tukey FFT (the most common algorithm), along with other FFT
> algorithms for that matter, intrinsically requires some sort of re-
> ordering, in the sense that its operations map consecutive inputs to
> non-consecutive outputs and vice-versa.
>
> The most well-known approach is to do the re-ordering all at once,
> with a single bit-reversal pass (or, more generally, a mixed-base
> digit reversal).  Another approach that still works in-place is to
> arrange the sequence of radices so that one can do a sequence of small
> transpositions interleaved with the transform (alternating DIT/DIF
> stages work well here).  The simplest approaches work out-of-place.
> e.g. the Stockham algorithm goes back and forth between two arrays
> with each pass, transposing one digit at a time.
>
> The easiest approach, in my mind, is simply to do the algorithm
> recursively out-of-place; if you directly transcribe the Cooley-Tukey
> algorithm from the mathematical description, working recursively, all
> of the outputs are "automatically" put in the right place.  (This was
> the approach we used in the original FFTW; the current version of FFTW
> implements a variety of algorithms as described in our Proc. IEEE
> paper.)
>
> For example, here is a very simple 15-line radix-2 FFT, working
> recursively and out-of-place, that does not require any separate bit-
> reversal pass.  (To make it reasonably fast, one would want to use a
> larger base case than the size-1 base-case used here.  It also uses a
> relatively inaccurate trigonometric recurrence.  But of course, if you
> were actually interested in such practical issues you should be using
> a pre-existing FFT library.)
>
> #include <complex.h>
> #define TWOPI 6.2831853071795864769252867665590057683943388
> /* call with rec_fft(+1 or -1, N, input, output, 1, 1), where input
> and output are arrays of length N ...
>    requires C99 complex-number support  */
> void rec_fft(int sign, int N, complex double *x, complex double *y,
> int is, int os)
> {
>      if (N == 1) *y = *x;
>      else {
>           int N2 = N/2;
>           rec_fft(sign, N2, x, y, is*2, os);
>           rec_fft(sign, N2, x+is, y+os*N2, is*2, os);
>           complex double omega = 1.0, omega1 = cexp(sign * I * TWOPI /
> N);
>           for (int i = 0; i < N2; ++i) {
>                complex double a = y[os*i], b = omega * y[os*(i + N2)];
>                y[os*i] = a + b;
>                y[os*(i + N2)] = a - b;
>                omega *= omega1;
>           }
>      }
>
> }
>
> Regards,
> Steven G. Johnson

Richard wanted the FFT routines in BASIC, so I converted
the above routine into Basic for Chipmunk Basic, changing
pointer arithmetic into array access, converting complex
data types back to standard float variables, and removing
the trig recursion.  A Basic implementation that allow
recursive subroutines with locally scoped variables is
required.

I found that the line in the above C code:
>      if (N == 1) *y = *x;
does pretty much the same thing as does the final sort of
a generic in-place decimation in frequency FFT algorithm,
except distributing the sort with the butterflys, and
calculating the bit-reversed indexes recursively.

rem *** recursive out-of-place radix-2 FFT in Basic ***
rem
rem *** call as:  rec_fft(flag, n, xr,xi,0, yr,yi,0, 1,1)
rem  flag = 1 for FFT, -1 for IFFT
rem  n is FFT length
rem  xr,xi are input arrays (must be dimensioned to length >= n)
rem  yr,yi are result arrays (must be dimensiond to length >= n)
rem      (kx,ky,ks,os) are sub-vector parameter state variables
rem      (n2,i,c,s,k1,k2,ar,ai,br,bi) are local variables
rem

sub rec_fft(flag,n,xr,xi,kx,yr,yi,ky,ks,os,
n2,i,c,s,k1,k2,ar,ai,br,bi)
  if (n = 1)
    rem ** this does a bit-reversed-index copy and scale **
    rem  print ky,kx  to see post (DiF) sorting
    s = 1 : if (flag = -1) then s = 1/ks
    yr(ky) = xr(kx) * s
    yi(ky) = xi(kx) * s
  else
    n2 = n/2
    rec_fft(flag,n2, xr,xi,kx   , yr,yi,ky,       ks*2,os)
    rec_fft(flag,n2, xr,xi,kx+ks, yr,yi,ky+os*n2, ks*2,os)
    for i = 0 to n2-1
      c   = cos(i        * 2*pi/n)
      s   = sin(i * flag * 2*pi/n)
      k1 = ky+os*(i)
      k2 = ky+os*(i+n2)
      ar =     yr(k1)
      ai =     yi(k1)
      br = c * yr(k2) - s * yi(k2)
      bi = c * yi(k2) + s * yr(k2)
      yr(k1) = ar + br
      yi(k1) = ai + bi
      yr(k2) = ar - br
      yi(k2) = ai - bi
    next i
  endif
end sub  : rem  end rec_fft

If the above code formats wrong (old-fashioned Basic is
line-break sensitive), a copy is here:
  http://www.nicholson.com/dsp.fft1.html


IMHO. YMMV.
--
rhn A.T nicholson d.0.t C-o-M

Steven G. Johnson wrote:
(snip regarding bit-reversal in FFTs)

> The most well-known approach is to do the re-ordering all at once,
> with a single bit-reversal pass (or, more generally, a mixed-base
> digit reversal).  Another approach that still works in-place is to
> arrange the sequence of radices so that one can do a sequence of small
> transpositions interleaved with the transform (alternating DIT/DIF
> stages work well here).  The simplest approaches work out-of-place.
> e.g. the Stockham algorithm goes back and forth between two arrays
> with each pass, transposing one digit at a time.

I still remember a Fortran FFT subroutine from long ago that did
bit reversal using nested DO loops.  Depending on N (the number
of points) it would (illegally) GOTO the appropriate nested DO and
(maybe) GOTO out at the appropriate point.  Not very portable, but
it must have worked on at least one machine.

-- glen