Reply by James Van Buskirk February 3, 20062006-02-03
<stevenj@alum.mit.edu> wrote in message
news:1138992155.258156.39210@g47g2000cwa.googlegroups.com...

> That being said, FFTW *will* give you an exact count of the arithmetic > operations if you ask, via the fftw_flops function. (If you want to > force it to use as few arithmetic operations as possible, you can can > plan in FFTW_ESTIMATE mode and edit the estimate_cost function in > kernel/planner.c to not include "ops.other" in the heuristic cost.) For > size 37, our general-n Rader's algorithm gives 1672 flops. Your > routine on http://home.comcast.net/~kmbtib/ lists 1092 flops. For size > 43, our routine reports 2456 flops for Rader. For size 59, it reports > 6740 flops for Rader. The Bluestein and zero-padded Rader in 3.1 do not > reduce the flop counts for these sizes.
Thanks. I was curious about 43 and 59 because of their reduction via the Rader algorithm to Sophie Germain primes 23 and 29. I guess I forgot that I was more curious about 47 than 43 in this context. For 47 and 59 I used incompletely zero-padded Rader in http://home.comcast.net/~kmbtib/codelets.ZIP to obtain 2236 flops and 3442 flops respectively. Also I got 1720 flops for p = 43. The flop counts for 43, 47, and 59 in my program are, I think, not optimal. I was hoping that the extra techniques you were employing would already help some for 47 and 59 because fully recursive Rader seems to be so far off the mark for these cases, but you seem to have shown me that I was wrong in this assumption. Incompletely zero-padded Rader looks like it would be a pain to implement in a code generator because of all the extra choices that are there to be examined. All this prime-order stuff is probably more fun for you and me than the rest of the readership, but for an FFT package implementor it's potentially less effort to provide O(N*log(N)) algorithms for all orders than it might be to answer all the questions from possibly naive users about why the package can't perform an FFT of the order they want. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
Reply by February 3, 20062006-02-03
Hi James, I suspect that you're barking up the wrong tree.  That
release note referred to generic routines for large prime sizes, which
should not really be comparable to the hard-coded low-arithmetic
kernels for small primes that I know you've been working on.

In particular, we use Bluestein's or Rader's method to express a large
prime-size FFT as a couple of different-size FFTs (a convolution) plus
some extra operations.  In this way, we achieve O(n log n) complexity,
but I doubt we get anything like the absolute theoretical minimum
number of operations (whatever that is) for those prime sizes.  Nor
should we --- the tricks to achieve minimal or near-minimal operation
counts for prime sizes, as far as I can tell, are mostly suitable for
hard-coded transforms and are not practical to perform on the fly.

(For p=43 or p=59, on my machine, FFTW doesn't even use an O(n log n)
algorithm...the O(n^2) algorithm turns out to be faster.  Not because
it has less arithmetic, but because its simplicity and tight loops mean
that its efficiency gains apparently outweigh its arithmetic costs for
small primes in the absence of a hard-coded routine for those sizes.)

That being said, FFTW *will* give you an exact count of the arithmetic
operations if you ask, via the fftw_flops function.  (If you want to
force it to use as few arithmetic operations as possible, you can can
plan in FFTW_ESTIMATE mode and edit the estimate_cost function in
kernel/planner.c to not include "ops.other" in the heuristic cost.) For
size 37, our general-n Rader's algorithm gives 1672 flops.  Your
routine on http://home.comcast.net/~kmbtib/ lists 1092 flops.  For size
43, our routine reports 2456 flops for Rader.  For size 59, it reports
6740 flops for Rader. The Bluestein and zero-padded Rader in 3.1 do not
reduce the flop counts for these sizes.

Steven

PS. Don't enable SSE/SSE2 if you want to get the operation counts,
since it currently counts SIMD operations in that case rather than
scalar operations.

Reply by James Van Buskirk February 3, 20062006-02-03
"Matteo Frigo" <athena@fftw.org> wrote in message
news:85k6cd2jac.fsf@fftw.org...

> * Faster prime-size DFTs: implemented Bluestein's algorithm, as well > as a zero-padded Rader variant to limit recursive use of Rader's
algorithm. Do you have before & after operation counts for e.g. p = 43 or p = 59? -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end
Reply by Matteo Frigo February 2, 20062006-02-02
We are pleased to announce the availability of FFTW 3.1, which you can
download from the web page.

       http://fftw.org/download.html

In addition, we have updated our fft benchmark pages with results from
recent machines.

Cordially,
Matteo Frigo
Steven G. Johnson

Changes since FFTW 3.0.1:

* Faster FFTW_ESTIMATE planner.

* New (faster) algorithm for REDFT00/RODFT00 (type-I DCT/DST) of odd size.

* "4-step" algorithm for faster FFTs of very large sizes (> 2^18).

* Faster in-place real-data DFTs (for R2HC and HC2R r2r formats).

* Faster in-place non-square transpositions (FFTW uses these internally
  for in-place FFTs, and you can also perform them explicitly using
  the guru interface).

* Faster prime-size DFTs: implemented Bluestein's algorithm, as well
  as a zero-padded Rader variant to limit recursive use of Rader's algorithm.

* SIMD support for split complex arrays.

* Much faster Altivec/VMX performance.

* New fftw_set_timelimit function to specify a (rough) upper bound to the
  planning time (does not affect ESTIMATE mode).

* Removed --enable-3dnow support; use --enable-k7 instead.

* FMA (fused multiply-add) version is now included in "standard" FFTW,
  and is enabled with --enable-fma (the default on PowerPC and Itanium).

* Automatic detection of native architecture flag for gcc.  New
  configure options: --enable-portable-binary and --with-gcc-arch=<arch>,
  for people distributing compiled binaries of FFTW (see manual).

* Automatic detection of Altivec under Linux with gcc 3.4 (so that
  same binary should work on both Altivec and non-Altivec PowerPCs).

* Compiler-specific tweaks/flags/workarounds for gcc 3.4, xlc, HP/UX,
  Solaris/Intel.

* Various documentation clarifications.

* 64-bit clean.  (Fixes a bug affecting the split guru planner on 
  64-bit machines, reported by David Necas.)

* Fixed Debian bug #259612: inadvertent use of SSE instructions on
  non-SSE machines (causing a crash) for --enable-sse binaries.

* Fixed bug that caused HC2R transforms to destroy the input in
  certain cases, even if the user specified FFTW_PRESERVE_INPUT.

* Fixed bug where wisdom would be lost under rare circumstances,
  causing excessive planning time.

* FAQ notes bug in gcc-3.4.[1-3] that causes FFTW to crash with SSE/SSE2.

* Fixed accidentally exported symbol that prohibited simultaneous
  linking to double/single multithreaded FFTW (thanks to Alessio Massaro).

* Support Win32 threads under MinGW (thanks to Alessio Massaro).

* Fixed problem with building DLL under Cygwin; thanks to Stephane Fillod.

* Fix build failure if no Fortran compiler is found (thanks to Charles
  Radley for the bug report).

* Fixed compilation failure with icc 8.0 and SSE/SSE2.  Automatic
  detection of icc architecture flag (e.g. -xW).

* Fixed compilation with OpenMP on AIX (thanks to Greg Bauer).

* Fixed compilation failure on x86-64 with gcc (thanks to Orion Poplawski).

* Incorporated patch from FreeBSD ports (FreeBSD does not have memalign,
  but its malloc is 16-byte aligned).

* Cycle-counter compilation fixes for Itanium, Alpha, x86-64, Sparc,
  MacOS (thanks to Matt Boman, John Bowman, and James A. Treacy for
  reports/fixes).  Added x86-64 cycle counter for PGI compilers,
  courtesy Cristiano Calonaci.

* Fix compilation problem in test program due to C99 conflict.

* Portability fix for import_system_wisdom with djgpp (thanks to Juan
  Manuel Guerrero).

* Fixed compilation failure on MacOS 10.3 due to getopt conflict.

* Work around Visual C++ (version 6/7) bug in SSE compilation;
  thanks to Eddie Yee for his detailed report.