We are pleased to announce the availability of FFTW 3.1, which you can download from the web page. http://fftw.org/download.html In addition, we have updated our fft benchmark pages with results from recent machines. Cordially, Matteo Frigo Steven G. Johnson Changes since FFTW 3.0.1: * Faster FFTW_ESTIMATE planner. * New (faster) algorithm for REDFT00/RODFT00 (type-I DCT/DST) of odd size. * "4-step" algorithm for faster FFTs of very large sizes (> 2^18). * Faster in-place real-data DFTs (for R2HC and HC2R r2r formats). * Faster in-place non-square transpositions (FFTW uses these internally for in-place FFTs, and you can also perform them explicitly using the guru interface). * Faster prime-size DFTs: implemented Bluestein's algorithm, as well as a zero-padded Rader variant to limit recursive use of Rader's algorithm. * SIMD support for split complex arrays. * Much faster Altivec/VMX performance. * New fftw_set_timelimit function to specify a (rough) upper bound to the planning time (does not affect ESTIMATE mode). * Removed --enable-3dnow support; use --enable-k7 instead. * FMA (fused multiply-add) version is now included in "standard" FFTW, and is enabled with --enable-fma (the default on PowerPC and Itanium). * Automatic detection of native architecture flag for gcc. New configure options: --enable-portable-binary and --with-gcc-arch=<arch>, for people distributing compiled binaries of FFTW (see manual). * Automatic detection of Altivec under Linux with gcc 3.4 (so that same binary should work on both Altivec and non-Altivec PowerPCs). * Compiler-specific tweaks/flags/workarounds for gcc 3.4, xlc, HP/UX, Solaris/Intel. * Various documentation clarifications. * 64-bit clean. (Fixes a bug affecting the split guru planner on 64-bit machines, reported by David Necas.) * Fixed Debian bug #259612: inadvertent use of SSE instructions on non-SSE machines (causing a crash) for --enable-sse binaries. * Fixed bug that caused HC2R transforms to destroy the input in certain cases, even if the user specified FFTW_PRESERVE_INPUT. * Fixed bug where wisdom would be lost under rare circumstances, causing excessive planning time. * FAQ notes bug in gcc-3.4.[1-3] that causes FFTW to crash with SSE/SSE2. * Fixed accidentally exported symbol that prohibited simultaneous linking to double/single multithreaded FFTW (thanks to Alessio Massaro). * Support Win32 threads under MinGW (thanks to Alessio Massaro). * Fixed problem with building DLL under Cygwin; thanks to Stephane Fillod. * Fix build failure if no Fortran compiler is found (thanks to Charles Radley for the bug report). * Fixed compilation failure with icc 8.0 and SSE/SSE2. Automatic detection of icc architecture flag (e.g. -xW). * Fixed compilation with OpenMP on AIX (thanks to Greg Bauer). * Fixed compilation failure on x86-64 with gcc (thanks to Orion Poplawski). * Incorporated patch from FreeBSD ports (FreeBSD does not have memalign, but its malloc is 16-byte aligned). * Cycle-counter compilation fixes for Itanium, Alpha, x86-64, Sparc, MacOS (thanks to Matt Boman, John Bowman, and James A. Treacy for reports/fixes). Added x86-64 cycle counter for PGI compilers, courtesy Cristiano Calonaci. * Fix compilation problem in test program due to C99 conflict. * Portability fix for import_system_wisdom with djgpp (thanks to Juan Manuel Guerrero). * Fixed compilation failure on MacOS 10.3 due to getopt conflict. * Work around Visual C++ (version 6/7) bug in SSE compilation; thanks to Eddie Yee for his detailed report.

# FFTW 3.1 available, new benchmarks

Started by ●February 2, 2006

Reply by ●February 3, 20062006-02-03

"Matteo Frigo" <athena@fftw.org> wrote in message news:85k6cd2jac.fsf@fftw.org...> * Faster prime-size DFTs: implemented Bluestein's algorithm, as well > as a zero-padded Rader variant to limit recursive use of Rader'salgorithm. Do you have before & after operation counts for e.g. p = 43 or p = 59? -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end

Reply by ●February 3, 20062006-02-03

Hi James, I suspect that you're barking up the wrong tree. That release note referred to generic routines for large prime sizes, which should not really be comparable to the hard-coded low-arithmetic kernels for small primes that I know you've been working on. In particular, we use Bluestein's or Rader's method to express a large prime-size FFT as a couple of different-size FFTs (a convolution) plus some extra operations. In this way, we achieve O(n log n) complexity, but I doubt we get anything like the absolute theoretical minimum number of operations (whatever that is) for those prime sizes. Nor should we --- the tricks to achieve minimal or near-minimal operation counts for prime sizes, as far as I can tell, are mostly suitable for hard-coded transforms and are not practical to perform on the fly. (For p=43 or p=59, on my machine, FFTW doesn't even use an O(n log n) algorithm...the O(n^2) algorithm turns out to be faster. Not because it has less arithmetic, but because its simplicity and tight loops mean that its efficiency gains apparently outweigh its arithmetic costs for small primes in the absence of a hard-coded routine for those sizes.) That being said, FFTW *will* give you an exact count of the arithmetic operations if you ask, via the fftw_flops function. (If you want to force it to use as few arithmetic operations as possible, you can can plan in FFTW_ESTIMATE mode and edit the estimate_cost function in kernel/planner.c to not include "ops.other" in the heuristic cost.) For size 37, our general-n Rader's algorithm gives 1672 flops. Your routine on http://home.comcast.net/~kmbtib/ lists 1092 flops. For size 43, our routine reports 2456 flops for Rader. For size 59, it reports 6740 flops for Rader. The Bluestein and zero-padded Rader in 3.1 do not reduce the flop counts for these sizes. Steven PS. Don't enable SSE/SSE2 if you want to get the operation counts, since it currently counts SIMD operations in that case rather than scalar operations.

Reply by ●February 3, 20062006-02-03

<stevenj@alum.mit.edu> wrote in message news:1138992155.258156.39210@g47g2000cwa.googlegroups.com...> That being said, FFTW *will* give you an exact count of the arithmetic > operations if you ask, via the fftw_flops function. (If you want to > force it to use as few arithmetic operations as possible, you can can > plan in FFTW_ESTIMATE mode and edit the estimate_cost function in > kernel/planner.c to not include "ops.other" in the heuristic cost.) For > size 37, our general-n Rader's algorithm gives 1672 flops. Your > routine on http://home.comcast.net/~kmbtib/ lists 1092 flops. For size > 43, our routine reports 2456 flops for Rader. For size 59, it reports > 6740 flops for Rader. The Bluestein and zero-padded Rader in 3.1 do not > reduce the flop counts for these sizes.Thanks. I was curious about 43 and 59 because of their reduction via the Rader algorithm to Sophie Germain primes 23 and 29. I guess I forgot that I was more curious about 47 than 43 in this context. For 47 and 59 I used incompletely zero-padded Rader in http://home.comcast.net/~kmbtib/codelets.ZIP to obtain 2236 flops and 3442 flops respectively. Also I got 1720 flops for p = 43. The flop counts for 43, 47, and 59 in my program are, I think, not optimal. I was hoping that the extra techniques you were employing would already help some for 47 and 59 because fully recursive Rader seems to be so far off the mark for these cases, but you seem to have shown me that I was wrong in this assumption. Incompletely zero-padded Rader looks like it would be a pain to implement in a code generator because of all the extra choices that are there to be examined. All this prime-order stuff is probably more fun for you and me than the rest of the readership, but for an FFT package implementor it's potentially less effort to provide O(N*log(N)) algorithms for all orders than it might be to answer all the questions from possibly naive users about why the package can't perform an FFT of the order they want. -- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, & 6.0134700243160014d-154/),(/'x'/)); end