>I've written a relatively efficient radix 2 fft engine for a commercial
>voice quality measurement product (while I was working) which largely
>modelled the one in Numerical Recipe.

I've never seen a radix-2 FFT code that was efficient...

>I did some research.  Apparently, most people find integrating the fft in
>Intel's library in general does indeed give 5 times of speed enhancement.

It's not just Intel's library.  I've been benchmarking FFTs for seven
years now, and the Numerical Recipes code (along with other textbook
radix-2 codes) has always been on the order of 5 times slower than the
fastest implementations, on essentially every modern CPU.  (See
www.fftw.org/speed ...look at the single-precision graphs to find
Numerical Recipes.)

In order to fully exploit the pipelines, register sets, and caches of
CPUs from the last decade, you need to have a larger kernel (among other
things) than the simple 2-point butterfly in a classic radix-2
algorithm.  Unfortunately, Numerical Recipes' discussion of the benefits
of larger radices, etc., is 20 years out of date (they claim that you
can only get a 20% improvement, which is true only if the speed were
determined entirely by the flop count ... that would make too much sense
for machines these days).

Architecture-specific hacks like SSE2 help Intel's library as well, of
course (as well as FFTW, which also uses SIMD in the latest release),
but this is only another factor of ~2 (in double precision).

>I wonder if anyone see the same amount of speed performance using Intel's
>library with AMD's processor (which is also MMX compatible).

See the above URL for benchmarks on AMD chips.

Cordiall,
Steven G. Johnson