Reply by May 14, 20072007-05-14
> I seem to recall that the Pentium M's SSE2 support was basically for > compatibility only ...
Thanks. I already googled, and didn't find much information about this. This is disappointing. I tried another Computer, and i now see a significant performance improvement.
Reply by Steven G. Johnson May 14, 20072007-05-14
On May 14, 6:59 pm, Sven K=F6hler
> But there's abosolutely > no difference between -onosimd and without. > > Very strange - i start to believe, that my CPU is silly. > > model name : Intel(R) Pentium(R) M processor 1.73GHz
I seem to recall that the Pentium M's SSE2 support was basically for compatibility only ... that the SSE2 floating-point instructions in Pentium M take twice as long as a single scalar floating-point instruction, meaning that they have no net performance advantage. If my memory is correct, that would certainly explain what you are seeing. Try it on a Pentium IV or a more recent CPU like a Core Duo, and the results should be better. Regards, Steven G. Johnson
Reply by May 14, 20072007-05-14
> When you measure performance, you should be excluding the planning > time. You can also use FFTW's tests/bench program, e.g. > tests/bench -oestimate i1024 > tests/bench -oestimate -onosimd i1024 > will benchmark an in-place N=1024 transform, in FFTW_ESTIMATE mode, > with and without SIMD. Using tests/bench is a good idea, to make sure > you didn't accidentally mess up your benchmarking.
OK, so here we go. I compiled FFTW3 manually now. Here's are some results: # ./bench --info-all (name "fftw3") (version "fftw-3.1.2-sse2") (cc "gcc -std=gnu99 -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -ffast-math -march=pentium-m") (codelet-optim "-O -fno-schedule-insns -fno-web -fno-loop-optimize --param inline-unit-growth=1000 --param large-function-growth=1000") (benchmark-precision "double") So SSE2 is enabled. # ./bench i1024 Problem: i1024, setup: 259.48 ms, time: 31.00 us, ``mflops'': 1651.6 # ./bench -onosimd i1024 Problem: i1024, setup: 156.76 ms, time: 32.25 us, ``mflops'': 1587.6 # ./bench -oestimate i1024 Problem: i1024, setup: 46.00 us, time: 37.25 us, ``mflops'': 1374.5 # ./bench -oestimate -onosimd i1024 Problem: i1024, setup: 46.00 us, time: 36.75 us, ``mflops'': 1393.2 So what about these results? Without -oestimate, FFTW is faster. That's OK. But there's abosolutely no difference between -onosimd and without. Very strange - i start to believe, that my CPU is silly. Any ideas now? # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 13 model name : Intel(R) Pentium(R) M processor 1.73GHz stepping : 8 cpu MHz : 1733.000 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx est tm2 bogomips : 3460.37 clflush size : 64
Reply by May 14, 20072007-05-14
> Is FFTW_ESTIMATE important? Because i don't pass it when i create my > plans. I thought, it would only be necessary to avoid overwriting of the > arrays. But i don't care if they are overwritten at the time i create my > plans.
OK. I didn't pass FFTW_MEASURE or FFTW_ESTIMATE. And FFTW_MEASURE seems to be default. Since i'm passing FFTW_ESTIMATE, creating plans is much faster.
>> You can also try calling fftw_print_plan, which will output the >> algorithm being used. For SIMD, the codelet names (in quotation >> marks) should have a "v" in them (for "vectorized"). For example, for >> N=1024 in-place transforms on my machine I get: >> (dft-ct-dit/32 >> (dftw-direct-32/8 "t3fv_32") >> (dft-directbuf/34-32-x32 "n2fv_32")) >> Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a >> size-32 hard-coded FFT). > > Interesting! fftw_print_plan is very nice for debugging puposes! > Exactly what i was looking for!
So here are some plans: creating dft plan for N=65536, flags=131136 (dft-ct-dit/8 (dftw-direct-8/6 "t2_8") (dft-ctsq-dif/8 "q1_8" (dft-vrank>=1-x8/1 (dft-vrank>=1-x8/1 (dft-ct-dit/32 (dftw-direct-32/8 "t2_32") (dft-directbuf/34-32-x32 "n1_32")))))) creating dft plan for N=65536, flags=64 (dft-ct-dit/8 (dftw-direct-8/6 "t3fv_8") (dft-ctsq-dif/8 "q1fv_8" (dft-vrank>=1-x8/1 (dft-vrank>=1-x8/1 (dft-ct-dit/32 (dftw-direct-32/8 "t3fv_32") (dft-directbuf/34-32-x32 "n1fv_32")))))) As you can see, everything seems to work correctly. The first plan uses routines without v, the seconf one with v. The only thing that's not correct, is the runtime :-) SSE2 still doesn't make a difference - very strange.
Reply by May 14, 20072007-05-14
>> I'm using FFTW. But i cannot measure any difference between runs with >> FFTW_NO_SIMD and without. >> >> I'm trying to do in-place DFT, with plans created like this: >> >> fftw_complex *buf0 = fftw_malloc(...) >> fftw_plan_dft_1d(N, buf0, buf0, FFTW_FORWARD, 0) >> >> FFTW has been compiled with SSE and SSE2 support. I'm using >> double-precision. > > Note that you cannot compile FFTW with SSE *and* SSE2, since the > former is only for single precision and the latter is only for double > precision. Given an FFTW properly compiled for SSE2, and a CPU that > supports SSE2 (i.e. Pentium IV or later), you should see a significant > speedup from SIMD for almost all N, for both in and out-of-place > transforms.
OK, but on my system i have all 3 FFTW variants: libfftw3f.so for single precision (float) libfftw3.so for double precision (double) libfftw3l.so for long double precision as far as i can see, libfftw3f.so is compiled with SSE support and libfftw3.so is compiled with SSE2 support. And /proc/cpuinfo states, that my CPU supports SSE2. (It's a centrino pentium-m notebook).
> This leads me to suspect that you miscompiled FFTW somehow. What > operating system are you running? Did you use the standard configure/ > Makefile scripts? i.e.
I'm using Gentoo Linux's FFTW package.
> ./configure --enable-sse2 > This should result in a config.h file that includes the line > #define HAVE_SSE2 1 > unless your compiler is so old that it doesn't support SSE2 > intrinsics.
I looked at the parameters passed to FFTW's configure. They build all 3 variants of FFTW, and they pass "--enable-sse2" to FFTW's configure when compiling the double-precision library.
> When you measure performance, you should be excluding the planning > time. You can also use FFTW's tests/bench program, e.g. > tests/bench -oestimate i1024 > tests/bench -oestimate -onosimd i1024 > will benchmark an in-place N=1024 transform, in FFTW_ESTIMATE mode, > with and without SIMD. Using tests/bench is a good idea, to make sure > you didn't accidentally mess up your benchmarking.
I will try this. Is FFTW_ESTIMATE important? Because i don't pass it when i create my plans. I thought, it would only be necessary to avoid overwriting of the arrays. But i don't care if they are overwritten at the time i create my plans.
> You can also try calling fftw_print_plan, which will output the > algorithm being used. For SIMD, the codelet names (in quotation > marks) should have a "v" in them (for "vectorized"). For example, for > N=1024 in-place transforms on my machine I get: > (dft-ct-dit/32 > (dftw-direct-32/8 "t3fv_32") > (dft-directbuf/34-32-x32 "n2fv_32")) > Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a > size-32 hard-coded FFT).
Interesting! fftw_print_plan is very nice for debugging puposes! Exactly what i was looking for! Thanks for your help! Regards, Sven
Reply by Steven G. Johnson May 14, 20072007-05-14
On May 13, 9:56 pm, Sven K=F6hler <remove-for-no-spam-skoeh...@upb.de>
wrote:
> I'm using FFTW. But i cannot measure any difference between runs with > FFTW_NO_SIMD and without. > > I'm trying to do in-place DFT, with plans created like this: > > fftw_complex *buf0 =3D fftw_malloc(...) > fftw_plan_dft_1d(N, buf0, buf0, FFTW_FORWARD, 0) > > FFTW has been compiled with SSE and SSE2 support. I'm using > double-precision.
Note that you cannot compile FFTW with SSE *and* SSE2, since the former is only for single precision and the latter is only for double precision. Given an FFTW properly compiled for SSE2, and a CPU that supports SSE2 (i.e. Pentium IV or later), you should see a significant speedup from SIMD for almost all N, for both in and out-of-place transforms. This leads me to suspect that you miscompiled FFTW somehow. What operating system are you running? Did you use the standard configure/ Makefile scripts? i.e. ./configure --enable-sse2 This should result in a config.h file that includes the line #define HAVE_SSE2 1 unless your compiler is so old that it doesn't support SSE2 intrinsics. If you are using Windows, I strongly recommend using one of the precompiled libraries from www.fftw.org/windows.html When you measure performance, you should be excluding the planning time. You can also use FFTW's tests/bench program, e.g. tests/bench -oestimate i1024 tests/bench -oestimate -onosimd i1024 will benchmark an in-place N=3D1024 transform, in FFTW_ESTIMATE mode, with and without SIMD. Using tests/bench is a good idea, to make sure you didn't accidentally mess up your benchmarking. You can also try calling fftw_print_plan, which will output the algorithm being used. For SIMD, the codelet names (in quotation marks) should have a "v" in them (for "vectorized"). For example, for N=3D1024 in-place transforms on my machine I get: (dft-ct-dit/32 (dftw-direct-32/8 "t3fv_32") (dft-directbuf/34-32-x32 "n2fv_32")) Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a size-32 hard-coded FFT). Regards, Steven G. Johnson
Reply by May 14, 20072007-05-14
Hi,

Sven K&#2013266166;hler schrieb:
> any FFTW experts reading this? :-) > > I'm using FFTW. But i cannot measure any difference between runs with > FFTW_NO_SIMD and without.
most likely SIMD is not used anyway. Maybe your function is not yet implemented with SSE2 or your CPU does not support SSE2.
> I'm trying to do in-place DFT, with plans created like this:
In-place may be part of the problem.
> FFTW has been compiled with SSE and SSE2 support. I'm using > double-precision.
Double precision, too.
> Any ideas how can i can track the problem down?
- Test without in-place. - Test with single precision. - Check wehther your CPU supports SSE2. - Check the release notes of fftw for information on x86 assembler language implementations and their restrictions. - Look at the fftw source whether there is an SSE2 implementation of your core function. Marcel
Reply by May 13, 20072007-05-13
Hi,

any FFTW experts reading this? :-)

I'm using FFTW. But i cannot measure any difference between runs with
FFTW_NO_SIMD and without.

I'm trying to do in-place DFT, with plans created like this:

fftw_complex *buf0 = fftw_malloc(...)
fftw_plan_dft_1d(N, buf0, buf0, FFTW_FORWARD, 0)


FFTW has been compiled with SSE and SSE2 support. I'm using
double-precision.

Any ideas how can i can track the problem down?


Regards,
  Sven