> I seem to recall that the Pentium M's SSE2 support was basically for
> compatibility only ...
Thanks.
I already googled, and didn't find much information about this.
This is disappointing.
I tried another Computer, and i now see a significant performance
improvement.
Reply by Steven G. Johnson●May 14, 20072007-05-14
On May 14, 6:59 pm, Sven K=F6hler
> But there's abosolutely
> no difference between -onosimd and without.
>
> Very strange - i start to believe, that my CPU is silly.
>
> model name : Intel(R) Pentium(R) M processor 1.73GHz
I seem to recall that the Pentium M's SSE2 support was basically for
compatibility only ... that the SSE2 floating-point instructions in
Pentium M take twice as long as a single scalar floating-point
instruction, meaning that they have no net performance advantage. If
my memory is correct, that would certainly explain what you are
seeing.
Try it on a Pentium IV or a more recent CPU like a Core Duo, and the
results should be better.
Regards,
Steven G. Johnson
Reply by ●May 14, 20072007-05-14
> When you measure performance, you should be excluding the planning
> time. You can also use FFTW's tests/bench program, e.g.
> tests/bench -oestimate i1024
> tests/bench -oestimate -onosimd i1024
> will benchmark an in-place N=1024 transform, in FFTW_ESTIMATE mode,
> with and without SIMD. Using tests/bench is a good idea, to make sure
> you didn't accidentally mess up your benchmarking.
OK, so here we go.
I compiled FFTW3 manually now. Here's are some results:
# ./bench --info-all
(name "fftw3")
(version "fftw-3.1.2-sse2")
(cc "gcc -std=gnu99 -O3 -fomit-frame-pointer -malign-double
-fstrict-aliasing -ffast-math -march=pentium-m")
(codelet-optim "-O -fno-schedule-insns -fno-web -fno-loop-optimize
--param inline-unit-growth=1000 --param large-function-growth=1000")
(benchmark-precision "double")
So SSE2 is enabled.
# ./bench i1024
Problem: i1024, setup: 259.48 ms, time: 31.00 us, ``mflops'': 1651.6
# ./bench -onosimd i1024
Problem: i1024, setup: 156.76 ms, time: 32.25 us, ``mflops'': 1587.6
# ./bench -oestimate i1024
Problem: i1024, setup: 46.00 us, time: 37.25 us, ``mflops'': 1374.5
# ./bench -oestimate -onosimd i1024
Problem: i1024, setup: 46.00 us, time: 36.75 us, ``mflops'': 1393.2
So what about these results?
Without -oestimate, FFTW is faster. That's OK. But there's abosolutely
no difference between -onosimd and without.
Very strange - i start to believe, that my CPU is silly.
Any ideas now?
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Pentium(R) M processor 1.73GHz
stepping : 8
cpu MHz : 1733.000
cache size : 2048 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx est tm2
bogomips : 3460.37
clflush size : 64
Reply by ●May 14, 20072007-05-14
> Is FFTW_ESTIMATE important? Because i don't pass it when i create my
> plans. I thought, it would only be necessary to avoid overwriting of the
> arrays. But i don't care if they are overwritten at the time i create my
> plans.
OK. I didn't pass FFTW_MEASURE or FFTW_ESTIMATE. And FFTW_MEASURE seems
to be default.
Since i'm passing FFTW_ESTIMATE, creating plans is much faster.
>> You can also try calling fftw_print_plan, which will output the
>> algorithm being used. For SIMD, the codelet names (in quotation
>> marks) should have a "v" in them (for "vectorized"). For example, for
>> N=1024 in-place transforms on my machine I get:
>> (dft-ct-dit/32
>> (dftw-direct-32/8 "t3fv_32")
>> (dft-directbuf/34-32-x32 "n2fv_32"))
>> Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a
>> size-32 hard-coded FFT).
>
> Interesting! fftw_print_plan is very nice for debugging puposes!
> Exactly what i was looking for!
So here are some plans:
creating dft plan for N=65536, flags=131136
(dft-ct-dit/8
(dftw-direct-8/6 "t2_8")
(dft-ctsq-dif/8 "q1_8"
(dft-vrank>=1-x8/1
(dft-vrank>=1-x8/1
(dft-ct-dit/32
(dftw-direct-32/8 "t2_32")
(dft-directbuf/34-32-x32 "n1_32"))))))
creating dft plan for N=65536, flags=64
(dft-ct-dit/8
(dftw-direct-8/6 "t3fv_8")
(dft-ctsq-dif/8 "q1fv_8"
(dft-vrank>=1-x8/1
(dft-vrank>=1-x8/1
(dft-ct-dit/32
(dftw-direct-32/8 "t3fv_32")
(dft-directbuf/34-32-x32 "n1fv_32"))))))
As you can see, everything seems to work correctly.
The first plan uses routines without v, the seconf one with v.
The only thing that's not correct, is the runtime :-)
SSE2 still doesn't make a difference - very strange.
Reply by ●May 14, 20072007-05-14
>> I'm using FFTW. But i cannot measure any difference between runs with
>> FFTW_NO_SIMD and without.
>>
>> I'm trying to do in-place DFT, with plans created like this:
>>
>> fftw_complex *buf0 = fftw_malloc(...)
>> fftw_plan_dft_1d(N, buf0, buf0, FFTW_FORWARD, 0)
>>
>> FFTW has been compiled with SSE and SSE2 support. I'm using
>> double-precision.
>
> Note that you cannot compile FFTW with SSE *and* SSE2, since the
> former is only for single precision and the latter is only for double
> precision. Given an FFTW properly compiled for SSE2, and a CPU that
> supports SSE2 (i.e. Pentium IV or later), you should see a significant
> speedup from SIMD for almost all N, for both in and out-of-place
> transforms.
OK, but on my system i have all 3 FFTW variants:
libfftw3f.so for single precision (float)
libfftw3.so for double precision (double)
libfftw3l.so for long double precision
as far as i can see, libfftw3f.so is compiled with SSE support
and libfftw3.so is compiled with SSE2 support.
And /proc/cpuinfo states, that my CPU supports SSE2.
(It's a centrino pentium-m notebook).
> This leads me to suspect that you miscompiled FFTW somehow. What
> operating system are you running? Did you use the standard configure/
> Makefile scripts? i.e.
I'm using Gentoo Linux's FFTW package.
> ./configure --enable-sse2
> This should result in a config.h file that includes the line
> #define HAVE_SSE2 1
> unless your compiler is so old that it doesn't support SSE2
> intrinsics.
I looked at the parameters passed to FFTW's configure.
They build all 3 variants of FFTW, and they pass "--enable-sse2" to
FFTW's configure when compiling the double-precision library.
> When you measure performance, you should be excluding the planning
> time. You can also use FFTW's tests/bench program, e.g.
> tests/bench -oestimate i1024
> tests/bench -oestimate -onosimd i1024
> will benchmark an in-place N=1024 transform, in FFTW_ESTIMATE mode,
> with and without SIMD. Using tests/bench is a good idea, to make sure
> you didn't accidentally mess up your benchmarking.
I will try this.
Is FFTW_ESTIMATE important? Because i don't pass it when i create my
plans. I thought, it would only be necessary to avoid overwriting of the
arrays. But i don't care if they are overwritten at the time i create my
plans.
> You can also try calling fftw_print_plan, which will output the
> algorithm being used. For SIMD, the codelet names (in quotation
> marks) should have a "v" in them (for "vectorized"). For example, for
> N=1024 in-place transforms on my machine I get:
> (dft-ct-dit/32
> (dftw-direct-32/8 "t3fv_32")
> (dft-directbuf/34-32-x32 "n2fv_32"))
> Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a
> size-32 hard-coded FFT).
Interesting! fftw_print_plan is very nice for debugging puposes!
Exactly what i was looking for!
Thanks for your help!
Regards,
Sven
Reply by Steven G. Johnson●May 14, 20072007-05-14
On May 13, 9:56 pm, Sven K=F6hler <remove-for-no-spam-skoeh...@upb.de>
wrote:
> I'm using FFTW. But i cannot measure any difference between runs with
> FFTW_NO_SIMD and without.
>
> I'm trying to do in-place DFT, with plans created like this:
>
> fftw_complex *buf0 =3D fftw_malloc(...)
> fftw_plan_dft_1d(N, buf0, buf0, FFTW_FORWARD, 0)
>
> FFTW has been compiled with SSE and SSE2 support. I'm using
> double-precision.
Note that you cannot compile FFTW with SSE *and* SSE2, since the
former is only for single precision and the latter is only for double
precision. Given an FFTW properly compiled for SSE2, and a CPU that
supports SSE2 (i.e. Pentium IV or later), you should see a significant
speedup from SIMD for almost all N, for both in and out-of-place
transforms.
This leads me to suspect that you miscompiled FFTW somehow. What
operating system are you running? Did you use the standard configure/
Makefile scripts? i.e.
./configure --enable-sse2
This should result in a config.h file that includes the line
#define HAVE_SSE2 1
unless your compiler is so old that it doesn't support SSE2
intrinsics.
If you are using Windows, I strongly recommend using one of the
precompiled libraries from www.fftw.org/windows.html
When you measure performance, you should be excluding the planning
time. You can also use FFTW's tests/bench program, e.g.
tests/bench -oestimate i1024
tests/bench -oestimate -onosimd i1024
will benchmark an in-place N=3D1024 transform, in FFTW_ESTIMATE mode,
with and without SIMD. Using tests/bench is a good idea, to make sure
you didn't accidentally mess up your benchmarking.
You can also try calling fftw_print_plan, which will output the
algorithm being used. For SIMD, the codelet names (in quotation
marks) should have a "v" in them (for "vectorized"). For example, for
N=3D1024 in-place transforms on my machine I get:
(dft-ct-dit/32
(dftw-direct-32/8 "t3fv_32")
(dft-directbuf/34-32-x32 "n2fv_32"))
Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a
size-32 hard-coded FFT).
Regards,
Steven G. Johnson
Reply by ●May 14, 20072007-05-14
Hi,
Sven K�hler schrieb:
> any FFTW experts reading this? :-)
>
> I'm using FFTW. But i cannot measure any difference between runs with
> FFTW_NO_SIMD and without.
most likely SIMD is not used anyway.
Maybe your function is not yet implemented with SSE2 or your CPU does
not support SSE2.
> I'm trying to do in-place DFT, with plans created like this:
In-place may be part of the problem.
> FFTW has been compiled with SSE and SSE2 support. I'm using
> double-precision.
Double precision, too.
> Any ideas how can i can track the problem down?
- Test without in-place.
- Test with single precision.
- Check wehther your CPU supports SSE2.
- Check the release notes of fftw for information on x86 assembler
language implementations and their restrictions.
- Look at the fftw source whether there is an SSE2 implementation of
your core function.
Marcel
Reply by ●May 13, 20072007-05-13
Hi,
any FFTW experts reading this? :-)
I'm using FFTW. But i cannot measure any difference between runs with
FFTW_NO_SIMD and without.
I'm trying to do in-place DFT, with plans created like this:
fftw_complex *buf0 = fftw_malloc(...)
fftw_plan_dft_1d(N, buf0, buf0, FFTW_FORWARD, 0)
FFTW has been compiled with SSE and SSE2 support. I'm using
double-precision.
Any ideas how can i can track the problem down?
Regards,
Sven