On May 14, 6:59 pm, Sven K=F6hler
> But there's abosolutely
> no difference between -onosimd and without.
>
> Very strange - i start to believe, that my CPU is silly.
>
> model name      : Intel(R) Pentium(R) M processor 1.73GHz

I seem to recall that the Pentium M's SSE2 support was basically for
compatibility only ... that the SSE2 floating-point instructions in
Pentium M take twice as long as a single scalar floating-point
instruction, meaning that they have no net performance advantage.  If
my memory is correct, that would certainly explain what you are
seeing.

Try it on a Pentium IV or a more recent CPU like a Core Duo, and the
results should be better.

Regards,
Steven G. Johnson

> When you measure performance, you should be excluding the planning
> time.  You can also use FFTW's tests/bench program, e.g.
>        tests/bench -oestimate i1024
>        tests/bench -oestimate -onosimd i1024
> will benchmark an in-place N=1024 transform, in FFTW_ESTIMATE mode,
> with and without SIMD.  Using tests/bench is a good idea, to make sure
> you didn't accidentally mess up your benchmarking.

OK, so here we go.

I compiled FFTW3 manually now. Here's are some results:

# ./bench --info-all
(name "fftw3")
(version "fftw-3.1.2-sse2")
(cc "gcc -std=gnu99 -O3 -fomit-frame-pointer -malign-double
-fstrict-aliasing -ffast-math -march=pentium-m")
(codelet-optim "-O -fno-schedule-insns -fno-web -fno-loop-optimize
--param inline-unit-growth=1000 --param large-function-growth=1000")
(benchmark-precision "double")

So SSE2 is enabled.

# ./bench i1024
Problem: i1024, setup: 259.48 ms, time: 31.00 us, ``mflops'': 1651.6
# ./bench -onosimd i1024
Problem: i1024, setup: 156.76 ms, time: 32.25 us, ``mflops'': 1587.6
# ./bench -oestimate i1024
Problem: i1024, setup: 46.00 us, time: 37.25 us, ``mflops'': 1374.5
# ./bench -oestimate -onosimd i1024
Problem: i1024, setup: 46.00 us, time: 36.75 us, ``mflops'': 1393.2

So what about these results?

Without -oestimate, FFTW is faster. That's OK. But there's abosolutely
no difference between -onosimd and without.

Very strange - i start to believe, that my CPU is silly.


Any ideas now?




# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 13
model name      : Intel(R) Pentium(R) M processor 1.73GHz
stepping        : 8
cpu MHz         : 1733.000
cache size      : 2048 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx est tm2
bogomips        : 3460.37
clflush size    : 64

> Is FFTW_ESTIMATE important? Because i don't pass it when i create my
> plans. I thought, it would only be necessary to avoid overwriting of the
> arrays. But i don't care if they are overwritten at the time i create my
> plans.

OK. I didn't pass FFTW_MEASURE or FFTW_ESTIMATE. And FFTW_MEASURE seems
to be default.

Since i'm passing FFTW_ESTIMATE, creating plans is much faster.

>> You can also try calling fftw_print_plan, which will output the
>> algorithm being used.  For SIMD, the codelet names (in quotation
>> marks) should have a "v" in them (for "vectorized").  For example, for
>> N=1024 in-place transforms on my machine I get:
>>    (dft-ct-dit/32
>>      (dftw-direct-32/8 "t3fv_32")
>>      (dft-directbuf/34-32-x32 "n2fv_32"))
>> Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a
>> size-32 hard-coded FFT).
> 
> Interesting! fftw_print_plan is very nice for debugging puposes!
> Exactly what i was looking for!

So here are some plans:


creating dft plan for N=65536, flags=131136
(dft-ct-dit/8
  (dftw-direct-8/6 "t2_8")
  (dft-ctsq-dif/8 "q1_8"
    (dft-vrank>=1-x8/1
      (dft-vrank>=1-x8/1
        (dft-ct-dit/32
          (dftw-direct-32/8 "t2_32")
          (dft-directbuf/34-32-x32 "n1_32"))))))

creating dft plan for N=65536, flags=64
(dft-ct-dit/8
  (dftw-direct-8/6 "t3fv_8")
  (dft-ctsq-dif/8 "q1fv_8"
    (dft-vrank>=1-x8/1
      (dft-vrank>=1-x8/1
        (dft-ct-dit/32
          (dftw-direct-32/8 "t3fv_32")
          (dft-directbuf/34-32-x32 "n1fv_32"))))))


As you can see, everything seems to work correctly.

The first plan uses routines without v, the seconf one with v.

The only thing that's not correct, is the runtime :-)
SSE2 still doesn't make a difference - very strange.

>> I'm using FFTW. But i cannot measure any difference between runs with
>> FFTW_NO_SIMD and without.
>>
>> I'm trying to do in-place DFT, with plans created like this:
>>
>> fftw_complex *buf0 = fftw_malloc(...)
>> fftw_plan_dft_1d(N, buf0, buf0, FFTW_FORWARD, 0)
>>
>> FFTW has been compiled with SSE and SSE2 support. I'm using
>> double-precision.
> 
> Note that you cannot compile FFTW with SSE *and* SSE2, since the
> former is only for single precision and the latter is only for double
> precision.  Given an FFTW properly compiled for SSE2, and a CPU that
> supports SSE2 (i.e. Pentium IV or later), you should see a significant
> speedup from SIMD for almost all N, for both in and out-of-place
> transforms.

OK, but on my system i have all 3 FFTW variants:
libfftw3f.so for single precision (float)
libfftw3.so for double precision (double)
libfftw3l.so for long double precision

as far as i can see, libfftw3f.so is compiled with SSE support
and libfftw3.so is compiled with SSE2 support.

And /proc/cpuinfo states, that my CPU supports SSE2.
(It's a centrino pentium-m notebook).

> This leads me to suspect that you miscompiled FFTW somehow.  What
> operating system are you running?  Did you use the standard configure/
> Makefile scripts?  i.e.

I'm using Gentoo Linux's FFTW package.

>        ./configure --enable-sse2
> This should result in a config.h file that includes the line
>        #define HAVE_SSE2 1
> unless your compiler is so old that it doesn't support SSE2
> intrinsics.

I looked at the parameters passed to FFTW's configure.
They build all 3 variants of FFTW, and they pass "--enable-sse2" to
FFTW's configure when compiling the double-precision library.

> When you measure performance, you should be excluding the planning
> time.  You can also use FFTW's tests/bench program, e.g.
>        tests/bench -oestimate i1024
>        tests/bench -oestimate -onosimd i1024
> will benchmark an in-place N=1024 transform, in FFTW_ESTIMATE mode,
> with and without SIMD.  Using tests/bench is a good idea, to make sure
> you didn't accidentally mess up your benchmarking.

I will try this.

Is FFTW_ESTIMATE important? Because i don't pass it when i create my
plans. I thought, it would only be necessary to avoid overwriting of the
arrays. But i don't care if they are overwritten at the time i create my
plans.

> You can also try calling fftw_print_plan, which will output the
> algorithm being used.  For SIMD, the codelet names (in quotation
> marks) should have a "v" in them (for "vectorized").  For example, for
> N=1024 in-place transforms on my machine I get:
>    (dft-ct-dit/32
>      (dftw-direct-32/8 "t3fv_32")
>      (dft-directbuf/34-32-x32 "n2fv_32"))
> Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a
> size-32 hard-coded FFT).

Interesting! fftw_print_plan is very nice for debugging puposes!
Exactly what i was looking for!


Thanks for your help!


Regards,
  Sven

On May 13, 9:56 pm, Sven K=F6hler <remove-for-no-spam-skoeh...@upb.de>
wrote:
> I'm using FFTW. But i cannot measure any difference between runs with
> FFTW_NO_SIMD and without.
>
> I'm trying to do in-place DFT, with plans created like this:
>
> fftw_complex *buf0 =3D fftw_malloc(...)
> fftw_plan_dft_1d(N, buf0, buf0, FFTW_FORWARD, 0)
>
> FFTW has been compiled with SSE and SSE2 support. I'm using
> double-precision.

Note that you cannot compile FFTW with SSE *and* SSE2, since the
former is only for single precision and the latter is only for double
precision.  Given an FFTW properly compiled for SSE2, and a CPU that
supports SSE2 (i.e. Pentium IV or later), you should see a significant
speedup from SIMD for almost all N, for both in and out-of-place
transforms.

This leads me to suspect that you miscompiled FFTW somehow.  What
operating system are you running?  Did you use the standard configure/
Makefile scripts?  i.e.
       ./configure --enable-sse2
This should result in a config.h file that includes the line
       #define HAVE_SSE2 1
unless your compiler is so old that it doesn't support SSE2
intrinsics.

If you are using Windows, I strongly recommend using one of the
precompiled libraries from www.fftw.org/windows.html

When you measure performance, you should be excluding the planning
time.  You can also use FFTW's tests/bench program, e.g.
       tests/bench -oestimate i1024
       tests/bench -oestimate -onosimd i1024
will benchmark an in-place N=3D1024 transform, in FFTW_ESTIMATE mode,
with and without SIMD.  Using tests/bench is a good idea, to make sure
you didn't accidentally mess up your benchmarking.

You can also try calling fftw_print_plan, which will output the
algorithm being used.  For SIMD, the codelet names (in quotation
marks) should have a "v" in them (for "vectorized").  For example, for
N=3D1024 in-place transforms on my machine I get:
   (dft-ct-dit/32
     (dftw-direct-32/8 "t3fv_32")
     (dft-directbuf/34-32-x32 "n2fv_32"))
Note the "v" in "t3fv_32" (a radix-32 codelet) and "n2fv_32" (a
size-32 hard-coded FFT).

Regards,
Steven G. Johnson

Hi,

Sven K&#4294967295;hler schrieb:
> any FFTW experts reading this? :-)
> 
> I'm using FFTW. But i cannot measure any difference between runs with
> FFTW_NO_SIMD and without.

most likely SIMD is not used anyway.

Maybe your function is not yet implemented with SSE2 or your CPU does 
not support SSE2.

> I'm trying to do in-place DFT, with plans created like this:

In-place may be part of the problem.


> FFTW has been compiled with SSE and SSE2 support. I'm using
> double-precision.

Double precision, too.


> Any ideas how can i can track the problem down?

- Test without in-place.
- Test with single precision.
- Check wehther your CPU supports SSE2.
- Check the release notes of fftw for information on x86 assembler 
language implementations and their restrictions.
- Look at the fftw source whether there is an SSE2 implementation of 
your core function.


Marcel