Reply by Andreas Huennebeck January 8, 20082008-01-08
Rune Allnor wrote:

> Below is a test program I wrote to check the relative > performance of templated code and "standard" code for > computing quasi-Chebyshev polynomials ('quasi' because > there are some 1/2^N scaling factors missing). > > /**********************************************************/
[..]
> /**********************************************************/ > > When optimized for speed, the output on my computer is > > Templates : 961 > Standard : 1803 > > which means that the template version is twice as fast as > the naive code.
I tried this on a Linux PC with gcc 3.4.6, with different optimizer flags (none/-O1/-O2/-O3/-O4/-O5). Each version was called 4 times in a row:
>./template-speed
Templates : 2380000 Standard : 2280000
>./template-speed
Templates : 2690000 Standard : 2480000
>./template-speed
Templates : 2330000 Standard : 2370000
>./template-speed
Templates : 2270000 Standard : 2510000
>./template-speed-O1
Templates : 2210000 Standard : 1950000
>./template-speed-O1
Templates : 1600000 Standard : 1910000
>./template-speed-O1
Templates : 1770000 Standard : 2220000
>./template-speed-O1
Templates : 2260000 Standard : 2230000
>./template-speed-O2
Templates : 1670000 Standard : 1410000
>./template-speed-O2
Templates : 1650000 Standard : 1300000
>./template-speed-O2
Templates : 1660000 Standard : 1360000
>./template-speed-O2
Templates : 1660000 Standard : 1690000
>./template-speed-O3
Templates : 990000 Standard : 1660000
>./template-speed-O3
Templates : 650000 Standard : 1490000
>./template-speed-O3
Templates : 890000 Standard : 1550000
>./template-speed-O3
Templates : 630000 Standard : 1210000
>./template-speed-O4
Templates : 890000 Standard : 1470000
>./template-speed-O4
Templates : 870000 Standard : 1430000
>./template-speed-O4
Templates : 670000 Standard : 1540000
>./template-speed-O4
Templates : 890000 Standard : 1300000
>./template-speed-O5
Templates : 890000 Standard : 1550000
>./template-speed-O5
Templates : 650000 Standard : 1540000
>./template-speed-O5
Templates : 660000 Standard : 1260000
>./template-speed-O5
Templates : 890000 Standard : 1650000 bye Andreas -- Andreas H�nnebeck | email: acmh@gmx.de ----- privat ---- | www : http://www.huennebeck-online.de Fax/Anrufbeantworter: 0721/151-284301 GPG-Key: http://www.huennebeck-online.de/public_keys/andreas.asc PGP-Key: http://www.huennebeck-online.de/public_keys/pgp_andreas.asc
Reply by Rune Allnor January 7, 20082008-01-07
On 5 Jan, 04:58, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> Also, it's not at all clear to me how > you think templates help you gain performance in the above code vs. a > C-style explicit implementation
The rationale for the question is that templates displays all the code to the compiler at once. In principle, nothing is compiled until user code instantiates an instance of a templated function. Since the compiler sees all the source code (including the context where the user calls the library) the compiler can - presumably - do a better job at optimizing. Below is a test program I wrote to check the relative performance of templated code and "standard" code for computing quasi-Chebyshev polynomials ('quasi' because there are some 1/2^N scaling factors missing). /**********************************************************/ #include <iostream.h> #include <time.h> #include <vector> #include <stdlib.h> ////////////////////////////////////////////////////////////////////// // // Template functions template<unsigned long int N> // Forward declaration of double U(const double); // Chebyshev function of 2nd kind template<unsigned long int N> // Chebyshev function of 1st kind double T(const double x) { return x*T<N-1>(x)-(1-x*x)*U<N-2>(x); } template<unsigned long int N> // Chebyshev function of 2nd kind double U(const double x) { return x*U<N-1>(x) + T<N>(x); } template<> // Initial function recurrence double T<0>(const double x) { return 1; } template<> double T<1>(const double x) { return x; } template<> double U<0>(const double x) { return 1; } ////////////////////////////////////////////////////////////////////// // // Standard recursive functions double u(const long unsigned int,const double); // Forward declaration of // Chebyshev function of 2nd kind double t(const long unsigned int N,const double x) { if (N>1) { return x*t(N-1,x) - (1-x*x)*u(N-2,x); } if (N==1) { return x; } return 1; } double u(const long unsigned int N, const double x) { if (N>0) { return x*u(N-1,x) + t(N,x); } return 1; } int main() { const unsigned long int N = 100000; std::vector<double> x; std::vector<double> y; x.reserve(N); y.reserve(N); long unsigned int n; for (n=0;n<N;++n) { x[n] = (double)rand()/(double)RAND_MAX; } clock_t t1,t2,d1,d2; t1 = clock(); for (n=0;n<N;++n) { y[n] = T<10>(x[n]); } d1 = clock() - t1; t2 = clock(); for (n=0;n<N;++n) { y[n] = t(10,x[n]); } d2 = clock() - t2; std::cout << "Templates : " << d1 << std::endl; std::cout << "Standard : " << d2 << std::endl; return 0; } /**********************************************************/ When optimized for speed, the output on my computer is Templates : 961 Standard : 1803 which means that the template version is twice as fast as the naive code. Rune
Reply by Steven G. Johnson January 4, 20082008-01-04
On Jan 4, 2:15 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> Vladimir caught me on the terminology - I wasn't specific > enough. While both <vector> and <complex> are templates, > I was thinking about the 'butterflies'. It's decades since > I played with FFT source code, so don't be too harsh on the > outline... > > Anyway, suppose one wants an N-point FFT where N is decomposed > into primes as N=a*b*c*...*d. Instead of hardcoding the > loop over the primes, use a template expression based on > > template<int N> > vector<float> myclass::butterfly() > { > if (/* M is non-prime and N > 1*/){ > int M = /* lowest prime factor of N */ > return this->butterfly<M>(); > } > else{ > /* Implement 'butterfly' for prime N*/ > } > > } > > If one can do something like that, one might achieve > reasonable performance without doing all the detailed > analyses you people do. Might make a difference for > the hobby programmer...
Doing explicit recursion via dividing by the smallest prime factor at each step, until you reach a prime size, will be horrendously slow. Note also that what you wrote down is not really the outline of an FFT---at intermediate steps where N is divided by a factor M, you have to do M recursive transforms of size N/M and also N/M butterflies of size M. (Also, the leaves of the recursion are substantially different from the butterflies, in that the latter have twiddle factors and the former do not.) FFTW uses explicit recursion, but the key is that the butterflies it uses are almost never prime, nor are they small. Instead, the building blocks (esp. the leaves of the recursion) should be big (e.g. FFTW regularly uses radix 32 or 64 steps), both to amortize the recursive function-call overhead and also to take advantage of the large register set and other properties of CPUs these days. And generating these large building blocks (FFTW's "codelets") is nontrivial. Nevertheless, even the outline that you propose is substantially different from a textbook radix-2 code that makes log2(n) non- recursive breadth-first passes over the array; it was precisely my point that such textbook code is not a useful starting point anymore for performance optimization. Also, it's not at all clear to me how you think templates help you gain performance in the above code vs. a C-style explicit implementation (of course, some people may prefer templates stylistically). Regards, Steen G. Johnson
Reply by Rune Allnor January 4, 20082008-01-04
On 4 Jan, 20:15, Rune Allnor <all...@tele.ntnu.no> wrote:

Typo correctons in pseudo code:

> template<int N> > vector<float> myclass::butterfly() > { > &#4294967295; &#4294967295;--if--(/*-M-is-non=prime-and-M->-1*/)---{
if (/* N is non-prime and N > 1*/){
> &#4294967295; &#4294967295;int M = /* lowest prime factor of N */ > &#4294967295; &#4294967295;return this->butterfly<M>(); > &#4294967295; &#4294967295;} > &#4294967295; &#4294967295;else{ > &#4294967295; &#4294967295;/*-Implement-'butterfly'-for-prime-M*/
/* Implement 'butterfly' for prime N*/
> &#4294967295; &#4294967295;} > > }
Reply by Rune Allnor January 4, 20082008-01-04
On 4 Jan, 18:18, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Jan 4, 10:58 am, Rune Allnor <all...@tele.ntnu.no> wrote: > > > Have anybody tested what can be gained by switching > > to C++-style[*] template programming (as opposed to > > regular code)? > > What do you suggest using the templates for in the case of an FFT? > > For type polymorphism, it's a coding convenience rather than a > performance issue. &#4294967295;The complex type is also more a question of > convenience.
Vladimir caught me on the terminology - I wasn't specific enough. While both <vector> and <complex> are templates, I was thinking about the 'butterflies'. It's decades since I played with FFT source code, so don't be too harsh on the outline... Anyway, suppose one wants an N-point FFT where N is decomposed into primes as N=a*b*c*...*d. Instead of hardcoding the loop over the primes, use a template expression based on template<int N> vector<float> myclass::butterfly() { if (/* M is non-prime and M > 1*/){ int M = /* lowest prime factor of N */ return this->butterfly<M>(); } else{ /* Implement 'butterfly' for prime M*/ } } If one can do something like that, one might achieve reasonable performance without doing all the detailed analyses you people do. Might make a difference for the hobby programmer... Rune
Reply by Steven G. Johnson January 4, 20082008-01-04
On Jan 4, 10:58 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> Have anybody tested what can be gained by switching > to C++-style[*] template programming (as opposed to > regular code)?
What do you suggest using the templates for in the case of an FFT? For type polymorphism, it's a coding convenience rather than a performance issue. The complex type is also more a question of convenience. For STL container templates, the vendor-optimized parts aren't too useful here. The fastest FFT algorithms don't really consist of a sequence of BLAS-like operations or similar collective functions where the optimized code in STL vector templates will help you (performance- wise) as far as I can tell; for random access and operations on individual elements, it is going to be equivalent to ordinary arrays/ pointers. For Blitz++ style metaprogramming, they have tried it for FFT (http:// www.oonumerics.org/blitz/examples/fft.html). Basically, you can use templates to write a code generator for hard-coded FFTs of small sizes. (Generating a hard-coded FFT for sizes > 128 or so is impractical due to instruction-cache issues.) However, there's much more to generating fast hard-coded FFTs than just partial evaluation of a radix-2 FFT, which is what Blitz++ did IIRC; one has to worry about code-scheduling issues to minimize register spills (the compiler can't do it for you because it is NP-hard in general, but there are specialized solutions that work well for FFTs), one would like to simplify operations involving trivial powers of roots of unity (which give 1, i, [1+i]/sqrt(2), etcetera, that lead to simplified complex multiplications), etcetera....see our papers on FFTW's code generator. Of course, in principle this could all be done with template metaprogramming, since it is Turing-complete (if I recall correctly), but it doesn't seem like the right language for this sort of thing. In FFTW, we wrote our code generator in Caml, which is more suited for writing compiler-like programs (built-in support for pattern matching, symbolic-expression datatypes, etc). Anyway, all of this is insufficient for FFTs > 128 or so, as I mentioned; at best, it gives you building blocks for larger FFTs, as in FFTW. Regards, Steven G. Johnson
Reply by Vladimir Vassilevsky January 4, 20082008-01-04

Rune Allnor wrote:

>>>I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM >>>(DLL)not to compete with FFTW for versatility but SPEED in specialised >>>medias .
> Have anybody tested what can be gained by switching > to C++-style[*] template programming (as opposed to > regular code)?
Here is the real FFT code using <vector> and <complex> It is somewhat faster then the direct implementation. Disclaimer: the code was submitted by the third party for my consideration. I don't know who is the author. /***************************************************/ #ifndef _FFT_H #include <vector> #include <math.h> #include "complex.h" int fft_bah(vector<double>* ps, vector<Complex>* pS){ int N = ps->size(); int Z=N; while(Z>1) { int Z1=Z>>1; float Z2=(float)(Z)/2; if (float(Z1)==Z2) {Z=Z1;} else { return 1;} } pS -> clear(); Complex c; vector<Complex> St; St.clear(); int step, N1; step=0; for (N1=N; N1>1;) { step=step+1; N1= N1>>1; } int k; k=1; for (int i=0; i<ps->size(); i++) {c.r=ps->at(i); c.i=0.0; St.push_back(c);} for (int z=0; z<step; z++) {k=k*2; for (int j=0; j<k; j++) {if (j%2==0) for (int n = 0; n<N/2; n++) {c=St.at(n+N*j/2)+St.at(N/2+n+N*j/2); pS -> push_back(c); } else for (int n = 0; n<N/2; n++) {c=(St.at(N*(j-1)/2+n)-St.at(N/2+n+N*(j-1)/2))*exp(1.0, -2*M_PI*n/N); pS -> push_back(c); } } N=N/2; St.clear(); for (int i=0; i<pS->size(); i++) {c=pS->at(i); St.push_back(c); } pS->clear(); } int z; N=St.size(); for (int i = 0; i<N; i++) { int j=0; int n1 = 0; int m=1; j=i; for (int k1=N; k1>1;) { m=m*2; z=int(j%(k1/2)); n1=n1+(j-z)*m/k1; k1=k1/2; j=z;} c=St.at(n1); pS->push_back(c); } return 0; }; #define _FFT_H #endif /**********************************************/ Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
Reply by Rune Allnor January 4, 20082008-01-04
On 3 Jan, 23:46, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Jan 1, 10:06 pm, "gebe" <g...@telkomsa.net> wrote: > > > I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM > > (DLL)not to compete with FFTW for versatility but SPEED in specialised > > medias . > > I am at the moment (using the best I could find) getting 36 usecs(1024 > > pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points > > That's not bad at all for a first try, but you still have a fairly > long way to go.
Have anybody tested what can be gained by switching to C++-style[*] template programming (as opposed to regular code)? I gained a factor 2 on a naive Chebychev polynomial computation by switching from regular code to template programming, using the same naive algorithm in both cases. Rune [*] As far as I know, template programming is only available in C++, but I wouldn't be surprised if it has become available elsewhere.
Reply by Steven G. Johnson January 3, 20082008-01-03
On Jan 1, 10:06 pm, "gebe" <g...@telkomsa.net> wrote:
> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM > (DLL)not to compete with FFTW for versatility but SPEED in specialised > medias . > I am at the moment (using the best I could find) getting 36 usecs(1024 > pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
That's not bad at all for a first try, but you still have a fairly long way to go. Let me assume you are using single precision, which should be more than sufficient for most applications involving such small transforms. In that case, according to the benchmark numbers on our web site (http://www.fftw.org/speed/), FFTW can do a 1024-point complex-data transform in about 10usecs on a 2.4GHz AMD Opteron 275 (32-bit mode), and a 2048-point transform in about 23usecs; naively scaling by the clock speed to 1.8GHz gives 13 and 31usecs, respectively. On a newer machine, a 3GHz Core Duo in 64-bit mode, the times are a bit under 4usecs and 10usecs, respectively. And if your data are purely real, then you can gain an additional factor of 1.5 to 2. (It's fairly hard to get anywhere close to optimal performance on a general-purpose CPU these days if you start with a textbook radix-2 FFT algorithm and try to optimize it, if that is what you are doing.) Regards, Steven G. Johnson
Reply by Steven G. Johnson January 3, 20082008-01-03
On Jan 2, 1:25 am, "gebe" <g...@telkomsa.net> wrote:
> I know FFTW has a way to rate speed as MFLOPS but I don't understand that > way
As clearly defined at the top of our benchmarks page, the "mflops" number we report is simply a convenient rescaling of the speed ( = 5 N log2 N / [time in microseconds]): http://www.fftw.org/speed/ (You can also download data files with the raw timing numbers if you prefer, as describe on our web page.) Regards, Steven G. Johnson