Reply by Andreas Huennebeck●January 8, 20082008-01-08
Rune Allnor wrote:
> Below is a test program I wrote to check the relative
> performance of templated code and "standard" code for
> computing quasi-Chebyshev polynomials ('quasi' because
> there are some 1/2^N scaling factors missing).
>
> /**********************************************************/
[..]
> /**********************************************************/
>
> When optimized for speed, the output on my computer is
>
> Templates : 961
> Standard : 1803
>
> which means that the template version is twice as fast as
> the naive code.
I tried this on a Linux PC with gcc 3.4.6, with different optimizer flags
(none/-O1/-O2/-O3/-O4/-O5). Each version was called 4 times in a row:
On 5 Jan, 04:58, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> Also, it's not at all clear to me how
> you think templates help you gain performance in the above code vs. a
> C-style explicit implementation
The rationale for the question is that templates displays
all the code to the compiler at once. In principle, nothing
is compiled until user code instantiates an instance of
a templated function. Since the compiler sees all the source
code (including the context where the user calls the library)
the compiler can - presumably - do a better job at optimizing.
Below is a test program I wrote to check the relative
performance of templated code and "standard" code for
computing quasi-Chebyshev polynomials ('quasi' because
there are some 1/2^N scaling factors missing).
/**********************************************************/
#include <iostream.h>
#include <time.h>
#include <vector>
#include <stdlib.h>
//////////////////////////////////////////////////////////////////////
//
// Template functions
template<unsigned long int N> // Forward declaration of
double U(const double); // Chebyshev function of 2nd kind
template<unsigned long int N> // Chebyshev function of 1st kind
double T(const double x)
{
return x*T<N-1>(x)-(1-x*x)*U<N-2>(x);
}
template<unsigned long int N> // Chebyshev function of 2nd kind
double U(const double x)
{
return x*U<N-1>(x) + T<N>(x);
}
template<> // Initial function recurrence
double T<0>(const double x)
{
return 1;
}
template<>
double T<1>(const double x)
{
return x;
}
template<>
double U<0>(const double x)
{
return 1;
}
//////////////////////////////////////////////////////////////////////
//
// Standard recursive functions
double u(const long unsigned int,const double); // Forward declaration
of
// Chebyshev function of 2nd kind
double t(const long unsigned int N,const double x)
{
if (N>1)
{
return x*t(N-1,x) - (1-x*x)*u(N-2,x);
}
if (N==1)
{
return x;
}
return 1;
}
double u(const long unsigned int N, const double x)
{
if (N>0)
{
return x*u(N-1,x) + t(N,x);
}
return 1;
}
int main()
{
const unsigned long int N = 100000;
std::vector<double> x;
std::vector<double> y;
x.reserve(N);
y.reserve(N);
long unsigned int n;
for (n=0;n<N;++n)
{
x[n] = (double)rand()/(double)RAND_MAX;
}
clock_t t1,t2,d1,d2;
t1 = clock();
for (n=0;n<N;++n)
{
y[n] = T<10>(x[n]);
}
d1 = clock() - t1;
t2 = clock();
for (n=0;n<N;++n)
{
y[n] = t(10,x[n]);
}
d2 = clock() - t2;
std::cout << "Templates : " << d1 << std::endl;
std::cout << "Standard : " << d2 << std::endl;
return 0;
}
/**********************************************************/
When optimized for speed, the output on my computer is
Templates : 961
Standard : 1803
which means that the template version is twice as fast as
the naive code.
Rune
Reply by Steven G. Johnson●January 4, 20082008-01-04
On Jan 4, 2:15 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> Vladimir caught me on the terminology - I wasn't specific
> enough. While both <vector> and <complex> are templates,
> I was thinking about the 'butterflies'. It's decades since
> I played with FFT source code, so don't be too harsh on the
> outline...
>
> Anyway, suppose one wants an N-point FFT where N is decomposed
> into primes as N=a*b*c*...*d. Instead of hardcoding the
> loop over the primes, use a template expression based on
>
> template<int N>
> vector<float> myclass::butterfly()
> {
> if (/* M is non-prime and N > 1*/){
> int M = /* lowest prime factor of N */
> return this->butterfly<M>();
> }
> else{
> /* Implement 'butterfly' for prime N*/
> }
>
> }
>
> If one can do something like that, one might achieve
> reasonable performance without doing all the detailed
> analyses you people do. Might make a difference for
> the hobby programmer...
Doing explicit recursion via dividing by the smallest prime factor at
each step, until you reach a prime size, will be horrendously slow.
Note also that what you wrote down is not really the outline of an
FFT---at intermediate steps where N is divided by a factor M, you have
to do M recursive transforms of size N/M and also N/M butterflies of
size M. (Also, the leaves of the recursion are substantially different
from the butterflies, in that the latter have twiddle factors and the
former do not.)
FFTW uses explicit recursion, but the key is that the butterflies it
uses are almost never prime, nor are they small. Instead, the
building blocks (esp. the leaves of the recursion) should be big (e.g.
FFTW regularly uses radix 32 or 64 steps), both to amortize the
recursive function-call overhead and also to take advantage of the
large register set and other properties of CPUs these days. And
generating these large building blocks (FFTW's "codelets") is
nontrivial.
Nevertheless, even the outline that you propose is substantially
different from a textbook radix-2 code that makes log2(n) non-
recursive breadth-first passes over the array; it was precisely my
point that such textbook code is not a useful starting point anymore
for performance optimization. Also, it's not at all clear to me how
you think templates help you gain performance in the above code vs. a
C-style explicit implementation (of course, some people may prefer
templates stylistically).
Regards,
Steen G. Johnson
Reply by Rune Allnor●January 4, 20082008-01-04
On 4 Jan, 20:15, Rune Allnor <all...@tele.ntnu.no> wrote:
Typo correctons in pseudo code:
> � �int M = /* lowest prime factor of N */
> � �return this->butterfly<M>();
> � �}
> � �else{
> � �/*-Implement-'butterfly'-for-prime-M*/
/* Implement 'butterfly' for prime N*/
> � �}
>
> }
Reply by Rune Allnor●January 4, 20082008-01-04
On 4 Jan, 18:18, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Jan 4, 10:58 am, Rune Allnor <all...@tele.ntnu.no> wrote:
>
> > Have anybody tested what can be gained by switching
> > to C++-style[*] template programming (as opposed to
> > regular code)?
>
> What do you suggest using the templates for in the case of an FFT?
>
> For type polymorphism, it's a coding convenience rather than a
> performance issue. �The complex type is also more a question of
> convenience.
Vladimir caught me on the terminology - I wasn't specific
enough. While both <vector> and <complex> are templates,
I was thinking about the 'butterflies'. It's decades since
I played with FFT source code, so don't be too harsh on the
outline...
Anyway, suppose one wants an N-point FFT where N is decomposed
into primes as N=a*b*c*...*d. Instead of hardcoding the
loop over the primes, use a template expression based on
template<int N>
vector<float> myclass::butterfly()
{
if (/* M is non-prime and M > 1*/){
int M = /* lowest prime factor of N */
return this->butterfly<M>();
}
else{
/* Implement 'butterfly' for prime M*/
}
}
If one can do something like that, one might achieve
reasonable performance without doing all the detailed
analyses you people do. Might make a difference for
the hobby programmer...
Rune
Reply by Steven G. Johnson●January 4, 20082008-01-04
On Jan 4, 10:58 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> Have anybody tested what can be gained by switching
> to C++-style[*] template programming (as opposed to
> regular code)?
What do you suggest using the templates for in the case of an FFT?
For type polymorphism, it's a coding convenience rather than a
performance issue. The complex type is also more a question of
convenience.
For STL container templates, the vendor-optimized parts aren't too
useful here. The fastest FFT algorithms don't really consist of a
sequence of BLAS-like operations or similar collective functions where
the optimized code in STL vector templates will help you (performance-
wise) as far as I can tell; for random access and operations on
individual elements, it is going to be equivalent to ordinary arrays/
pointers.
For Blitz++ style metaprogramming, they have tried it for FFT (http://
www.oonumerics.org/blitz/examples/fft.html). Basically, you can use
templates to write a code generator for hard-coded FFTs of small
sizes. (Generating a hard-coded FFT for sizes > 128 or so is
impractical due to instruction-cache issues.) However, there's much
more to generating fast hard-coded FFTs than just partial evaluation
of a radix-2 FFT, which is what Blitz++ did IIRC; one has to worry
about code-scheduling issues to minimize register spills (the compiler
can't do it for you because it is NP-hard in general, but there are
specialized solutions that work well for FFTs), one would like to
simplify operations involving trivial powers of roots of unity (which
give 1, i, [1+i]/sqrt(2), etcetera, that lead to simplified complex
multiplications), etcetera....see our papers on FFTW's code
generator. Of course, in principle this could all be done with
template metaprogramming, since it is Turing-complete (if I recall
correctly), but it doesn't seem like the right language for this sort
of thing. In FFTW, we wrote our code generator in Caml, which is more
suited for writing compiler-like programs (built-in support for
pattern matching, symbolic-expression datatypes, etc). Anyway, all of
this is insufficient for FFTs > 128 or so, as I mentioned; at best, it
gives you building blocks for larger FFTs, as in FFTW.
Regards,
Steven G. Johnson
Reply by Vladimir Vassilevsky●January 4, 20082008-01-04
Rune Allnor wrote:
>>>I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
>>>(DLL)not to compete with FFTW for versatility but SPEED in specialised
>>>medias .
> Have anybody tested what can be gained by switching
> to C++-style[*] template programming (as opposed to
> regular code)?
Here is the real FFT code using <vector> and <complex>
It is somewhat faster then the direct implementation.
Disclaimer: the code was submitted by the third party for my
consideration. I don't know who is the author.
/***************************************************/
#ifndef _FFT_H
#include <vector>
#include <math.h>
#include "complex.h"
int fft_bah(vector<double>* ps, vector<Complex>* pS){
int N = ps->size();
int Z=N;
while(Z>1)
{
int Z1=Z>>1;
float Z2=(float)(Z)/2;
if (float(Z1)==Z2)
{Z=Z1;}
else
{
return 1;}
}
pS -> clear();
Complex c;
vector<Complex> St;
St.clear();
int step, N1;
step=0;
for (N1=N; N1>1;)
{
step=step+1;
N1= N1>>1;
}
int k;
k=1;
for (int i=0; i<ps->size(); i++)
{c.r=ps->at(i);
c.i=0.0;
St.push_back(c);}
for (int z=0; z<step; z++)
{k=k*2;
for (int j=0; j<k; j++)
{if (j%2==0)
for (int n = 0; n<N/2; n++)
{c=St.at(n+N*j/2)+St.at(N/2+n+N*j/2);
pS -> push_back(c);
}
else
for (int n = 0; n<N/2; n++)
{c=(St.at(N*(j-1)/2+n)-St.at(N/2+n+N*(j-1)/2))*exp(1.0, -2*M_PI*n/N);
pS -> push_back(c);
}
}
N=N/2;
St.clear();
for (int i=0; i<pS->size(); i++)
{c=pS->at(i);
St.push_back(c);
}
pS->clear();
}
int z;
N=St.size();
for (int i = 0; i<N; i++)
{
int j=0;
int n1 = 0;
int m=1;
j=i;
for (int k1=N; k1>1;)
{
m=m*2;
z=int(j%(k1/2));
n1=n1+(j-z)*m/k1;
k1=k1/2;
j=z;}
c=St.at(n1);
pS->push_back(c);
}
return 0;
};
#define _FFT_H
#endif
/**********************************************/
Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com
Reply by Rune Allnor●January 4, 20082008-01-04
On 3 Jan, 23:46, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Jan 1, 10:06 pm, "gebe" <g...@telkomsa.net> wrote:
>
> > I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
> > (DLL)not to compete with FFTW for versatility but SPEED in specialised
> > medias .
> > I am at the moment (using the best I could find) getting 36 usecs(1024
> > pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
>
> That's not bad at all for a first try, but you still have a fairly
> long way to go.
Have anybody tested what can be gained by switching
to C++-style[*] template programming (as opposed to
regular code)?
I gained a factor 2 on a naive Chebychev polynomial
computation by switching from regular code to template
programming, using the same naive algorithm in
both cases.
Rune
[*] As far as I know, template programming is only
available in C++, but I wouldn't be surprised
if it has become available elsewhere.
Reply by Steven G. Johnson●January 3, 20082008-01-03
On Jan 1, 10:06 pm, "gebe" <g...@telkomsa.net> wrote:
> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
> (DLL)not to compete with FFTW for versatility but SPEED in specialised
> medias .
> I am at the moment (using the best I could find) getting 36 usecs(1024
> pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
That's not bad at all for a first try, but you still have a fairly
long way to go.
Let me assume you are using single precision, which should be more
than sufficient for most applications involving such small
transforms. In that case, according to the benchmark numbers on our
web site (http://www.fftw.org/speed/), FFTW can do a 1024-point
complex-data transform in about 10usecs on a 2.4GHz AMD Opteron 275
(32-bit mode), and a 2048-point transform in about 23usecs; naively
scaling by the clock speed to 1.8GHz gives 13 and 31usecs,
respectively. On a newer machine, a 3GHz Core Duo in 64-bit mode, the
times are a bit under 4usecs and 10usecs, respectively. And if your
data are purely real, then you can gain an additional factor of 1.5 to
2.
(It's fairly hard to get anywhere close to optimal performance on a
general-purpose CPU these days if you start with a textbook radix-2
FFT algorithm and try to optimize it, if that is what you are doing.)
Regards,
Steven G. Johnson
Reply by Steven G. Johnson●January 3, 20082008-01-03
On Jan 2, 1:25 am, "gebe" <g...@telkomsa.net> wrote:
> I know FFTW has a way to rate speed as MFLOPS but I don't understand that
> way
As clearly defined at the top of our benchmarks page, the "mflops"
number we report is simply a convenient rescaling of the speed ( = 5 N
log2 N / [time in microseconds]):
http://www.fftw.org/speed/
(You can also download data files with the raw timing numbers if you
prefer, as describe on our web page.)
Regards,
Steven G. Johnson