Forums

Highest possible speed algo in Vb or C or Pseudocode

Started by gebe January 1, 2008
On 4 Jan, 18:18, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Jan 4, 10:58 am, Rune Allnor <all...@tele.ntnu.no> wrote: > > > Have anybody tested what can be gained by switching > > to C++-style[*] template programming (as opposed to > > regular code)? > > What do you suggest using the templates for in the case of an FFT? > > For type polymorphism, it's a coding convenience rather than a > performance issue. &#2013266080;The complex type is also more a question of > convenience.
Vladimir caught me on the terminology - I wasn't specific enough. While both <vector> and <complex> are templates, I was thinking about the 'butterflies'. It's decades since I played with FFT source code, so don't be too harsh on the outline... Anyway, suppose one wants an N-point FFT where N is decomposed into primes as N=a*b*c*...*d. Instead of hardcoding the loop over the primes, use a template expression based on template<int N> vector<float> myclass::butterfly() { if (/* M is non-prime and M > 1*/){ int M = /* lowest prime factor of N */ return this->butterfly<M>(); } else{ /* Implement 'butterfly' for prime M*/ } } If one can do something like that, one might achieve reasonable performance without doing all the detailed analyses you people do. Might make a difference for the hobby programmer... Rune
On 4 Jan, 20:15, Rune Allnor <all...@tele.ntnu.no> wrote:

Typo correctons in pseudo code:

> template<int N> > vector<float> myclass::butterfly() > { > &#2013266080; &#2013266080;--if--(/*-M-is-non=prime-and-M->-1*/)---{
if (/* N is non-prime and N > 1*/){
> &#2013266080; &#2013266080;int M = /* lowest prime factor of N */ > &#2013266080; &#2013266080;return this->butterfly<M>(); > &#2013266080; &#2013266080;} > &#2013266080; &#2013266080;else{ > &#2013266080; &#2013266080;/*-Implement-'butterfly'-for-prime-M*/
/* Implement 'butterfly' for prime N*/
> &#2013266080; &#2013266080;} > > }
On Jan 4, 2:15 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> Vladimir caught me on the terminology - I wasn't specific > enough. While both <vector> and <complex> are templates, > I was thinking about the 'butterflies'. It's decades since > I played with FFT source code, so don't be too harsh on the > outline... > > Anyway, suppose one wants an N-point FFT where N is decomposed > into primes as N=a*b*c*...*d. Instead of hardcoding the > loop over the primes, use a template expression based on > > template<int N> > vector<float> myclass::butterfly() > { > if (/* M is non-prime and N > 1*/){ > int M = /* lowest prime factor of N */ > return this->butterfly<M>(); > } > else{ > /* Implement 'butterfly' for prime N*/ > } > > } > > If one can do something like that, one might achieve > reasonable performance without doing all the detailed > analyses you people do. Might make a difference for > the hobby programmer...
Doing explicit recursion via dividing by the smallest prime factor at each step, until you reach a prime size, will be horrendously slow. Note also that what you wrote down is not really the outline of an FFT---at intermediate steps where N is divided by a factor M, you have to do M recursive transforms of size N/M and also N/M butterflies of size M. (Also, the leaves of the recursion are substantially different from the butterflies, in that the latter have twiddle factors and the former do not.) FFTW uses explicit recursion, but the key is that the butterflies it uses are almost never prime, nor are they small. Instead, the building blocks (esp. the leaves of the recursion) should be big (e.g. FFTW regularly uses radix 32 or 64 steps), both to amortize the recursive function-call overhead and also to take advantage of the large register set and other properties of CPUs these days. And generating these large building blocks (FFTW's "codelets") is nontrivial. Nevertheless, even the outline that you propose is substantially different from a textbook radix-2 code that makes log2(n) non- recursive breadth-first passes over the array; it was precisely my point that such textbook code is not a useful starting point anymore for performance optimization. Also, it's not at all clear to me how you think templates help you gain performance in the above code vs. a C-style explicit implementation (of course, some people may prefer templates stylistically). Regards, Steen G. Johnson
On 5 Jan, 04:58, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> Also, it's not at all clear to me how > you think templates help you gain performance in the above code vs. a > C-style explicit implementation
The rationale for the question is that templates displays all the code to the compiler at once. In principle, nothing is compiled until user code instantiates an instance of a templated function. Since the compiler sees all the source code (including the context where the user calls the library) the compiler can - presumably - do a better job at optimizing. Below is a test program I wrote to check the relative performance of templated code and "standard" code for computing quasi-Chebyshev polynomials ('quasi' because there are some 1/2^N scaling factors missing). /**********************************************************/ #include <iostream.h> #include <time.h> #include <vector> #include <stdlib.h> ////////////////////////////////////////////////////////////////////// // // Template functions template<unsigned long int N> // Forward declaration of double U(const double); // Chebyshev function of 2nd kind template<unsigned long int N> // Chebyshev function of 1st kind double T(const double x) { return x*T<N-1>(x)-(1-x*x)*U<N-2>(x); } template<unsigned long int N> // Chebyshev function of 2nd kind double U(const double x) { return x*U<N-1>(x) + T<N>(x); } template<> // Initial function recurrence double T<0>(const double x) { return 1; } template<> double T<1>(const double x) { return x; } template<> double U<0>(const double x) { return 1; } ////////////////////////////////////////////////////////////////////// // // Standard recursive functions double u(const long unsigned int,const double); // Forward declaration of // Chebyshev function of 2nd kind double t(const long unsigned int N,const double x) { if (N>1) { return x*t(N-1,x) - (1-x*x)*u(N-2,x); } if (N==1) { return x; } return 1; } double u(const long unsigned int N, const double x) { if (N>0) { return x*u(N-1,x) + t(N,x); } return 1; } int main() { const unsigned long int N = 100000; std::vector<double> x; std::vector<double> y; x.reserve(N); y.reserve(N); long unsigned int n; for (n=0;n<N;++n) { x[n] = (double)rand()/(double)RAND_MAX; } clock_t t1,t2,d1,d2; t1 = clock(); for (n=0;n<N;++n) { y[n] = T<10>(x[n]); } d1 = clock() - t1; t2 = clock(); for (n=0;n<N;++n) { y[n] = t(10,x[n]); } d2 = clock() - t2; std::cout << "Templates : " << d1 << std::endl; std::cout << "Standard : " << d2 << std::endl; return 0; } /**********************************************************/ When optimized for speed, the output on my computer is Templates : 961 Standard : 1803 which means that the template version is twice as fast as the naive code. Rune
Rune Allnor wrote:

> Below is a test program I wrote to check the relative > performance of templated code and "standard" code for > computing quasi-Chebyshev polynomials ('quasi' because > there are some 1/2^N scaling factors missing). > > /**********************************************************/
[..]
> /**********************************************************/ > > When optimized for speed, the output on my computer is > > Templates : 961 > Standard : 1803 > > which means that the template version is twice as fast as > the naive code.
I tried this on a Linux PC with gcc 3.4.6, with different optimizer flags (none/-O1/-O2/-O3/-O4/-O5). Each version was called 4 times in a row:
>./template-speed
Templates : 2380000 Standard : 2280000
>./template-speed
Templates : 2690000 Standard : 2480000
>./template-speed
Templates : 2330000 Standard : 2370000
>./template-speed
Templates : 2270000 Standard : 2510000
>./template-speed-O1
Templates : 2210000 Standard : 1950000
>./template-speed-O1
Templates : 1600000 Standard : 1910000
>./template-speed-O1
Templates : 1770000 Standard : 2220000
>./template-speed-O1
Templates : 2260000 Standard : 2230000
>./template-speed-O2
Templates : 1670000 Standard : 1410000
>./template-speed-O2
Templates : 1650000 Standard : 1300000
>./template-speed-O2
Templates : 1660000 Standard : 1360000
>./template-speed-O2
Templates : 1660000 Standard : 1690000
>./template-speed-O3
Templates : 990000 Standard : 1660000
>./template-speed-O3
Templates : 650000 Standard : 1490000
>./template-speed-O3
Templates : 890000 Standard : 1550000
>./template-speed-O3
Templates : 630000 Standard : 1210000
>./template-speed-O4
Templates : 890000 Standard : 1470000
>./template-speed-O4
Templates : 870000 Standard : 1430000
>./template-speed-O4
Templates : 670000 Standard : 1540000
>./template-speed-O4
Templates : 890000 Standard : 1300000
>./template-speed-O5
Templates : 890000 Standard : 1550000
>./template-speed-O5
Templates : 650000 Standard : 1540000
>./template-speed-O5
Templates : 660000 Standard : 1260000
>./template-speed-O5
Templates : 890000 Standard : 1650000 bye Andreas -- Andreas H&#2013266172;nnebeck | email: acmh@gmx.de ----- privat ---- | www : http://www.huennebeck-online.de Fax/Anrufbeantworter: 0721/151-284301 GPG-Key: http://www.huennebeck-online.de/public_keys/andreas.asc PGP-Key: http://www.huennebeck-online.de/public_keys/pgp_andreas.asc